Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
988 views
in Technique[技术] by (71.8m points)

performance - Python efficient way to check if very large string contains a substring

Python is not my best language, and so I'm not all that good at finding the most efficient solutions to some of my problems. I have a very large string (coming from a 30 MB file) and I need to check if that file contains a smaller substring (this string is only a few dozen characters). The way I am currently doing it is:

if small_string in large_string:
    # logic here

But this seems to be very inefficient because it will check every possible sequence of characters in the file. I know that there will only be an exact match on a newline, so would it be better to read the file in as a list and iterate through that list to match?

EDIT: To clear up some confusion on "matching on a newline only", here's an example:

small_string = "This is a line"
big_string = "This is a line
This is another line
This is yet another"

If I'm not mistake, the in keyword will check all sequences, not just every line.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Is it really slow? You're talking about 30MB string; let's try it with even bigger string:

In [12]: string="agu82934u"*50*1024*1024+"string to be found"

In [13]: len(string)
Out[13]: 471859218

In [14]: %timeit "string to be found" in string
1 loops, best of 3: 335 ms per loop

In [15]: %timeit "string not to be found" in string
1 loops, best of 3: 200 ms per loop

I wouldn't say that 335 ms is much time looking for substring in 450MB string.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...