I'm just grepping some Xliff files for the pattern
grep 'approved="no"' FILE
def grep(pattern, file_path):
ret = False
with codecs.open(file_path, "r", encoding="utf-8") as f:
while 1 and not ret:
lines = f.readlines(100000)
if not lines:
for line in lines:
if re.search(pattern, line):
ret = True
Python, being an interpreted language vs. a compiled C version of
grep will always be slower.
Apart from that your Python implementation is not the same as your
grep example. It is not returning the matching lines, it is merely testing to see if the pattern matches the characters on any one line. A closer comparison would be:
grep -q 'approved="no"' FILE
which will return as soon as a match is found and not produce any output.
You can substantially speed up your code by writing your
grep() function more efficiently:
def grep_1(pattern, file_path): with io.open(file_path, "r", encoding="utf-8") as f: while True: lines = f.readlines(100000) if not lines: return False if re.search(pattern, ''.join(lines)): return True
io instead of
codecs which I found was a little faster. The while loop condition does not need to check
ret and you can return from the function as soon as the result is known. There's no need to run
re.search() for each individual ilne - just join the lines and perform a single search.
At the cost of memory usage you could try this:
import io def grep_2(pattern, file_path): with io.open(file_path, "r", encoding="utf-8") as f: return re.search(pattern, f.read())
If memory is an issue you could
mmap the file and run the regex search on the
import io import mmap def grep_3(pattern, file_path): with io.open(file_path, "r", encoding="utf-8") as f: return re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))
mmap will efficiently read the data from the file in pages without consuming a lot of memory. Also, you'll probably find that
mmap runs faster than the other solutions.
timeit for each of these functions shows that this is the case:
10 loops, best of 3: 639 msec per loop # grep() 10 loops, best of 3: 78.7 msec per loop # grep_1() 10 loops, best of 3: 19.4 msec per loop # grep_2() 100 loops, best of 3: 5.32 msec per loop # grep_3()
The file was
/usr/share/dict/words containing approx 480,000 lines and the search pattern was
zymurgies, which occurs near the end of the file. For comparison, when pattern is near the start of the file, e.g.
abaciscus, the times are:
10 loops, best of 3: 62.6 msec per loop # grep() 1000 loops, best of 3: 1.6 msec per loop # grep_1() 100 loops, best of 3: 14.2 msec per loop # grep_2() 10000 loops, best of 3: 37.2 usec per loop # grep_3()
which again shows that the
mmap version is fastest.
Now comparing the
grep command with the Python
$ time grep -q zymurgies /usr/share/dict/words real 0m0.010s user 0m0.007s sys 0m0.003s $ time python x.py grep_3 # uses mmap real 0m0.023s user 0m0.019s sys 0m0.004s
Which is not too bad considering the advantages that