Lexicon Lexicon - 16 days ago 5
Python Question

Looking for a Pattern 512 bytes at a time and printing match.

I need to find a string that matches a particular pattern in a very large log file (several GB). The problem is that I can only look at 512 bytes of the file at a time. As I compare the string 512 bytes at a time, the pattern isn't always found because it may overlap two different chunks. for example, if the pattern is "potato", the first part of the word may only exist at the end of one chunk, while the rest only exists on the beginning of the second chunk.

Ideally, I'd like to substitute the pattern for a regex and only print the string that the pattern matched. I'd love to see how others would approach a problem like this. Any help would be greatly appreciated.

import sys
import re

file = open(sys.argv[1], "rb")
pattern = re.compile('potato')

try:
chunk= file.read(512)
while byte != "":
if pattern.search(chunk):
print chunk
# TODO: Print only the part that matched pattern
chunk = file.read(100)
finally:
file.close()

Answer

first, create a group using parentheses.

Read chunks, and match in part of previous chunk + chunk (not both chunks because you would match more than once on 2 successive reads). I keep only the length of the pattern, which may be wrong if it's a real regular expression.

Then if you find a match, just print the first and only group, like this:

file = open(sys.argv[1], "rb")
ptrn="potato"
pattern = re.compile('({})'.format(ptrn),re.DOTALL)  # group & multi-line match

prev=""
try:
    while True:
        chunk= file.read(512)
        if not chunk:
           break
        m = pattern.search(prev+chunk)
        if m:
            # Print only the part that matched pattern
            print(m.group(1))
        prev = chunk[-len(ptrn):]   # keep end of previous chunk
finally:
    file.close()

Notes:

  • Since you can encounter a end-of-line when reading like this and not line by line, I suggested the re.DOTALL flag for a multi-line match
  • There's probably a typo in the while condition. You probably meant chunk and not bytes. I have fixed that and simplified the read loop (and also you went on reading 100 bytes instead of 512)