Ev. Kounis Ev. Kounis - 6 months ago 15
Python Question

A rather special parsing of a txt file

Alright good people of stackOverflow, my question is on the broad subject of parsing. The information i want to obtain is on multiple positions on a text file marked by begin and end headers (special strings) on each appearance. I want to get everything that's between these headers. The code i have implemented so far seems somehow terribly inefficient (although not slow) and as you can see below makes use of two while statements.

with open(sessionFile, 'r') as inp_ses:
curr_line = inp_ses.readline()
while 'ga_group_create' not in curr_line:
curr_line = inp_ses.readline()
set_name = curr_line.split("\"")[1]
recording = []
curr_line = inp_ses.readline()
# now looking for the next instance
while 'ga_group_create' not in curr_line:
recording.append(curr_line)
curr_line = inp_ses.readline()


Pay no attention to the fact that the begin and end headers are the same string (just call them "begin" and "end"). The code above gives me the text between the headers only the first time they appear. I can modify it to give me the rest by keeping track of variables that increment in every instance, modifying my while statements etc but all this feels like trying to re-invent the wheel and in a very bad way too.

Is there anything out there i can make use of?

Answer

I agree regex is a good way to go here, but this is a more direct application to your problem:

import re

options = re.DOTALL | re.MULTILINE
contents = open('parsexample.txt').read()    
m = re.search('ga_group_create(.*)ga_group_create', contents, 
              options)    
lines_in_between = m.groups(0)[0].split()

If you have a couple of these groups, you can iterate through them:

for m in re.finditer('ga_group_create(.*?)ga_group_create', contents, options):
    print(m.groups(0)[0].split())

Notice I've used *? to do non-greedy matching.