stackErr stackErr - 4 days ago 6
Python Question

Parsing structured data from file python

The file has the following format:


Component_name - version - author@email.com - multi-line comment with new lines and other white space characters

\t ...continue multi-line comment

Component_name2 - version - author2@email.com - possibly multi-line comment with new lines and other white space characters

Component_name - version - author@email.com - possibly multi-line comment with new lines and other white space characters 2

Component_name - version - author2@email.com - possibly multi-line comment with new lines and other white space characters 2

and so on...


After parsing the output format should be grouped by component_name:

output = [
"component_name" -> ["version - author@email.com - comment 1", "version - author@email.com - comment 2", ...],
"component_name2" -> [...],
...
]


Currently, this is what I have so far to parse it:

reTemp = r"[\w\_\-]*( \- )(\d*\.?){3}( \- )[\w\d\_\-\.\@]*( \- )[\S ]*"
numData = 4
reFormat = re.compile(reTemp)

textFileLines = textFile.split("\n")
temp = [x.split(" - ", numData - 1) for x in textFileLines if re.search(reFormat, x)]
m = filter(None, temp) # remove all empty lists
group = groupby(m, lambda y: y[0].strip())


This works well for single line comments but fails with multi-line comments. Also, I am not sure if Regex is the right tool for this. Is there a better/pythonic way to do this?

EDIT:


  • Multi-line comments are tab delimited
    \t
    on a new line (e.g. look at first entry above)

  • Comments are GIT commit messages and can contain JSON or code

  • Entries are separated by a newline character


Answer

I've had to deal with structured data files like this and ended up writing a state machine to parse the file. Something like this (rough pseudocode):

for line in file:
    if line matches new_record_regex:
        records.append(record)
        record = {"version": field1, "author": field2, "comment": field3}
    else:
        record["comment"] += line
Comments