siamii siamii - 2 months ago 13
Python Question

parse xml like document with regex

I have a file that has many xml-like elements such as this one:

<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958

I need to parse the docid and the text. What's a suitable regular expression for that?

I've tried this but it doesn't work:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)

EDIT: I've modified the pattern like this:

<document docid=(\d+)>(.*)</document>

This matches the whole document unfortunately not the individual document elements.

EDIT2: The correct implementation from Ahmad's and Acorn's answer is:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)


Your pattern is greedy, so if you have multiple <document> elements it will end up matching all of them.

You can make it non-greedy by using .*?, which means "match zero or more characters, as few as possible." The updated pattern is:

<document docid=(\d+)>(.*?)</document>