Matt Warren Matt Warren - 2 months ago 17
Python Question

pyparsing capturing groups of arbitrary text with given headers as nested lists

I have a text file that looks similar to;


section header 1:

some words can be anything

more words could be anything at all

etc etc lala

some other header:

as before could be anything

hey isnt this fun


I am trying to contruct a grammar with pyparser that would result in the following list structure when asking for the parsed results as a list; (IE; the following should be printed when iterating through the parsed.asList() elements)


['section header 1:',[['some words can be anything'],['more words could be anything at all'],['etc etc lala']]]

['some other header:',[['as before could be anything'],['hey isnt this fun']]]


The header names are all known beforehand, and individual headers may or may not appear. If they do appear, thre is always at least one line of content.

The problem I am having, is that I am having trouble gettnig the parser to recognise where 'section header 1:' ands, and 'some other header:' begins. I end up with a parsed.asList() looking like;


['section header 1:',[[''some words can be anything'],['more words could be anything at all'],['etc etc lala'],['some other header'],[''as before could be anything'],['hey isnt this fun']]]


(IE: section header 1: gets seen correctly, but everythng following it gets added to section header 1, including further header lines etc..)

Ive tried various things, played with leaveWhitespace() and LineEnd() in various ways but I can't figure it out.

The base parser I am hacking about with is (contrived example - in reality this is a class definition etc..).

header_1_line=Literal('section header 1:')

text_line=Group(OneOrMore(Word(printables)))

header_1_block=Group(header_1_line+Group(OneOrMore(text_line)))

header_2_line=Literal('some other header:')

header_2_block=Group(header_2_line+Group(OneOrMore(text_line)))

overall_structure=ZeroOrMore(header_1_block|header_2_block)


and is being called with

parsed=overall_structure.parseFile()


Cheers, Matt.

Answer

Matt -

Welcome to pyparsing! You have fallen into one of the most common pitfalls in working with pyparsing, and that is that people are smarter than computers. When you look at your input text, you can easily see which text can be headers and which text can't be. Unfortunately, pyparsing is not so intuitive, so you have to tell it explicitly what can and can't be text.

When you look at your sample text, you are not accepting just any line of text as possible text within a section header. How do you know that 'some other header:' is not valid as text? Because you know that that string matches one of the known header strings. But in your current code, you have told pyparsing that any collection of Word(printables) is valid text, even if that collection is a valid section header.

To fix this, you have to add some explicit lookahead to your parser. Pyparsing offers two constructs, NotAny and FollowedBy. NotAny can be abbreviated using the '~' operator, so we can write this pseudocode expression for text:

text = ~any_section_header + everything_up_to_the_end_of_the_line

Here is a complete parser using negative lookahead to make sure you read each section, breaking on section headings:

from pyparsing import ParserElement, LineEnd, Literal, restOfLine, ZeroOrMore, Group, StringEnd

test = """
section header 1:
 some words can be anything
 more words could be anything at all
 etc etc lala 

some other header:
 as before could be anything
 hey isnt this fun
"""
ParserElement.defaultWhitespaceChars=(" \t")
NL = LineEnd().suppress()
END = StringEnd()

header_1=Literal('section header 1:') 
header_2=Literal('some other header:')
any_header = (header_1 | header_2)
# text isn't just anything! don't accept header line, and stop at the end of the input string
text=Group(~any_header + ~END + restOfLine) 

overall_structure = ZeroOrMore(Group(any_header +
                                     Group(ZeroOrMore(text))))
overall_structure.ignore(NL)

from pprint import pprint
print(overall_structure.parseString(test).asList())

In my first attempt, I forgot to also look for the end of string, so my restOfLine expression looped forever. By adding a second lookahead for the string end, my program terminates successfully. Exercise left for you: instead of enumerating all possible headers, define a header line as any line that ends with a ':'.

Good luck with your pyparsing efforts, -- Paul