I have a text file that looks similar to;
section header 1:
some words can be anything
more words could be anything at all
etc etc lala
some other header:
as before could be anything
hey isnt this fun
['section header 1:',[['some words can be anything'],['more words could be anything at all'],['etc etc lala']]]
['some other header:',[['as before could be anything'],['hey isnt this fun']]]
['section header 1:',[[''some words can be anything'],['more words could be anything at all'],['etc etc lala'],['some other header'],[''as before could be anything'],['hey isnt this fun']]]
header_1_line=Literal('section header 1:')
header_2_line=Literal('some other header:')
Welcome to pyparsing! You have fallen into one of the most common pitfalls in working with pyparsing, and that is that people are smarter than computers. When you look at your input text, you can easily see which text can be headers and which text can't be. Unfortunately, pyparsing is not so intuitive, so you have to tell it explicitly what can and can't be text.
When you look at your sample text, you are not accepting just any line of text as possible text within a section header. How do you know that 'some other header:' is not valid as text? Because you know that that string matches one of the known header strings. But in your current code, you have told pyparsing that any collection of
Word(printables) is valid text, even if that collection is a valid section header.
To fix this, you have to add some explicit lookahead to your parser. Pyparsing offers two constructs, NotAny and FollowedBy. NotAny can be abbreviated using the '~' operator, so we can write this pseudocode expression for text:
text = ~any_section_header + everything_up_to_the_end_of_the_line
Here is a complete parser using negative lookahead to make sure you read each section, breaking on section headings:
from pyparsing import ParserElement, LineEnd, Literal, restOfLine, ZeroOrMore, Group, StringEnd test = """ section header 1: some words can be anything more words could be anything at all etc etc lala some other header: as before could be anything hey isnt this fun """ ParserElement.defaultWhitespaceChars=(" \t") NL = LineEnd().suppress() END = StringEnd() header_1=Literal('section header 1:') header_2=Literal('some other header:') any_header = (header_1 | header_2) # text isn't just anything! don't accept header line, and stop at the end of the input string text=Group(~any_header + ~END + restOfLine) overall_structure = ZeroOrMore(Group(any_header + Group(ZeroOrMore(text)))) overall_structure.ignore(NL) from pprint import pprint print(overall_structure.parseString(test).asList())
In my first attempt, I forgot to also look for the end of string, so my restOfLine expression looped forever. By adding a second lookahead for the string end, my program terminates successfully. Exercise left for you: instead of enumerating all possible headers, define a header line as any line that ends with a ':'.
Good luck with your pyparsing efforts, -- Paul