Dave Dave - 2 months ago 16
Python Question

How to use pyparsing LineStart?

I'm trying to use pyparsing to parse key:value pairs from the comments in a document. A key starts at the beginning of a line, and a value follows. Values may be continued on multiple lines that begin with whitespace.

import pyparsing as pp

instring = """
-- This is (a) #%^& comment

/*
name1: val
name2: val2 with $*&#@) junk
name3: val3: with @)(*% multi-
line: content
*/
"""

comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = pp.LineStart() + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = pp.LineStart() + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")

if __name__ == "__main__":
p = metalist.parseString(instring)
print(p)


Fails with:

Matched {Empty SkipTo:(LineEnd) Empty} -> ['This is (a) #%^& comment']

File "C:\Users\user\py3\lib\site-packages\pyparsing.py", line 2305, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected start of line (at char 32), (line:4, col:1)


The answer to pyparsing whitespace match issues says

LineStart has always been difficult to work with, but ...


If the parser is at line 4 column 1 (the first key:value pair), then why is it not finding a start of line? What is the correct pyparsing syntax to recognize lines beginning with no whitespace and lines beginning with whitespace?

Answer

I think the confusion I have with LineStart is that, for LineEnd, I can look for a '\n' character, but there is no separate character for LineStart. So in LineStart I look to see if the current parser location is positioned just after a '\n'; or if it is currently on a '\n', move past it and still continue. Unfortunately, I implemented this in a place that messes up the reporting location, so you get those weird errors that read like "failed to find a start of line on line X col 1," which really does sound like it should be a successfully matched start of a line. Also, I think I need to revisit this implicit newline-skipping, or for that matter, all whitespace-skipping in general for LineStart.

For now, I've gotten your code to work by expanding your line-starting expression slightly, as:

LS = pp.Optional(pp.LineEnd()) + pp.LineStart()

and replaced the LineStart references in meta1 and meta2 with LS:

comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()
identifier = pp.Word(pp.alphanums + "_").setDebug()
meta1 = LS + identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd())
meta2 = LS + pp.White() + pp.SkipTo(pp.LineEnd())
metaval = meta1 + pp.ZeroOrMore(meta2)
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.OneOrMore(metaval) + pp.Literal("*/")

If this situation with LineStart leaves you uncomfortable, here is another tactic you can try: using a parse-time condition to only accept identifiers that start in column 1:

comment1 = pp.Literal("--") + pp.originalTextFor(pp.SkipTo(pp.LineEnd())).setDebug()

identifier = pp.Word(pp.alphanums + "_").setName("identifier")
identifier.addCondition(lambda instring,loc,toks: pp.col(loc,instring) == 1)

meta1 = identifier + pp.Literal(":") + pp.SkipTo(pp.LineEnd()).setDebug()
meta2 = pp.White().setDebug() + pp.SkipTo(pp.LineEnd()).setDebug()
metaval = meta1 + pp.ZeroOrMore(meta2, stopOn=pp.Literal('*/'))
metalist = pp.ZeroOrMore(comment1) + pp.Literal("/*") + pp.LineEnd() + pp.OneOrMore(metaval) + pp.Literal("*/")

This code does away with LineStart completely, while I figure out just what I want this particular token to do. I also had to modify the ZeroOrMore repetition in metaval so that */ would not be accidentally processed as continued comment content.

Thanks for your patience with this - I am not keen to quickly put out a patched LineStart change and then find that I have overlooked other compatibility or other edge cases that just put me back in the current less-than-great state on this class. But I'll put some effort into clarifying this behavior before putting out 2.1.10.

Comments