I'm trying to parse PDFs with PLY lex/yacc, and Ive hit a snag regarding yacc parsing rule that govern
NUMBER
def p_value_list(t):
r'''value_list : value_list value
 value'''
if t.slice[0] == 'item_list':
t[0] = {'type':'value_list', 'children' : t[0]['children'] + [t[1]] }
else:
t[0] = {'type':'value_list', 'children' : [t[1]] }
pass
def p_value(t):
r'''value : dictionary
 array
 indirect_reference
 NUMBER
 HEX
 STREAM
 TEXT
 BOOL
 empty'''
t[0] = t[1]
pass
def p_indirect_reference(t):
r'''indirect_reference : NUMBER NUMBER KEY_R'''
t[0] = {'type':'indirect_reference', 'children' : [t[1], t[2]] }
pass
def p_array(t):
r'''array : LBRACKET value_list RBRACKET'''
t[0] = {'type':'array', 'children' : t[2] }
pass
NUMBERS
NUMBER NUMBER
NUMBER NUMBER KEY_R
NUMBER
[0 0 0]
ERROR: Error : obj_list NUMBER NUMBER OBJ LTLT key_value_list ID LBRACKET NUMBER NUMBER . LexToken(NUMBER,'0',1,166273)
ERROR: Error : obj_list NUMBER NUMBER OBJ LTLT key_value_list ID LBRACKET NUMBER NUMBER . LexToken(RBRACKET,']',1,88)
KEY_R
NUMBER
[ 486 173]
It's not really an ambiguity; the grammar is perfectly unambiguous. However, it is not LR(1), and an LR(1) parser cannot figure out whether to shift or reduce an integer if it is followed by another integer. (The grammar is LR(2) since the secondnext token is sufficient to decide; if it is R
, the integer is the first token in an indirect reference and should be shifted.
Fixing the problem is a lot trickier. In theory, you can convert any LR(2) grammar to an LR(1) grammar, but the transformation is cumbersome and I don't know of any automated tool which does it. The basic idea is to extend productions so as to avoid the need for a reduction until enough context has been encountered, and then fix the parse tree as necessary.
(For some other possible solutions to this problem, see below. The last option presented is my personal favourite.)
Here's an illustration of the LR(2)→LR(1) technique; you can see how NUMBER
tokens are simply accrued until their disposition is known:
value_not_number
: dictionary
 array
 HEX
 STREAM
 TEXT
 BOOL
 empty
value_list
: value_list_ends_non_number
 value_list_ends_one_number
 value_list_ends_two_numbers
 value_list_ends_indref

value_list_ends_one_number
: value_list_ends_non_number NUMBER
value_list_ends_two_numbers
: value_list_ends_non_number NUMBER NUMBER
 value_list_ends_two_numbers NUMBER
value_list_ends_indref
: value_list_ends_two_numbers 'R'
value_list_ends_non_number
: value_list value_not_number
array : '[' value_list ']'
Note that the parse tree generated by the grammar is not quite accurate, since it will parse 0 0 R
as a value_list_ends_two_numbers
followed by an R
. In order to retrieve the real parse tree, the reduction action for value_list_ends_indref
needs to steal the last two numbers from its first child, use them to manufacture an indirect reference object, and then add that to the end of the first child. So the PLY grammar with actions might look like this:
def p_unit_productions(t):
r'''value_not_number
: dictionary
 array
 HEX
 STREAM
 TEXT
 BOOL
value_list
: value_list_ends_non_number
 value_list_ends_one_number
 value_list_ends_two_numbers
 value_list_ends_indref'''
t[0] = t[1]
def p_value_list_empty(t):
r'''value_list:'''
t[0] = []
def p_value_list_push(t):
r'''value_list_ends_one_number
: value_list_ends_non_number NUMBER
value_list_ends_two_numbers
: value_list_ends_two_numbers NUMBER
value_list_ends_non_number
: value_list value_not_number'''
t[0] = t[1]
t[0].append(t[2])
def p_value_list_push2(t):
r'''value_list_ends_two_numbers
: value_list_ends_non_number NUMBER NUMBER'''
t[0] = t[1]
t[0].append(t[2])
t[0].append(t[3])
def p_indirect_reference(t):
r'''value_list_ends_indref
: value_list_ends_two_numbers 'R' '''
t[0] = t[1]
gen = t[0].pop()
obj = t[0].pop()
t[0].append({'type': 'indirect_reference',
'children': [obj, gen] })
def p_array(t):
r'''array : '[' value_list ']' '''
t[0] = {'type':'array', 'children' : t[2] }
That's not quite the same as yours:
value_list
object type (which seemed to me unnecessary).value: empty
on the assumption that your empty
is an empty nonterminal, as per the PLY manual; that will lead to a real ambiguity since you can't know how many empty strings are found between two nonterminals.(I haven't tested that PLY fragment; let me know if there are problems. I could easily have made some typos.)
As suggested in a comment, one alternative is to use the lexical scanner to recognize the entire sequence (in this case an indirect reference) which creates the need for extended lookahead.
In effect, this uses the lexical scanner to implement a fallback solution. This is a common solution when there are only a few LR(2) states in the grammar. For example, bison
itself uses this technique to distinguish between the lefthand side of a production and the use of a symbol on the righthand side. (You can tell the difference when you see  or fail to see  the colon after the symbol, but you have to know when the symbol itself is the first lookahead.)
In some cases that may seem like a simpler solution, but in other cases it results in simply exporting the complexity from the parser to the scanner. In this particular case, for example, you probably use the longestmatch lexical disambiguation algorithm to distinguish between an R
and an identifier token which starts with R
, and you probably have a rule which ignores whitespace. Neither of those will help you with the rule to recognize indirect references; you need to explicitly match the whitespace, and you need to explicitly verify that whatever follows the R
cannot form a longer identifier. That's not horribly complicated, but neither is it trivial; you'll need to do some work to create appropriate test coverage.
As the number of possible extendedlookahead states increases, you'll find this solution increasingly complicated. (Not that the parser solution outlined above is great, either.)
Since the grammar itself is not ambiguous, you could use a GLR parser instead of an LR(1) parser, if you had a parser generator which could produce GLR parsers. The GLR algorithm can parse any contextfree grammar, but at a cost: it is slightly slower in common cases and much slower in edge cases (in some grammars, it could require quadratic or even cubic time), and typical implementations delay semantic actions until the ambiguity is resolved, which means that they don't necessarily execute in the expected order.
In this case, there is a much stronger disincentive for this solution: PLY
doesn't produce GLR grammars. However, bison
does have a GLR option, and it is relatively simple to wrap a bison parser into a python module (which will quite possible be faster than the PLY parser, if that matters).
PDF is derived from Postscript, which is a stack language. Stack languages generally don't require parsers; they have just a lexical scanner and a stack. Each token has an associated action; for many tokens (literals, for example), the action is simply to push the token onto the stack. Other tokens may have more complicated actions.
The ]
token, for example, creates an array by popping the stack until it finds a [
and placing the popped elements into a newlycreated array object, which it then pushes on the stack.
Similarly, the R
token's action would be to pop two things off the top of the stack; check that they are numbers; and then construct an indirect reference object using the two numbers, which it would then push onto the stack.
PLY won't help you much in the construction of a PDF stack parser, but it will at least build the lexer for you, which is actually the most complicated part of the parser. There are not many stack operations required, and none of them are particularly challenging.