warl0ck warl0ck - 1 month ago 17
Python Question

antlr4 + python: debug token match

I'm using antlr4 + python target to match a phrase like this´╝î

select 1 from dual where id=.0union select 1


The tokens are:

['select', '1', 'from', 'dual', 'where', 'id', '=', '.0union', 'select', '1']


My problem is, the
.0
and
union
token has been merged into one single token, aka
.0union
, and antlr reports an error like this,

line 1:32 mismatched input '=' expecting {<EOF>, '&&', <INVALID>, ';', <INVALID>, <INVALID>, <INVALID>, <INVALID>, <INVALID>, <INVALID>, <INVALID>, <INVALID>, <INVALID>, <INVALID>, <INVALID>, <INVALID>}


Any ideas on debugging it?

Is there any way to debug the token match process?

Answer

As we found out in a private discussion this problem has to do with how the dot-identifier rule is defined in the grammar. There is a conflict between input like .0union and .union. The first should be treated as a decimal number and a keyword, while the second form should be taken as a whole and tagged as dot-identifier. So, the solution is to not to allow digits after the dot in a dot-identifier (which would always have to resolve to a decimal):

FLOAT_NUMBER: DECIMAL_NUMBER [eE] (MINUS_OPERATOR | PLUS_OPERATOR)? DIGITS;
DECIMAL_NUMBER: DIGITS? DOT_SYMBOL DIGITS;

// Special rule that should also match all keywords if they are directly preceded by a dot.
// Hence it's defined before all keywords.
DOT_IDENTIFIER: DOT_SYMBOL LETTER_WHEN_UNQUOTED_NO_DIGIT LETTER_WHEN_UNQUOTED*;

fragment LETTER_WHEN_UNQUOTED:
    DIGIT
    | LETTER_WHEN_UNQUOTED_NO_DIGIT
;

fragment LETTER_WHEN_UNQUOTED_NO_DIGIT:
    [a-zA-Z_$\u0080-\uffff]
;
Comments