Ronn Ronn - 28 days ago 13
Java Question

While tokenizing the following string 40 println "Hello ",(5+6-4), "-4" is showing a single token and not separate one

I am writing a lexer in java for a custom base language. For the following line
40 println "Hello ",(5+6-4)
I want the output as

40
println
"Hello "
,
(
5
+
6
-
4
)


Everything else is coming alright, but for some reason i am getting - and 4 together "-4" as a token.

Regex used:

For Numbers
-?[0-9]+


Special operator / Characters:
[\\[|\\]|/|.|$|*|-|+|=|>|<|#|(|)|%|,|!|||&|||{|}]


Regex for Number without the leading "-" is showing error at char 89 which is start of ?[0-9]+

dangling Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '?' near index 89 ((?<Reserved>\bPRINTLN\b|\bPRINT\b|\bINTEGER\b|\bINPUT\b|\bEND\b|\bLET\b))|((?<Constants>?[0-9]+))|((?<Special>[\[|\]|/|.|$|*|-|+|=|>|<|#|(|)|%|,|!|||&|||{|}]))|((?<Literals>"[^"]*"))|((?<Identifiers>\w+))


I am storing the regex in a string and using named capturing grouping to identify the tokens.

Answer

(?<Constants>?[0-9]+) - This part in your regex seems to be the problem. The ? following the capture group name is a dangling one.

Also, there is no need to separate a character class members using |.

Based on the error you shared, the following would be what you want:

    String regex = "((?<Reserved>\\bPRINTLN\\b|\\bPRINT\\b|\\bINTEGER\\b|\\bINPUT\\b|\\bEND\\b|\\bLET\\b))|((?<Constants>[0-9]+))|((?<Special>[\\[\\]/.$*\\-+=><#()%,!|&{|}]))|((?<Literals>\"[^\"]*\"))|((?<Identifiers>\\w+))";
    String s = "40 println \"Hello \",(5+6-4) ";
    Matcher matcher = Pattern.compile(regex).matcher(s);
    while(matcher.find()) {
        System.out.println(matcher.group());
    }

I have removed the dangling ? mentioned above, removed the |s used for separation inside character class and escaped the - inside the character class (alternatively you can move the - to the end of the character class).