Andreas W. Wylach Andreas W. Wylach -4 years ago 82
C++ Question

Flex lexer rule with positive lookahead assertion on alphanumeric strings containing hyphens and slashes

I have a bit of trouble to build a flex lexer rule with a positive lookahead assertion for a certain type of token and could use some help. I am sure I am missing something simple here.

The token string I want to match looks like this:

33-abc-13/12
99-ab-33
o3sehh04/00
glu6-840d/00
vm-22hd
xyz-3


The token object to match is a string containing
letters and digits
and has
slashes and/or hyphens
, a rare cases
a dot
, possibly something like
xx-3006/10.00


What must not be matched (cause other rules cover these cases) are tokens such as:

numeric370
hyphen-term
plainterm
00/40


What I tried so far is this rule with a lookahead:

([a-z0-9/-]*)/[-/]+[0-9/-]+


With above I get results that comes close to what I would like to achieve. It matches all these above listed strings, but the last character or digit is skipped. The matched tokens look like:

33-abc-13/1
o3sehh04/0
...


Unfortunately the rule also matches
00/40
(resulting in
00/4
).

So my question is what do I miss here? It would be nice to cover these cases with one rule if possible and fast enough. I am aware of the order of processed rules in the lexer script, so the position of that rule would be one of the first ones in the entire set.
If it is not possible perhaps a breakdown of that rule would be another way to go.

With this project I use the RE-flex package (https://github.com/Genivia/RE-flex) because it covers the flex lexer interface and provides unicode (I need to work with
wchar_t
character strings).
My lexer is a whitespace tokenizer with token classification, it was basically build on the flex 2.5 package a few years back. I've refactored a few things in the token processing and moved to re-flex as it gives me more opportunities. The tokenizer Input strings are short simple text snippets, they do not exceed a length of, lets say, 250-300 characters. So far the background.

NOTE: I use regex101.com to check/experiment when building rules before I transform them for the lexer. It helps a little to get to the right direction, but that's all.

Any help is greatly appreciated, thanks for your efforts in advance!

Update:
Based on rici's answer the final pattern now looks like this:

[a-z0-9/.-]*[/.-][0-9/-]+


This also covers now tokens containing a
.
, for example

xx33-4.00
f/44-7.87
...


And considering the sentence separator problem in my comment below was simply
a
.
in the last character group of the pattern. I removed it and now it works as expected.

Answer Source

I don't know anything about RE-flex (although it looks cool) but assuming it really is compatible with flex, the same approach should work: forget about forward lookahead assertions (since the string matched will not include the lookahead pattern, and you want to match the whole string) and put the rule after all the other rules which might match the same thing.

The flex rule is:

  • the pattern which has the longest match wins, but
  • if two or more patterns both match the longest match, the first pattern in the file wins.

So, for example, say you have the patterns:

[0-9]+("/"[0-9]+)*          { return SLASHED_NUMBERS; }
[a-z0-9/-]*[/-][0-9/-]+     { return GENERAL_TOKEN;   }

[Note 1]

Both of those will match 00/40, so if that is the token at the input point, that token will be detected as SLASHED_NUMBERS (the first rule in the file). On the other hand, if you have 00/49-23, it will be detected as GENERAL_TOKEN because that rule matched more characters.


Notes

  1. I based that on your regex. I didn't understand "a rare cases a dot" and it doesn't seem to be reflected in your pattern; furthermore, your pattern seems to be more specific than just "letters, numbers, hyphens and slashes", but I'm not sure exactly what the specifics are.
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download