user6513847 user6513847 - 1 year ago 52
Python Question

Why does lazy regex capture extra words?

I am using following lazy regex to find word before and after "=". I am not sure why it capturing extra words


The text is in format

my name = jil
part = #2

So I want to capture name = jil

am i doing something wrong here or can I do it in different manner.

Note : Before and after "=" we can have special characters

Answer Source

You're looking for: (\S+)\s*\=\s*(\S+)

\S matches non-whitespace, and will allow for ./\#@&, etc in the capture group.

\w matches only word characters, so this matches the last word before an equals and the first word after, with or without whitespace between the = if you change the \s+ to \s*

Why it doesn't work is because it parses it left to right: When it finds any amount of whitespace \s+ it begins sucking in all characters .*? until it finds a " =". So it will match the whole line before the " =" after any whitespace character.

The lazy evaluation doesn't go back to find the smallest set it can, it just goes until it reaches the first complete match and stops:

dog dog dog dog = cat cat cat cat

a lazy capture of \s+(.*?)\s+= gives: us dog dog dog, because that's the first match it got: starting from a " " after the first dog and ending at the first " =" it finds. The second group does what you expect, because it doesn't have the extra requirement that it ends on a space with an equals sign.

After the =, the lazy will limit it to only the first word, as that is the first point at which it gets a match. A greedy version would continue sucking in characters and find the longest string which ends in \s+.

tl;dr: lazy evaluation won't go back to find the smallest match, it will grab the first match when parsing from left to right. d+?og will match ddddddog in it's entirety, as it needed to gobble all the other ds to match the first d with the og and it's too lazy to go back and see if it really needed to eat all those extra characters.