user6513847 user6513847 - 5 months ago 8
Python Question

why does regex capture extra words after "=" sign

I am using following regex to find word before and after "=". I am not sure why it capturing extra words

r'\s+(.*?)\s+\=\s+(.*?)\s+'


The text is in format

my name = jil
part = #2


So I want to capture name = jil

am i doing something wrong here or can I do it in different manner.

Note : Before and after "=" we can have special characters

Answer

You're looking for: (\S+)\s*\=\s*(\S+)

\S matches non-whitespace, and will allow for ./\#@&, etc in the capture group.

\w matches only word characters, so this matches the last word before an equals and the first word after, with or without whitespace between the = if you change the \s+ to \s*

Why it doesn't work is because it parses it left to right: When it finds any amount of whitespace \s+ it begins sucking in all characters .*? until it finds a " =". So it will match the whole line before the " =" after any whitespace character.

The lazy evaluation doesn't go back to find the smallest set it can, it just goes until it reaches the first complete match and stops:

dog dog dog dog = cat cat cat cat

a lazy capture of \s+(.*?)\s+= gives: us dog dog dog, because that's the first match it got: starting from a " " after the first dog and ending at the first " =" it finds. The second group does what you expect, because it doesn't have the extra requirement that it ends on a space with an equals sign.

After the =, the lazy will limit it to only the first word, as that is the first point at which it gets a match. A greedy version would continue sucking in characters and find the longest string which ends in \s+.