David Faux David Faux - 3 months ago 10
Python Question

How do I separate words using regex in python while considering words with apostrophes?

I tried separate m's in a python regex by using word boundaries and find them all. These m's should either have a whitespace on both sides or begin/end the string:

r = re.compile("\\bm\\b")
re.findall(r, someString)


However, this method also finds m's within words like
I'm
since apostrophes are considered to be word boundaries. How do I write a regex that doesn't consider apostrophes as word boundaries?

I've tried this:

r = re.compile("(\\sm\\s) | (^m) | (m$)")
re.findall(r, someString)


but that just doesn't match any m. Odd.

Answer

Using lookaround assertion:

>>> import re
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I'm a boy")
[]
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I m a boy")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "mama")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "pm")
['m']

(?=...)

Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

(?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef, ...

from Regular expression syntax

BTW, using raw string (r'this is raw string'), you don't need to escape \.

>>> r'\s' == '\\s'
True