plfrick plfrick - 1 year ago 73
Python Question

Regular expressions in python to match Twitter handles

I'm trying to use regular expressions to capture all Twitter handles within a tweet body. The challenge is that I'm trying to get handles that

  1. Contain a specific string

  2. Are of unknown length

  3. May be followed by either

    • punctuation

    • whitespace

    • or the end of string.

For example, for each of these strings, Ive marked in italics what I'd like to return.

"@handle what is your problem?" [RETURN '@handle']

"what is your problem @handle?" [RETURN '@handle']

"@123handle what is your problem @handle123?" [RETURN '@123handle', '@handle123']

This is what I have so far:

>>> import re
>>> re.findall(r'(@.*handle.*?)\W','hi @123handle, hello @handle123')
# This misses the handles that are followed by end-of-string

I tried modifying to include an
character allowing the end-of-string character. Instead, it just returns the whole string.

>>> re.findall(r'(@.*handle.*?)(?=\W|$)','hi @123handle, hello @handle123')
['@123handle, hello @handle123']
# This looks like it is too greedy and ends up returning too much

How can I write an expression that will satisfy both conditions?

I've looked at a couple other places, but am still stuck.

Answer Source

It seems you are trying to match strings starting with @, then having 0+ word chars, then handle, and then again 0+ word chars.



or - to avoid matching @+word chars in emails:


See the Regex 1 demo and the Regex 2 demo (the \B non-word boundary requires a non-word char or start of string to be right before the @).

Note that the .* is a greedy dot matching pattern that matches any characters other than newline, as many as possible. \w* only matches 0+ characters (also as many as possible) but from the [a-zA-Z0-9_] set if the re.UNICODE flag is not used (and it is not used in your code).

Python demo:

import re
p = re.compile(r'@\w*handle\w*')
test_str = "@handle what is your problem?\nwhat is your problem @handle?\n@123handle what is your problem @handle123?\n"
# => ['@handle', '@handle', '@123handle', '@handle123']