noahandthewhale noahandthewhale - 1 month ago 17
Python Question

How to account for accent characters for regex in Python?

I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)


It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example:
áéíóúñü¿
.

If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example,
#yogenfrüz
would be
#yogenfr
.

I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like
#yogenfrüz


How can I go about doing this

Answer

Try the following:

hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)

Regex101 Demo

EDIT Check the useful comment below from Martijn Pieters.

Comments