sun qingyao sun qingyao - 9 months ago 69
Python Question

Matching same characters in a row using regex

I want to match "three uppercase letters, one lowercase letters, and three uppercase letters" using regular expression. What makes this difficult is that adjacent uppercase letters must be same. For example, I expect

, but not

Here is what I've tried:

print(re.findall("[A-Z]{3}[a-z][A-Z]{3}", l))

However, this is not what I want, because it matches
as well.

Answer Source

You can use capture groups and backreferences:

re.findall(r"(([A-Z])\2\2[a-z]([A-Z])\3\3)", string)

Note, however, that in the presence of groups in the pattern re.findall() will return the groups instead of matches. So to get the matched strings you'll need to enclose the whole pattern in parentheses and take the 1st group:

>>> s = "AAAbCCC AAAbCCD"
>>> [groups[0] for groups in re.findall(r"(([A-Z])\2\2[a-z]([A-Z])\3\3)", s)]

You can also use re.finditer(), which returns an iterator over the match objects:

>>> [ for match in re.finditer(r"(([A-Z])\2\2[a-z]([A-Z])\3\3)", s)]