sun qingyao sun qingyao - 1 month ago 7
Python Question

Matching same characters in a row using regex

I want to match "three uppercase letters, one lowercase letters, and three uppercase letters" using regular expression. What makes this difficult is that adjacent uppercase letters must be same. For example, I expect

AAAbCCC
, but not
AAAbCCD
or
ABAbCDC
.

Here is what I've tried:

print(re.findall("[A-Z]{3}[a-z][A-Z]{3}", l))


However, this is not what I want, because it matches
AAAbCCD
and
ABAbCDC
as well.

Answer

You can use capture groups and backreferences:

re.findall(r"(([A-Z])\2\2[a-z]([A-Z])\3\3)", string)

Note, however, that in the presence of groups in the pattern re.findall() will return the groups instead of matches. So to get the matched strings you'll need to enclose the whole pattern in parentheses and take the 1st group:

>>> s = "AAAbCCC AAAbCCD"
>>> [groups[0] for groups in re.findall(r"(([A-Z])\2\2[a-z]([A-Z])\3\3)", s)]
['AAAbCCC']

You can also use re.finditer(), which returns an iterator over the match objects:

>>> [match.group(1) for match in re.finditer(r"(([A-Z])\2\2[a-z]([A-Z])\3\3)", s)]
['AAAbCCC']
Comments