Simplicity Simplicity - 6 months ago 8
Python Question

Regular expression - range of the match

I have the following regular expression:

re.findall(r'(\b[A-Za-z][a-z]{3,10}\b)', string_var)


I expected that this regular expression will return matches with the length ranging from
3
to
10
. It however returns matches for words ranging in length from
4
to
11
.

Do we thus read the above regular expression as matching those words which start with an upper case or lower case letter, followed by letters ranging in length from
3
to
10
? In other words, having the first letter as the extra letter which extended the range?

Thanks.

Answer

Yes.

Your regex is

(\b[A-Za-z][a-z]{3,10}\b)

Now, the grouping parens don't affect the match, so we can ignore them. And the \b is a "zero-width" matching operator - it matches a transition from one character class to another - so it doesn't actually correspond to any characters. We can ignore them. That leaves this:

[A-Za-z][a-z]{3,10}

This is two character classes, with a repetition specifier suffix on the second:

  1. [A-Za-z] - matches one character, upper or lower case Latin alphabetic.

  2. [a-z]{3,10} - matches at least 3, at most 10 characters, lowercase a-z

So in total, you are matching 1 + [3,10] character. Your minimal match will be 4 characters, and your maximal match will be 11.

Comments