alphanumeric alphanumeric - 1 month ago 7
Python Question

How to use multiple token using Regex Expression

To extract first three letters 'abc' and three sets of three-digits numbers in

000_111_222
I am using the following expression:

text = 'abc_000_111_222'
print re.findall('^[a-z]{3}_[0-9]{3}_[0-9]{3}_[0-9]{3}', text)


But the expression returns empty list when instead of underscores there are minuses or periods used instead:
abc.000.111.222
or
abc-000-111-222
or any combination of it like:
abc_000.111-222


Sure I could use a simple replace method to unify the text variable
text=text.replace('-','_').replace('.','_')


But I wonder if instead of replacing I could modify regex expression that would recognize the underscores, minuses and periods.

Answer

You can use regex character classes with [...]. For your case, it can be [_.-] (note the hyphen at the end, if it isn't at the end, it will be considered as a range like [a-z]).

You can use a regex like this:

print re.findall('^[a-z]{3}[_.-][0-9]{3}[_.-][0-9]{3}[_.-][0-9]{3}', text)

enter image description here

Btw, you can shorten your regex to have something like this:

print re.findall('^[a-z]{3}[_.-](\d{3}[_.-]){2}\d{3}', text)

Just as a comment, in case you want to match the same separator, then you can use capture groups and reference its content like this:

^[a-z]{3}([_.-])[0-9]{3}\1[0-9]{3}\1[0-9]{3}
Comments