Thoughtcraft Thoughtcraft - 11 months ago 55
Python Question

How to use Boolean OR inside a regex

I want to use a regex to find a substring, followed by a variable number of characters, followed by any of several substrings.

an re.findall of


should give me:


I have tried all of the following without success:

import re
re.findall('(ATG.*TAA)|(ATG.*TAG)', string2)
re.findall('ATG.*(TAA|TAG)', string2)
re.findall('ATG.*((TAA)|(TAG))', string2)
re.findall('ATG.*(TAA)|(TAG)', string2)
re.findall('ATG.*(TAA)|ATG.*(TAG)', string2)
re.findall('(ATG.*)(TAA)|(ATG.*)(TAG)', string2)
re.findall('(ATG.*)TAA|(ATG.*)TAG', string2)

What am I missing here?

Answer Source

This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.

As long as the strings are fairly short, you can check every substring:

import re
p = re.compile(r'ATG.*TA[GA]$')

for start in range(len(s)-6):  # string is at least 6 letters long
    for end in range(start+6, len(s)):
        if p.match(s, pos=start, endpos=end):

This prints:


Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.