Thoughtcraft Thoughtcraft - 2 months ago 14
Python Question

How to use Boolean OR inside a regex

I want to use a regex to find a substring, followed by a variable number of characters, followed by any of several substrings.

an re.findall of

"ATGTCAGGTAAGCTTAGGGCTTTAGGATT"


should give me:

['ATGTCAGGTAA', 'ATGTCAGGTAAGCTTAG', 'ATGTCAGGTAAGCTTAGGGCTTTAG']


I have tried all of the following without success:

import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
re.findall('(ATG.*TAA)|(ATG.*TAG)', string2)
re.findall('ATG.*(TAA|TAG)', string2)
re.findall('ATG.*((TAA)|(TAG))', string2)
re.findall('ATG.*(TAA)|(TAG)', string2)
re.findall('ATG.*(TAA)|ATG.*(TAG)', string2)
re.findall('(ATG.*)(TAA)|(ATG.*)(TAG)', string2)
re.findall('(ATG.*)TAA|(ATG.*)TAG', string2)


What am I missing here?

Answer

This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.

As long as the strings are fairly short, you can check every substring:

import re
s = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
p = re.compile(r'ATG.*TA[GA]$')

for start in range(len(s)-6):  # string is at least 6 letters long
    for end in range(start+6, len(s)):
        if p.match(s, pos=start, endpos=end):
            print(s[start:end])

This prints:

ATGTCAGGTAA
ATGTCAGGTAAGCTTAG
ATGTCAGGTAAGCTTAGGGCTTTAG

Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.