DrJessop - 9 months ago 58
Python Question

# DNA Motif Enumeration with Try/Except and Loops - Python3

I've been struggling with a particular bioinformatics problem for the last few days, and I was wondering if somebody could either find the fault in my logic or fault in my code (or both).
The function is supposed to find all k-mers with a Hamming distance of at most d from all of the strings of DNA.
This is what I'm trying to do:

1. Start with an iteration of all possible k-mers and compare them to each string

2. This implies I need another loop that goes through the strings of DNA as well

3. I make a while loop for
`c+k <= len(DNA[0])-1`
.
`c+k`
is my window of size
`k`
, and I want to find at least one window in each string of DNA where my combo has a Hamming distance from that string equal to or less than an arbitrary
`d`
. If the Hamming distance meets the criterion, then the while loop breaks, allowing the next string to be compared. If it doesn't, then the window is changed, and if
`c+k==len(DNA[0])-1`
, and the Hamming distance still doesn't meet the criterion, I create a name error
`int(a)`
and the exception kills the
`inner_loop`
.

However, my function returns nothing except
`set()`
, which I don't understand.

``````import itertools
def combination(k):
bases=['A','T','G','C']
combo=[''.join(p) for p in itertools.product(bases, repeat=k)]
return combo

def hammingDistance(Pattern, seq):
if Pattern == seq:
return 0
else:
dist=0
for i in range(len(seq)):
if Pattern[i] != seq[i]:
dist += 1
return dist

def motif_enumeration(k, d, DNA):
combos = combination(k)
global pattern
for combo in combos:
try:
inner_loop(k, d, DNA, combo)
except:
continue

return set(pattern)

def inner_loop(k, d, DNA, combo):
global pattern
for strings in DNA:
inner_loop_two(k, d, DNA, combo, strings)

def inner_loop_two(k, d, DNA, combo, strings):
global pattern
c=0
while c+k < len(DNA[0]):
print(combo, strings[c:c+k], hammingDistance(combo, strings[c:c+k]))
if d >= hammingDistance(combo, strings[c:c+k]) and strings == DNA[len(DNA)-1]:
#if we've reached the last string and the condition is met,
#that means that the combo is suitable for each string of DNA
pattern += [combo]
elif d >= hammingDistance(combo, strings[c:c+k]):
#condition is met for one string, now move onto next
break
elif d < hammingDistance(combo, strings[c:c+k]) and c+k == len(DNA[0])-1:
#Name error causes this inner loop two to crash, thus causing the first inner loop
#to pass
int(a)
elif d < hammingDistance(combo, strings[c:c+k]):
#change the window to see if the combo is valid later in the string
c += 1

pattern = []
DNA=['ATTTGGC', 'TGCCTTA', 'CGGTATC', 'GAAAATT']
print(motif_enumeration(3,1,DNA))
print(pattern)
``````

I thought that since my second inner loop crashed, this would cause my first inner loop to pass, and then another combo in
`motif_enumeration`
would be tested, but the first conditional in my
`inner_loop_two`
never prints anything. I noticed also that when the inner loop crashes and
`motif_enumeration`
continues, it continues for both the outer and the inner loop. Here's an example of what I mean...

``````AAA ATT 2
AAA TTT 3
AAA TTG 3
AAA TGG 3
AAT ATT 1
AAT TGC 3
AAT GCC 3
AAT CCT 2
AAT CTT 2
AAG ATT 2
AAG TTT 3
AAG TTG 2
AAG TGG 2
AAC ATT 2
AAC TTT 3
AAC TTG 3
AAC TGG 3
ATA ATT 1 etc...
``````

My expected output is
`pattern=[ATA, ATT, GTT, TTT]`

The core component of the logic is that we want to collect a combo into the pattern set if the combo matches at any position on all of the target strings. We can use Python's `all` and `any` functions to do this. Those functions work efficiently because they stop testing as soon as the result is decided.

``````import itertools

def combination(k):
return (''.join(p) for p in itertools.product('ATCG', repeat=k))

def hamming_distance(pattern, seq):
return sum(c1 != c2 for c1, c2 in zip(pattern, seq))

def window(s, k):
for i in range(1 + len(s) - k):
yield s[i:i+k]

def motif_enumeration(k, d, DNA):
pattern = set()
for combo in combination(k):
if all(any(hamming_distance(combo, pat) <= d
for pat in window(string, k)) for string in DNA):
return pattern

DNA = ['ATTTGGC', 'TGCCTTA', 'CGGTATC', 'GAAAATT']
print(motif_enumeration(3, 1, DNA))
``````

output

``````{'GTT', 'ATA', 'TTT', 'ATT'}
``````

I've made a few other changes to your code. We can calculate the Hamming distance efficiently by passing a generator to the `sum` function. And we can save time & RAM by converting the combo tuples to strings with a generator rather than putting them into a list.

The `motif_enumeration` function can be condensed even further into a set comprehension, but I must admit that it is rather dense, and even harder to read than the previous version. It may be slightly more efficient, though.

``````def motif_enumeration(k, d, DNA):
return {combo for combo in combination(k)
if all(any(hamming_distance(combo, pat) <= d
for pat in window(string, k)) for string in DNA)}
``````

And here's a slightly more readable version, where I've given `motif_enumeration` a helper function `in_window` to perform the inner test.

``````# Return True if combo is within d in any window of string
def in_window(combo, string, k, d):
return any(hamming_distance(combo, pat) <= d for pat in window(string, k))

def motif_enumeration(k, d, DNA):
pattern = set()
for combo in combination(k):
if all(in_window(combo, string, k, d) for string in DNA):