 Iv&#225;n Castro - 3 years ago 190
Python Question

# A more complex version of "How can I tell if a string repeats itself in Python?"

I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string.

For example, find all the repetitive motifs in

``````string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
``````

Here the repetitive motifs:
'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

So, the output should be something like this:

``````output = {'ACGT': {'repeat': 2,
'region': (5,13)},
'GT': {'repeat': 3,
'region': (19,24)},
'TATACG': {'repeat': 2,
'region': (29,40)}}
``````

This example comes from a typical biological phenomena termed microsatellite which is present into the DNA.

UPDATE 1: Asterisks were removed from the string variable. It was a mistake.

UPDATE 2: Single character motif doesn't count. For example: in ACGUGAAAGUC, the 'A' motif is not taken into account. Kasramvd

you can use a recursion function as following :

``````import re
def finder(st,past_ind=0,l=[]):
m=re.search(r'(.+)\1+',st)
if m:
i,j=m.span()
sub=st[i:j]
ind = (sub+sub).find(sub, 1)
sub=sub[:ind]
if len(sub)>1:
l.append([sub,(i+past_ind+1,j+past_ind+1)])
past_ind+=j
return finder(st[j:],past_ind)
else:
return l

s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)
``````

result:

``````[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]
``````

# answer to previous question for following string :

``````s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
``````

You can use both answers from mentioned question and some extra recipes :

First you can split the string with `**` then create a new list contain the repeated strings with `r'(.+)\1+'` regex :

So the result will be :

``````>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']
``````

Note about `'ACGTACGT'` that missed the `A` at the end!

Then you can use `principal_period`'s function to get the repeated sub strings :

``````def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]

>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
...
``````

So you will have the repeated strings in `l` and main strings in `sub` :

``````>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']
``````

Then you need a the `region` that you can do it with `span` method :

``````>>> for t in sub:
...    regons.append(re.search(t,s).span())

>>> regons
[(6, 14), (24, 30), (38, 50)]
``````

And at last you can zip the 3 list `regon`,`sub`,`l` and use a dict comprehension to create the expected result :

``````>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
``````

The main code :

``````>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
...

>>> for t in sub:
...    regons.append(re.search(t,s).span())
...
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download