Gia Constantina Gia Constantina - 3 months ago 9
Python Question

Find number of breaks in a sequence

I have a program that is parsing allele sequences. I am trying to write a code that determines if the allele is complete or not. To do so, I need to count the number of breaks in the reference sequence. A break is signified by a string of '-'. If there is more than one break I want the program to say "Incomplete Allele."

How can I figure out how to count the number of breaks in the sequence?

Here is an example of a "broken" sequence:

>DQB1*04:02:01
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--ATGTCTTGGAAGAAGGCTTTGCGGAT-------CCCTGGAGGCCTTCGGGTAGCAACT
GTGACCTT----GATGCTGGCGATGCTGAGCACCCCGGTGGCTGAGGGCAGAGACTCTCC
CGAGGATTTCGTGTTCCAGTTTAAGGGCATGTGCTACTTCACCAACGGGACCGAGCGCGT
GCGGGGTGTGACCAGATACATCTATAACCGAGAGGAGTACGCGCGCTTCGACAGCGACGT
GGGGGTGTATCGGGCGGTGACGCCGCTGGGGCGGCTTGACGCCGAGTACTGGAATAGCCA
GAAGGACATCCTGGAGGAGGACCGGGCGTCGGTGGACACCGTATGCAGACACAACTACCA
GTTGGAGCTCCGCACGACCTTGCAGCGGCGA-----------------------------
-----------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---GTGGAGCCCACAGTGACCATCTCCCCATCCAGGACAGAGGCCCTCAACCACCACAAC
CTGCTGGTCTGCTCAGTGACAGATTTCTATCCAGCCCAGATCAAAGTCCGGTGGTTTCGG
AATGACCAGGAGGAGACAACTGGCGTTGTGTCCACCCCCCTTATTAGGAACGGTGACTGG
ACCTTCCAGATCCTGGTGATGCTGGAAATGACTCCCCAGCGTGGAGACGTCTACACCTGC
CACGTGGAGCACCCCAGCCTCCAGAACCCCATCATCGTGGAGTGGCGGGCTCAGTCTGAA
TCTGCCCAGAGCAAGATGCTGAGTGG----CATTGGAGGCTTCGTGCTGGGGCTGATCTT
CCTCGGGCTGGGCCTTATTATC--------------CATCACAGGAGTCAGAAAGGGCTC
CTGCACTGA---------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------


The code I have so far is as follows:

idx=[]
for m in range(len(sequence)):
for n in re.finditer('-',sequence[0]):
idx.append(n.start())
counter=0
min_val=[]
for n in range(len(idx)):
if counter==idx[n]:
counter=counter+1
elif counter !=0:
min_val.append(idx[n-1])
counter=0


My reasoning for the above code was if I could find the start positions of the '-' then I can see how many times they appear within the sequence and if they break the sequence at all. However, I know there are some flaws in the above code.

Answer

It seems like you can just count the occurrances of -+, i.e. a sequence of one or more - symbols. The only problem are the line breaks, but you could either incorporate those into the regex, or split and join the string before matching.

>>> sequence = """>DQB1*04:02:01....."""
>>> joined = ''.join(sequence.splitlines())
>>> sum(1 for m in re.finditer("-+", joined))
7

Note: This includes the - at the start and end of the sequence.

Or reverse the approach: Instead of counting the gaps, count the groups:

>>> sum(1 for m in re.finditer("[GATC]+", joined))
6
Comments