user2300042 user2300042 - 4 months ago 8
Python Question

calculate the length of a sequence after adding the length of previous sequences

I want to determine length of individual sequences in a multifasta file. I got this biopython code from the bio manual as:

from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
output_line = '%s\t%i' % \
(seq_record.id, len(seq_record))
print(output_line)


My input file is like:

>Protein1
MNT
>Protein2
TSMN
>Protein3
TTQRT


And the code yields:

Protein1 3
Protein2 4
Protein3 5


But I want to calculate the length of a sequence after adding the length of previous sequences. It would be like:

Protein1 1-3
Protein2 4-7
Protein3 8-12


I don't know in which of the above line in the code I need to change to get that output. I'd appreciate any help on this issue, thanks!!!!

Answer

It is easy just to get the total length:

from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
total_len = 0
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
    total_len += len(seq_record)
    output_line = '%s\t%i' % (seq_record.id, total_len))
    print(output_line)

To get a range:

from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
total_len = 0
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
    previous_total_len = total_len
    total_len += len(seq_record)
    output_line = '%s\t%i - %i' % (seq_record.id, previous_total_len + 1, total_len)
    print(output_line)