Jon Jon - 8 months ago 75
Python Question

Python: Removing characters from beginnings of sequences in fasta format

I have sequences in fasta format that contains primers of 17 bp at the beginning of the sequences. And the primers sometimes have mismatches. I therefore want to remove the first 17 chars of the sequences, except from the fasta header.

The sequences look like this:

> name_name_number_etc
SEQUENCEFOLLOWSHERE
> name_number_etc
SEQUENCEFOLLOWSHERE
> name_name_number_etc
SEQUENCEFOLLOWSHERE


How can I do this in python?

Thanks! Jon

Answer Source

If I understand correctly, you have to remove the primer only from the first 17 characters of a potentially multiline sequence. What you ask is a bit more difficult. Yes, a simple solution exists, but it can fail in some situations.

My suggestion is: use Biopython to perform the parsing of the FASTA file. Straight from the tutorial

from Bio import SeqIO
handle = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(handle, "fasta") :
    print seq_record.id
    print repr(seq_record.seq)
    print len(seq_record)
handle.close()

Then rewrite the sequence down with the first 17 letters deleted. I don't have an installation of biopython on my current machine, but if you take a look at the tutorial, it won't take more than 15 lines of code in total.

If you want to go hardcore, and do it manually, you have to do something like this (from the first poster, modified)

f = open('sequence.fsa')

first_line = False
for line in f.xreadlines():
    if line[0] == ">":
        first_line=True
        print line,
    else:
        if first_line:
             print line[17:],
        else:
             print line,
        first_line = False
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download