Jessica Jessica - 7 months ago 30
Python Question

Pulling out only fasta sequences python

I want to read in a large file that contains the fasta sequences (1 header and 1 line of sequence below the header) and other random junk and unorganized spacing in between lines.

I want to read each line in and if a line starts with the ">" symbol, which is how a fasta sequence header starts, then pull that header out along with the next line which would be the sequence.

I have a small example data file to show:

> 1
GCTAGCGCCACCatgactcccgcatttatcttgtgcatgctctt
>2
GCTAGCACCATGGAGACAGACACACTCCTGCTATGGGTACTGCTGCTCTG
>3
GCTAGCACCATGGAGACAGACACACTCCTGCTATG


Task 2: Subclone the synthesized

junk

junk

>4
GCTAGCACCATGGAGACAGAC


my code:

f=open("File.fasta", "r")
fastaseq = open("OnlyFastaseq.fasta", "w")

for line in f:
line = line.strip('\n')
if line.startswith(">"):
title = line.rstrip()
seq = f.readline()
seq = seq.rstrip()
fastaseq.write(title+"\n"+seq+"\n")


desired output:

> 1
GCTAGCGCCACCatgactcccgcatttatcttgtgcatgctctt
>2
GCTAGCACCATGGAGACAGACACACTCCTGCTATGGGTACTGCTGCTCTG
>3
GCTAGCACCATGGAGACAGACACACTCCTGCTATG
>4
GCTAGCACCATGGAGACAGAC


the result contains most of the header+sequence except for the '>3' sequence, it didn't pull the next line (which is the sequence) out.

> 1
GCTAGCGCCACCatgactcccgcatttatcttgtgcatgctctt
>2
GCTAGCACCATGGAGACAGACACACTCCTGCTATGGGTACTGCTGCTCTG
>3
>4
GCTAGCACCATGGAGACAGAC

Answer

You can filter those out by iterating over the input and finding the lines that start with > then write that line and the next from from the input file, eg:

with open('File.fasta') as fin, open('OnlyFastaseq.fasta', 'w') as fout:
    for line in fin:
        if line.startswith('>'):
            fout.write(line)
            fout.write(next(fin))