user3224522 user3224522 - 6 months ago 62
Python Question

Extract a part of fasta sequence based on bp coordinates

I have a huge fasta file, but I need to extract only a part of it, If I know the start and end base pair coordinate of my sequence. Also, it should be in fasta format with the length of 60 bp per line.This is my try, please let me know if looks OK and any suggestions to improve it are welcome.

from Bio import SeqIO

inFile = open('full_chr.fa','r')
line_width = 60
for record in SeqIO.parse(inFile,'fasta'):
fw.write(">" + + "\n")
fww = (str(record.seq[600130000:602000000]) + '\n')
for i in xrange(0,len(fww),line_width):
fw.write(str(fww[i:i+line_width]) + '\n')


It's as easy as:

from Bio import SeqIO

record ="Chromosome.fas", "fasta")

with open("output.fas", "w") as out:
    SeqIO.write(record[100:500], out, "fasta")

The SeqIO.write already uses a 60 character length wrapping. If you want to manipulate the line wrap use the FastaWriter object. This is an example for 80 bp lines:

from Bio import SeqIO
from Bio.SeqIO.FastaIO import FastaWriter

record ="Chromosome.fas", "fasta")

with open("output.fas", "w") as out:
    writer = FastaWriter(out, wrap=80)