wl284 wl284 - 26 days ago 13
Python Question

How to retrieve all fasta records that contain any of a list of strings in the header

I want a script that can extract all fasta records from a file which contain any of a list of strings in the header.

So if I had a list in one file like this:

2.A.1.13.5
3.A.1.208.23


And a fasta file like this:

>gnl|TC-DB|O60645|1.F.2.1.2 Exocyst complex component 3 OS=Homo sapiens GN=EXOC3 PE=1 SV=2
MQCEDSTSFFTMKETDREAVATAVQRVAGMLQRPDQLDKVEQYRRREARKKASVEARLKA
>gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human).
MPPMPSAPPVHPPPDGGWGWIVVGAAFISIGFSYAFPKAVTVFFKEIQQIFHTTYSEIAW
>gnl|TC-DB|O60683|3.A.20.1.1 Peroxisome biogenesis factor 10 OS=Homo sapiens GN=PEX10 PE=1 SV=1
MAPAAASPPEVIRAAQKDEYYRGGLRSAAGGALHSLAGARKWLEWRKEVELLSDVAYFGL
>gnl|TC-DB|O60684|1.I.1.1.3 Importin subunit alpha-7 OS=Homo sapiens GN=KPNA6 PE=1 SV=1
METMASPGKDNYRMKSYKNNALNPEEMRRRREEEGIQLRKQKREQQLFKRRNVELINEEA
>gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2
MSLSFCGNNISSYNINDGVLQNSCFVDALNLVPHVFLLFITFPILFIGWGSQSSKVQIHH
>gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1
MGKLIRMGPQERWLLRTKRLHWSRLLFLLGMLIIGSTYQHLRRPRGLSSLWAAVSSHQPI
>gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
MDVPGPVSRRAAAAAATVLLRTARVRRECWFLPTALLCAYGFFASLRPSEPFLTPYLLGP


Then the script would print the 2nd, 5th, 6th and 7th records, like so:

>gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human).
MPPMPSAPPVHPPPDGGWGWIVVGAAFISIGFSYAFPKAVTVFFKEIQQIFHTTYSEIAW
>gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2
MSLSFCGNNISSYNINDGVLQNSCFVDALNLVPHVFLLFITFPILFIGWGSQSSKVQIHH
>gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1
MGKLIRMGPQERWLLRTKRLHWSRLLFLLGMLIIGSTYQHLRRPRGLSSLWAAVSSHQPI
>gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
MDVPGPVSRRAAAAAATVLLRTARVRRECWFLPTALLCAYGFFASLRPSEPFLTPYLLGP


Here's the kind of approach I was taking but I could be miles off as not really sure what I'm doing. I've been told BioPython is good for handling fasta format files but still trying to get to grasps with it.

from Bio import SeqIO
import sys

headers = []

search_list = open(sys.argv[1])
for line in search_list:
headers.append(line.rstrip())
search_list.close()

print search_list

for seq_record in SeqIO.parse(sys.argv[2], "fasta"):
#print seq_record
for a in headers:
head = str(a)
if head in seq_record:
print seq_record


Cheers in advance for any help!

Answer

Your code has almost everything. You just need to check if the header is in the description, then print the SeqRecord elements and you are done.

from Bio import SeqIO
import sys

headers = list()

with open(sys.argv[1], 'r') as search_list:
    for line in search_list:
        headers.append(line.rstrip())

for seq_record in SeqIO.parse(sys.argv[2], "fasta"):
    for head in headers:
        if head in seq_record.description:
            print(seq_record.format('fasta'))
            break

Small note: I added a break statement after the first found element, in case two headers are found it would print it twice.

Comments