jds jds - 3 months ago 12
Python Question

Merge two large text files by common row to one mapping file

I have two text files that have similar formatting. The first (732KB):


The second (5.26GB):

>Stool268_1 HWI-ST155_0605:1:1101:1194:2070#CTGTCTCTCCTA

Note the key difference is the header for each entry (lib_1749 vs. Stool268_1). What I need is to create a mapping file between the headers of one file and the headers of the second using the sequence (e.g.,
) as a key.

Note as one final complication the mapping is not going to be 1-to-1 there will be multiple entries of the form Stool****** for each entry of lib****. This is because the length of the key in the first file was trimmed to have 200 characters but in the second file it can be longer.

For smaller files I would just do something like this in python but I often have trouble because these files are so big and cannot be read into memory at one time. Usually I try unix utilities but in this case I cannot think of how to accomplish this.

Thank you!


BioPython should be able to read in large FASTA files.

from Bio import SeqIO
from collections import defaultdict

mapping = defaultdict(list)

for stool_record in SeqIO.parse('stool.fasta', 'fasta'):
    stool_seq = str(stool_record.seq)

    for lib_record in SeqIO.parse('libs.fasta', 'fasta'):
        lib_seq = str(lib_record.seq)

        if stool_seq.startswith(lib_seq):