seandavi seandavi - 7 months ago 12
Python Question

Very large input and piping using subprocess.Popen

I have pretty simple problem. I have a large file that goes through three steps, a decoding step using an external program, some processing in python, and then recoding using another external program. I have been using subprocess.Popen() to try to do this in python rather than forming unix pipes. However, all the data are buffered to memory. Is there a pythonic way of doing this task, or am I best dropping back to a simple python script that reads from stdin and writes to stdout with unix pipes on either side?

import os, sys, subprocess

def main(infile,reflist):
print infile,reflist
samtoolsin = subprocess.Popen(["samtools","view",infile],
stdout=subprocess.PIPE,bufsize=1)
samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
for line in samtoolsin.stdout.read():
if(line.startswith("@")):
samtoolsout.stdin.write(line)
else:
linesplit = line.split("\t")
if(linesplit[10]=="*"):
linesplit[9]="*"
samtoolsout.stdin.write("\t".join(linesplit))

Answer

Try to make this small change, see if the efficiency is better.

 for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))