Jamie Leigh Jamie Leigh - 15 days ago 6
Python Question

reading a file outside a function to iterate through

I have created a function, that I want to run over an entire file, but I am having some trouble. I am only getting output from the last line of the file.

I have two different input files, and the idea is to take the lines from one file and collecting certain terms, adding them to a dictionary, and then searching the second file for the corresponding lines and printing the output. I know the problem is most likely the placement of my call for the function.

The matrix file looks like this

Sp_ds Sp_hs Sp_log Sp_plat
c3833_g1_i2 4.00 0.07 16.84 26.37
c4832_g1_i1 24.55 116.87 220.53 28.82
c5161_g1_i1 107.49 89.39 26.95 698.97
c4399_g1_i2 27.91 72.57 5.56 36.58
c5916_g1_i1 82.57 19.03 48.55 258.22


The Blast file looks like this

c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1000_g1_i1|m.799 gi|48474761|sp|O94288.1|NOC3_SCHPO 100.00 747 0 0 5 751 1 747 0.0 1506
c1001_g1_i1|m.800 gi|259016383|sp|O42919.3|RT26A_SCHPO 100.00 268 0 0 1 268 1 268 0.0 557
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
c1003_g1_i1|m.803 gi|74631197|sp|Q6BDR8.1|NSE4_SCHPO 100.00 246 0 0 1 246 1 246 1e-179 502
c1004_g1_i1|m.804 gi|74676184|sp|O94325.1|PEX5_SCHPO 100.00 598 0 0 1 598 1 598 0.0 1227
c1005_g1_i1|m.805 gi|9910811|sp|O42832.2|SPB1_SCHPO 100.00 802 0 0 1 802 1 802 0.0 1644
c1006_g1_i1|m.806 gi|74627042|sp|O94631.1|MRM1_SCHPO 100.00 255 0 0 1 255 47 301 0.0 525
c1007_g1_i1|m.807 gi|20137702|sp|O74370.1|ISY1_SCHPO 100.00 201 0 0 1 201 1 201 4e-146 412


The program that I have gotten so far is this

def parse_blast(blast_line="NA"):
transcript = blast_line[0][0]
swissProt = blast_line[1][3]
return(transcript, swissProt)

blast = open("/scratch/RNASeq/blastp.outfmt6")
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)


transcript_to_protein = {}
transcript_to_protein[transcript] = swissProt
if transcript in transcript_to_protein:
protein = transcript_to_protein.get(transcript)

matrix = open("/scratch/RNASeq/diffExpr.P1e-3_C2.matrix")
for line in matrix:
matrixFields = line.rstrip("\n").split("\t")
transcript = matrixFields[0]
Sp_ds = matrixFields[1]
Sp_hs = matrixFields[2]
Sp_log = matrixFields[3]
Sp_plat = matrixFields[4]

tab = "\t"
fields = (protein,Sp_ds,Sp_hs,Sp_log,Sp_plat)
out = open("parsed_blast.txt","w")
out.write(tab.join(fields))
matrix.close()
blast.close()
out.close()

Answer

It's a scope problem, as your indentation is not correct.

for line in blast:
  line= [item.split('|') for item in line.split()]
  (transcript, swissProt) = parse_blast(blast_line = line)

So you keep looping till the last line without saving the values you get. I think you should change your indentation to this

transcript_to_protein = {} # 1. declare the dictionary

for line in blast:
      line= [item.split('|') for item in line.split()]
      (transcript, swissProt) = parse_blast(blast_line = line)
      transcript_to_protein[transcript] = swissProt # 2. Add the data to the dictionary

This will solve the problem of your first file.But not your second as you don't use the dictionary inside the loop.

So you have to move these lines inside the second loop

if transcript in transcript_to_protein:
    protein = transcript_to_protein.get(transcript)

I think you got the idea. I will leave the rest for you to do, there a few lines that needs to be moved before the loops and one or two inside the second loop.

Comments