DNAngel DNAngel -4 years ago 69
Python Question

Concatenate multiple text files of DNA sequences in Python or R?

I was wondering how to concatenate exon/DNA fasta files using python or R.

Example files:



So far I really liked using R ape package for the cbind method, solely because of the
fill.with.gaps=TRUE
attribute. I really need gaps inserted when a species is missing an exon.

my code:

ex1 <- read.dna("exon1.txt", format="fasta")
ex2 <- read.dna("exon2.txt", format="fasta")
output <- cbind(ex1, ex2, fill.with.gaps=TRUE)
write.dna(output, "Output.txt", format="fasta")


Example:

exon1.txt

>sp1
AAAA
>sp2
CCCC


exon2.txt

>sp1
AGG-G
>sp2
CTGAT
>sp3
CTTTT


Output file:

>sp1
AAAAAGG-G
>sp2
CCCCCTGAT
>sp3
----CTTTT


So far I am having trouble trying to apply this technique when I have multiple exon files (trying to figure out a loop to open and execute the cbind method for all files ending with .fa in the directory), and sometimes not all files have exons that are all identical in length - hence DNAbin stops working.

So far I have:

file_list <- list.files(pattern=".fa")

myFunc <- function(x) {
for (file in file_list) {
x <- read.dna(file, format="fasta")
out <- cbind(x, fill.with.gaps=TRUE)
write.dna(out, "Output.txt", format="fasta")
}
}


However when I run this and I check my output text file, it misses many exons and I think that is because not all files have the same exon length...or my script is failing somewhere and I can't figure it out :(

Any ideas? I can also try python.

Answer Source

I just came out with this answer in Python 3:

def read_fasta(fasta): #Function that reads the files
  output = {}
  for line in fasta.split("\n"):
    line = line.strip()
    if not line:
      continue
    if line.startswith(">"):
      active_sequence_name = line[1:]
      if active_sequence_name not in output:
        output[active_sequence_name] = []
      continue
    sequence = line
    output[active_sequence_name].append(sequence)
  return output

with open("exon1.txt", 'r') as file: # read exon1.txt
  file1 = read_fasta(file.read())
with open("exon2.txt", 'r') as file: # read exon2.txt
  file2 = read_fasta(file.read())

finaldict = {}                                     #Concatenate the
for i in list(file1.keys()) + list(file2.keys()):  #both files content
  if i not in file1.keys():
    file1[i] = ["-" * len(file2[i][0])]
  if i not in file2.keys():
    file2[i] = ["-" * len(file1[i][0])]
  finaldict[i] = file1[i] + file2[i]

with open("output.txt", 'w') as file:  # output that in file 
  for k, i in finaldict.items():       # named output.txt
    file.write(">{}\n{}\n".format(k, "".join(i))) #proper formatting

It's pretty hard to comment and explain it completely, and it might not help you, but this is better than nothing :P

I used Ɓukasz Rogalski's code from answer to Reading a fasta file format into Python dict.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download