RicardoC RicardoC - 10 days ago 6
R Question

Applying an R script to multiple files

I have an R script that reads a certain type of file (nexus files of phylogenetic trees), whose name ends in *.trees.txt. It then applies a number of functions from an R package called bGMYC, available here and creates 3 pdf files. I would like to know what I should do to make the script loop through the files for each of 14 species.

The input files are in a separate folder for each species, but I can put them all in one folder if that facilitates the task. Ideally, I would like to output the pdf files to a folder for each species, different from the one containing the input file.

Here's the script

# Call Tree file
trees <- read.nexus("L_boscai_1411_test2.trees.txt")

# To use with different species, substitute "L_boscai_1411_test2.trees.txt" by the path to each species tree

#Store the number of tips of the tree
ntips <- length(trees$tip.label[[1]])

#Apply bgmyc.single
results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))

#Create the 1st pdf
pdf('results_single_boscai.pdf')
plot(results.single)
dev.off()

#Sample 50 trees
n <- sample(1:length(trees), 50)
trees.sample <- trees[n]

#Apply bgmyc.multiphylo
results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))

#Create 2nd pdf
pdf('results_boscai.pdf') # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
plot(results.multi)
dev.off()

#Apply bgmyc.spec and spec.probmat
results.spec <- bgmyc.spec(results.multi)
results.probmat <- spec.probmat(results.multi)

#Create 3rd pdf
pdf('trees_boscai.pdf') # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
for (i in 1:50) plot(results.probmat, trees.sample[[i]])
dev.off()


I've read several posts with a similar question, but they almost always involve .csv files, refer to multiple files in a single folder, have a simpler script or do not need to output files to separate folders, so I couldn't find a solution to my specific problem.

Shsould I use a
for
loop or could I create a function out of this script and use
lapply
or another sort of
apply
? Could you provide me with sample code for your proposed solution or point me to a tutorial or another reference?

Thanks for your help.

Answer

It really depends on the way you want to run it. If you are using linux / command line job submission, it might be best to look at

How can I read command line parameters from an R script?

If you are using GUI (Rstudio...) you might not be familiar with this, so I would solve the problem as a function or a loop.

First, get all your file names.

files = list.files(path = "your/folder")
# Now you have list of your file name as files. Just call each name one at a time
# and use for loop or apply (anything of your choice)

And since you would need to name pdf files, you can use your file name or index (e.g loop counter) and append to the desired file name. (e.g. paste("single_boscai", "i"))

In your case,

 files = list.files(path = "your/folder")
    # Use pattern = "" if you want to do string matching, and extract
    # only matching files from the source folder.

    genPDF = function(input) {
      # Read the file
      trees <- read.nexus(input)
      # Store the index (numeric)
      index = which(files == input)

      #Store the number of tips of the tree
      ntips <- length(trees$tip.label[[1]])

      #Apply bgmyc.single
      results.single <- bgmyc.singlephy(trees[[1]], mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))

      #Create the 1st pdf
      outname = paste('results_single_boscai', index, '.pdf', sep = "")
      pdf(outnam)
      plot(results.single)
      dev.off()

      #Sample 50 trees
      n <- sample(1:length(trees), 50)
      trees.sample <- trees[n]

      #Apply  bgmyc.multiphylo  
      results.multi <- bgmyc.multiphylo(trees.sample, mcmc=150000, burnin=40000, thinning=100, t1=2, t2=ntips, start=c(1,1,ntips/2))

      #Create 2nd pdf
      outname = paste('results_boscai', index, '.pdf', sep = "")
      pdf(outname)     # Substitute 'results_boscai.pdf' by "*speciesname.pdf"
      plot(results.multi)
      dev.off()

      #Apply bgmyc.spec and spec.probmat 
      results.spec <- bgmyc.spec(results.multi)
      results.probmat <- spec.probmat(results.multi)

      #Create 3rd pdf
      outname = paste('trees_boscai', index, '.pdf', sep = "")
      pdf(outname)     # Substitute 'trees_boscai.pdf' by "trees_speciesname.pdf"
      for (i in 1:50) plot(results.probmat, trees.sample[[i]])
      dev.off()
    }


    for (i in 1:length(files)) {
      genPDF(files[i])
    }