Alex Alex - 5 months ago 22
R Question

perform function in multiple csv files iteratively

I have a lot of csv files stored in a folder. i.e. file1.csv, file2.csv, file3.csv, etc, etc
each csv file contains the same measurements for each object.
a file looks like this:

ID time measuremment1 measurement2 measurement3
1 5 12 324 123
1 6 123 654 45
1 3 346 556 548


another one looks like this:

ID time measurement1 measurement2 measurement3
2 2 234 345 253
2 8 35 998 316
2 17 515 1005 323
2 50 156 155 616


and so on. in addition, I have a data frame with several calculations I want to perform for each of my objects(files), which looks like this:

calc<- data.frame(mean1 = mean(measurement1), var1 = var(measurement1),
sd1 = sd(measurement1), mean2 = mean(measurement2), var2 = var(measurement2),
sd2 = sd(measurement2))


etc, etc. what I want to do is find a way to read each csv file iteratively and perform these calculations for each object. finally, I would like to either export them in a separate csv file (to have the information I need concentrated), or either print it in R console and copy it from there to a text or excel file.
I am working in R
can anyone provide any help with this?
thank you!

Answer Source

Something like this:

library(dplyr)

Option 1: Read in all files, combine into a single data frame and then summarize

With this method, all the data files are loaded into an R list.

dat = sapply(list.files(pattern="csv$"), function(file) {
  df = read.csv(file, stringsAsFactors=FALSE, header=TRUE)
  df$source = file
  df
}, simplify=FALSE)

dat = bind_rows(dat)

Summarize by ID:

dat.summary = dat %>% group_by(ID) %>%
  summarise_each(funs(mean(., na.rm=TRUE), var(., na.rm=TRUE), sd(., na.rm=TRUE)), -time)

Or in the newer dplyr idiom:

dat.summary = dat %>% group_by(ID) %>%
  summarise_at(vars(matches("measurement")), 
               funs(mean(., na.rm=TRUE), var(., na.rm=TRUE), sd(., na.rm=TRUE)))

Option 2: Read and summarize each individual file, then bind the individual summaries into a single summary data frame

This way you load only one data file into memory at a time.

dat.summary = sapply(list.files(pattern="csv$"), function(file) {
  df = read.csv(file, stringsAsFactors=FALSE, header=TRUE)

  # Summarise by ID
  df %>% group_by(ID) %>%
    summarise_at(vars(matches("measurement")), 
                 funs(mean(., na.rm=TRUE), var(., na.rm=TRUE), sd(., na.rm=TRUE)))
})

dat.summary = bind_rows(dat.summary)

Now save the summary:

write.csv(dat.summary, "my_summary.csv", row.names=FALSE)

or

library(xlsx)

write.xlsx(dat.summary, "my_summary.xlsx", row.names=FALSE)