jrzelling jrzelling - 3 months ago 25
R Question

Merge entire directory of data frames by common column name

All the data frames have a common element

"hostname"
but when I try this line I get the error:

temp.table <- merge(all.data.frames[1], all.data.frames[2], by = "hostname", all = TRUE)


Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column


Here is my code that reads the directory to list all the data frames and I'm trying to merge them all on a common variable:

setwd("C:/temp")
all.files <- ls()
all.data.frames <- all.files[sapply(all.files, function(x) is.data.frame(get(x)))]


This makes a list equivalent to:

c("data.frame1.dt", "data.frame2.dt", "data.frame3.dt", ...)


Here is some test data:

data.frame1.dt <- cbind("hostname" = c("a", "b", "c"), "username" = c("dkfj", "adfkjds", "adklfh"), "tele" = c(12,43,56))

data.frame2.dt <- cbind("hostname" = c("d", "b", "q"), "username" = c("dkfjdfs", "adfdtkjds", "adasdklfh"), "scan" = c(132,403,546))

data.frame3.dt <- cbind("hostname" = c("a", "q", "x", "Ip1"), "timer" = c("Log1", "Log3", "Log5", "LAG"), "cpu" = c("1343232-a","4354342-03", "5dfasd46", "Nothing"))


The merging of the data works like this:

xx <- merge(data.frame1.dt, data.frame2.dt, by = "hostname", all = TRUE)

total.join <- merge(xx, data.frame3.dt, by = "hostname", all = TRUE)


Giving the result:

hostname username.x tele username.y scan timer cpu
1 a dkfj 12 <NA> <NA> Log1 1343232-a
2 b adfkjds 43 adfdtkjds 403 <NA> <NA>
3 c adklfh 56 <NA> <NA> <NA> <NA>
4 d <NA> <NA> dkfjdfs 132 <NA> <NA>
5 q <NA> <NA> adasdklfh 546 Log3 4354342-03
6 Ip1 <NA> <NA> <NA> <NA> LAG Nothing
7 x <NA> <NA> <NA> <NA> Log5 5dfasd46


How can I do this merge on the entire directory of data.frame files?

Pj_ Pj_
Answer

You haven't created data frames properly. Change from cbind() to data.frame() and you will get the desired result.

typeof(data.frame1.dt) will give you an idea of what mistake you are making.

data.frame1.dt <- data.frame(hostname = c("a", "b", "c"), username = c("dkfj", "adfkjds", "adklfh"), tele = c(12,43,56))

data.frame2.dt <- data.frame(hostname = c("d", "b", "q"), username = c("dkfjdfs", "adfdtkjds", "adasdklfh"), scan = c(132,403,546))

data.frame3.dt <- data.frame(hostname = c("a", "q", "x", "Ip1"), timer = c("Log1", "Log3", "Log5", "LAG"), cpu = c("1343232-a","4354342-03", "5dfasd46", "Nothing"))


xx <- merge(data.frame1.dt, data.frame2.dt, by = "hostname", all = TRUE)

total.join <- merge(xx, data.frame3.dt, by = "hostname", all = TRUE)