bshor bshor - 1 month ago 18
R Question

Simultaneously merge multiple data.frames in a list

I have a list of many data.frames that I want to merge. The issue here is that each data.frame differs in terms of the number of rows and columns, but they all share the key variables (which I've called

"var1"
and
"var2"
in the code below). If the data.frames were identical in terms of columns, I could merely
rbind
, for which plyr's rbind.fill would do the job, but that's not the case with these data.

Because the
merge
command only works on 2 data.frames, I turned to the Internet for ideas. I got this one from here, which worked perfectly in R 2.7.2, which is what I had at the time:

merge.rec <- function(.list, ...){
if(length(.list)==1) return(.list[[1]])
Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...)
}


And I would call the function like so:

df <- merge.rec(my.list, by.x = c("var1", "var2"),
by.y = c("var1", "var2"), all = T, suffixes=c("", ""))


But in any R version after 2.7.2, including 2.11 and 2.12, this code fails with the following error:

Error in match.names(clabs, names(xi)) :
names do not match previous names


(Incidently, I see other references to this error elsewhere with no resolution).

Is there any way to solve this?

Answer

Reduce makes this fairly easy:

merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)

Here's a fully example using some mock data:

set.seed(1)
list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
#    x  a  b         y
#12 12 NA 18        NA
#13 13 NA 19        NA
#14 14 NA 20 0.4976992
#15 15 NA NA 0.7176185
#16 16 NA NA 0.3841037
#17 19 NA NA 0.3800352

And here's an example using these data to replicate my.list:

merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list)
merged.data.frame[, 1:12]

#  matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y
#1   ALGIERE   200 RI      026       S         NA   <NA>   NA   NA   NA         NA   <NA>
#2     ALVES   100 RI      019       S         NA   <NA>   NA   NA   NA         NA   <NA>
#3    BADEAU   100 RI      032       S         NA   <NA>   NA   NA   NA         NA   <NA>

Note: It looks like this is arguably a bug in merge. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.frame which does make.unique the names, causing the rbind to fail.

# first merge will end up with 'name.x' & 'name.y'
merge(my.list[[1]], my.list[[2]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y
#<0 rows> (or 0-length row.names)
# as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again
merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y senate1995   name         votes.year  
#<0 rows> (or 0-length row.names)
# the next merge will fail as 'name' will get renamed to a pre-existing field.

Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge. Eg:

my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,
      names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list))

The merge/Reduce will then work fine.