gdy gdy - 12 days ago 6
R Question

Merge select columns from multiple tables using common identifiers in R

I would like to combine (merge) select columns from multiple tables with following organization.

Here's two datasets as examples that I want to combine

"dataset1"

A B C D E F (header)

1 2 3 4 5 F1(1st row)

6 7 8 9 10 F2(2nd row)

11 12 13 14 15 F3 (3rd row)

....

"dataset2"

A B C D E F (header)

16 17 18 19 20 F1(1st row)

21 22 23 24 25 F2(2nd row)

26 27 28 29 30 F3 (3rd row)

....


Here, header for all different datasets (I have more than 100 datasets) are identical, and I want to use names in F columns (F1, F2, F3...more than F200) as unique identifier.
For example, If I combine column "A" from all different datasets using column F as identifier, the results should look like this. Also to distinguish where the data come from, header also needs to be changed to dataset ID.

dataset1 dataset2 F (header)

1 16 F1 (1st row)

6 21 F2 (2nd row)

11 26 F3 (3rd row)

....


Note that all datasets I have contain different numbers of row, so that some data point values corresponding to F1~F200 could be missing. in this case I want to put NA or leave it as empty.

To this end, I tried following code

x <- merge(dataset1, dataset2, by="F", all=T)


But this way, I cannot extract only column A, rather it merges evert columns.

Similarly, I tried also

x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1, dataset2))


This gave me actually identical results as previous code. To further extract only column A using this code, I tried following one, but did not worked.

x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1[,1], dataset2[,1))


And I have no idea how to change name of header into the name of data set which came from.

Please understand I just started to learn R basics.

I'm using RStudio 0.98507 and currently all datasets (more than hundred) were loaded and in present in "Global Environment"

Thank you very much!

Answer

Here's one solution with the following four sample data frames:

dataset1 <- data.frame(A = c(1, 6, 11), 
                       B = c(2, 7, 12), 
                       C = c(3, 8, 12), 
                       D = c(4, 9, 13), 
                       E = c(5, 10, 14), 
                       F = c("F1", "F2", "F3"))

dataset2 <- data.frame(A = c(16, 21, 26),
                       B = c(17, 22, 27), 
                       C = c(18, 23, 28),
                       D = c(19, 24, 29),
                       E = c(20, 25, 30), 
                       F = c("F1", "F2", "F3"))

dataset3 <- data.frame(A = c(30, 61),
                       B = c(57, 90), 
                       C = c(38, 33),
                       D = c(2, 16),
                       E = c(77, 25), 
                       F = c("F1", "F2"))

dataset4 <- data.frame(A = c(36, 61),
                       B = c(47, 30), 
                       C = c(37, 33),
                       D = c(45, 10),
                       E = c(66, 29), 
                       F = c("F1", "F2"))

First combine them into a list:

datasets <- list(dataset1, dataset2, dataset3, dataset4)

Then rename all the columns except the F column. This is because later when we merge the data frames together, if the columns all have the same names then merge will try to differentiate them by adding .x or .y to the names -- which is fine when you're only merging two data sets, but gets confusing with more than two.

for (i in seq_along(datasets)) {
  for (j in seq_along(colnames(datasets[[i]]))) {
    if (colnames(datasets[[i]])[j] != "F") {
      colnames(datasets[[i]])[j] <- paste(colnames(datasets[[i]])[j], i, sep = ".")
    }
  }
}

This gives us data frames whose column headers look like this:

datasets[[1]]
##   A.1 B.1 C.1 D.1 E.1  F
## 1   1   2   3   4   5 F1
## 2   6   7   8   9  10 F2
## 3  11  12  12  13  14 F3

Then use Reduce:

df <- Reduce(function(x, y) merge(x, y, all = TRUE, by = "F"), datasets)

And select the columns you want, in this case all the columns with A in the column name:

df[, c("F", grep("A", names(df), value = TRUE))]
##    F A.1 A.2 A.3 A.4
## 1 F1   1  16  30  36
## 2 F2   6  21  61  61
## 3 F3  11  26  NA  NA
Comments