lll lll - 1 month ago 3
R Question

R: why when merging data - got a result that is larger than both of the data

I have two data sets like the following:


T306 986566
U078 94333
T324 97170
R354 97489
T538 987411
R141 94334


A001 2535
A002 1200
A003 1560
A004 2575
A005 5575
A006 1140

dataA has about 400,000 rows and dataB has about 40,000 rows, but when i did merge using the following function, I got a data that is about 500,000. I am wondering why this is happening, and both data has only one column that shares name - newPCODE1. I tried to google around but did not find much information. Would appreciate any insights!

combo_new = merge(dataA,dataB, by = "newPCODE1")


Each newPCODE1 match will add a row to your merged data frame, including repeats of a code. For example, if T306 appears twice in dataA and three times in dataB, you'll get six rows in your merged data frame from that value of newPCODE1 (because the first T306 in dataA matches three rows in dataB and the second T306 in dataA matches those same three rows in dataB).

To calculate the number of rows in the merged data frame:

First, only newPCODE1 values that are in both data frames will be returned from the merge (in the form used in your question). In other words, we need the intersection of newPCODE1 from the two data frames:

common.codes = intersect(dataA$newPCODE1,dataB$newPCODE1)

The number of rows returned for each value of newPCODE1 will be the product of the number of rows with that value of newPCODE1 in each of the two data frames. The total number of rows in the merged data frame is the sum of these products:

sum(table(dataA$newPCODE1[dataA$newPCODE1 %in% common.codes]) * 
      table(dataB$newPCODE1[dataB$newPCODE1 %in% common.codes]))

In the example below, note that the merged data frame has 25 rows, even though the original two data frames have only a total of 20. If all 10 rows had the same newPCODE1 value in both data frames, the merged data frame would have had 100 rows. (If, say, T306 appeared in all 400,000 rows of dataA and all 40,000 rows of dataB, your merged data frame would have 400,000*40,000 = 16 billion rows!)

dataA = data.frame(newPCODE1=c(1,3,4,4,5,5,6,6,6,6), value1=letters[1:10])
dataB = data.frame(newPCODE1=c(3,4,5,5,5,6,6,6,6,10), value2=LETTERS[1:10])
merge(dataA,dataB, by="newPCODE1")
   newPCODE1 value1 value2
1          3      b      A
2          4      c      B
3          4      d      B
4          5      e      C
5          5      e      D
6          5      e      E
7          5      f      C
8          5      f      D
9          5      f      E
10         6      g      G
11         6      g      H
12         6      g      I
13         6      g      F
14         6      h      G
15         6      h      H
16         6      h      I
17         6      h      F
18         6      i      G
19         6      i      H
20         6      i      I
21         6      i      F
22         6      j      G
23         6      j      H
24         6      j      I
25         6      j      F
common.codes = intersect(dataA$newPCODE1,dataB$newPCODE1)

sum(table(dataA$newPCODE1[dataA$newPCODE1 %in% common.codes]) * 
      table(dataB$newPCODE1[dataB$newPCODE1 %in% common.codes]))
[1] 25

If you want to keep rows from one or both data frames even when there is no matching row in the other data frame, you can do this:

merge(dataA,dataB, by="newPCODE1", all.x=TRUE)   # Keep all rows from first data frame
merge(dataA,dataB, by="newPCODE1", all.y=TRUE)   # Keep all rows from second data frame
merge(dataA,dataB, by="newPCODE1", all=TRUE)     # Keep all rows from both data frames