lll - 1 year ago 69
R Question

# R: why when merging data - got a result that is larger than both of the data

I have two data sets like the following:

dataA:

``````newPCODE1   INT
T306     986566
U078     94333
T324     97170
R354     97489
T538     987411
R141     94334
...
``````

dataB:

`````` newPCODE1 PRICE
A001  2535
A002  1200
A003  1560
A004  2575
A005  5575
A006  1140
....
``````

dataA has about 400,000 rows and dataB has about 40,000 rows, but when i did merge using the following function, I got a data that is about 500,000. I am wondering why this is happening, and both data has only one column that shares name - newPCODE1. I tried to google around but did not find much information. Would appreciate any insights!

``````combo_new = merge(dataA,dataB, by = "newPCODE1")
``````

Each `newPCODE1` match will add a row to your merged data frame, including repeats of a code. For example, if `T306` appears twice in `dataA` and three times in `dataB`, you'll get six rows in your merged data frame from that value of `newPCODE1` (because the first `T306` in `dataA` matches three rows in `dataB` and the second `T306` in `dataA` matches those same three rows in `dataB`).

To calculate the number of rows in the merged data frame:

First, only `newPCODE1` values that are in both data frames will be returned from the `merge` (in the form used in your question). In other words, we need the intersection of `newPCODE1` from the two data frames:

``````common.codes = intersect(dataA\$newPCODE1,dataB\$newPCODE1)
``````

The number of rows returned for each value of `newPCODE1` will be the product of the number of rows with that value of `newPCODE1` in each of the two data frames. The total number of rows in the merged data frame is the sum of these products:

``````sum(table(dataA\$newPCODE1[dataA\$newPCODE1 %in% common.codes]) *
table(dataB\$newPCODE1[dataB\$newPCODE1 %in% common.codes]))
``````

In the example below, note that the merged data frame has 25 rows, even though the original two data frames have only a total of 20. If all 10 rows had the same `newPCODE1` value in both data frames, the merged data frame would have had 100 rows. (If, say, `T306` appeared in all 400,000 rows of `dataA` and all 40,000 rows of `dataB`, your merged data frame would have 400,000*40,000 = 16 billion rows!)

``````dataA = data.frame(newPCODE1=c(1,3,4,4,5,5,6,6,6,6), value1=letters[1:10])
dataB = data.frame(newPCODE1=c(3,4,5,5,5,6,6,6,6,10), value2=LETTERS[1:10])
merge(dataA,dataB, by="newPCODE1")
``````
``````   newPCODE1 value1 value2
1          3      b      A
2          4      c      B
3          4      d      B
4          5      e      C
5          5      e      D
6          5      e      E
7          5      f      C
8          5      f      D
9          5      f      E
10         6      g      G
11         6      g      H
12         6      g      I
13         6      g      F
14         6      h      G
15         6      h      H
16         6      h      I
17         6      h      F
18         6      i      G
19         6      i      H
20         6      i      I
21         6      i      F
22         6      j      G
23         6      j      H
24         6      j      I
25         6      j      F
``````
``````common.codes = intersect(dataA\$newPCODE1,dataB\$newPCODE1)

sum(table(dataA\$newPCODE1[dataA\$newPCODE1 %in% common.codes]) *
table(dataB\$newPCODE1[dataB\$newPCODE1 %in% common.codes]))
``````
``````[1] 25
``````

If you want to keep rows from one or both data frames even when there is no matching row in the other data frame, you can do this:

``````merge(dataA,dataB, by="newPCODE1", all.x=TRUE)   # Keep all rows from first data frame
merge(dataA,dataB, by="newPCODE1", all.y=TRUE)   # Keep all rows from second data frame
merge(dataA,dataB, by="newPCODE1", all=TRUE)     # Keep all rows from both data frames
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download