lll - 7 months ago 32

R Question

I have two data sets like the following:

dataA:

`newPCODE1 INT`

T306 986566

U078 94333

T324 97170

R354 97489

T538 987411

R141 94334

...

dataB:

`newPCODE1 PRICE`

A001 2535

A002 1200

A003 1560

A004 2575

A005 5575

A006 1140

....

dataA has about 400,000 rows and dataB has about 40,000 rows, but when i did merge using the following function, I got a data that is about 500,000. I am wondering why this is happening, and both data has only one column that shares name - newPCODE1. I tried to google around but did not find much information. Would appreciate any insights!

`combo_new = merge(dataA,dataB, by = "newPCODE1")`

Answer

Each `newPCODE1`

match will add a row to your merged data frame, including repeats of a code. For example, if `T306`

appears twice in `dataA`

and three times in `dataB`

, you'll get six rows in your merged data frame from that value of `newPCODE1`

(because the first `T306`

in `dataA`

matches three rows in `dataB`

and the second `T306`

in `dataA`

matches those same three rows in `dataB`

).

To calculate the number of rows in the merged data frame:

First, only `newPCODE1`

values that are in both data frames will be returned from the `merge`

(in the form used in your question). In other words, we need the intersection of `newPCODE1`

from the two data frames:

```
common.codes = intersect(dataA$newPCODE1,dataB$newPCODE1)
```

The number of rows returned for each value of `newPCODE1`

will be the product of the number of rows with that value of `newPCODE1`

in each of the two data frames. The total number of rows in the merged data frame is the sum of these products:

```
sum(table(dataA$newPCODE1[dataA$newPCODE1 %in% common.codes]) *
table(dataB$newPCODE1[dataB$newPCODE1 %in% common.codes]))
```

In the example below, note that the merged data frame has 25 rows, even though the original two data frames have only a total of 20. If all 10 rows had the same `newPCODE1`

value in both data frames, the merged data frame would have had 100 rows. (If, say, `T306`

appeared in all 400,000 rows of `dataA`

and all 40,000 rows of `dataB`

, your merged data frame would have 400,000*40,000 = 16 billion rows!)

```
dataA = data.frame(newPCODE1=c(1,3,4,4,5,5,6,6,6,6), value1=letters[1:10])
dataB = data.frame(newPCODE1=c(3,4,5,5,5,6,6,6,6,10), value2=LETTERS[1:10])
merge(dataA,dataB, by="newPCODE1")
```

`newPCODE1 value1 value2 1 3 b A 2 4 c B 3 4 d B 4 5 e C 5 5 e D 6 5 e E 7 5 f C 8 5 f D 9 5 f E 10 6 g G 11 6 g H 12 6 g I 13 6 g F 14 6 h G 15 6 h H 16 6 h I 17 6 h F 18 6 i G 19 6 i H 20 6 i I 21 6 i F 22 6 j G 23 6 j H 24 6 j I 25 6 j F`

```
common.codes = intersect(dataA$newPCODE1,dataB$newPCODE1)
sum(table(dataA$newPCODE1[dataA$newPCODE1 %in% common.codes]) *
table(dataB$newPCODE1[dataB$newPCODE1 %in% common.codes]))
```

`[1] 25`

If you want to keep rows from one or both data frames even when there is no matching row in the other data frame, you can do this:

```
merge(dataA,dataB, by="newPCODE1", all.x=TRUE) # Keep all rows from first data frame
merge(dataA,dataB, by="newPCODE1", all.y=TRUE) # Keep all rows from second data frame
merge(dataA,dataB, by="newPCODE1", all=TRUE) # Keep all rows from both data frames
```