arcee123 arcee123 - 1 month ago 7
R Question

how to perform a left-join MERGE in R

I have the below code:

options(java.parameters = "-Xmx4000m")
require(xlsx)
library(plyr)

setwd("~/PycharmProjects/CatScrape")
rm(list=ls(all=TRUE))
jgc <- function() .jcall("java/lang/System", method = "gc")

Master <- read.xlsx2("MASTER.xlsx", sheetIndex = 2, startRow = 1, colIndex=4,endRow = 10000, as.data.frame = TRUE, header=TRUE)

Dutch_Stage <- read.xlsx2("languages/Dutch.xlsx", sheetIndex = 1, startRow = 1, colIndex=c(5,8),endRow = 10000, header=TRUE)
Dutch <- unique(Dutch_Stage)
rm(Dutch_Stage)
Dutch <- rename(Dutch, c("Key.s."="Key", "Status"="Dutch"))
jgc()
output <- merge(Master, Dutch, by="Key", all.Master = TRUE)

## OUTPUT RECORD NUMBER MATCHES MASTER

Finnish_Stage <- read.xlsx2("languages/Finnish.xlsx", sheetIndex = 1, startRow = 1, colIndex=c(5,8),endRow = 10000, header=TRUE)
Finnish <- unique(Finnish_Stage)
rm(Finnish_Stage)
Finnish <- rename(Finnish, c("Key.s."="Key", "Status"="Finnish"))
jgc()
output <- merge(output, Finnish, by="Key", all.output = TRUE)


## OUTPUT RECORD NUMBER INCREASES by 6


I have 12 more files to add, and when that happens, I end up with 25 times the number of records.

In this case,
all.output = TRUE
is set to all of the files, and my goal is to just show the records from Master, and the associations to those records. I don't want the additional records.

This makes me think this is not a true "left join". How do I make it just a "LEFT JOIN"?

Thanks

Answer

A little too long for a comment so I'll give my two cents in an answer:

Overall its hard to answer your question directly without sample data. But, generally if you want a left join you should do:

merge(x,y,all.x=T). 

The above code is saying keep all the observations from your x dataset (your left dataset)

However, I suspect that that this won't solve your issue since you are getting MORE and not LESS observations than expected. I think your issue is likely with unique(Finnish_Stage). Using this unique statement doesn't necessarily mean your Key variable only has 1 observation for each value (if you have other variables mapped to the same key value that could be an issue). For example try this: unique(data.frame(Key=c(1,1,2),value=c(1,2,3)))

Joins (including left joins) will merge everything together. Maybe someone else can explain this in words slightly better, but I think an example is the best way to show what happens:

Data:

d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6),y3=c(7,8,9))
d2 <- data.frame(y1=c(1,1,2))

Original Join (inner join)

merge(d1,d2,all.output=TRUE)

  y1 y2 y3
1  1  4  7
2  1  4  7
3  2  5  8

Here you can see that we have a duplicate row. This is because there were two y1=1 values in your d2 dataset so the join will merge these on twice.

The same thing also happens for a left join:

Left join

merge(d1,d2,all.x=TRUE)

  y1 y2 y3
1  1  4  7
2  1  4  7
3  2  5  8
4  3  6  9

So what's likely happening is you have multiple observations for each of your Key values which is then causing these to be merged on multiple times giving you more observations than you expect.

To solve this, I just make sure you should only ever see 1 value of your Key variable in the dataset you want to merge on (and if do expect this, make sure you nodupe based on only the Key variable). If you would expect to have more than 1 Key value, then the merge is actually working the way it should.

tldr: Left joins don't have to return the same number of rows as the original dataset.