AGUY AGUY - 1 month ago 10
R Question

Select only the first row when merging data frames with multiple matches

I have two data frames, "data" and "scores", and want to merge them on the "id" column:

data = data.frame(id = c(1,2,3,4,5),
state = c("KS","MN","AL","FL","CA"))
scores = data.frame(id = c(1,1,1,2,2,3,3,3),
score = c(66,75,78,86,85,76,75,90))
merge(data, scores, by = "id")
semi_join(data, scores, by = "id")


In the "scores" data, there are "id" with multiple observations, where each match gets a row following the join. See
?merge
:


If there is more than one match, all possible matches contribute one row each.


However, I want keep only the row corresponding to the first match from the
scores
table.

A semi join would have been nice, but I'm not able to select the score from the right table.

Any suggestions?

Answer Source

Using data.table along with mult = "first" and nomatch = 0L:

require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference

scores[data, mult = "first", on = "id", nomatch=0L]
#    id score state
# 1:  1    66    KS
# 2:  2    86    MN
# 3:  3    76    AL

For each row on data's id column, the matching rows in scores' id column are found, and the first one alone is retained (because mult = "first"). If there are no matches, they're removed (because of nomatch = 0L).