Matt Chambers Matt Chambers - 2 months ago 14
R Question

Data.table left join and aggregate/concatenate/group_concat

I have tables like:

x = data.table(Id=c(1,1,2,3,3,4), Name=c("A", "A", "B", "C", "C", "D"), TxId=c(10, 11, 20, 30, 31, 40))
#Id Name TxId
#1: 1 A 10
#2: 1 A 11
#3: 2 B 20
#4: 3 C 30
#5: 3 C 31
#6: 4 D 40

y = data.table(Name=c("A", "B", "B", "C"), Family=c("A-alpha", "B-beta", "B-gamma", "C-delta"))
# Name Family
#1: A A-alpha
#2: B B-beta
#3: B B-gamma
#4: C C-delta


I can do the left-join and concatenation, but I only want one output row for each row in X.

# Left join X to Y on Name column
xy = y[x, on="Name"]
# Name Family Id TxId
#1: A A-alpha 1 10
#2: A A-alpha 1 11
#3: B B-beta 2 20
#4: B B-gamma 2 20
#5: C C-delta 3 30
#6: C C-delta 3 31
#7: D NA 4 40

# Concatenate Family column
xy[, Family:=paste0(Family, collapse=", "), by=c("Name", "TxId")]
# Name Family Id TxId
#1: A A-alpha 1 10
#2: A A-alpha 1 11
#3: B B-beta, B-gamma 2 20
#4: B B-beta, B-gamma 2 20
#5: C C-delta 3 30
#6: C C-delta 3 31
#7: D NA 4 40


How do I get rid of the extra row for B? I want it to be unique on Id/TxId. i.e.

# Name Family Id TxId
#1: A A-alpha 1 10
#2: A A-alpha 1 11
#3: B B-beta, B-gamma 2 20
#5: C C-delta 3 30
#6: C C-delta 3 31
#7: D NA 4 40


If I do as eddi comments:

xy[, .(Family=paste0(Family, collapse=", "), by=c("Name", "TxId")])


I get the correct result. But if I try to add other columns, it doesn't work (I get the same result as if I had done the
:=
version):

xy[, .(Id, Family=paste0(Family, collapse=", ")), by=c("Name", "TxId")]

Answer

Please, try

xy[, .(Family = paste0(Family, collapse = ", "), by = c("Id", "Name", "TxId")]

I'll try to explain:
If Id is part of the group-by it will only appear once for each unique value of Id (to be exact, for each unique combination of Id, Name, TxId). If Id is included in the j-expression, i.e., .(Id, Family = paste0(Family, collapse = ", ") then every row of Id will be included in the result set despite Family is being aggregated.