Chris Chris -4 years ago 91
R Question

Matching unique and non-unique values between data.tables and update data.table

I have been working on matching values between two large data tables (around 25 million records each) based on unique and non-unique values. I posted a question yesterday on how to update values in a data.table using another data.table and the answers have made me wonder how I can improve the efficiency of this matching. In my example

dt1
contains a non-unique area code and a type which can be from 1 to 10 (along with some blank columns).
dt2
contains the same non-unique area codes (although not the same number as in
dt1
), types from 1 to 10 and a unique ID (
U_ID
).

What I would like to do is find the
AREA_CD
and
TYPE
rows in
dt1
and
dt2
that match and copy the
U_ID
for those rows from
dt2
to
dt1
. The issue is that dt1 and dt2 do not have the same number of instances of each unique combination. For example
AREA_CD
'A1' and
TYPE
'1' occur 21 times in
dt1
and only 20 times in
dt2
. In these instances then the minimum number of rows (so 20 in this case) would have the match operation applied leaving 1 row in
dt1
unmodified (if
dt2
had more rows than
dt1
then the number of rows in
dt1
would be used).

Here is an example of my dataset. My actual datasets have around 25,000,000 rows and contain around 10,000 unique areas and types from 1 to 10.

require("data.table")

df1 <-data.frame(AREA_CD = c(rep("A1", 205), rep("A2", 145), rep("A3", 250), rep("A4", 100), rep("A5", 300)), TYPE = rep(1:10), ALLOCATED = 0, U_ID = 0, ID_CD = c(1:1000))
df1$ID_CD <- interaction( "ID", df1$ID_CD, sep = "")

df2 <-data.frame(U_ID = c(1:1000), AREA_CD = c(rep("A1", 200), rep("A2", 155), rep("A3", 245), rep("A4", 90), rep("A5", 310)), TYPE = rep(1:10), ASSIGNED = 0)
df2$U_ID <- interaction( "U", df2$U_ID, sep = "")

dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)


The output I am looking for would look something like this:

for(o in 1:5){
Ao <- paste("A",o,sep="")
for(i in 1:10){
R.Num <- min(nrow(with(df1, df1[AREA_CD == Ao & TYPE == i ,])), nrow(with(df2, df2[AREA_CD == Ao & TYPE == i ,])))
df1[df1$AREA_CD == Ao & df1$TYPE == i,][1:R.Num,"U_ID"] <- as.character(df2[df2$AREA_CD == Ao & df2$TYPE == i,][1:R.Num,"U_ID"])
}}


I hope that makes sense.

Answer Source

For each group of AREA_CD and TYPE, the OP wants to match the rows one to one in the order they appear in both data.tables. E.g., the first row with AREA_CD == "A1" & TYPE == 1 in dt1 shall match the first row with AREA_CD == "A1" & TYPE == 1 in dt2, then the second rows and so forth.

This can be done in a join operation if a row index I (or running count) within each group is added:

# add row indices
dt1[, I := seq_len(.N), by = .(AREA_CD, TYPE)]
dt2[, I := seq_len(.N), by = .(AREA_CD, TYPE)]

# alternative code: same result as above but more concise
dt1[, I := rowid(AREA_CD, TYPE)]
dt2[, I := rowid(AREA_CD, TYPE)]


# right join (all rows of dt1 are used)
dt0 <- dt2[dt1, on = .(AREA_CD, TYPE, I)]

# show result for one group
dt0[AREA_CD == "A1" & TYPE == 1, ]
#    U_ID AREA_CD TYPE ASSIGNED  I ALLOCATED i.U_ID ID_CD
# 1:   U1      A1    1        0  1         0      0   ID1
# 2:  U11      A1    1        0  2         0      0  ID11
# 3:  U21      A1    1        0  3         0      0  ID21
#...
#19: U181      A1    1        0 19         0      0 ID181
#20: U191      A1    1        0 20         0      0 ID191
#21:   NA      A1    1       NA 21         0      0 ID201

Note that the last row has a NA in some columns. This is due to the different number of rows for this group in dt1 and dt2. dt1 has 21 rows while dt2 has only 20 rows. So, the last row of dt1 has no match in dt2.

Alternatively, a inner join will only return rows with a match in both dt1 and dt2:

# inner join
dt0 <- dt2[dt1, on = .(AREA_CD, TYPE, I), nomatch = 0]

# show result for one group
dt0[AREA_CD == "A1" & TYPE == 1, ]
#    U_ID AREA_CD TYPE ASSIGNED  I ALLOCATED i.U_ID ID_CD
# 1:   U1      A1    1        0  1         0      0   ID1
# 2:  U11      A1    1        0  2         0      0  ID11
# 3:  U21      A1    1        0  3         0      0  ID21
#...
#18: U171      A1    1        0 18         0      0 ID171
#19: U181      A1    1        0 19         0      0 ID181
#20: U191      A1    1        0 20         0      0 ID191

Now, only 20 rows are returned for this group.


Data

dt1 <- data.table(
  AREA_CD = factor(c(rep("A1", 205), rep("A2", 145), rep("A3", 250), rep("A4", 100), rep("A5", 300))), 
  TYPE = rep(1:10), 
  ALLOCATED = 0, 
  U_ID = 0, 
  ID_CD = factor(paste0("ID", 1:1000)))
dt2 <- data.table(
  U_ID = factor(paste0("U", 1:1000)), 
  AREA_CD = factor(c(rep("A1", 200), rep("A2", 155), rep("A3", 245), rep("A4", 90), rep("A5", 310))), 
  TYPE = rep(1:10), 
  ASSIGNED = 0)

Note that ID_CD and U_ID are created using paste0() instead of interaction() which turned out to be rather slow.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download