i.Mik i.Mik - 1 month ago 9
R Question

Why does merging data.tables not work for long-integer keys in R, and ways to avoid?

I am working with a dataset that has some fairly absurdly long IDs. After not getting the result I expected, I traced the issue to a step of code where I merged two data tables.

Example:

options("scipen" = 999) ## The only way I know to disable scientific notation.
library(data.table)
table1 <- data.table("ID" = sample(400000000000:400000000004, size = 5),
"colA" = rep('a', 5))
table2 <- data.table("ID" = sample(400000000000:400000000001, size = 2),
"colB" = rep('b', 2))
merge(table1, table2, by = "ID", all.x = F)


Since each ID is unique in each table, I would expect the output to be two rows long (returning only the matching rows of
table2
due to
all.x = F
).

Instead, I get:

ID colA colB
1: 400000000000 a b
2: 400000000000 a b
3: 400000000001 a b
4: 400000000001 a b


At a certain point, it looks like R stops reading these integers as distinct for calculation purposes. It looks like
right_join()
in
dplyr
doesn't have this particular problem, and in this case it's not crucial by any means for me to keep the
data.table
class... but I'd still be interested in knowing if it's avoidable and why this limitation pops up (and why
join()
avoids it).

Answer

From merge documentation :

Merges on numeric columns: Columns of numeric types (i.e., double) have their last two bytes rounded off while computing order, by defalult, to avoid any unexpected behaviour due to limitations in representing floating point numbers precisely. For large numbers (integers > 2^31), we recommend using bit64::integer64. Have a look at setNumericRounding to learn more.

The solution here is to use

setNumericRounding(0)
merge(table1, table2, by = "ID", all.x = F)
             ID colA colB
1: 400000000000    a    b
2: 400000000001    a    b
Comments