AntoniosK AntoniosK - 1 month ago 13
R Question

dplyr, data.table and setDT interaction issue

This is a very simplified version of a problem in a much bigger scale.
The objective is to use

data.table
structure and
dplyr
commands for faster ordering and grouping on multiple columns.

The correct version is the following:

library(dplyr)
library(data.table)
library(dtplyr)
library(lubridate)

# data set
dt = data.frame(id = c("a","b", "a"),
date = ymd(c("2016-01-03","2016-01-02","2016-01-01")),
value = c(10,5,9), stringsAsFactors = F)

# process to get the id of the largest value
(setDT(dt, key=c("id","value")) %>% select(id,value) %>% arrange(desc(value)) %>% slice(1))$id -> picked_id

# return all rows of this id
dt %>% filter(id %in% picked_id)

# id date value
# 1: a 2016-01-01 9
# 2: a 2016-01-03 10


But when I try to use
setDT
in a different position in my script I get a different result:

dt = data.frame(id = c("a","b", "a"),
date = ymd(c("2016-01-03","2016-01-02","2016-01-01")),
value = c(10,5,9), stringsAsFactors = F)

(dt %>% select(id,value) %>% setDT(., key=c("id","value")) %>% arrange(desc(value)) %>% slice(1))$id -> picked_id

dt %>% filter(id %in% picked_id)

# id date value
# 1 a 2016-01-03 9
# 2 a 2016-01-02 10


Obviously, there are other much more understandable scripts for this simple task, but I want to understand why this issue occurs here.

Answer

You cannot safely mix (i) data.table's functions that modify by reference into (ii) a dplyr chain, which is designed to never modify by reference. See what happens here:

library(dplyr)
library(data.table)
library(dtplyr)
library(lubridate)

dt = data.frame(id = c("a","b", "a"),
                date = ymd(c("2016-01-03","2016-01-02","2016-01-01")),
                value = c(10,5,9), stringsAsFactors = FALSE)

dt


  id       date value
1  a 2016-01-03    10
2  b 2016-01-02     5
3  a 2016-01-01     9


dt %>% select(id,value) %>% setDT(., key=c("id","value"))

dt


  id       date value
1  a 2016-01-03     9
2  a 2016-01-02    10
3  b 2016-01-01     5

So the selected columns have been modified by setDT call. You could view this as a bug in dtplyr's implementation of select or as a misuse by the OP. Anyways, I would stick with one paradigm at a time (personally, I just use data.table with magrittr and never hit these problems). For now, you can add a copy to your chain,

dt %>% select(id,value) %>% copy %>% setDT(., key=c("id","value"))

But I guess you'd need to do this all over the place.