research111 research111 - 1 month ago 17
R Question

Conditional data manipulation using data.table in R

I have 2 dataframes, testx and testy

testx

testx <- structure(list(group = 1:2), .Names = "group", class = "data.frame", row.names = c(NA,

-2L))


testy

testy <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),

time = c(1L, 3L, 4L, 1L, 4L, 5L, 1L, 5L, 7L), value = c(50L,

52L, 10L, 4L, 84L, 2L, 25L, 67L, 37L)), .Names = c("group",

"time", "value"), class = "data.frame", row.names = c(NA, -9L

))


Based on this topic, I add missing time values using the following code, which works perfectly.

data <- setDT(testy, key='time')[, .SD[J(min(time):max(time))], by = group]


Now I would like to only add these missing time values IF the value for group appears in testx. In this example, I thus only want to add missing time values for groups matching the values for group in the file testx.

data <- setDT(testy, key='time')[,if(testy[group %in% testx[, group]]) .SD[J(min(time):max(time))], by = group]


The error I get is "undefined columns selected". I looked here, here and here, but I don't see why my code isn't working. I am doing this on large datasets, why I prefer using
data.table
.

Answer

You don't need to refer testy when you are within testy[] and are using group by, directly using group as a variable gives correct result, you need an extra else statement to return rows where group is not within testx if you want to keep all records in testy:

testy[, {if(group %in% testx$group) .SD[J(min(time):max(time))] else .SD}, by = group]

#     group time value
#  1:     1    1    50
#  2:     1    2    NA
#  3:     1    3    52
#  4:     1    4    10
#  5:     2    1     4
#  6:     2    2    NA
#  7:     2    3    NA
#  8:     2    4    84
#  9:     2    5     2
# 10:     3    1    25
# 11:     3    5    67
# 12:     3    7    37
Comments