LyssBucks LyssBucks - 4 months ago 17
R Question

na.locf fill NAs up to maxgap even if gap > maxgap, with groups

I've seen a solution to this, but can't get it to work for groups
(Fill NA in a time series only to a limited number), and thought there must be a neater way to do this also?

Say I have the following dt:

dt <- data.table(ID = c(rep("A",10),rep("B",10)), Price = c(seq(1,10,1), seq(11,20,1)))
dt[c(1:2, 5:10) ,2] <- NA
dt[c(11:13, 15:19) ,2] <- NA
dt
ID Price
1: A NA
2: A NA
3: A 3
4: A 4
5: A NA
6: A NA
7: A NA
8: A NA
9: A NA
10: A NA
11: B NA
12: B NA
13: B NA
14: B 14
15: B NA
16: B NA
17: B NA
18: B NA
19: B NA
20: B 20


What I would like to do, is to fill NAs both forward and back from the most recent non-NA value, but only up to a maximum of two rows forward or back.

I also need it to be done by group (ID).

I have tried using na.locf/na.approx with maxgap = x etc, but it does not fill NAs where the gap between non-NA values is greater than maxgap. Whereas I want to fill these forward and back even if the gap between non-NA values is greater than maxgap, but only by two rows.

The final result should looks something like:

ID Price Price_Fill
1: A NA 3
2: A NA 3
3: A 3 3
4: A 4 4
5: A NA 4
6: A NA 4
7: A NA NA
8: A NA NA
9: A NA NA
10: A NA NA
11: B NA NA
12: B NA 14
13: B NA 14
14: B 14 14
15: B NA 14
16: B NA 14
17: B NA NA
18: B NA 20
19: B NA 20
20: B 20 20


In reality, my data set is massive, and I want to be able to fill NAs forward and back for up to 672 rows, but no more, by group.

Thanks!

Answer

For the example showed, we group by 'ID', get the shift of 'Price' with n = 0:2, and type as 'lead' to create 3 temporary columns, get the pmax from this, use the output to do the shift with type = 'lag' (by default it is 'lag') and same n, get the pmin and assign it as 'Price_Fill'

dt[, Price_Fill := do.call(pmin, c(shift(do.call(pmax, c(shift(Price, n = 0:2, 
                  type = "lead"), na.rm=TRUE)), n= 0:2), na.rm = TRUE)) , by = ID]
dt
#    ID Price Price_Fill
#1:  A    NA          3
#2:  A    NA          3
#3:  A     3          3
#4:  A     4          4
#5:  A    NA          4
#6:  A    NA          4
#7:  A    NA         NA
#8:  A    NA         NA
#9:  A    NA         NA
#10: A    NA         NA
#11: B    NA         NA
#12: B    NA         14
#13: B    NA         14
#14: B    14         14
#15: B    NA         14
#16: B    NA         14
#17: B    NA         NA
#18: B    NA         20
#19: B    NA         20
#20: B    20         20

A more general approach would be to do the pmin/pmax on .I as the 'Price' can be different and not the sequence number as showed in the OP's post.

i1 <- dt[,  do.call(pmin, c(shift(do.call(pmax, c(shift(NA^(is.na(Price))* 
    .I, n = 0:2, type = "lead"), na.rm = TRUE)), n = 0:2), na.rm = TRUE)), ID]$V1

dt$Price_Fill <  dt$Price[i1]
dt$Price_Fill
#[1]  3  3  3  4  4  4 NA NA NA NA NA 14 14 14 14 14 NA 20 20 20

i.e. suppose we change the 'Price', it will be different

dt$Price[3] <- 10
dt$Price[14] <- 7
dt$Price_Fill <- dt$Price[i1]
dt$Price_Fill
#[1] 10 10 10  4  4  4 NA NA NA NA NA  7  7  7  7  7 NA 20 20 20