LyssBucks - 1 year ago 114
R Question

# na.locf fill NAs up to maxgap even if gap > maxgap, with groups

I've seen a solution to this, but can't get it to work for groups
(Fill NA in a time series only to a limited number), and thought there must be a neater way to do this also?

Say I have the following dt:

``````dt <- data.table(ID = c(rep("A",10),rep("B",10)), Price = c(seq(1,10,1), seq(11,20,1)))
dt[c(1:2, 5:10) ,2] <- NA
dt[c(11:13, 15:19) ,2] <- NA
dt
ID Price
1:  A    NA
2:  A    NA
3:  A     3
4:  A     4
5:  A    NA
6:  A    NA
7:  A    NA
8:  A    NA
9:  A    NA
10:  A    NA
11:  B    NA
12:  B    NA
13:  B    NA
14:  B    14
15:  B    NA
16:  B    NA
17:  B    NA
18:  B    NA
19:  B    NA
20:  B    20
``````

What I would like to do, is to fill NAs both forward and back from the most recent non-NA value, but only up to a maximum of two rows forward or back.

I also need it to be done by group (ID).

I have tried using na.locf/na.approx with maxgap = x etc, but it does not fill NAs where the gap between non-NA values is greater than maxgap. Whereas I want to fill these forward and back even if the gap between non-NA values is greater than maxgap, but only by two rows.

The final result should looks something like:

``````    ID Price Price_Fill
1:  A    NA          3
2:  A    NA          3
3:  A     3          3
4:  A     4          4
5:  A    NA          4
6:  A    NA          4
7:  A    NA         NA
8:  A    NA         NA
9:  A    NA         NA
10:  A    NA         NA
11:  B    NA         NA
12:  B    NA         14
13:  B    NA         14
14:  B    14         14
15:  B    NA         14
16:  B    NA         14
17:  B    NA         NA
18:  B    NA         20
19:  B    NA         20
20:  B    20         20
``````

In reality, my data set is massive, and I want to be able to fill NAs forward and back for up to 672 rows, but no more, by group.

Thanks!

For the example showed, we group by 'ID', get the `shift` of 'Price' with `n = 0:2`, and `type` as 'lead' to create 3 temporary columns, get the `pmax` from this, use the output to do the `shift` with `type = 'lag'` (by default it is 'lag') and same `n`, get the `pmin` and assign it as 'Price_Fill'

``````dt[, Price_Fill := do.call(pmin, c(shift(do.call(pmax, c(shift(Price, n = 0:2,
type = "lead"), na.rm=TRUE)), n= 0:2), na.rm = TRUE)) , by = ID]
dt
#    ID Price Price_Fill
#1:  A    NA          3
#2:  A    NA          3
#3:  A     3          3
#4:  A     4          4
#5:  A    NA          4
#6:  A    NA          4
#7:  A    NA         NA
#8:  A    NA         NA
#9:  A    NA         NA
#10: A    NA         NA
#11: B    NA         NA
#12: B    NA         14
#13: B    NA         14
#14: B    14         14
#15: B    NA         14
#16: B    NA         14
#17: B    NA         NA
#18: B    NA         20
#19: B    NA         20
#20: B    20         20
``````

A more general approach would be to do the `pmin/pmax` on `.I` as the 'Price' can be different and not the sequence number as showed in the OP's post.

``````i1 <- dt[,  do.call(pmin, c(shift(do.call(pmax, c(shift(NA^(is.na(Price))*
.I, n = 0:2, type = "lead"), na.rm = TRUE)), n = 0:2), na.rm = TRUE)), ID]\$V1

dt\$Price_Fill <  dt\$Price[i1]
dt\$Price_Fill
#[1]  3  3  3  4  4  4 NA NA NA NA NA 14 14 14 14 14 NA 20 20 20
``````

i.e. suppose we change the 'Price', it will be different

``````dt\$Price[3] <- 10
dt\$Price[14] <- 7
dt\$Price_Fill <- dt\$Price[i1]
dt\$Price_Fill
#[1] 10 10 10  4  4  4 NA NA NA NA NA  7  7  7  7  7 NA 20 20 20
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download