Stat - 3 months ago 18
R Question

# Check overlap begin and end time by group in R (Run incorrectly when data have NA)

This is a followup to this previous question, but I've come across an issue with the answer that was provided there due to

`NA`
:

``````require(data.table)
ID <- c(rep(1,4), rep(3, 5), rep(4,4),rep(5,5))
Begin <- c(0,2.5,NA,3,7,8,7,25,25,10,15,0,0,1,NA,10,11,13)
End <- c(1.5,3.5,NA,6,12,8,11,29,35, 12,19,NA,28,5,20,30,20,25)
df <- data.table(ID, Begin, End)
df[, Begin_New := {
high_so_far = shift(cummax(End), fill=Begin[1L])
w = which(Begin < high_so_far)
Begin[w] = high_so_far[w]
Begin
}, by=ID]
ID   Begin  End    Begin_New
1:  1   0.0  1.5       0.0
2:  1   2.5  3.5       2.5
3:  1    NA   NA        NA
4:  1   3.0  6.0       3.0* # <~~ it supposed 3.5
5:  3   7.0 12.0       7.0
6:  3   8.0  8.0      12.0
7:  3   7.0 11.0      12.0
8:  3  25.0 29.0      25.0
9:  3  25.0 35.0      29.0
10:  4  10.0 12.0      10.0
11:  4  15.0 19.0      15.0
12:  4   0.0   NA      19.0
13:  4   0.0 28.0       0.0* # <~~ it's supposed 19.0
14:  5   1.0  5.0       1.0
15:  5    NA 20.0        NA
16:  5  10.0 30.0      20.0
17:  5  11.0 20.0      30.0
18:  5  13.0 25.0      30.0
``````

I try to check overlap, if Begin < End, Begin_New needs to be equal with End previous by each ID,keep checking until Begin > End, but when I have NA of End variable, the code is not understand, keep checking value. I try couple code but it doesn't work

You can add another step before `cummax`:

``````df[, Begin_New := {
End[is.na(End)] = 0 # <- new step here
high_so_far = shift(cummax(End), fill=Begin[1L])
w = which(Begin < high_so_far)
Begin[w] = high_so_far[w]
Begin
}, by=ID][]
``````

How I got this. To troubleshoot problems like this, I run `j` in steps for one group at a time like

``````df[, if (.GRP == 1L){
high_so_far = shift(cummax(End), fill=Begin[1L])
print(high_so_far)
# w = which(Begin < high_so_far)
# Begin[w] = high_so_far[w]
# Begin
}, by=ID][]

# 0.0 1.5 3.5  NA
``````

So I can see that this is where the problem occurs and address it by reading `?cummax` to see if there is an `na.rm` option. Not finding one there, I can think about what other step I can take before or after this one to finagle the desired result.

If I hadn't found the issue at this step, then I could gradually uncomment later lines and add more `print` statements. Or I could change `.GRP==1` to some other group.