Stat Stat - 3 months ago 18
R Question

Check overlap begin and end time by group in R (Run incorrectly when data have NA)

This is a followup to this previous question, but I've come across an issue with the answer that was provided there due to

NA
:

require(data.table)
ID <- c(rep(1,4), rep(3, 5), rep(4,4),rep(5,5))
Begin <- c(0,2.5,NA,3,7,8,7,25,25,10,15,0,0,1,NA,10,11,13)
End <- c(1.5,3.5,NA,6,12,8,11,29,35, 12,19,NA,28,5,20,30,20,25)
df <- data.table(ID, Begin, End)
df[, Begin_New := {
high_so_far = shift(cummax(End), fill=Begin[1L])
w = which(Begin < high_so_far)
Begin[w] = high_so_far[w]
Begin
}, by=ID]
ID Begin End Begin_New
1: 1 0.0 1.5 0.0
2: 1 2.5 3.5 2.5
3: 1 NA NA NA
4: 1 3.0 6.0 3.0* # <~~ it supposed 3.5
5: 3 7.0 12.0 7.0
6: 3 8.0 8.0 12.0
7: 3 7.0 11.0 12.0
8: 3 25.0 29.0 25.0
9: 3 25.0 35.0 29.0
10: 4 10.0 12.0 10.0
11: 4 15.0 19.0 15.0
12: 4 0.0 NA 19.0
13: 4 0.0 28.0 0.0* # <~~ it's supposed 19.0
14: 5 1.0 5.0 1.0
15: 5 NA 20.0 NA
16: 5 10.0 30.0 20.0
17: 5 11.0 20.0 30.0
18: 5 13.0 25.0 30.0


I try to check overlap, if Begin < End, Begin_New needs to be equal with End previous by each ID,keep checking until Begin > End, but when I have NA of End variable, the code is not understand, keep checking value. I try couple code but it doesn't work

Answer

You can add another step before cummax:

df[, Begin_New := {
  End[is.na(End)] = 0 # <- new step here
  high_so_far = shift(cummax(End), fill=Begin[1L])
  w = which(Begin < high_so_far)
  Begin[w] = high_so_far[w]
  Begin
}, by=ID][]

How I got this. To troubleshoot problems like this, I run j in steps for one group at a time like

df[, if (.GRP == 1L){
  high_so_far = shift(cummax(End), fill=Begin[1L])
  print(high_so_far)
  # w = which(Begin < high_so_far)
  # Begin[w] = high_so_far[w]
  # Begin
}, by=ID][]

# 0.0 1.5 3.5  NA

So I can see that this is where the problem occurs and address it by reading ?cummax to see if there is an na.rm option. Not finding one there, I can think about what other step I can take before or after this one to finagle the desired result.

If I hadn't found the issue at this step, then I could gradually uncomment later lines and add more print statements. Or I could change .GRP==1 to some other group.

Comments