SMB SMB - 2 months ago 11
R Question

comparing immediate past values to current value in R, missing data too

I am using Rstudio (version .99.903), have a PC (windows 8). I have a question that is really difficult for me. Here is what the data looks like:

Number Trial ID Open date Enrollment
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0


What I need to do is compare the enrollment of each date within ID to the one prior to it. If there is no date within the ID prior to it, then the comparison should not be made. For instance, for ID 26, there would be no comparison. Similarly, for ID 53, there would be no comparison for 5/12/2010, but I would like to compare 8/17/2010 to 5/12/2010, and then 1/19/2011 to 8/17/2010 (but not also to 5/12/2010). The output would ideally be a dot plot that plots the value of the current against the value of the prior (prior would be on y axis, current on the x axis). Finally, I would need to generate a column that subtracts the current value from the one just prior...

There are >20,000 data points. I've tried to write a script to look back to the prior, but I haven't been been able to control for ID. Also, I imagine it wouldn't be much different if I looked back one year, two years, five years, etc...?

Any help would be much appreciated.

Answer

As for the data process, I think what you want is to get difference in days between two dates. You can achieve this in a few ways. Here, I chose to use shift() in the data.table package. You specify type = "lag" and in the function and handle subtraction. You do this operation for each ID by specifying by = ID. I cannot visualize what kind of plot you have in your mind. I am happy to support you if you can clarify what you meant in your question.

library(tidyverse)
library(data.table)

setDT(mydf)[, Opendate := as.IDate(Opendate, format = "%m/%d/%Y")][,
    out := as.numeric(Opendate - shift(Opendate, type = "lag")), by = ID][,
    out := coalesce(out, 0)]


#    Number       Trial ID   Opendate Enrollment  out
# 1:    420 NCT00091442  9 2005-01-28        0.2    0
# 2:   1476 NCT00301457 26 2008-02-22        1.0    0
# 3:  10559 NCT01307397 34 2011-07-28        0.6    0
# 4:   6794 NCT00948675 53 2010-05-12        0.0    0
# 5:   6451 NCT00917384 53 2010-08-17        0.3   97
# 6:   8754 NCT01168973 53 2011-01-19        0.2  155
# 7:   8578 NCT01140347 53 2011-12-30        2.4  345
# 8:  11655 NCT01358877 53 2012-04-02        0.3   94
# 9:    428 NCT00091442 55 2005-09-07        0.1    0
#10:    112 NCT00065325 62 2003-10-15        0.2    0
#11:    477 NCT00091442 62 2005-11-11        0.1  758
#12:  16277 NCT01843374 62 2013-12-16        0.2 2957
#13:  17386 NCT01905657 62 2014-01-08        0.6   23
#14:    411 NCT00091442 66 2005-01-12        0.0    0

DATA

mydf <- structure(list(Number = c(420L, 1476L, 10559L, 6794L, 6451L, 
8754L, 8578L, 11655L, 428L, 112L, 477L, 16277L, 17386L, 411L), 
Trial = structure(c(2L, 3L, 8L, 5L, 4L, 7L, 6L, 9L, 2L, 1L, 
2L, 10L, 11L, 2L), .Label = c("NCT00065325", "NCT00091442", 
"NCT00301457", "NCT00917384", "NCT00948675", "NCT01140347", 
"NCT01168973", "NCT01307397", "NCT01358877", "NCT01843374", 
"NCT01905657"), class = "factor"), ID = c(9L, 26L, 34L, 53L, 
53L, 53L, 53L, 53L, 55L, 62L, 62L, 62L, 62L, 66L), Opendate = structure(c(3L, 
9L, 12L, 11L, 13L, 2L, 8L, 10L, 14L, 5L, 6L, 7L, 4L, 1L), .Label = c("1/12/2005", 
"1/19/2011", "1/28/2005", "1/8/2014", "10/15/2003", "11/11/2005", 
"12/16/2013", "12/30/2011", "2/22/2008", "4/2/2012", "5/12/2010", 
"7/28/2011", "8/17/2010", "9/7/2005"), class = "factor"), 
Enrollment = c(0.2, 1, 0.6, 0, 0.3, 0.2, 2.4, 0.3, 0.1, 0.2, 
0.1, 0.2, 0.6, 0)), .Names = c("Number", "Trial", "ID", "Opendate", 
"Enrollment"), class = "data.frame", row.names = c(NA, -14L))