user3032689 user3032689 - 2 months ago 16
R Question

Transform into balanced panel data

I have an unbalanced panel like the following example:

test <- read.table(
text = "
A 2010-01-01 1 rdm
A 2010-01-10 2 dfg
A 2010-01-14 3 fdgfd
A 2010-02-15 4 fdgfd
A 2010-08-17 5 dg
A 2010-12-19 6 dfg
B 2009-01-01 1 dfg
B 2010-01-01 2 ydg
B 2010-01-10 3 fdgfd
B 2010-01-14 4 dfg
B 2010-02-15 5 dfg
",header=F)
library(data.table)
setDT(test)
names(test) <- c("ID", "date", "nr", "namecol")


I would like to balance it with regard to date, i.e. every individual (A, B, etc.) has NA rows for the dates where there is no data. I do not know the minimum date per group or the minimum date across groups to begin with. Same with the maximum, but maybe it is faster to just choose a maximum which equals a certain date (as compared to calculating it across groups).
The desired output is:

out <- read.table(
text = "
A 2009-01-01 NA NA
A 2010-01-01 1 rdm
A 2010-01-10 2 dfg
A 2010-01-14 3 fdgfd
A 2010-02-15 4 fdgfd
A 2010-08-17 5 dg
A 2010-12-19 6 dfg
B 2009-01-01 1 dfg
B 2010-01-01 2 ydg
B 2010-01-10 3 fdgfd
B 2010-01-14 4 dfg
B 2010-02-15 5 dfg
B 2010-08-17 NA NA
B 2010-12-19 NA NA
",header=F)
setDT(out)
names(out) <- c("ID", "date", "nr", "namecol")


My data set is very large, so I believe it would be best to do this in
data.table
(or
plyr
,
reshape2
) or something similar suitable.

Answer

We do a cross join (CJ) with unique 'ID', and 'date' of the dataset after setting the key columns as 'ID' and 'date' and then do a join with the original dataset.

setDT(test, key = c("ID", "date"))[CJ(ID, date, unique=TRUE)]
#    ID       date nr namecol
# 1:  A 2009-01-01 NA      NA
# 2:  A 2010-01-01  1     rdm
# 3:  A 2010-01-10  2     dfg
# 4:  A 2010-01-14  3   fdgfd
# 5:  A 2010-02-15  4   fdgfd
# 6:  A 2010-08-17  5      dg
# 7:  A 2010-12-19  6     dfg
# 8:  B 2009-01-01  1     dfg
# 9:  B 2010-01-01  2     ydg
#10:  B 2010-01-10  3   fdgfd
#11:  B 2010-01-14  4     dfg
#12:  B 2010-02-15  5     dfg
#13:  B 2010-08-17 NA      NA
#14:  B 2010-12-19 NA      NA

data

test <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B"), date = structure(c(14610, 14619, 14623, 14655, 
14838, 14962, 14245, 14610, 14619, 14623, 14655), class = "Date"), 
nr = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L), namecol = c("rdm", 
"dfg", "fdgfd", "fdgfd", "dg", "dfg", "dfg", "ydg", "fdgfd", 
"dfg", "dfg")), .Names = c("ID", "date", "nr", "namecol"),
 row.names = c(NA, -11L), class = "data.frame")