Konrad Konrad - 4 months ago 11
R Question

Automatically expending data frame with NAs values across any number of columns for missing dates

I'm interested in expending a data frame with missing values across any number of columns for the periods where data is missing following the data units.

Example



The problem can be easily illustrated on with use of a simple example.

Data



The generated data contains some time series observations and dates missing on random.

# Data generation

# Seed
set.seed(1)

# Size
sizeDf <- 10

# Populate data frame
dta <- data.frame(
dates = seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
),
varA = runif(n = sizeDf),
varB = runif(n = sizeDf),
varC = runif(n = sizeDf)
)

# Delete rows
dta <-
dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]


Preview



>> dta
dates varA varB varC
1 2016-07-28 0.26550866 0.2059746 0.93470523
2 2016-07-29 0.37212390 0.1765568 0.21214252
3 2016-07-30 0.57285336 0.6870228 0.65167377
4 2016-07-31 0.90820779 0.3841037 0.12555510
7 2016-08-03 0.94467527 0.7176185 0.01339033
8 2016-08-04 0.66079779 0.9919061 0.38238796
9 2016-08-05 0.62911404 0.3800352 0.86969085
10 2016-08-06 0.06178627 0.7774452 0.34034900


Key characteristics



From the perspective of the proposed analysis, the key characteristics are:


  • The date units, days in that case

  • Randomly missing dates



Missing dates

seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
)[!(seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
) %in% dta$dates)]

"2016-08-01" "2016-08-02"


Desired results



The newly created data frame should look like that:

>> dtaNew
dates varA varB varC
1 2016-07-28 0.3337749 0.32535215 0.8762692
2 2016-07-29 0.4763512 0.75708715 0.7789147
3 2016-07-30 0.8921983 0.20269226 0.7973088
4 2016-07-31 0.8643395 0.71112122 0.4552745
5 2016-08-01 NA NA NA
6 2016-08-02 NA NA NA
7 2016-08-03 0.9606180 0.14330438 0.6049333
8 2016-08-04 0.4346595 0.23962942 0.6547239
9 2016-08-05 0.7125147 0.05893438 0.3531973
10 2016-08-06 0.3999944 0.64228826 0.2702601


This simply obtained with use of:

dtaNew[dtaNew$dates %in% missDates, 2:4] <- NA


where the
missDates
is taken from the previous
seq
.




Attempts



Creating vector with all the dates is simple:

allDates <- seq(from = min(dta$dates), to = max(dta$dates), by = 1)


but obviously I cannot just push it to the data frame:

>> dta$allDates <- allDates
Error in `$<-.data.frame`(`*tmp*`, "allDates", value = c(17010, 17011, :
replacement has 10 rows, data has 8


The possible solution could use the loop that would push the row with
NA
values to the data frame row by row for each of the dates identified as missing but this is grossly inefficient and messy.




To sum up, I'm interested in achieving the following:


  1. Expanding the data frame with all the dates following the same unit. I.e. for missing daily data days are added, for missing quarterly data quarters are added.

  2. I would like to then push the NA values across all the columns in the data frame for where the missing date was found


Answer

If I understand your question, you can use rbind.fill from the plyr package to get your desired output:

sizeDf <- 10
# Populate data frame
dta <- data.frame(
  dates = seq(
    from = Sys.Date() - (sizeDf - 1),
    to = Sys.Date(),
    by = 1
  ),
  varA = runif(n = sizeDf),
  varB = runif(n = sizeDf),
  varC = runif(n = sizeDf)
)

# Delete rows
dta <-dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]

#Get missing dates
missing_dates <- seq(from=min(dta$dates), to=max(dta$dates), by=1)[!(seq(from=min(dta$dates), to=max(dta$dates), by=1) %in% dta$dates)]

#Create the new dataset by using plyr's rbind.fill function
dta_new <- plyr::rbind.fill(dta,data.frame(dates=missing_dates))

#Order the data by the dates column
dta_new <- dta_new[order(dta_new$dates),]

#Print it
print(dta_new, row.names = F, right = F)


 dates      varA        varB      varC      
 2016-07-28 0.837859418 0.2966637 0.61245244
 2016-07-29 0.144884547 0.9284294 0.11033990
 2016-07-30          NA        NA         NA
 2016-07-31          NA        NA         NA
 2016-08-01 0.003167049 0.9096805 0.29239470
 2016-08-02 0.574859760 0.1466993 0.69541969
 2016-08-03          NA        NA         NA
 2016-08-04 0.748639215 0.9602836 0.67681826
 2016-08-05 0.983939562 0.4867804 0.35270309
 2016-08-06 0.383366957 0.2241982 0.09244522

I hope this helps.