Konrad - 2 months ago 8
R Question

Automatically expanding data frame with NAs values across any number of columns for missing dates

I'm interested in expanding a data frame with missing values across any number of columns for the periods where data is missing following the data units.

Example

The problem can be easily illustrated on with use of a simple example.

Data

The generated data contains some time series observations and dates missing on random.

# Data generation

# Seed
set.seed(1)

# Size
sizeDf <- 10

# Populate data frame
dta <- data.frame(
dates = seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
),
varA = runif(n = sizeDf),
varB = runif(n = sizeDf),
varC = runif(n = sizeDf)
)

# Delete rows
dta <-
dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]

Preview

>> dta
dates varA varB varC
1 2016-07-28 0.26550866 0.2059746 0.93470523
2 2016-07-29 0.37212390 0.1765568 0.21214252
3 2016-07-30 0.57285336 0.6870228 0.65167377
4 2016-07-31 0.90820779 0.3841037 0.12555510
7 2016-08-03 0.94467527 0.7176185 0.01339033
8 2016-08-04 0.66079779 0.9919061 0.38238796
9 2016-08-05 0.62911404 0.3800352 0.86969085
10 2016-08-06 0.06178627 0.7774452 0.34034900

Key characteristics

From the perspective of the proposed analysis, the key characteristics are:

• The date units, days in that case

• Randomly missing dates

Missing dates

seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
)[!(seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
) %in% dta\$dates)]

"2016-08-01" "2016-08-02"

Desired results

The newly created data frame should look like that:

>> dtaNew
dates varA varB varC
1 2016-07-28 0.3337749 0.32535215 0.8762692
2 2016-07-29 0.4763512 0.75708715 0.7789147
3 2016-07-30 0.8921983 0.20269226 0.7973088
4 2016-07-31 0.8643395 0.71112122 0.4552745
5 2016-08-01 NA NA NA
6 2016-08-02 NA NA NA
7 2016-08-03 0.9606180 0.14330438 0.6049333
8 2016-08-04 0.4346595 0.23962942 0.6547239
9 2016-08-05 0.7125147 0.05893438 0.3531973
10 2016-08-06 0.3999944 0.64228826 0.2702601

This simply obtained with use of:

dtaNew[dtaNew\$dates %in% missDates, 2:4] <- NA

where the
missDates
is taken from the previous
seq
.

Attempts

Creating vector with all the dates is simple:

allDates <- seq(from = min(dta\$dates), to = max(dta\$dates), by = 1)

but obviously I cannot just push it to the data frame:

>> dta\$allDates <- allDates
Error in `\$<-.data.frame`(`*tmp*`, "allDates", value = c(17010, 17011, :
replacement has 10 rows, data has 8

The possible solution could use the loop that would push the row with
NA
values to the data frame row by row for each of the dates identified as missing but this is grossly inefficient and messy.

To sum up, I'm interested in achieving the following:

1. Expanding the data frame with all the dates following the same unit. I.e. for missing daily data days are added, for missing quarterly data quarters are added.

2. I would like to then push the NA values across all the columns in the data frame for where the missing date was found

Answer

If I understand your question, you can use rbind.fill from the plyr package to get your desired output:

sizeDf <- 10
# Populate data frame
dta <- data.frame(
dates = seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
),
varA = runif(n = sizeDf),
varB = runif(n = sizeDf),
varC = runif(n = sizeDf)
)

# Delete rows
dta <-dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]

#Get missing dates
missing_dates <- seq(from=min(dta\$dates), to=max(dta\$dates), by=1)[!(seq(from=min(dta\$dates), to=max(dta\$dates), by=1) %in% dta\$dates)]

#Create the new dataset by using plyr's rbind.fill function
dta_new <- plyr::rbind.fill(dta,data.frame(dates=missing_dates))

#Order the data by the dates column
dta_new <- dta_new[order(dta_new\$dates),]

#Print it
print(dta_new, row.names = F, right = F)

dates      varA        varB      varC
2016-07-28 0.837859418 0.2966637 0.61245244
2016-07-29 0.144884547 0.9284294 0.11033990
2016-07-30          NA        NA         NA
2016-07-31          NA        NA         NA
2016-08-01 0.003167049 0.9096805 0.29239470
2016-08-02 0.574859760 0.1466993 0.69541969
2016-08-03          NA        NA         NA
2016-08-04 0.748639215 0.9602836 0.67681826
2016-08-05 0.983939562 0.4867804 0.35270309
2016-08-06 0.383366957 0.2241982 0.09244522

I hope this helps.

Comments