lowndrul lowndrul - 1 year ago 114
R Question

How to fill in missing dates in range by group

I have a

of groups and dates. How do I fill in all the missing dates in the min-max date range for each group?

Ideally I would do this in
. But ultimately, I'd just like to do this efficiently with as few lines of (readable) code as possible. Below is a minimal example. I actually have many dates and groups. Both of my approaches look kind of ugly. There has to be a better way, right?

#### setup ####

df <- data.frame(the_group = rep(LETTERS[1:2], each=3), date = Sys.Date() + c(0:2, 1:3), stringsAsFactors = F) %>%
tbl_df() %>%
slice(-2) # represents that I may be missing data in a range!

#### dplyr approach with cross join dummy ####
full_seq <- data.frame(cross_join_dummy = 1, date = seq.Date(from=min(df$date), to=max(df$date), by = "day"))

range_by_group <- df %>%
group_by(the_group) %>%
summarise(min_date = min(date), max_date = max(date)) %>%
ungroup() %>%
mutate(cross_join_dummy = 1)

desired <- range_by_group %>%
inner_join(full_seq, by="cross_join_dummy") %>%
filter(date >= min_date, date <= max_date) %>%
select(the_group, date)

#### sqldf approach ####
full_seq <- data.frame(date = as.character(seq.Date(from=min(df$date), to=max(df$date), by="day")))

df <- df %>%
mutate(date = as.character(date))

range_by_group <- sqldf("
SELECT the_group, MIN(date) AS min_date, MAX(date) AS max_date
GROUP BY the_group

desired <- sqldf("
SELECT rbg.the_group, fs.date
FROM range_by_group rbg
JOIN full_seq fs
ON fs.date BETWEEN rbg.min_date AND rbg.max_date

Answer Source

1) no packages - by

This does not use any packages. by splits df by df$the_group and then performs the indicated operation on each one. do.call("rbind", ...) puts the groups back together.

seq_date <- function(x) seq(min(x), max(x), by = "day")
do.call("rbind", by(df, df$the_group, with, 
  data.frame(the_group = the_group[1], date = seq_date(date))))

2) data.table Here is a solution using data.table. seq_date is from (1)


dt <- as.data.table(df)
dt[, list(date = seq_date(date)), by = the_group]

3) tidyverse This uses map_df from purrr to apply the function given in formula notation over the groups and to put the result together into a data frame. data_frame is from the tibble package. seq_date is from (1).


df %>% 
   split(.$the_group) %>% 
   map_df(~ data_frame(the_group = .$the_group[1], date = seq_date(.$date)))

4) tapply

4a) tapply - tidyr/reshape2 seq_date is from (1).


df %>%
   { tapply(.$date, .$the_group, seq_date, simplify = FALSE) } %>%
   melt %>%

4b) tapply - no packages The last line pieces together the output of tapply avoiding the need for any packages. seq_date is from (1).

ta <- tapply(df$date, df$the_group, seq_date, simplify = FALSE)
data.frame(the_group = rep(names(ta), lengths(ta)), date = do.call("c", ta))

4c) tapply - lattice We can use the lattice package's make.groups on ta from (4b). lattice is pre-installed with R so it does not involve installing any additional packages. Unfortunately make.groups removes the Date class attribute so we have to put it back. Also make.groups uses which and data column names so we fix up the column names.

with(do.call("make.groups", ta), 
  data.frame(the_group = which, date = structure(data, class = "Date")))

4d) tapply - no packages - stack We can use stack to convert ta from (4b) to the desired form provided we remove the "Date" class first. Then after applying stack we can restore the "Date" class. stack uses hard coded column names which we replace using setNames.

stack_dates <- function(x) 
  transform(stack(lapply(x, as.vector)), values = structure(values, class = "Date"))
setNames(stack_dates(ta)[2:1], c("the_group", "date"))
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download