M.Qasim M.Qasim - 4 months ago 19
R Question

Estimate gaps between multiple dates

Is there any way to find gaps between multiple timelines. For example my data looks like the following:


df <- data.frame(Language = structure(c(rep("English",7), rep("German",5), rep("French", 10)), class = "character"),
Students = c(LETTERS[1:7], LETTERS[1:5], LETTERS[1:10]),
Start = structure(c(16713,16713,16713,16744,16713,16714,16754,16729,16729,16729,16750,16769,
16724,16724,16745,16724,16759,16766,16723,16722,16736,16796), class = "Date"),
End = structure(c(16762,16720,16762,16755,16720,16764,16762,16765,16765,16749,16761,16770,16758,
16744,16758,16764,16765,16766,16726,16723,16758,16806), class = "Date"))

ddply(df, .(Language), summarise,
FirstDay = min(Start),
LastDay = max(End),
Duration = LastDay - FirstDay)

plot(gvisTimeline(data=df, rowlabel = "Class", start = "Start", end = "End", options=list(width=600, height=1000) ))

I am after calculating gaps when none students are taking a class. Gaps are highlighted in red in the following chart.

enter image description here


This is a fairly classic problem. The solution about this is to filter rows based on if the start date is larger than previous maximum end date assuming rows are sorted by start date before hand. lag function and cummax() can be used to find out previous max end date, and since cummax() is not defined for Date class, we can convert it to integer, apply cummax and then convert it back:

df %>% 
       arrange(Start) %>% group_by(Language) %>% 
       mutate(End_Max = lag(as.Date(cummax(as.integer(End)), "1970-01-01"))) %>% 
       filter(Start > End_Max + 1) %>% select(Language, End_Max, Start)

# Source: local data frame [2 x 3]
# Groups: Language [2]

#  Language    End_Max      Start
#    <fctr>     <date>     <date>
#1   German 2015-11-26 2015-11-30
#2   French 2015-11-27 2015-12-27