Talvien Talvien - 4 days ago 7
R Question

R - count of items in line chart: match DateTime to count of items

I have a dataframe with the following structure:



df <- structure(list(Name = structure(1:9, .Label = c("task 1", "task 2",
"task 3", "task 4", "task 5", "task 6", "task 7", "task 8", "task 9"
), class = "factor"), Start = structure(c(1479799800, 1479800100,
1479800400, 1479800700, 1479801000, 1479801300, 1479801600, 1479801900,
1479802200), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1479801072,
1479800892, 1479801492, 1479802092, 1479802692, 1479803292, 1479803892,
1479804492, 1479805092), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("Name",
"Start", "End"), row.names = c(NA, -9L), class = "data.frame")




Now I want to count the items in column "Name" over time. They all have a start and end datetimes, which are formated as POSIXct.

With help of this solution here on SO I was able to do so (or at least I think I was) with following code:



library(data.table)
setDT(df)
dates = seq(min(df$Start), max(df$End), by = "min")
lookup = data.table(Start = dates, End = dates, key = c("Start", "End"))
ans = foverlaps(df, lookup, type = "any", which = TRUE)
library(ggplot2)
ggplot(ans[, .N, by = yid], aes(x = yid, y = N)) + geom_line()




Problem now:

How do I match my DateTime-scale to those integer values on the x-axis? Or is there a faster and better solution to solve my problem?

I tried to use
x = as.POSIXct(yid, format = "%Y-%m-%dT%H:%M:%S", origin = min(df$Start))
within the
aes
of the
ggplot()
. But that didn't work.

EDIT:

Here is some real life data, to see the problem of items where 0 is the count. They seem to be displayed as the latest not-0-count-value.



structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L,
7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L, 10L), .Label = c("Project 1",
"Project 2", "Project 3", "Project 4",
"Project 5", "Project 6",
"Project 7", "Project 8", "Project 9",
"Project 1 10"), class = "factor"), Start = structure(c(1480115232,
1480115251, 1480115534, 1480197600, 1480197984, 1480200077, 1480200091,
1480197600, 1480200360, 1480132200, 1480132800, 1480132200, 1480133593,
1480118400, 1480119077, 1480120366, 1480128360, 1480128447, 1480129869,
1480129894, 1480118400, 1480130633, 1480130635, 1480130635, 1480136130,
1480136130, 1480136131, 1480143085, 1480143097, 1480130635, 1480143828,
1480114800, 1480114843, 1480114884, 1480114800, 1480114958, 1480116981,
1480117118, 1480119076, 1480115660, 1480115782, 1480115783, 1480119735,
1480119809, 1480115783, 1480120598, 1480122000, 1480141200), class = c("POSIXct",
"POSIXt"), tzone = ""), End = structure(c(1480115282, 1480115565,
1480116661, 1480198015, 1480200108, 1480200122, 1480200391, 1480201340,
1480201340, 1480132831, 1480133624, 1480135685, 1480135685, 1480119108,
1480120396, 1480128391, 1480128478, 1480129900, 1480129925, 1480130664,
1480130665, 1480130665, 1480130666, 1480136161, 1480136161, 1480136162,
1480143116, 1480143128, 1480143859, 1480149022, 1480149022, 1480114874,
1480114915, 1480114989, 1480115540, 1480115540, 1480117149, 1480119107,
1480122615, 1480115813, 1480115814, 1480119766, 1480119840, 1480120629,
1480121032, 1480121032, 1480122341, 1480142377), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("Name", "Start", "End"), row.names = c(NA,
-48L), class = c("data.table", "data.frame"), .internal.selfref = < pointer:0x0000000000150788 > )



Answer

In the tidyverse framework, this is a slightly different task -

  1. Generate the sames dates variable you have.
  2. Construct a data frame with all dates and all times (cartesian join)
  3. Filter out the rows that are not in the interval for each task
  4. Add up the tasks for each minute that remain
  5. Plot.

That looks something like this --

library(tidyverse)
library(lubridate)

dates = seq(min(df$Start), max(df$End), by = "min")
df %>% 
  mutate(key = 1) %>% 
  left_join(data_frame(key = 1, times = dates)) %>% 
  mutate(include = times %within% interval(Start, End)) %>% 
  filter(include) %>% 
  group_by(times) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(times, count)) +
  geom_line()
#> Joining, by = "key"

enter image description here

If you need it to be faster, it will almost certainly be faster using your original data.table code.

Consider this.

library(data.table)
setDT(df)
dates = seq(min(df$Start), max(df$End), by = "min")
lookup = data.table(Start = dates, End = dates, key = c("Start", "End"))
ans = foverlaps(df, lookup, type = "any", which = TRUE)

ans[, .N, by = yid] %>%
  mutate(time = min(df$Start) + minutes(yid)) %>%
  ggplot(aes(time, N)) +
  geom_line()

Now we use data.table to calculate the overlap, and then index time off the starting minute. Once we add a new column with the times, we can plot.

enter image description here

Comments