view raw
Todd Young Todd Young - 8 months ago 29
R Question

Calculating average time difference by group using dplyr

Say I have the following data frame representing the dates users registered an application in various companies:

df <- data.frame(user = c("Tia", "Sam", "Matt", "Brandy", "Joe", "Nariko"),
company = c("Intel", "Intel", "Nvidia", "Nvidia", "Nvidia", "Google"),
registrationDate = as.Date(c("2015-01-04", "2015-01-04", "2015-01-19",
"2015-01-20", "2015-01-20", "2015-01-25")),
stringsAsFactors = FALSE)

How do I create a vector that would give me the average time difference between users at each company to register the application?

I am having some trouble getting simple summary statistics by company over the date variable. For example, when I try to find the maximum registration date for each company using dplyr:

df %>%
group_by(company) %>%
mutate(maxDate = max(registrationDate))

I get the maximum date over the entire registrationDate vector replicated for each row in the data frame. It is as though the max() function ignores dplyr's piping.

df %>% group_by(company) %>% 
  mutate(AvgTime = (max(registrationDate)-min(registrationDate))/length(company))

    user company registrationDate        AvgTime
1    Tia   Intel       2015-01-04 0.0000000 days
2    Sam   Intel       2015-01-04 0.0000000 days
3   Matt  Nvidia       2015-01-19 0.3333333 days
4 Brandy  Nvidia       2015-01-20 0.3333333 days
5    Joe  Nvidia       2015-01-20 0.3333333 days
6 Nariko  Google       2015-01-25 0.0000000 days