the_darkside the_darkside - 27 days ago 8
R Question

Aggregating combination sequences

I have sample dataset that I would like to aggregate by

user_id
. Each record represents a sign-up.

> test
user_id time plan
1 1 2017-06-23 20:00:00 monthly
2 2 2017-07-20 20:00:00 monthly
3 3 2017-06-03 20:00:00 monthly
4 1 2017-07-03 20:00:00 monthly
5 2 2017-05-11 20:00:00 yearly
6 3 2017-07-27 20:00:00 yearly
7 1 2017-05-09 20:00:00 yearly
8 2 2017-01-15 19:00:00 yearly
9 3 2017-08-18 20:00:00 yearly
10 1 2017-01-30 19:00:00 monthly


Each user has signed up for different plans in a different order(
time
).
For example, the sequence of user 1 is
monthly-yearly-monthly- monthly
, therefore user 1 has switched two times.

User 2 has
yearly-yearly-monthly
, therefore user 2 has switched once

User 3 has gone from
monthly-yearly-yearly
, therefore user 3 has switched once.

> test[order(test$time),]
user_id time plan
8 2 2017-01-15 19:00:00 yearly
10 1 2017-01-30 19:00:00 monthly
7 1 2017-05-09 20:00:00 yearly
5 2 2017-05-11 20:00:00 yearly
3 3 2017-06-03 20:00:00 monthly
1 1 2017-06-23 20:00:00 monthly
4 1 2017-07-03 20:00:00 monthly
2 2 2017-07-20 20:00:00 monthly
6 3 2017-07-27 20:00:00 yearly
9 3 2017-08-18 20:00:00 yearly


My objective is to summarize the combinations of switches, in other words, to summarize how many users have gone from
yearly
to
monthly
, how many have gone from
monthly
to
yearly
, and how many have switched plans multiple times. The output of the following dataset might look something like this:

> output
type count
1 monthly-yearly 1
2 yearly-monthly 1
3 multiple 1


How would one go about grouping by
user_id
and then reducing the sequences of strings in R to either
multiple
,
monthly-yearly
, or
yearly-monthly
? Any suggestions or advice would be appreciated.

The dataset above:

> dput(test)
structure(list(user_id = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1), time = structure(c(1498262400,
1500595200, 1496534400, 1499126400, 1494547200, 1501200000, 1494374400,
1484524800, 1503100800, 1485820800), class = c("POSIXct", "POSIXt"
)), plan = c("monthly", "monthly", "monthly", "monthly", "yearly",
"yearly", "yearly", "yearly", "yearly", "monthly")), .Names = c("user_id",
"time", "plan"), row.names = c(NA, -10L), class = "data.frame")

Answer Source

Here is one way to do it using dplyr and the useful rle function (run-length encoding)..

library(dplyr)

output <- test %>% group_by(user_id) %>% #group by id
      arrange(time) %>%                  #sort by date
      summarise(first=first(plan),switches=length(rle(plan)$values)) %>% 
                                         #find first plan and number of switches
      mutate(type=ifelse(switches>2,"multiple",
                     ifelse(first=="monthly","monthly-yearly","yearly-monthly"))) %>% 
                                         #convert these to your three types
      count(type)                        #short for group_by and n()

output
            type     n
           <chr> <int>
1 monthly-yearly     1
2       multiple     1
3 yearly-monthly     1