the_darkside the_darkside - 1 month ago 14
R Question

tracking a cohort over time in R

I have a sample dataset of user ids and months in which a transaction was made. My goal is to calculate, month over month, how many of the original users made transactions. In other words, how many users that were new in January also made transactions in February, March, and April. How many users that were new in February made transactions in March and April, and so on.

> data
date user_id
1 Jan 2017 1
2 Jan 2017 2
3 Jan 2017 3
4 Jan 2017 4
5 Jan 2017 5
6 Feb 2017 1
7 Feb 2017 3
8 Feb 2017 5
9 Feb 2017 7
10 Feb 2017 9
11 Mar 2017 2
12 Mar 2017 4
13 Mar 2017 6
14 Mar 2017 8
15 Mar 2017 10
16 Apr 2017 1
17 Apr 2017 3
18 Apr 2017 6
19 Apr 2017 9
20 Apr 2017 12


The output of this dataset would look something like this:

> output
Jan Feb Mar Apr
Jan 5 3 2 2
Feb NA 2 0 1
Mar NA NA 3 1
Apr NA NA NA 1


So far the only way I can think of doing this is to split the dataset and then calculate the unique ids for each month that are not present in the previous months, but this method is verbose and is not suited for a large dataset with many months.

subsets <-split(data, data$date, drop=TRUE)

for (i in 1:length(subsets)) {
assign(paste0("M", i), as.data.frame(subsets[[i]]))
}

M1_ids <- unique(M1$user_id)
M2_ids <- unique(M2$user_id)
M3_ids <- unique(M3$user_id)
M4_ids <- unique(M4$user_id)


M2_ids <- unique(setdiff(M2_ids, unique(M1_ids)))
M3_ids <- unique(setdiff(M3_ids, unique(c(M2_ids, M1_ids))))
M4_ids <- unique(setdiff(M4_ids, unique(c(M3_ids, M2_ids, M1_ids))))


Is there a way in R to come up with the above output with a shorter method using
dplyr
or even base R? The real data set has many years and months.

The format of the data is as follows:

> sapply(data, class)
date user_id
"yearmon" "integer"


And the sample data:

> dput(data)
structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017,
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333,
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667,
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25,
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L,
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L,
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")

Answer Source

Here's an example:

library(data.table)
library(zoo)
data <- structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017, 
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333, 
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667, 
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25, 
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L, 
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L, 
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
data <- data[c(1,1:nrow(data)),]
setDT(data)
(cohorts <- dcast(unique(data)[,cohort:=min(date),by=user_id],cohort~date))
#      cohort Jan 2017 Feb 2017 Mrz 2017 Apr 2017
# 1: Jan 2017        5        3        2        2
# 2: Feb 2017        0        2        0        1
# 3: Mrz 2017        0        0        3        1
# 4: Apr 2017        0        0        0        1

m <- as.matrix(cohorts[,-1])
rownames(m) <- cohorts[[1]]
m[lower.tri(m)] <- NA
names(dimnames(m)) <- c("cohort", "yearmon") 
m
#           yearmon
# cohort     Jan 2017 Feb 2017 Mrz 2017 Apr 2017
#   Jan 2017        5        3        2        2
#   Feb 2017       NA        2        0        1
#   Mrz 2017       NA       NA        3        1
#   Apr 2017       NA       NA       NA        1