the_darkside - 6 months ago 41
R Question

# tracking a cohort over time in R

I have a sample dataset of user ids and months in which a transaction was made. My goal is to calculate, month over month, how many of the original users made transactions. In other words, how many users that were new in January also made transactions in February, March, and April. How many users that were new in February made transactions in March and April, and so on.

``````> data
date user_id
1  Jan 2017       1
2  Jan 2017       2
3  Jan 2017       3
4  Jan 2017       4
5  Jan 2017       5
6  Feb 2017       1
7  Feb 2017       3
8  Feb 2017       5
9  Feb 2017       7
10 Feb 2017       9
11 Mar 2017       2
12 Mar 2017       4
13 Mar 2017       6
14 Mar 2017       8
15 Mar 2017      10
16 Apr 2017       1
17 Apr 2017       3
18 Apr 2017       6
19 Apr 2017       9
20 Apr 2017      12
``````

The output of this dataset would look something like this:

``````> output
Jan Feb Mar Apr
Jan   5   3   2   2
Feb  NA   2   0   1
Mar  NA  NA   3   1
Apr  NA  NA  NA   1
``````

So far the only way I can think of doing this is to split the dataset and then calculate the unique ids for each month that are not present in the previous months, but this method is verbose and is not suited for a large dataset with many months.

``````subsets <-split(data, data\$date, drop=TRUE)

for (i in 1:length(subsets)) {
assign(paste0("M", i), as.data.frame(subsets[[i]]))
}

M1_ids <- unique(M1\$user_id)
M2_ids <- unique(M2\$user_id)
M3_ids <- unique(M3\$user_id)
M4_ids <- unique(M4\$user_id)

M2_ids <- unique(setdiff(M2_ids, unique(M1_ids)))
M3_ids <- unique(setdiff(M3_ids, unique(c(M2_ids, M1_ids))))
M4_ids <- unique(setdiff(M4_ids, unique(c(M3_ids, M2_ids, M1_ids))))
``````

Is there a way in R to come up with the above output with a shorter method using
`dplyr`
or even base R? The real data set has many years and months.

The format of the data is as follows:

``````> sapply(data, class)
date   user_id
"yearmon" "integer"
``````

And the sample data:

``````> dput(data)
structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017,
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333,
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667,
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25,
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L,
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L,
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
``````

Here's an example:

``````library(data.table)
library(zoo)
data <- structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017,
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333,
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667,
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25,
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L,
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L,
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
data <- data[c(1,1:nrow(data)),]
setDT(data)
(cohorts <- dcast(unique(data)[,cohort:=min(date),by=user_id],cohort~date))
#      cohort Jan 2017 Feb 2017 Mrz 2017 Apr 2017
# 1: Jan 2017        5        3        2        2
# 2: Feb 2017        0        2        0        1
# 3: Mrz 2017        0        0        3        1
# 4: Apr 2017        0        0        0        1

m <- as.matrix(cohorts[,-1])
rownames(m) <- cohorts[[1]]
m[lower.tri(m)] <- NA
names(dimnames(m)) <- c("cohort", "yearmon")
m
#           yearmon
# cohort     Jan 2017 Feb 2017 Mrz 2017 Apr 2017
#   Jan 2017        5        3        2        2
#   Feb 2017       NA        2        0        1
#   Mrz 2017       NA       NA        3        1
#   Apr 2017       NA       NA       NA        1
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download