L. Molenaar L. Molenaar - 2 months ago 5
R Question

How to generate a session number in clickstream data in r?

I want to add a new variable that indicates the session number of each click.

My dataset looks like this (each row represents a click):

head(test)

CustomerID UserID Page
1 1 1 A
2 1 1 B
3 1 1 C
4 1 1 D
5 2 2 A
6 2 2 B


Because the different users will create multiple clickstreams, I want to assign a session number to each click. The condition is, when the customerID is different, this will be a new session for that user.

I would want it like this:

CustomerID UserID Page Session
1 1 1 A 1
2 1 1 B 1
3 1 1 C 1
4 1 1 D 1
5 2 2 A 1
6 2 2 B 1
7 2 2 E 1
8 2 2 F 1
9 3 3 A 1
10 3 3 B 1
11 3 3 C 1
12 3 3 G 1
13 3 3 H 1
14 3 3 I 1
15 4 4 A 1
16 4 4 B 1
17 4 4 C 1
18 4 4 D 1
19 4 4 E 1
20 5 5 A 1
21 5 5 B 1
22 6 6 A 1
23 6 6 B 1
24 7 1 A 2
25 7 1 B 2
26 8 2 A 2
27 8 2 B 2
28 8 2 C 2
29 8 2 G 2
30 8 2 H 2


I tried to solve it with the group_by() and mutate() command. However, I think I have to create something like an ifelse() statement to assign the right session numbers. I hope that anyone can help me out!

What I've tried:

test<-test %>% group_by(CustomerID, UserID) %>% mutate(Session = )

Answer

May be this helps (assuming that 'CustomerID' and 'UserID' are ordered)

library(dplyr)
test %>%
     mutate(Session = cumsum(c(TRUE, diff(UserID)< 0)))
#    CustomerID UserID Page Session
#1           1      1    A       1
#2           1      1    B       1
#3           1      1    C       1
#4           1      1    D       1
#5           2      2    A       1
#6           2      2    B       1
#7           2      2    E       1
#8           2      2    F       1
#9           3      3    A       1
#10          3      3    B       1
#11          3      3    C       1
#12          3      3    G       1
#13          3      3    H       1
#14          3      3    I       1
#15          4      4    A       1
#16          4      4    B       1
#17          4      4    C       1
#18          4      4    D       1
#19          4      4    E       1
#20          5      5    A       1
#21          5      5    B       1
#22          6      6    A       1
#23          6      6    B       1
#24          7      1    A       2
#25          7      1    B       2
#26          8      2    A       2
#27          8      2    B       2
#28          8      2    C       2
#29          8      2    G       2
#30          8      2    H       2

data

test <- structure(list(CustomerID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 
7L, 8L, 8L, 8L, 8L, 8L), UserID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L), Page = c("A", "B", "C", "D", "A", 
"B", "E", "F", "A", "B", "C", "G", "H", "I", "A", "B", "C", "D", 
"E", "A", "B", "A", "B", "A", "B", "A", "B", "C", "G", "H")),
.Names = c("CustomerID", 
"UserID", "Page"), row.names = c("1", "2", "3", "4", "5", "6", 
"7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", 
"18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", 
"29", "30"), class = "data.frame")
Comments