svnnf svnnf - 2 months ago 14
R Question

Splitting and aggregating a sequence of event logs into Intervals

Thanks to the help of other users i successfully could divide my data set into sequences and aggregate the responses for each sequence. A sequence is defined by the occurrence of a Stimuli (A or B) [before either one of those Stimuli in a user occurred it is the so called 0 Sequence]. This means that each user possibly has multiple sequences according to the amount of stimulis he perceives. Each User has event logs and i split the eventlogs according to the criteria above. I used the following code:

#change the date into posixct format
df$Date <- as.POSIXct(strptime(master$Date,"%d.%m.%Y %H:%M"))

#arrange the dataframe according to User and Date
df <- arrange(df, User,Date)

#create a unique ID for each stimuli combination
df$stims <- with(df, paste(cumsum(StimuliA), cumsum(StimuliB), sep="_"))

#aggregate all the eventlog rows according to the stimuli IDs
df1 <- aggregate(. ~ User + stims, data=df, sum)


Source: Summarize and count data in R with dplyr

Dataset:

structure(list(User = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), Date = c("02.12.2015 20:16", "03.12.2015 20:17",
"02.12.2015 20:44", "03.12.2015 09:32", "03.12.2015 09:33", "07.12.2015 08:18",
"08.12.2015 19:40", "08.12.2015 19:43", "22.12.2015 18:22", "22.12.2015 18:23",
"23.12.2015 14:18", "05.01.2016 11:35", "05.01.2016 13:21", "05.01.2016 13:22",
"05.01.2016 13:22", "04.08.2016 08:25"), StimuliA = c(0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), StimuliB = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L),
R2 = c(1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L,
0L, 0L, 0L), R3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 0L, 1L, 0L), R4 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), R5 = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), R6 = c(0L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
), R7 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 0L), User_Seq = c("1_0_0", "1_0_0", "1_0_0",
"1_0_0", "1_0_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0",
"1_2_0", "1_2_1", "1_2_1", "1_2_1", "1_2_1", "1_2_2")), .Names = c("User",
"Date", "StimuliA", "StimuliB", "R2", "R3", "R4", "R5", "R6",
"R7", "User_Seq"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-16L), spec = structure(list(cols = structure(list(User = structure(list(), class = c("collector_integer",
"collector")), Date = structure(list(), class = c("collector_character",
"collector")), StimuliA = structure(list(), class = c("collector_integer",
"collector")), StimuliB = structure(list(), class = c("collector_integer",
"collector")), R2 = structure(list(), class = c("collector_integer",
"collector")), R3 = structure(list(), class = c("collector_integer",
"collector")), R4 = structure(list(), class = c("collector_integer",
"collector")), R5 = structure(list(), class = c("collector_integer",
"collector")), R6 = structure(list(), class = c("collector_integer",
"collector")), R7 = structure(list(), class = c("collector_integer",
"collector")), User_Seq = structure(list(), class = c("collector_character",
"collector"))), .Names = c("User", "Date", "StimuliA", "StimuliB",
"R2", "R3", "R4", "R5", "R6", "R7", "User_Seq")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))


My goal is to adapt this code to create the same summary of sequences but split the responses into two parts. One for the first week after the date of the stimuli and then aggregate all the other "lagged" responses in that sequence besides.

I illustrated this in the sample below. It would also possible to do this in a long format with an additional column which identifies the lagged responses with 1/0 and the same date, but the optimal output would be to have it in a wide format.

User Da StimuliA StimuliB Seq_ID R2 R3 R4 R5 R6 R7 R2l R3l R4l R5l R6l R7l
1 02.12.2015 20:16 0 0 1_0_0 4 0 0 0 1 0 0 0 0 0 0 0
1 07.12.2015 08:18 1 0 1_1_0 1 0 0 0 0 1 2 0 0 0 0 0
1 23.12.2015 14:18 1 0 1_2_0 0 0 0 0 0 0 0 0 0 0 0 0
1 05.01.2016 11:35 0 1 1_2_1 0 2 0 0 0 1 0 1 0 0 0 0
1 04.08.2016 08:25 0 1 1_2_2 0 0 0 0 0 0 0 0 0 0 0 0


f.e As you can see here the line 9 & 10 from the sample were aggregated in R2l (Resoibse 2 lagged) because they occured one week after the 07.12.2015 08:18.

Answer Source

I found a solution for my problem. Basically i organize it by sequence id (Seqid) and Date and group it for the seqid. Then i create a new coloumn with the minimum date 7 days later. After that simply compare this earliest date plus 7 days with each normal date and put a 0 for first week values and 1 for the others.

df <- df %>%
        arrange(seqid, Date) %>% 
        group_by(seqid) %>%
        mutate(Date7 = (min(Date) + 604800)) %>%
        mutate(Group = ifelse(Date7>Date,0,1))

After that simply reshape it to a wide format like in the question.