Smasell Smasell - 3 months ago 12
R Question

How to know last log before every event? R language

Here is my table:

user_id event timestamp
Rob business 111111
Rob progress 111112
Rob business 222222
Mike progress 111111
Mike progress 222222
Rob progress 000001
Mike business 333333
Mike progress 444444
Lee progress 111111
Lee progress 222222
Mike business 333334


Dput table:

dput(input)
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 1L, 1L, 2L),
.Label = c("Lee", "Mike", "Rob"), class = "factor"),
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L),
.Label = c("business", "progress"), class = "factor"),
timestamp = c(111111,111112, 222222, 111111, 222222, 1, 333333, 444444, 111111, 222222, 333334)),
.Names = c("user_id", "event", "timestamp"), row.names = c(NA, -11L), class = "data.frame")


I want to know last
progress
event before every
business
event happens for every
user_id
(output):

user_id event timestamp
Mike progress 222222
Mike progress 222222
Rob progress 111112
Rob progress 1


Thx for help!

Answer

As long as I'm understanding the problem correctly, this looks like it could be solved with some use of the lag function and dplyr.

Here's an example:

# Set up the data structure
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 
    2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"), 
    event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business", 
    "progress"), class = "factor"), timestamp = c(111111,111112, 222222, 
    111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id", 
    "event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")

# Perform the manipulation
df %>% 
    arrange(user_id, timestamp) %>% # Sort by user and timestamp
    group_by(user_id) %>% # Group/partition by each user
    mutate(last_event = lag(event, 1), # Find the last event
           last_timestamp = lag(timestamp, 1)) %>% # And the time it occurred
    filter(event == "business") %>% # Chop down to just the business events - as that's what we're interested in
    select(user_id, last_event, last_timestamp) %>% # Select the fields of interest
    rename(event = last_event, # Tidy up the field names
           timestamp = last_timestamp)
  user_id    event timestamp
   <fctr>   <fctr>     <dbl>
1    Mike progress    222222
2     Rob progress         1
3     Rob progress    111112

This approach will not work if the event preceding each business event is not progress, however. A simple fix is just to filter down to business and progress events only, though:

df %>% 
    filter(event == "business"|event == "progress") %>% 
    arrange(user_id, timestamp) %>% 
    group_by(user_id) %>% 
    mutate(last_event = lag(event, 1),
           last_timestamp = lag(timestamp, 1)) %>% 
    filter(event == "business") %>% 
    select(user_id, last_event, last_timestamp) %>% 
    rename(event = last_event, 
           timestamp = last_timestamp)

On this data set the output will be the same, but if other events creep in this might be a necessary step.

Comments