Mehdi Farhangian Mehdi Farhangian - 4 months ago 10
R Question

Map two data frames in a certain condition

I already asked this question

But now my current problem is slightly different that I couldn't use this solution or figure out . I'd like the data from data set 1 that occurred before dataset 2 and this is my data:

# Dataset 1 (dts1)

UserID date Hour Events
1 5 25/07/2016 02:31 8
2 5 30/07/2016 02:42 6
3 4 23/07/2016 07:52 9
4 14 24/07/2016 03:02 5
5 17 25/07/2016 09:12 10
6 4 22/07/2016 03:22 4


and

# Dataset 2 (dts2)

UserID date Hour transactions
1 5 25/07/2016 02:29 4
2 4 24/07/2016 02:42 2
3 5 25/07/2016 02:52 3
4 6 24/07/2016 03:02 4
5 6 25/07/2016 03:12 1
6 14 26/07/2016 03:22 3


So, I wish to compare those dataset from dataset 1 and only add those happened before dataset 2. In other words, I want to make sure that I do not count those events that happened after the last transaction of a user. The Ideal output is as follows:

#output

UserID Events transaction

5 8 4,3
4 9,4 2
14 5 3
17 10 NA


In the above example, I made sure that I removed event 6 for user 5 because it happened after his last transaction.

Answer

This is a modification of @dimitris_ps answer for your previous question. If he chooses to answer, I will gladly delete mine.

The main difference between this problem and your previous problem is that we now want all dts1 events that are before the last dts2 transaction for each specific UserID. Therefore, we want to group_by the UserID first and then filter for only those rows where the dts1 event times are less than the last dts2 transaction time. Then we can summarise both unique Events and transactions, still grouped by UserID.

The code is:

library(dplyr)

## I will not use the lubridate package, instead I will convert the time
## using as.POSIXct
dts1$time <- as.POSIXct(paste(dts1$date, dts1$Hour), format="%d/%m/%Y %H:%M")
dts2$time <- as.POSIXct(paste(dts2$date, dts2$Hour), format="%d/%m/%Y %H:%M")

# first join the two data.frames by UserID.
result <- left_join(dts1, dts2, by="UserID") %>%

# all subsequent processing is grouped by the UserID because we 
# want to compare the last transaction time to the Event times
# for each UserID.
group_by(UserID) %>%

# apply the filtering condition dts1 Event must be before last dts2 transaction.
# Note that we keep rows for which there is no row in  
# dts2 for a UserID in dts1. This is the case for UserID=17.
filter(is.na(time.y) | last(time.y) > time.x) %>% 

# summarise Events and transactions
summarise(Events = toString(unique(Events)), transactions = toString(unique(transactions)))

The results are:

print(result)
## A tibble: 4 x 3
##  UserID Events transactions
##   <int>  <chr>        <chr>
##1      4   9, 4            2
##2      5      8         4, 3
##3     14      5            3
##4     17     10           NA

Hope this helps.

Comments