dimebucker91 - 1 year ago 44

R Question

I have data that looks like:

`player event diff`

A x NA

A y 2

A z 240

A w 3

A x 9

B x NA

B y 3

B z 120

C x NA

C x 8

What I did to get this was to group by the player column and take the difference between time events, hence the NA's for diff column whenever a new player has an event.

What I want to do is to partition the data into player specific interactions that are within a few minutes of each other (say a cutoff of diff = 20). What I want in the end is to have:

`player event diff interaction`

A x NA 1

A y 2 1

A z 240 2

A w 3 2

A x 9 2

B x NA 1

B y 3 1

B z 120 2

C x NA 1

C x 8 1

So basically the interactions are grouped based on having the same player and the difference being less than 20, otherwise a new interaction is started. A new interaction is also started if an NA is present. I'm not really sure how to do this in a fast/efficient way as I've got a large data set with many players. My preference is for a

`dplyr`

Answer Source

You can replace NA with 0(or other number that is below your threshold) using `coalesce`

in the diff column and do a cumsum on the `diff >= 20`

condition, which will give a distinct id whenever `diff`

exceed some threshold:

```
library(dplyr)
df %>% group_by(player) %>%
mutate(interaction = cumsum(coalesce(diff, 0L) >= 20) + 1)
# Source: local data frame [10 x 4]
# Groups: player [3]
# player event diff interaction
# <fctr> <fctr> <int> <dbl>
# 1 A x NA 1
# 2 A y 2 1
# 3 A z 240 2
# 4 A w 3 2
# 5 A x 9 2
# 6 B x NA 1
# 7 B y 3 1
# 8 B z 120 2
# 9 C x NA 1
# 10 C x 8 1
```