kungfujam kungfujam - 2 months ago 12
R Question

dplyr mutate calling another dataframe

I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'.

Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g.

set.seed(1337)
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))


The best I've got so far with dplyr is the following (but this loses the identity variables):

df1 %>%
rowwise() %>%
do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>%
mutate(n=nrow(count))


output:

Source: local data frame [3 x 2]
Groups: <by row>

# A tibble: 3 × 2
count n
<list> <int>
1 <data.frame [17 × 1]> 17
2 <data.frame [18 × 1]> 18
3 <data.frame [10 × 1]> 10


I was expecting to be able to do this:

df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
df1 %>%
group_by(id) %>%
mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15))))


but this returns the error:

Error: comparison (6) is possible only for atomic and list types


What is the dplyr way of doing this?

Answer

Another slightly different approach using dplyr:

result <- df1 %>% group_by(id) %>% 
                  summarise(count = length(which(df2$time > start_time &
                                                 df2$time < (start_time+15))))

print(result)
### A tibble: 3 x 2
##      id count
##  <fctr> <int>
##1      A    17
##2      B    18
##3      C    10

I believe you can use length and which to count the number of occurrences for which your condition is true for each id in df1. Then, group by id and use this to summarise.


If there are possibly more that one start_time per id, then you can use the same function but rowwise and with mutate:

result <- df1 %>% rowwise() %>% 
                  mutate(count = length(which(df2$time > start_time & 
                                              df2$time < (start_time+15))))
print(result)
##Source: local data frame [3 x 3]
##Groups: <by row>
##
### A tibble: 3 x 3
##      id start_time count
##  <fctr>      <dbl> <int>
##1      A         10    17
##2      B         20    18
##3      C         30    10