Harlan - 3 years ago 118
R Question

efficient tidy R technique for processing aligned data frame groups

I'm trying to find an efficient (and ideally tidy) way to process a pair of grouped data_frames. The setup looks more-or-less like this:

``````A = crossing(idx=1:1e5, asdf=seq(1:rpois(1,50))
B = tbl(idx=sample(1:1e5, replace=TRUE), yet_more_stuff='whatever')
proc_one_group <- function(one_A, one_b) { ... }
# example:
proc_one_group(filter(A, idx==50), filter(B, idx==50))
``````

So, my processing operation, which is fairly complex, operates on one
`idx`
at a time, from two separate data frames, where one of them has one or more (usually dozens) rows per
`idx`
, and the other can have zero, one, or multiple rows per
`idx`
.

A way I know I can do this is this, but it's very slow, because the
`filter`
operation on every value requires a full table scan and subset.

``````map_df(unique(A\$idx), ~ proc_one_group(filter(A, idx==.), filter(B, idx==.)))
``````

I also know I can use
`split`
to create a list of subframes of data_frames relatively efficiently, but I don't know a good way to then do O(1) lookups by index of the two
`data_frame`
s.

What I sort of want is the first step of a
`left_join`
, where it figures out the subgroups of indexes from each group, but instead of actually creating a single
`data_frame`
of the cartesian combination of each group, it just gives me the pair of subgroups that I can process as needed. (A full
`left_join`
doesn't help me here.)

Any ideas?

One possibility would be to nest your two data frames first, before joining:

``````library(tidyverse)

set.seed(1234)

A = crossing(idx = 1:1e5, asdf = seq(1:rpois(1, 50)))
B = data_frame(idx = sample(1:1e5, replace = TRUE), yet_more_stuff = "whatever")

proc_one_group <- function(one_A, one_B) { ... }

nest_A <- A %>%
group_by(idx) %>%
nest(.key = "data_a")
nest_B <- B %>%
group_by(idx) %>%
nest(.key = "data_b")

all_data <- full_join(nest_A, nest_B, by = "idx")
all_data
#> # A tibble: 100,000 x 3
#>      idx            data_a           data_b
#>    <int>            <list>           <list>
#>  1     1 <tibble [41 x 1]>           <NULL>
#>  2     2 <tibble [41 x 1]> <tibble [2 x 1]>
#>  3     3 <tibble [41 x 1]> <tibble [2 x 1]>
#>  4     4 <tibble [41 x 1]> <tibble [1 x 1]>
#>  5     5 <tibble [41 x 1]>           <NULL>
#>  6     6 <tibble [41 x 1]>           <NULL>
#>  7     7 <tibble [41 x 1]> <tibble [2 x 1]>
#>  8     8 <tibble [41 x 1]>           <NULL>
#>  9     9 <tibble [41 x 1]> <tibble [1 x 1]>
#> 10    10 <tibble [41 x 1]> <tibble [1 x 1]>
#> # ... with 99,990 more rows
``````

This results in a single data frame, with the data for each `idx` from data frame `A` in `data_a`, and the data from data frame `B` in `data_b`. Once this is done, the large data frame don't have to be filtered for each case in the `map_df` call.

``````all_data %>%
map2_df(data_a, data_b, proc_one_group)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download