Harlan Harlan - 3 years ago 118
R Question

efficient tidy R technique for processing aligned data frame groups

I'm trying to find an efficient (and ideally tidy) way to process a pair of grouped data_frames. The setup looks more-or-less like this:

A = crossing(idx=1:1e5, asdf=seq(1:rpois(1,50))
B = tbl(idx=sample(1:1e5, replace=TRUE), yet_more_stuff='whatever')
proc_one_group <- function(one_A, one_b) { ... }
# example:
proc_one_group(filter(A, idx==50), filter(B, idx==50))


So, my processing operation, which is fairly complex, operates on one
idx
at a time, from two separate data frames, where one of them has one or more (usually dozens) rows per
idx
, and the other can have zero, one, or multiple rows per
idx
.

A way I know I can do this is this, but it's very slow, because the
filter
operation on every value requires a full table scan and subset.

map_df(unique(A$idx), ~ proc_one_group(filter(A, idx==.), filter(B, idx==.)))


I also know I can use
split
to create a list of subframes of data_frames relatively efficiently, but I don't know a good way to then do O(1) lookups by index of the two
data_frame
s.

What I sort of want is the first step of a
left_join
, where it figures out the subgroups of indexes from each group, but instead of actually creating a single
data_frame
of the cartesian combination of each group, it just gives me the pair of subgroups that I can process as needed. (A full
left_join
doesn't help me here.)

Any ideas?

Answer Source

One possibility would be to nest your two data frames first, before joining:

library(tidyverse)

set.seed(1234)

A = crossing(idx = 1:1e5, asdf = seq(1:rpois(1, 50)))
B = data_frame(idx = sample(1:1e5, replace = TRUE), yet_more_stuff = "whatever")

proc_one_group <- function(one_A, one_B) { ... }

nest_A <- A %>%
  group_by(idx) %>%
  nest(.key = "data_a")
nest_B <- B %>%
  group_by(idx) %>%
  nest(.key = "data_b")

all_data <- full_join(nest_A, nest_B, by = "idx")
all_data
#> # A tibble: 100,000 x 3
#>      idx            data_a           data_b
#>    <int>            <list>           <list>
#>  1     1 <tibble [41 x 1]>           <NULL>
#>  2     2 <tibble [41 x 1]> <tibble [2 x 1]>
#>  3     3 <tibble [41 x 1]> <tibble [2 x 1]>
#>  4     4 <tibble [41 x 1]> <tibble [1 x 1]>
#>  5     5 <tibble [41 x 1]>           <NULL>
#>  6     6 <tibble [41 x 1]>           <NULL>
#>  7     7 <tibble [41 x 1]> <tibble [2 x 1]>
#>  8     8 <tibble [41 x 1]>           <NULL>
#>  9     9 <tibble [41 x 1]> <tibble [1 x 1]>
#> 10    10 <tibble [41 x 1]> <tibble [1 x 1]>
#> # ... with 99,990 more rows

This results in a single data frame, with the data for each idx from data frame A in data_a, and the data from data frame B in data_b. Once this is done, the large data frame don't have to be filtered for each case in the map_df call.

all_data %>%
  map2_df(data_a, data_b, proc_one_group)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download