Misha Misha - 1 month ago 14
R Question

listcolumns and multidplyr

I am new to multidplyr. I have a dataset similar to what this creates:

library(multidplyr)
library(tidyverse)
library(nycflights13)
f<-flights %>% group_by(month) %>% nest()


Now I´d like to do operations on each of these tibbles on different nodes.

cluster <- create_cluster(12)
f2<-partition(f,month,cluster=cluster)


everything seems ok until here, but when I do:

models<-f2 %>%
do(mod=lm(arr_delay~dey_delay,data=.))


I get the following error msg:

Error in checkForRemoteErrors(lapply(cl, recvResult)) :
12 nodes produced errors; first error: object 'arr_delay' not found


Now if I try

f2 %>% browser(.)


and then try .$ I do not have access to any of the columns-

Any ideas how these columns can be accessed?

Answer

This question has two parts:

1. Why are you getting an error using do?

The "proper" way to apply functions to a nested column (or "list column") is not to use do, but to use map instead. In this case, multidplyr isn't really important, since the normal dplyr code gives the same error.

f <- flights %>% group_by(month) %>% nest()    

models <- f %>% 
  do(mod = lm(arr_delay ~ dey_delay, data = .))

Error in eval(expr, envir, enclos) : object 'arr_delay' not found

Using map from purrr on the other hand works fine.

models <- f %>%
  mutate(model = map(data, ~ lm(arr_delay ~ dep_delay, data = .)))

Using your multidplyr code with mutate and map also works just fine.

2. How can I view the data in a party_df?

You can't easily do that. Remember they are not available in your current R session, but on the nodes. You can access the names using this little utility function:

names.party_df <- function(x) {
  fun <- function(x) names(eval(x))
  multidplyr::cluster_call(x$cluster, fun, as.name(x$name))[[1]]
}

But to access the full data, you'll most likely need to collect your data again. Alternatively, in RStudio one can use View, but note that this doesn't work great on large data sets.