Andy.Jian Andy.Jian - 10 months ago 34
R Question

How can I efficiently subset for big data.frame objects in the list?

I have data.frame object in the list, I am gonna filter out based on last column (A.K.A, score) of each respectively. subsetting for list is intuitive for me, but I want to have two different set (i.e., pass/fail) as result of filtration for each data.frame object. I think the way I used is not elegant, and seeking better/efficient solution for it. Can anyone point me how to achieve more elegant solution for this sort of task? Thanks a lot !

toy data:

mylist <- list(df1=data.frame( from=seq(1, by=4, len=16), to=seq(3, by=4, len=16), score=sample(30, 16)),
df2=data.frame( from=seq(3, by=7, len=20), to=seq(6, by=7, len=20), score=sample(30, 20)),
df3=data.frame( from=seq(4, by=8, len=25), to=seq(7, by=8, len=25), score=sample(30, 25)))


my initial attempt:

pass <- lapply(mylist, function(ele_) {
ans <- subset(ele_, ele_$score > 20)
ans
})


turns out I want also to have its opposite set where instances didn't meet the filtering condition, and put pass, fail set for each data.frame object into one list.

desired output: each data.frame object can have pass, fail set after subsetting.

what's better for doing this if I have very big data.frame object in the list? Does anyone have knows useful trick for this?

Answer Source

May be the data.table option would be more efficient

library(data.table)
lapply(mylist, function(x) setDT(x)[score > 20])

Or use filter from dplyr with map from purrr

library(dplyr)
library(purrr)
mylist %>% 
      map(filter, score > 20)

In addition to the above methods of subsetting each list element, we can also rbind the datasets into a single dataset (rbindlist from data.table or bind_rows from dplyr with an identifier column) and subset by group.

rbindlist(mylist, idcol= 'grp')[score > 20, .SD , by = .(grp)]

Or with dplyr

mylist %>% 
    bind_rows(., .id = 'grp') %>%
    group_by(grp) %>%
    filter(score > 20)

If the intention is to separate the data.frame into a list of 2 (> 20 and < 20 for 'score')

lapply(mylist, function(x) split(x, c("FAIL", "PASS")[(x$score > 20)+1]))