user88911 user88911 - 2 months ago 5x
R Question

How to efficiently iterate list of data.frame as an input for custom function?

I have list of data.frame as an input for my custom function, I expect custom function return multiple list of data.frame. I did some code changes on my function, but it return unexpected output. I want to make code clean which possible result in neat output. The way I used in my custom function is still not elegant, and I am looking for more efficient and robust solution instead. Can anyone suggest me how to improve the code in custom function ? Where I went wrong on my code? Any hint?


myList <- list(
foo = data.frame( start=seq(1, by=4, len=6), stop=seq(3, by=4, len=6)),
bar = data.frame(start=seq(5, by=2, len=7), stop=seq(7, by=2, len=7)),
bleh = data.frame(start=seq(1, by=5, len=5), stop=seq(3, by=5, len=5))

initial custom function was:

func <- function(set, idx=1L) {
entry <- set[[idx]]
self_ <- setdiff(entry, entry)
res <- lapply(set[-idx], function(ele_) {
joined_ <- setdiff(entry, ele_)
result <- c(list(self_), res)
names(result) <- c(names(set[idx]),names(set[-idx]))

and its output are:

res_1 <- func(list = myList, idx = 1L)
res_2 <- func(list = myList, idx = 2L)
res_3 <- func(list = myList, idx = 3L)

Now I want to implement following improved function instead:

func <- function(set) {
# check input param
output <- list()
for(id in 1: seq_along(set)) {
entry <- set[[id]]
self_ <- setdiff(entry, entry)
res <- lapply(set[-id], function(ele_) {
joined_ <- setdiff(entry, ele_)
ans <- c(list(self_), res)
names(ans) <- c(names(set[id]),names(set[-id]))
output[id] <- ans

desired output

I expect my custom function will return multiple list of data.frame object. putting these multiple list of data.frame into another bigger list is not elegant (always using list is boring, try to find something else). What's better structure for this? Which data.structure in R is more suitable for storing very big multiple list of data.frame? Can anyone give me some idea? Thanks in advance.


I'm still having a little trouble understanding your intent, but here's a suggestion for a cleaner solution.

First, it's often much easier to store data as a flat dataframe:

df <- ldply(df.list, rbind, .id = 'group1')

   group1 V1 V2
1       a  1  1
2       a  1  0
3       a  1  4
4       a  2  5
18      c  4  3

Then we can use plyr to loop through the combinations of the two groups and compute their set differences:

df.setdiff <- ddply(df, .(group1), function(x) {
    comparisons <- subset(df, group1 != x$group1[1])
    colnames(comparisons) <- c('group2', 'V1', 'V2')
    res <- ddply(comparisons, .(group2), function(y) {
        return(setdiff(x[c('V1', 'V2')], y[c('V1', 'V2')]))

This produces a single data frame:

   group1 group2 V1 V2
1       a      b  1  1
2       a      b  1  0
3       a      b  1  4
4       a      b  2  5
5       a      b  3  0
6       a      b  0  2
7       a      c  1  4
8       a      c  2  5
9       a      c  3  0
10      a      c  0  2
24      c      b  0  3

Some comparisons appear twice, since each group can appear in the "group1" or "group2" column, and my code does not skip those duplications, but this should get you started.