Jaroslav Jaroslav - 1 month ago 19
R Question

Re-assembling a dataframe after a split

I have trouble applying a split to a data.frame and then assembling some aggregated results back into a different data.frame. I tried using the 'unsplit' function but I can't figure out how to use it properly to get the desired result. Let me demonstrate on the common 'mtcars' data: Let's say that my ultimate result is to get a data frame with two variables: cyl (cylinders) and mean_mpg (mean over mpg for group of cars sharing the same count of cylinders).

So the initial split goes like this:

spl <- split(mtcars, mtcars$cyl)


The result of which looks something like this:

$`4`
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
...

$`6`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...

$`8`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
...


Now I want to do something along the lines of:

df <- as.data.frame(lapply(spl, function(x) mean(x$mpg)), col.names=c("cyl", "mean_mpg"))


However, doing the above results in:

X4 X6 X8
1 26.66364 19.74286 15.1


While I'd want the df to be like this:

cyl mean_mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000


Thanks, J.

Answer

If you are only interested in reassembling a split then look at (2), (4) and (4a) but if the actual underlying question is really about the way to perform aggregations over groups then they all may be of interest:

1) aggregate Normally one uses aggregate as already mentioned in the comments. Simplifying @alistaire's code slightly:

aggregate(mpg ~ cyl, mtcars, mean)

2) split/lapply/do.call Also @rawr has given a split/lapply/do.call solution in the comments which we can also simplify slightly:

spl <- split(mtcars, mtcars$cyl)
do.call("rbind", lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))

3) do.call/by The last one could alternately be rewritten in terms of by:

do.call("rbind", by(mtcars, mtcars$cyl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))

4) split/lapply/unsplit Another possibility is to use split and unsplit:

spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, "[[", "cyl"))

4a) or if row names are sufficient:

spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, rownames))

The above do not use any packages but there are also many packages that can do aggregations including dplyr, data.table and sqldf:

5) dplyr

library(dplyr)
mtcars %>%
       group_by(cyl) %>%
       summarize(mpg = mean(mpg)) %>%
       ungroup()

6) data.table

library(data.table)
as.data.table(mtcars)[, list(mpg = mean(mpg)), by = "cyl"]

7) sqldf

library(sqldf)
sqldf("select cyl, avg(mpg) mpg from mtcars group by cyl")