kjo - 2 months ago 6
R Question

# Applying data.frame-consuming functions over groups of rows

For example, suppose I have some

`data.frame`
`df`
:

``````df <- read.table(text = "
P    Q    R
c    1   10
a    1    0
a    2    0
b    2    0
b    1   10
c    2   10
b    1    0
a    2   10
",
stringsAsFactors = FALSE,
``````

...and some function
`foo`
that takes a
`data.frame`
as argument.

One can imagine splitting
`df`
into smaller
`data.frame`
's according to the value in one of its columns, say
`P`
, and applying
`foo`
to each of those smaller
`data.frame`
's.

Below I show the best I can come up with to solve this problem, but I suspect that more streamlined solutions already exist to perform such a natural operation. If so, my question is: what are they?

NB: I show two use-cases below; the first one of the two is the one that I expect can be improved significantly. As for the second one, I think my solution for it may already be about as good as it'll get; I include this use-case just in case my guess is wrong.

My solution depends on whether
`foo`
is a function that I call for its return value, or one that I call only for its side effects.

For the former case (
`foo`
called for its value), suppose that
`foo`
is this:

``````## returns a one-row data.frame corresponding to a random row of
## dataframe
## NB: this is *just an example* for the sake of this question
foo <- function (dataframe) {
dataframe[sample(nrow(dataframe), 1), ]
}
``````

...then my solution would be this:

``````set.seed(0)
sapply(unique(df\$P), function (value) foo(df[df\$P == value, ]),
simplify = FALSE)
## \$c
##   P Q  R
## 6 c 2 10
##
## \$a
##   P Q R
## 2 a 1 0
##
## \$b
##   P Q  R
## 5 b 1 10
``````

For the latter case (
`foo`
called for its side-effect), suppose that
`foo`
is this:

``````## prints to stdout a one-row data.frame corresponding to a random
## row of dataframe
## NB: this is *just an example* for the sake of this question
foo <- function (dataframe) {
cat(str(dataframe[sample(nrow(dataframe), 1), ]))
}
``````

...then my solution would be this:

``````set.seed(0)
for (value in unique(df\$P)) foo(df[df\$P == value, ])
## 'data.frame':    1 obs. of  3 variables:
##  \$ P: chr "c"
##  \$ Q: int 2
##  \$ R: int 10
## 'data.frame':    1 obs. of  3 variables:
##  \$ P: chr "a"
##  \$ Q: int 1
##  \$ R: int 0
## 'data.frame':    1 obs. of  3 variables:
##  \$ P: chr "b"
##  Q: int 1
##  R: int 10
``````

You can achieve both of your use cases with the function `by`. To replicate your results, however, we change your functions to return or output the last row of the group instead of a randomly selected row. This is necessary because the ordering of rows within a group is modified by `by`. In a real use case, this ordering should not matter. This only matters because your results depend on a random number generator to select on the grouped rows.

``````foo <- function (dataframe) {
dataframe[nrow(dataframe), ]
}

out1 <- sapply(unique(df\$P), function (value) foo(df[df\$P == value, ]),
simplify = FALSE)
``````

The result `out1` is a `list`:

``````str(out1)  ## this displays the structure of the out1 object
##List of 3
## \$ c:'data.frame':    1 obs. of  3 variables:
##  ..\$ P: chr "c"
##  ..\$ Q: int 2
##  ..\$ R: int 10
## \$ a:'data.frame':    1 obs. of  3 variables:
##  ..\$ P: chr "a"
##  ..\$ Q: int 2
##  ..\$ R: int 10
## \$ b:'data.frame':    1 obs. of  3 variables:
##  ..\$ P: chr "b"
##  ..\$ Q: int 1
##  ..\$ R: int 0
``````

We can achieve the same result using `by`, which returns an object of class `by`, which is a `list`:

``````by.out1 <- with(df, by(df, P, foo))
str(by.out1)
##List of 3
## \$ a:'data.frame':    1 obs. of  3 variables:
##  ..\$ P: chr "a"
##  ..\$ Q: int 2
##  ..\$ R: int 10
## \$ b:'data.frame':    1 obs. of  3 variables:
##  ..\$ P: chr "b"
##  ..\$ Q: int 1
##  ..\$ R: int 0
## \$ c:'data.frame':    1 obs. of  3 variables:
##  ..\$ P: chr "c"
##  ..\$ Q: int 2
##  ..\$ R: int 10
## - attr(*, "dim")= int 3
## - attr(*, "dimnames")=List of 1
##  ..\$ P: chr [1:3] "a" "b" "c"
## - attr(*, "call")= language by.data.frame(data = df, INDICES = P, FUN = foo)
## - attr(*, "class")= chr "by"
``````

Here, we are using `by` with `with` to execute the `by` within the environment constructed from `df`. This allows us to specify the columns of `df` (such as `P`) by name without quotes.

For your second use case (which displays to console via `cat`):

``````foo <- function (dataframe) {
cat(str(dataframe[nrow(dataframe), ]))
}

for (value in unique(df\$P)) foo(df[df\$P == value, ])
##'data.frame': 1 obs. of  3 variables:
## \$ P: chr "c"
## \$ Q: int 2
## \$ R: int 10
##'data.frame': 1 obs. of  3 variables:
## \$ P: chr "a"
## \$ Q: int 2
## \$ R: int 10
##'data.frame': 1 obs. of  3 variables:
## \$ P: chr "b"
## \$ Q: int 1
## \$ R: int 0
``````

Again, we can achieve the same result with `by`:

``````with(df, by(df, P, foo))
##'data.frame': 1 obs. of  3 variables:
## \$ P: chr "a"
## \$ Q: int 2
## \$ R: int 10
##'data.frame': 1 obs. of  3 variables:
## \$ P: chr "b"
## \$ Q: int 1
## \$ R: int 0
##'data.frame': 1 obs. of  3 variables:
## \$ P: chr "c"
## \$ Q: int 2
## \$ R: int 10
``````

The function `by` is in the `base` R package. As mentioned by Dave2e, there are many other packages that have similar data manipulation capabilities. Some of them provides more syntactic sugar for ease of use, and others provide better optimization, or both. Some of these are: `plyr`, `dplyr`, and `data.table`. I leave it to you to explore these.