kjo kjo - 2 months ago 6
R Question

Applying data.frame-consuming functions over groups of rows

For example, suppose I have some

data.frame
df
:

df <- read.table(text = "
P Q R
c 1 10
a 1 0
a 2 0
b 2 0
b 1 10
c 2 10
b 1 0
a 2 10
",
stringsAsFactors = FALSE,
header=T)


...and some function
foo
that takes a
data.frame
as argument.

One can imagine splitting
df
into smaller
data.frame
's according to the value in one of its columns, say
P
, and applying
foo
to each of those smaller
data.frame
's.

Below I show the best I can come up with to solve this problem, but I suspect that more streamlined solutions already exist to perform such a natural operation. If so, my question is: what are they?

NB: I show two use-cases below; the first one of the two is the one that I expect can be improved significantly. As for the second one, I think my solution for it may already be about as good as it'll get; I include this use-case just in case my guess is wrong.




My solution depends on whether
foo
is a function that I call for its return value, or one that I call only for its side effects.

For the former case (
foo
called for its value), suppose that
foo
is this:

## returns a one-row data.frame corresponding to a random row of
## dataframe
## NB: this is *just an example* for the sake of this question
foo <- function (dataframe) {
dataframe[sample(nrow(dataframe), 1), ]
}


...then my solution would be this:

set.seed(0)
sapply(unique(df$P), function (value) foo(df[df$P == value, ]),
simplify = FALSE)
## $c
## P Q R
## 6 c 2 10
##
## $a
## P Q R
## 2 a 1 0
##
## $b
## P Q R
## 5 b 1 10


For the latter case (
foo
called for its side-effect), suppose that
foo
is this:

## prints to stdout a one-row data.frame corresponding to a random
## row of dataframe
## NB: this is *just an example* for the sake of this question
foo <- function (dataframe) {
cat(str(dataframe[sample(nrow(dataframe), 1), ]))
}


...then my solution would be this:

set.seed(0)
for (value in unique(df$P)) foo(df[df$P == value, ])
## 'data.frame': 1 obs. of 3 variables:
## $ P: chr "c"
## $ Q: int 2
## $ R: int 10
## 'data.frame': 1 obs. of 3 variables:
## $ P: chr "a"
## $ Q: int 1
## $ R: int 0
## 'data.frame': 1 obs. of 3 variables:
## $ P: chr "b"
## Q: int 1
## R: int 10

Answer

You can achieve both of your use cases with the function by. To replicate your results, however, we change your functions to return or output the last row of the group instead of a randomly selected row. This is necessary because the ordering of rows within a group is modified by by. In a real use case, this ordering should not matter. This only matters because your results depend on a random number generator to select on the grouped rows.

In your first use case:

foo <- function (dataframe) {
  dataframe[nrow(dataframe), ]
}

out1 <- sapply(unique(df$P), function (value) foo(df[df$P == value, ]),
               simplify = FALSE)

The result out1 is a list:

str(out1)  ## this displays the structure of the out1 object
##List of 3
## $ c:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "c"
##  ..$ Q: int 2
##  ..$ R: int 10
## $ a:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "a"
##  ..$ Q: int 2
##  ..$ R: int 10
## $ b:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "b"
##  ..$ Q: int 1
##  ..$ R: int 0

We can achieve the same result using by, which returns an object of class by, which is a list:

by.out1 <- with(df, by(df, P, foo))
str(by.out1)
##List of 3
## $ a:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "a"
##  ..$ Q: int 2
##  ..$ R: int 10
## $ b:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "b"
##  ..$ Q: int 1
##  ..$ R: int 0
## $ c:'data.frame':    1 obs. of  3 variables:
##  ..$ P: chr "c"
##  ..$ Q: int 2
##  ..$ R: int 10
## - attr(*, "dim")= int 3
## - attr(*, "dimnames")=List of 1
##  ..$ P: chr [1:3] "a" "b" "c"
## - attr(*, "call")= language by.data.frame(data = df, INDICES = P, FUN = foo)
## - attr(*, "class")= chr "by"

Here, we are using by with with to execute the by within the environment constructed from df. This allows us to specify the columns of df (such as P) by name without quotes.

For your second use case (which displays to console via cat):

foo <- function (dataframe) {
  cat(str(dataframe[nrow(dataframe), ]))
}

for (value in unique(df$P)) foo(df[df$P == value, ])
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "c"
## $ Q: int 2
## $ R: int 10
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "a"
## $ Q: int 2
## $ R: int 10
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "b"
## $ Q: int 1
## $ R: int 0

Again, we can achieve the same result with by:

with(df, by(df, P, foo))
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "a"
## $ Q: int 2
## $ R: int 10
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "b"
## $ Q: int 1
## $ R: int 0
##'data.frame': 1 obs. of  3 variables:
## $ P: chr "c"
## $ Q: int 2
## $ R: int 10

The function by is in the base R package. As mentioned by Dave2e, there are many other packages that have similar data manipulation capabilities. Some of them provides more syntactic sugar for ease of use, and others provide better optimization, or both. Some of these are: plyr, dplyr, and data.table. I leave it to you to explore these.

Comments