pv7 pv7 - 27 days ago 8
R Question

Grouping and summarising when the columns returned are not known in advance

I have a dataframe --say x -- that feeds a function which returns a subset depending on the value of a column x$id.

This subset y includes a column y$room that contains a different mix of values depending on the x$id value.

The subset is then spread with tidyr and the values of the y$room become columns.

Then the resulting extended df --say ext_y-- must be grouped by a column y_ext$visit and summary statistics should be calculated for the remaining columns by a special function.

The obvious problem is that these columns are not known in advance and therefore can not be defined by their names within the function.

The alternative of using the indexes of the columns instead of the names does not seem to work with dplyr, when group_by is involved.

Do you have any ideas how this problem could be tackled?

The dataframe has many thousands rows, so I will give you only a glimpse:

> tail(y)
id visit room value
11940 14 2 living room 19
11941 14 2 living room 16
11942 14 2 living room 15
11943 14 2 living room 22
11944 14 2 living room 25
11945 14 2 living room 20

> unique(x$id)
[1] 14 20 41 44 46 54 64 74 104 106
> unique(x$visit)
[1] 0 1 2
> unique(x$room)
[1] "bedroom" "living room" "family room" "study room" "den"
[6] "tv room" "office" "hall" "kitchen" "dining room"
> summary(x$value)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.750 7.875 17.410 16.000 1775.000


For a given id the spread() of tidyr returns only a subset of the room values in x. E.g. for id = 54:

> y<- out
> y$row <- 1 : nrow(y)
> y_ext <- spread(y, room, value)
> head(y_ext)
id visit row bedroom family room living room
1 14 0 1 6.00 NA NA
2 14 0 2 6.00 NA NA
3 14 0 3 2.75 NA NA
4 14 0 4 2.75 NA NA
5 14 0 5 2.75 NA NA
6 14 0 6 2.75 NA NA


Now, I must compose a function that groups the result by visit and summarises the columns that are returned for each group in the following form:

visit bedroom family room living room
1 0 NA 2.79 3.25
2 1 NA NA 4.53
3 2 4.19 3.77 NA


As I mentioned above, I do not know in advance which columns will be returned for a given id and this complicates the problem. Of course a short cut would be
to check and find out for each id which columns are returned and then create an
if structure that directs each id to the appropriate code, but this is not very elegant, I am afraid.

Hope this helped to give you a better picture.

Answer

Alright, this was interesting enough to me that I made some sample data myself:

allRooms <-
  c("Living", "Dining", "Bedroom", "Master", "Family", "Garage", "Office")

set.seed(8675309)

df <-
  data_frame(
    id = sample(1:5, nSamples, TRUE)
    , visit = sample(1:3, nSamples, TRUE)
    , room = sample(allRooms, nSamples, TRUE)
    , value = round(rnorm(nSamples, 20, 5))
  )

The way I see it, there are three approaches, in ascending order of reasonabality. First option, is to follow your basic layout. Here, I am splitting the df by the id, spreading as instructed, then using summarise_all to do the summation, removing the need to identify room names explicitly.

df %>%
  split(.$id) %>%
  lapply(function(x){
    x %>%
      select(-id) %>%
      mutate(row = 1:n()) %>%
      spread(room, value) %>%
      select(-row) %>%
      group_by(visit) %>%
      summarise_all(sum, na.rm = TRUE)
  })

This returns the following (note unique columns):

$`1`
# A tibble: 3 × 6
  visit Bedroom Dining Garage Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1       0     27     27      0      0
2     2      22     19      0     20     23
3     3       0      0      0     27      0

$`2`
# A tibble: 3 × 6
  visit Bedroom Dining Family Living Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      15      0      0      0     17
2     2       0     14     42     30      0
3     3      15     13     18      0     20

$`3`
# A tibble: 3 × 6
  visit Bedroom Dining Living Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      24      0     36      0     28
2     2       0      0     15     30      0
3     3       0     25     21      0     15

$`4`
# A tibble: 3 × 7
  visit Bedroom Dining Garage Living Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1       0      0     23     20      0     24
2     2       0     28     22      0      0      0
3     3      24      0     36      0     16      0

$`5`
# A tibble: 3 × 8
  visit Bedroom Dining Family Garage Living Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      23      0      0     21      0     16      0
2     2      44     14     41      0     26      0     18
3     3      21     19      0      0     25     19      0

However, because you had to add the row in to get a spread to work (without it, there are not unique entries), the spread actually doesn't help. You can get the same thing a lot more easily if you do the summarising first, like so:

df %>%
  split(.$id) %>%
  lapply(function(x){
    x %>%
      select(-id) %>%
      group_by(visit, room) %>%
      summarise(Sum = sum(value)) %>%
      spread(room, Sum, 0)
  })

Note that it gives 0 for rooms with no visits because of that last 0 for the fill argument. If you would rather that returns NA, you can leave the default.

Finally, it is unclear why you would want to do this separately in the first place. It may make far more sense to just do this all in one big group_by and handle the missings as needed after the fact. To wit, here is a lot less code to get the same summaries.

df %>%
  group_by(id, visit, room) %>%
  summarise(sum = sum(value)) %>%
  spread(room, sum)

gives

      id visit Bedroom Dining Family Garage Living Master Office
*  <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1      1     1      NA     27     NA     27     NA     NA     NA
2      1     2      22     19     NA     NA     NA     20     23
3      1     3      NA     NA     NA     NA     NA     27     NA
4      2     1      15     NA     NA     NA     NA     NA     17
5      2     2      NA     14     42     NA     30     NA     NA
6      2     3      15     13     18     NA     NA     NA     20
7      3     1      24     NA     NA     NA     36     NA     28
8      3     2      NA     NA     NA     NA     15     30     NA
9      3     3      NA     25     NA     NA     21     NA     15
10     4     1      NA     NA     NA     23     20     NA     24
11     4     2      NA     28     NA     22     NA     NA     NA
12     4     3      24     NA     NA     36     NA     16     NA
13     5     1      23     NA     NA     21     NA     16     NA
14     5     2      44     14     41     NA     26     NA     18
15     5     3      21     19     NA     NA     25     19     NA

If you want to filter down to just one id, use filter after the fact, then remove columns with all NA entries. (Note, you would likely save the output once, then pass it through the last two lines once for each id of interest, e.g., when printing)

df %>%
  group_by(id, visit, room) %>%
  summarise(sum = sum(value)) %>%
  spread(room, sum) %>%
  filter(id == 1) %>%
  select_if(function(col) mean(is.na(col)) != 1)

gives

     id visit Bedroom Dining Garage Master Office
  <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1     1      NA     27     27     NA     NA
2     1     2      22     19     NA     20     23
3     1     3      NA     NA     NA     27     NA
Comments