Jake Fisher Jake Fisher - 25 days ago 16
R Question

Why do group_by and group_by_ give different answers when summarizing by two variables?

In the following example, I want to create a summary statistic by two variables. When I do it with

dplyr::group_by
, I get the correct answer, by when I do it with
dplyr::group_by_
, it summarizes one level more than I want it to.

library(dplyr)
set.seed(919)
df <- data.frame(
a = c(1, 1, 1, 2, 2, 2),
b = c(3, 3, 4, 4, 5, 5),
x = runif(6)
)

# Gives correct answer
df %>%
group_by(a, b) %>%
summarize(total = sum(x))

# Source: local data frame [4 x 3]
# Groups: a [?]
#
# a b total
# <dbl> <dbl> <dbl>
# 1 1 3 1.5214746
# 2 1 4 0.7150204
# 3 2 4 0.1234555
# 4 2 5 0.8208454

# Wrong answer -- too many levels summarized
df %>%
group_by_(c("a", "b")) %>%
summarize(total = sum(x))
# # A tibble: 2 × 2
# a total
# <dbl> <dbl>
# 1 1 2.2364950
# 2 2 0.9443009


What's going on?

Answer

If you want to use a vector of variable names, you can pass it to .dots parameter as:

df %>%
      group_by_(.dots = c("a", "b")) %>%
      summarize(total = sum(x))

#Source: local data frame [4 x 3]
#Groups: a [?]

#      a     b     total
#  <dbl> <dbl>     <dbl>
#1     1     3 1.5214746
#2     1     4 0.7150204
#3     2     4 0.1234555
#4     2     5 0.8208454

Or you can use it in the same way as you would do in NSE way:

df %>%
     group_by_("a", "b") %>%
     summarize(total = sum(x))

#Source: local data frame [4 x 3]
#Groups: a [?]

#      a     b     total
#  <dbl> <dbl>     <dbl>
#1     1     3 1.5214746
#2     1     4 0.7150204
#3     2     4 0.1234555
#4     2     5 0.8208454
Comments