jenswirf jenswirf - 3 months ago 87
R Question

Relative frequencies / proportions with dplyr

Suppose I want to calculate the proportion of different values within each group. For example, using the

mtcars
data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with
dplyr
?

library(dplyr)
data(mtcars)
mtcars = tbl_dt(mtcars)

# calculate frequency
mtcars %>%
group_by (am, gear) %>%
summarise (n=n())

# am gear n
# 0 3 15
# 0 4 4
# 1 4 8
# 1 5 5


What I would like to achieve (prettified):

am gear n rel.freq
0 3 15 79%
0 4 4 21%
1 4 8 62%
1 5 5 38%


EDIT:

For completeness I'll post my not-so-pretty attempt using the
data.table
special function
.N
..

mtcars %>%
group_by (am) %>%
mutate (total = .N) %>%
group_by (am, gear, total) %>%
summarise (n=n()) %>%
mutate (rel.freq = n / total)

Answer

Try this:

mtcars %>%
  group_by(am, gear) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))

#   am gear  n      freq
# 1  0    3 15 0.7894737
# 2  0    4  4 0.2105263
# 3  1    4  8 0.6153846
# 4  1    5  5 0.3846154

From the dplyr vignette: "When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset". Thus, after the summarise, the grouping variable 'gear' is peeled off, and the data is then grouped 'only' by 'am' (just check it with groups on the resulting data), on which we then perform the mutate calculation.

For rounding and prettification, please refer to the nice answer by @Tyler Rinker.