AnnaZ AnnaZ - 29 days ago 9
R Question

"R for Data Science" book (Wickham) . Cannot reproduce example

I am following H. Wickham's R for Data Science and could not make snippet of code from that book work.
I refer to this section and the following graph of the book.
plot .

I literally copied and pasted the part of the code from the book, but it does not work as expected.

library(tidyverse)
library(forcats)

by_age <- gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count() %>%
mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, color = marital)) +
geom_line(na.rm = TRUE)


And even if I use
ungroup() %>%
right before
mutate()
it plots something but not what is in the book (slightly different pattern).

I would greatly appreciate if someone could explain this paradox.

The main issue is that
prop
are all equal to 1 in my case. As a result, I get just a horizontal line on the plot.

Thank you!

tidyverse
version: 1.1.1
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Answer Source

This looks to be a rather simple issue with the code. Yes, it should probably be fixed by Hadley and co but its not a big deal.

If you strat by printing by_age in the console you should see:

# A tibble: 351 x 4
# Groups:   age, marital [351]

So, the tibble is grouped by both age and marital. This means that both count() and the subsequent sum(n) (within the mutate) return the same value since sum is only being calculated over the group with only one value i.e. sum(n) == n --> prop === 1.

You were on the right track with an ungroup() however, the desired calculation is the proportion of each marital status for each age. So, add a group(age) between the count and mutate and you are golden.

by_age <-  gss_cat %>%
  filter(!is.na(age)) %>%
  group_by(age, marital) %>%
  count() %>%
  group(age) %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, color = marital)) +
  geom_line(na.rm = TRUE)

Results in:

result