Christoph Christoph - 4 months ago 23
R Question

Behavior ggplot2 aes() in combination with facet_grid() when passing variable with dollar sign notation to aes()

I am doing some analysis in ggplot2 at the moment for a project and by chance I stumbled across some (for me) weird behavior that I cannot explain. When I write aes(x = cyl, ...) the plot looks different to what it does if I pass the same variable using aes(x = mtcars$cyl, ...). When I remove facet_grid(am ~ .) both graphs are the same again. The code below is modeled after the code in my project that generates the same behavior:

library(dplyr)
library(ggplot2)

data = mtcars

test.data = data %>%
select(-hp)


ggplot(test.data, aes(x = test.data$cyl, y = mpg)) +
geom_point() +
facet_grid(am ~ .) +
labs(title="graph 1 - dollar sign notation")

ggplot(test.data, aes(x = cyl, y = mpg)) +
geom_point()+
facet_grid(am ~ .) +
labs(title="graph 2 - no dollar sign notation")


Here is the picture of graph 1:
graph 1 - dollar sign notation

Here is the picture of graph 2:
graph 2 - no dollar sign notation

I found that I can work around this problem using aes_string() instead of aes() and passing the variable names as strings, but I would like to understand why ggplot is behaving that way.

Thx a lot for any help in advance! I feel very uncomfortable if I do not understand that properly...

Answer

tl;dr

Never use [ or $ inside aes().


Consider this illustrative example where the facetting variable f is purposely in a non-obvious order with respect to x

d <- data.frame(x=1:10, f=rev(letters[gl(2,5)]))

Now contrast what happens with these two plots,

p1 <- ggplot(d) +
  facet_grid(.~f, labeller = label_both) +
  geom_text(aes(x, y=0, label=x, colour=f)) +
  ggtitle("good mapping") 

p2 <- ggplot(d) +
  facet_grid(.~f, labeller = label_both) +
  geom_text(aes(d$x, y=0, label=x, colour=f)) +
  ggtitle("$ corruption") 

enter image description here

We can get a better idea of what's happening by looking at the data.frame created internally by ggplot2 for each panel,

 ggplot_build(p1)[["data"]][[1]][,c("x","PANEL")]

    x PANEL
1   6     1
2   7     1
3   8     1
4   9     1
5  10     1
6   1     2
7   2     2
8   3     2
9   4     2
10  5     2

 ggplot_build(p2)[["data"]][[1]][,c("x", "PANEL")]

    x PANEL
1   1     1
2   2     1
3   3     1
4   4     1
5   5     1
6   6     2
7   7     2
8   8     2
9   9     2
10 10     2

The second plot has the wrong mapping, because when ggplot creates a data.frame for each panel, it picks x values in the "wrong" order.

This occurs because the use of $ breaks the link between the various variables to be mapped (ggplot must assume it's an independent variable, which for all it knows could come from an arbitrary, disconnected source). Since the data.frame in this example is not ordered according to the factor f, the subset data.frames used internally for each panel assume the wrong order.