dzeltzer - 1 year ago 158
R Question

# ggplot stat_summary_bin glitch?

I was happy to discover that ggplot has binned scatter plots, which are useful for exploring and visualizing relationships in large data. Yet the top bin appears to misbehave. Here's an example: All bin averages are roughly linearly aligned, as they should be, but the top one is off on both dimensions:

the code:

``````library(ggplot2)

# simulate an example of linear data
set.seed(1)
N <- 10^4
x <- runif(N)
y <- x + rnorm(N)
dt <- data.frame(x=x, y=y)

ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point')
``````

is there a simple workaround (and where should this be posted)?

`stat_summary_bin` is actually excluding the two rows with the largest x-values from the bins, and those two values are ending up with bin = `NA`. The mean of those two excluded values is plotted as a separate bin to the right of the regular bins. First, I show what is going wrong in your original plot then I provide a workaround to get the desired behavior.

### What's going wrong in the original plot

To see what's going wrong in your original plot, create a plot with two calls to `stat_summary_bin` where we calculate the mean of each bin and the number of values in each bin. Then use `ggplot_build` to capture all of the internal data that ggplot generated to create the plot.

``````p1 = ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y=mean, bins=10, size=5, geom='text',
aes(label=..y..)) +
stat_summary_bin(fun.y=length, bins=10, size=5, geom='text',
aes(label=..y.., y=0))

p1b = ggplot_build(p1)
``````

Now let's look at the data for the `mean` and `length` layers, respectively. I've printed only bins 9 through 11 (the three right-most bins) for brevity. Bin 11 is the "extra" bin and you can see that it contains only 2 values (its `label` is `2` in the second table below), and that the mean of those two values is `-0.1309998`, as can be seen in the first table below.

``````p1b\$data[[2]][9:11,c(1,2,4,6,7)]
``````
``````        label bin          y         x      width
9   0.8158320   9  0.8158320 0.8498505 0.09998242
10  0.9235531  10  0.9235531 0.9498329 0.09998242
11 -0.1309998  11 -0.1309998 1.0498154 0.09998244
``````
``````p1b\$data[[3]][9:11,c(1,2,4,6,7)]
``````
``````   label bin    y         x      width
9   1025   9 1025 0.8498505 0.09998242
10  1042  10 1042 0.9498329 0.09998242
11     2  11    2 1.0498154 0.09998244
``````

Which two values are those? It looks like they come from the two rows with the highest x values in the original data frame:

``````mean(dt[order(-dt\$x), "y"][1:2])
``````
``````[1] -0.1309998
``````

I'm not sure how `stat_summary_bin` is managing to bin the data such that the two highest x values are excluded.

### Workaround to get the desired behavior

A workaround is to summarize the data yourself, so you'll have complete control over how the bins are created. The example below uses your original code and then plots pre-summarized values in blue, so you can compare the behavior. I've included the `dplyr` package so that I can use the chaining operator (`%>%`) to summarize the data on the fly:

``````library(dplyr)

ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
geom_point(data=dt %>%
group_by(bins=cut(x,breaks=seq(min(x),max(x),length.out=11), include.lowest=TRUE)) %>%
summarise(x=mean(x), y=mean(y)),
aes(x,y), size=3, color="blue") +
theme_bw()
``````

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download