I was happy to discover that ggplot has binned scatter plots, which are useful for exploring and visualizing relationships in large data. Yet the top bin appears to misbehave. Here's an example: All bin averages are roughly linearly aligned, as they should be, but the top one is off on both dimensions:
# simulate an example of linear data
N <- 10^4
x <- runif(N)
y <- x + rnorm(N)
dt <- data.frame(x=x, y=y)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point')
stat_summary_bin is actually excluding the two rows with the largest x-values from the bins, and those two values are ending up with bin =
NA. The mean of those two excluded values is plotted as a separate bin to the right of the regular bins. First, I show what is going wrong in your original plot then I provide a workaround to get the desired behavior.
To see what's going wrong in your original plot, create a plot with two calls to
stat_summary_bin where we calculate the mean of each bin and the number of values in each bin. Then use
ggplot_build to capture all of the internal data that ggplot generated to create the plot.
p1 = ggplot(dt, aes(x, y)) + geom_point(alpha = 0.1, size = 0.01) + stat_summary_bin(fun.y=mean, bins=10, size=5, geom='text', aes(label=..y..)) + stat_summary_bin(fun.y=length, bins=10, size=5, geom='text', aes(label=..y.., y=0)) p1b = ggplot_build(p1)
Now let's look at the data for the
length layers, respectively. I've printed only bins 9 through 11 (the three right-most bins) for brevity. Bin 11 is the "extra" bin and you can see that it contains only 2 values (its
2 in the second table below), and that the mean of those two values is
-0.1309998, as can be seen in the first table below.
label bin y x width 9 0.8158320 9 0.8158320 0.8498505 0.09998242 10 0.9235531 10 0.9235531 0.9498329 0.09998242 11 -0.1309998 11 -0.1309998 1.0498154 0.09998244
label bin y x width 9 1025 9 1025 0.8498505 0.09998242 10 1042 10 1042 0.9498329 0.09998242 11 2 11 2 1.0498154 0.09998244
Which two values are those? It looks like they come from the two rows with the highest x values in the original data frame:
I'm not sure how
stat_summary_bin is managing to bin the data such that the two highest x values are excluded.
A workaround is to summarize the data yourself, so you'll have complete control over how the bins are created. The example below uses your original code and then plots pre-summarized values in blue, so you can compare the behavior. I've included the
dplyr package so that I can use the chaining operator (
%>%) to summarize the data on the fly:
library(dplyr) ggplot(dt, aes(x, y)) + geom_point(alpha = 0.1, size = 0.01) + stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') + geom_point(data=dt %>% group_by(bins=cut(x,breaks=seq(min(x),max(x),length.out=11), include.lowest=TRUE)) %>% summarise(x=mean(x), y=mean(y)), aes(x,y), size=3, color="blue") + theme_bw()