dzeltzer - 5 months ago 37

R Question

I was happy to discover that ggplot has binned scatter plots, which are useful for exploring and visualizing relationships in large data. Yet the top bin appears to misbehave. Here's an example: All bin averages are roughly linearly aligned, as they should be, but the top one is off on both dimensions:

the code:

`library(ggplot2)`

# simulate an example of linear data

set.seed(1)

N <- 10^4

x <- runif(N)

y <- x + rnorm(N)

dt <- data.frame(x=x, y=y)

ggplot(dt, aes(x, y)) +

geom_point(alpha = 0.1, size = 0.01) +

stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point')

is there a simple workaround (and where should this be posted)?

Answer

`stat_summary_bin`

is actually excluding the two rows with the largest x-values from the bins, and those two values are ending up with bin = `NA`

. The mean of those two excluded values is plotted as a separate bin to the right of the regular bins. First, I show what is going wrong in your original plot then I provide a workaround to get the desired behavior.

To see what's going wrong in your original plot, create a plot with two calls to `stat_summary_bin`

where we calculate the mean of each bin and the number of values in each bin. Then use `ggplot_build`

to capture all of the internal data that ggplot generated to create the plot.

```
p1 = ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y=mean, bins=10, size=5, geom='text',
aes(label=..y..)) +
stat_summary_bin(fun.y=length, bins=10, size=5, geom='text',
aes(label=..y.., y=0))
p1b = ggplot_build(p1)
```

Now let's look at the data for the `mean`

and `length`

layers, respectively. I've printed only bins 9 through 11 (the three right-most bins) for brevity. Bin 11 is the "extra" bin and you can see that it contains only 2 values (its `label`

is `2`

in the second table below), and that the mean of those two values is `-0.1309998`

, as can be seen in the first table below.

```
p1b$data[[2]][9:11,c(1,2,4,6,7)]
```

`label bin y x width 9 0.8158320 9 0.8158320 0.8498505 0.09998242 10 0.9235531 10 0.9235531 0.9498329 0.09998242 11 -0.1309998 11 -0.1309998 1.0498154 0.09998244`

```
p1b$data[[3]][9:11,c(1,2,4,6,7)]
```

`label bin y x width 9 1025 9 1025 0.8498505 0.09998242 10 1042 10 1042 0.9498329 0.09998242 11 2 11 2 1.0498154 0.09998244`

Which two values are those? It looks like they come from the two rows with the highest x values in the original data frame:

```
mean(dt[order(-dt$x), "y"][1:2])
```

`[1] -0.1309998`

I'm not sure how `stat_summary_bin`

is managing to bin the data such that the two highest x values are excluded.

A workaround is to summarize the data yourself, so you'll have complete control over how the bins are created. The example below uses your original code and then plots pre-summarized values in blue, so you can compare the behavior. I've included the `dplyr`

package so that I can use the chaining operator (`%>%`

) to summarize the data on the fly:

```
library(dplyr)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
geom_point(data=dt %>%
group_by(bins=cut(x,breaks=seq(min(x),max(x),length.out=11), include.lowest=TRUE)) %>%
summarise(x=mean(x), y=mean(y)),
aes(x,y), size=3, color="blue") +
theme_bw()
```