dww - 9 months ago 84

R Question

I'm trying to create a horizontal boxplot with logarithmic axis using ggplot2. But, the length of whiskers are wrong.

A minimal reproducible example:

Some data

`library(ggplot2)`

library(reshape2)

set.seed(1234)

my.df <- data.frame(a = rnorm(1000,150,50), b = rnorm(1000,500,150))

my.df$a[which(my.df$a < 5)] <- 5

my.df$b[which(my.df$b < 5)] <- 5

If I plot this using base R

`boxplot()`

`boxplot(my.df, log="x", horizontal=T)`

But with ggplot,

`my.df.long <- melt(my.df, value.name = "vals")`

ggplot(my.df.long, aes(x=variable, y=vals)) +

geom_boxplot() +

scale_y_log10(breaks=c(5,10,20,50,100,200,500,1000), limits=c(5,1000)) +

theme_bw() + coord_flip()

I get this plot, in which the whiskers are the wrong length (see for example how there are many additional outliers below the whiskers and none above).

Note that, without log axes, ggplot has the whiskers the correct length

`ggplot(my.df.long, aes(x=variable, y=vals)) +`

geom_boxplot() +

theme_bw() + coord_flip()

How do I produce a horizontal logarithmic boxplot using ggplot with the correct length whiskers? Preferably with the whiskers extending to 1.5 times the IQR.

As explained here. It is possible to use

`coord_trans(y = "log10")`

`scale_y_log10`

`coord_trans`

`coord_flip`

Answer

The problem is due to the fact that `scale_y_log10`

transforms the data before calculating the stats. This does not matter for the median and percentile points, because e.g. 10^log10(median) is still the median value, which will be plotted in the correct location. But it *does* matter for the whiskers which are calculated using 1.5*IQR, because 10^(1.5*IQR(log10(x)) is not equal to 1.5*IQR(x). So the calculation fails for the whiskers.

This error becomes evident if we compare

```
boxplot.stats(my.df$b)$stats
# [1] 117.4978 407.3983 502.0460 601.2937 873.0992
10^boxplot.stats(log10(my.df$b))$stats
# [1] 231.1603 407.3983 502.0459 601.2935 975.1906
```

In which we see that the median and percentile ppoints are identical, but the whisker ends (1st and last elements of the stats vector) differ

This detailed and useful answer by @eipi10, shows how to calculate the stats yourself and force ggplot to use these user-defined stats rather than its internal (and incorrect) algorithm. Using this approach, it becomes relatively simple to calculate the correct statistics and use these instead.

```
# Function to use boxplot.stats to set the box-and-whisker locations
mybxp = function(x) {
bxp = log10(boxplot.stats(10^x)[["stats"]])
names(bxp) = c("ymin","lower", "middle","upper","ymax")
return(bxp)
}
# Function to use boxplot.stats for the outliers
myout = function(x) {
data.frame(y=log10(boxplot.stats(10^x)[["out"]]))
}
ggplot(my.df.long, aes(x=variable, y=vals)) + theme_bw() + coord_flip() +
scale_y_log10(breaks=c(5,10,20,50,100,200,500,1000), limits=c(5,1000)) +
stat_summary(fun.data=mybxp, geom="boxplot") +
stat_summary(fun.data=myout, geom="point")
```

Which produces the correct plot

**A note on using coord_trans as an alternative approach:**

Using `coord_trans(y = "log10")`

instead of `scale_y_log10`

, causes the stats to be calculated (correctly) on the untransformed data. *However*, `coord_trans`

cannot be used in combination with `coord_flip`

. So, this does not solve the issue of creating horizontal boxplots with a log axis. The suggestion here to use `ggdraw(switch_axis_position())`

from the cowplot package to flip the axes after using `coord_trans`

did not work, but throws an error (cowplot v0.4.0 with ggplot2 v2.1.0)

Error in Ops.unit(gyl$x, grid::unit(0.5, "npc")) : both operands must be units

In addition: Warning message:

`axis.ticks.margin`

is deprecated. Please set`margin`

property of`axis.text`

instead