Peer Wünsche - 1 year ago 102

R Question

Does someone know how to create a graph like the one in the screenshot? I've tried to get a similar effect adjusting alpha, but this renders outliers to be almost invisible. I know this type of graph only from a software called FlowJo, here they refer to it as "pseudocolored dot plot". Not sure if this an official term.

I'd like to do it specifically in ggplot2, as I need the faceting option. I attached another screenshot of one of my graphs. The vertical lines depict clusters of mutations at certain genomic regions. Some of these clusters are much denser than others. I'd like to illustrate this using the density colors.

The data is quite big and hard to simulate, but here's a try. I doesn't look like the actual data, but the data format is the same.

`chr <- c(rep(1:10,1000))`

position <- runif(10000, min=0, max=5e8)

distance <- runif(10000, min=1, max=1e5)

log10dist <- log10(distance)

df1 <- data.frame(chr, position, distance, log10dist)

ggplot(df1, aes(position, log10dist)) +

geom_point(shape=16, size=0.25, alpha=0.5, show.legend = FALSE) +

facet_wrap(~chr, ncol = 5, nrow = 2, scales = "free_x")

Any help is highly appreciated.

Answer Source

```
library(ggplot2)
library(ggalt)
library(viridis)
chr <- c(rep(1:10,1000))
position <- runif(10000, min=0, max=5e8)
distance <- runif(10000, min=1, max=1e5)
log10dist <- log10(distance)
df1 <- data.frame(chr, position, distance, log10dist)
ggplot(df1, aes(position, log10dist)) +
geom_point(shape=16, size=0.25, show.legend = FALSE) +
stat_bkde2d(aes(fill=..level..), geom="polygon") +
scale_fill_viridis() +
facet_wrap(~chr, ncol = 5, nrow = 2, scales = "free_x")
```

In practice, I'd take the initial bandwidth guess and then figure out an optimal bandwidth. Apart from taking the lazy approach and just plotting the points w/o filtering (`smoothScatter()`

filters everything but the outliers based on `npoints`

) this is generating the "smoothed scatterplot" like the example you posted.

`smoothScatter()`

uses different defaults, so it comes out a bit differently:

```
par(mfrow=c(nr=2, nc=5))
for (chr in unique(df1$chr)) {
plt_df <- dplyr::filter(df1, chr==chr)
smoothScatter(df1$position, df1$log10dist, colramp=viridis)
}
```

`geom_hex()`

is going to show the outliers, but not as distinct points:

```
ggplot(df1, aes(position, log10dist)) +
geom_point(shape=16, size=0.25, show.legend = FALSE, color="red") +
scale_fill_viridis() +
facet_wrap(~chr, ncol = 5, nrow = 2, scales = "free_x")
```

This:

```
ggplot(df1, aes(position, log10dist)) +
geom_point(shape=16, size=0.25) +
stat_bkde2d(bandwidth=c(18036446, 0.05014539),
grid_size=c(128, 128), geom="polygon", aes(fill=..level..)) +
scale_y_continuous(limits=c(3.5, 5.1)) +
scale_fill_viridis() +
facet_wrap(~chr, ncol = 5, nrow = 2, scales = "free_x") +
theme_bw() +
theme(panel.grid=element_blank())
```

gets you very close to the defaults `smoothScatter()`

uses, but hackishly accomplishes most of what the `nrpoints`

filtering code does in the `smoothScatter()`

function solely by restricting the y axis limits.