Jon Nagra Jon Nagra - 2 months ago 32
R Question

Display maximum frequency point of each bin in ggplot2 stat_binhex

I have a data set in which a coordinate can be repeated several times.
I want to make a hexbinplot displaying the maximum number of times a coordinate is repeated within that bin. I am using R and I would prefer to make it with ggplot so the graph is consistent with other graphs in the same report.

Minimum working example (the bins display the count not the max):

library(ggplot2)
library(data.table)
set.seed(41)
dat<-data.table(x=sample(seq(-10,10,1),1000,replace=TRUE),
y=sample(seq(-10,10,1),1000,replace=TRUE))
dat[,.N,by=c("x","y")][,max(N)]
# No bin should be over 9

p1 <- ggplot(dat,aes(x=x,y=y))+stat_binhex(bins=10)
p1


I believe the approach should be related to this question:
calculating percentages for bins in ggplot2 stat_binhex but I am not sure how to adapt it to my case.
Also, I am concerned about this issue ggplot2: ..count.. not working with stat_bin_hex anymore as it can make my objective harder than what I initially thought.

Is it possible to make the bins display the maximum number of times a point is repeated?

Answer

I think, after playing with the data a bit more, I now understand. Each bin in the plot represents multiple points, e.g., (9,9);(9,10)(10,9);(10,10) are all in a single bin in the plot. I must caution that this is the expected behavior. It is unclear to me why you do not want to do it this way. Instead, you seem to want to display the values of just one of those points (e.g. 9,9).

I don't think you will be able to do this directly in a call to geom_hex or stat_hexbin, as those functions are trying to faithfully represent all of the data. In fact, they are not necessarily expecting discrete coordinates like you have at all -- they work equally well on continuous data.

For your purpose, if you want finer control, you may want to instead use geom_tile and count the values yourself, eg. (using dplyr and magrittr):

countedData <-
  dat %$%
  table(x,y) %>%
  as.data.frame()

ggplot(countedData
       , aes(x = x
             , y = y
             , fill = Freq)) +
  geom_tile()

enter image description here

and you might play with the representation a bit from there, but it would at least display each of the separate coordinates more faithfully.

Alternatively, you could filter your raw data to only include the points that are the maximum within a bin. That would require you to match the binning, but could at least be an option.