Jon Nagra - 2 months ago 32

R Question

I have a data set in which a coordinate can be repeated several times.

I want to make a hexbinplot displaying the maximum number of times a coordinate is repeated within that bin. I am using R and I would prefer to make it with ggplot so the graph is consistent with other graphs in the same report.

Minimum working example (the bins display the count not the max):

`library(ggplot2)`

library(data.table)

set.seed(41)

dat<-data.table(x=sample(seq(-10,10,1),1000,replace=TRUE),

y=sample(seq(-10,10,1),1000,replace=TRUE))

dat[,.N,by=c("x","y")][,max(N)]

# No bin should be over 9

p1 <- ggplot(dat,aes(x=x,y=y))+stat_binhex(bins=10)

p1

I believe the approach should be related to this question:

calculating percentages for bins in ggplot2 stat_binhex but I am not sure how to adapt it to my case.

Also, I am concerned about this issue ggplot2: ..count.. not working with stat_bin_hex anymore as it can make my objective harder than what I initially thought.

Is it possible to make the bins display the maximum number of times a point is repeated?

Answer

I think, after playing with the data a bit more, I now understand. Each bin in the plot represents multiple points, e.g., (9,9);(9,10)(10,9);(10,10) are all in a single bin in the plot. I must caution that this is the *expected* behavior. It is unclear to me why you do not want to do it this way. Instead, you seem to want to display the values of just one of those points (e.g. 9,9).

I don't think you will be able to do this directly in a call to `geom_hex`

or `stat_hexbin`

, as those functions are trying to faithfully represent all of the data. In fact, they are not necessarily expecting discrete coordinates like you have at all -- they work equally well on continuous data.

For your purpose, if you want finer control, you may want to instead use `geom_tile`

and count the values yourself, eg. (using `dplyr`

and `magrittr`

):

```
countedData <-
dat %$%
table(x,y) %>%
as.data.frame()
ggplot(countedData
, aes(x = x
, y = y
, fill = Freq)) +
geom_tile()
```

and you might play with the representation a bit from there, but it would at least display each of the separate coordinates more faithfully.

Alternatively, you could filter your raw data to only include the points that *are* the maximum within a bin. That would require you to match the binning, but could at least be an option.