stats134711 - 1 year ago 62
R Question

# Obtain endpoints from interval that is factor variable

Setup
I sample

`1,000,000`
observations from the following normal mixture model and bin the observations such that each of the
`10,000`
bin has an equal number of observations (i.e.
`100`
). This creates a factor for each bin in the form
`(a,b]`
, where
`a`
and
`b`
are numbers.

``````#Random sample
set.seed(1234)
X = ks::rnorm.mixt(n=1000000,mus=c(0.2,0.8),sigmas=c(0.04,0.01),props=c(0.95,0.05))

#Bins based on random sample with ~100 observations in each bins
bins = ggplot2::cut_number(X,10000)

dat = data.frame(X,bins)
``````

Question
I would like to extract the numbers
`a`
and
`b`
from the factor
`(a,b]`
. Here is what the bins look like:

``````> head(table(bins))
bins
[0.00501617,0.0518875]  (0.0518875,0.0594831]  (0.0594831,0.0640679]
100                    100                    100
(0.0640679,0.0670062]  (0.0670062,0.0694194]  (0.0694194,0.0717924]
100                    100                    100
> tail(table(bins),20)
bins
(0.817766,0.818032]   (0.818032,0.8183]   (0.8183,0.818544] (0.818544,0.818879]
100                 100                 100                 100
(0.818879,0.819112] (0.819112,0.819394] (0.819394,0.819664] (0.819664,0.819979]
100                 100                 100                 100
(0.819979,0.820328] (0.820328,0.820727] (0.820727,0.821118]  (0.821118,0.82158]
100                 100                 100                 100
(0.82158,0.822109] (0.822109,0.822646] (0.822646,0.823253]  (0.823253,0.82408]
100                 100                 100                 100
(0.82408,0.825026] (0.825026,0.826417] (0.826417,0.828651]  (0.828651,0.84424]
100                 100                 100                 100
``````

As you can see, the numbers in the factors don't always have the same number of digits and they may be preceded by 0's (e.g.
`(0.0518875,0.0594831]`
).

I initially tried to extract just the numeric portion using

``````endpts=na.omit(as.numeric(unlist(strsplit(as.character(unlist(bins)),"[^0-9]+"))))
``````

For the above bin (
`(0.0518875,0.0594831]`
), this procedure would output
`518875 594831`
, but because the trailing zeros are gone, it could be mapped to several values (e.g.
`0.518875 0.594831`
). Furthermore, there are bins in which one or both of the numbers have different number of digits (e.g.
`(0.818032,0.8183]`
). This lack of uniformity in the output is giving me problems when trying to get the endpoints. Ultimately, I'd like to get the left and right endpoints. Any suggestions?

EDIT I also looked into the code for
`ggplot2::cut_number`
, which uses the
`cut`
function. The default input in
`cut`
for the number of digits is
`dig.lab=3`
, but this doesn't seem to be reflected in the above output.

Something along this lightly tested approach:

``````unique( as.numeric(  unlist(
strsplit( gsub( "[][(]" , "", levels(bins)[1:5] ) , ","))))
``````

I have learned to "read nested R code from the inside-out". This first (1) removes the flanking "(", "[" and "]" using a character class pattern, then (2) splits on commas, (3) "vectorizes" the list structure with `unlist`, (4)then converts to numeric and finally (5) removes duplicates. This shows it using line breaks for formatting:

``````unique(                    #     (5)
as.numeric(                  #     (4)
unlist(                        #     (3)
strsplit(                     #     (2)
gsub( "[][(]" , "", levels(bins)[1:5] ) , ",") # (1)
)))
``````

This was tested on your example and produces this for a smaller example using the first 5 levels:

``````unique( as.numeric(  unlist( strsplit( gsub( "[][(]" , "", levels(bins)[1:5] ) , ","))))
[1] 0.00501617 0.05188750 0.05948310 0.06406790 0.06700620 0.06941940
``````

I put the word "vectorizes" in quotes because it's not really the meaning of that word in R terminology, where it refers to operations that return a vector of equal length as its input.

Here's the results of my suggestion to keep the decimal point (period) in the items not used as splitting criteria and comaison with what my code would have delivered. You were not clear about whether you wanted just the unique values or that values for each item:

``````endpts= na.omit( as.numeric( unlist( strsplit( as.character( unlist(bins)),"[^0-9.]+"))))