stats134711 stats134711 - 23 days ago 5
R Question

Obtain endpoints from interval that is factor variable

Setup
I sample

1,000,000
observations from the following normal mixture model and bin the observations such that each of the
10,000
bin has an equal number of observations (i.e.
100
). This creates a factor for each bin in the form
(a,b]
, where
a
and
b
are numbers.

#Random sample
set.seed(1234)
X = ks::rnorm.mixt(n=1000000,mus=c(0.2,0.8),sigmas=c(0.04,0.01),props=c(0.95,0.05))

#Bins based on random sample with ~100 observations in each bins
bins = ggplot2::cut_number(X,10000)

dat = data.frame(X,bins)


Question
I would like to extract the numbers
a
and
b
from the factor
(a,b]
. Here is what the bins look like:

> head(table(bins))
bins
[0.00501617,0.0518875] (0.0518875,0.0594831] (0.0594831,0.0640679]
100 100 100
(0.0640679,0.0670062] (0.0670062,0.0694194] (0.0694194,0.0717924]
100 100 100
> tail(table(bins),20)
bins
(0.817766,0.818032] (0.818032,0.8183] (0.8183,0.818544] (0.818544,0.818879]
100 100 100 100
(0.818879,0.819112] (0.819112,0.819394] (0.819394,0.819664] (0.819664,0.819979]
100 100 100 100
(0.819979,0.820328] (0.820328,0.820727] (0.820727,0.821118] (0.821118,0.82158]
100 100 100 100
(0.82158,0.822109] (0.822109,0.822646] (0.822646,0.823253] (0.823253,0.82408]
100 100 100 100
(0.82408,0.825026] (0.825026,0.826417] (0.826417,0.828651] (0.828651,0.84424]
100 100 100 100


As you can see, the numbers in the factors don't always have the same number of digits and they may be preceded by 0's (e.g.
(0.0518875,0.0594831]
).

I initially tried to extract just the numeric portion using

endpts=na.omit(as.numeric(unlist(strsplit(as.character(unlist(bins)),"[^0-9]+"))))


For the above bin (
(0.0518875,0.0594831]
), this procedure would output
518875 594831
, but because the trailing zeros are gone, it could be mapped to several values (e.g.
0.518875 0.594831
). Furthermore, there are bins in which one or both of the numbers have different number of digits (e.g.
(0.818032,0.8183]
). This lack of uniformity in the output is giving me problems when trying to get the endpoints. Ultimately, I'd like to get the left and right endpoints. Any suggestions?

EDIT I also looked into the code for
ggplot2::cut_number
, which uses the
cut
function. The default input in
cut
for the number of digits is
dig.lab=3
, but this doesn't seem to be reflected in the above output.

42- 42-
Answer

Something along this lightly tested approach:

unique( as.numeric(  unlist( 
                 strsplit( gsub( "[][(]" , "", levels(bins)[1:5] ) , ","))))

I have learned to "read nested R code from the inside-out". This first (1) removes the flanking "(", "[" and "]" using a character class pattern, then (2) splits on commas, (3) "vectorizes" the list structure with unlist, (4)then converts to numeric and finally (5) removes duplicates. This shows it using line breaks for formatting:

unique(                    #     (5)
  as.numeric(                  #     (4)
      unlist(                        #     (3)
            strsplit(                     #     (2)
                gsub( "[][(]" , "", levels(bins)[1:5] ) , ",") # (1)
       )))

This was tested on your example and produces this for a smaller example using the first 5 levels:

unique( as.numeric(  unlist( strsplit( gsub( "[][(]" , "", levels(bins)[1:5] ) , ","))))
[1] 0.00501617 0.05188750 0.05948310 0.06406790 0.06700620 0.06941940

I put the word "vectorizes" in quotes because it's not really the meaning of that word in R terminology, where it refers to operations that return a vector of equal length as its input.

Here's the results of my suggestion to keep the decimal point (period) in the items not used as splitting criteria and comaison with what my code would have delivered. You were not clear about whether you wanted just the unique values or that values for each item:

endpts= na.omit( as.numeric( unlist( strsplit( as.character( unlist(bins)),"[^0-9.]+"))))

 head(endpts)
#[1] 0.216698 0.216709 0.243665 0.243682 0.201100 0.201114
 end2 <- unique( as.numeric(  unlist( strsplit( gsub( "[][(]" , "", levels(bins) ) , ","))))
head(end2)
#[1] 0.00501617 0.05188750 0.05948310 0.06406790 0.06700620 0.06941940
 length(endpts)
#[1] 2000000
 length(end2)
#[1] 10001