qed qed - 1 month ago 13
R Question

Inconsistency in the binning of the cut function in RStudio

Here is some experiments in RStudio with an RMarkdown file:

---
title: "test"
author: "qed"
date: "10/10/2016"
output: html_document
---


```{r}
library(ISLR)
set.seed(3)
Wage$age = jitter(Wage$age)
get_breaks = function(cutted) {
labels = levels(cutted)
lower = as.numeric(sub("\\((.+),.*", "\\1", labels))
upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labels[length(labels)]))
c(lower, upper)
}
age_groups = cut(Wage$age, 4)
age_groups1 = cut(Wage$age, get_breaks(age_groups))
all(levels(age_groups) == levels(age_groups1))
idx = which(age_groups != age_groups1)
idx # not empty!
```


If you knitr it you will see that idx is not empty.

RStudio version 0.99.903

R version 3.3.1

Essentailly, I tried to extract the breaks from the output of the cut function and apply it explicitly. It's expected that the new output should be exactly the same with the old, but they are not.

Is this a bug? How to fix it?




Edit



Actually, after repeatedly trying this in the R console, the same problem turns out to exist there, too, so it's not an RStudio bug. The even more troubling thing is that the behavior doesn't seem deterministic in spite of
set.seed
.

Answer

You think the two ways of cutting the vector are equivalent, but they are not. This issue is irrlevant to RStudio or knitr. It is easy to show the problem in a normal R session:

problem = function() {
  library(ISLR)
  set.seed(NULL)  # reinitialize random seed
  Wage$age.jittered = jitter(Wage$age)
  get_breaks = function(cutted) {
    labels = levels(cutted)
    lower = as.numeric(sub("\\((.+),.*", "\\1", labels))
    upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labels[length(labels)]))
    c(lower, upper)
  }
  age_groups = cut(Wage$age.jittered, 4)
  age_groups1 = cut(Wage$age.jittered, get_breaks(age_groups))
  all(levels(age_groups) == levels(age_groups1))
  idx = which(age_groups != age_groups1)
  length(idx)
}

res = replicate(1000, problem())
barplot(table(res))

frequency of length(idx)

You'd expect the barplot to only have non-zero frequencies at 0, but the length of idx is not zero for quite a few times.

Back to your question, the labels that you saw are not necessarily the exact endpoints. They could be rounded. See the argument dig.lab in the help page ?cut.

Comments