agenis - 4 months ago 10

R Question

I want to find the index of the outlier spotted by the

`grubbs.test`

`outliers`

`where = function(x) which(x==as.numeric(strsplit(grubbs.test(x)$alternative," ")[[1]][3]))`

It works by retrieving the number in the text displayed by the grubbs result. It's kind of a hack but it works well, let's say, for round numbers:

`df=c(0, 3, rnorm(10))`

where(df) #[1] 2

When it gets to decimal numbers, the text doesn't match all the times with the digits of the actual number:

`df=c(0, sqrt(10), rnorm(10))`

where(df) # integer(0)

Someone has an idea to fix that problem? Or another way to find the index of the grubbs test biggest outlier? I'm trying to use this in a loop.

Answer

The problem is because `strsplit`

returns stings instead of numbers. In your second example I get:

```
[1] "highest" "value" "3.16227766016838" "is" "an" "outlier"
```

but the third element is not really the character version of the number `3.16227766016838`

. In fact the real number returned from `grubbs.test`

might have a lot more decimal places and this is why the `==`

operator does not 'catch' it as an equality. This can be seen clearly here:

```
a<-sqrt(10)
> a == as.numeric(as.character(a))
[1] FALSE
```

Is there a solution to this?

**YES** there is.

In order to tackle this problem just use the `almost.equal`

function that I took the liberty to copy from this R-help post:

```
almost.equal <- function (x, y, tolerance=.Machine$double.eps^0.5,
na.value=TRUE)
{
answer <- rep(na.value, length(x))
test <- !is.na(x)
answer[test] <- abs(x[test] - y) < tolerance
answer
}
```

The above function is a vectorized form of the `all.equal`

function which checks for an 'approximate' equality so that it captures cases like yours.

Let's convert your function to:

```
where = function(x) {
which(almost.equal(x, as.numeric(strsplit(grubbs.test(x)$alternative," ")[[1]][3])))
}
```

And let's check it now:

```
> df=c(0, 3, rnorm(10))
> where(df)
[1] 2
```

And:

```
> df=c(0, sqrt(10), rnorm(10))
> where(df)
[1] 2
```

And you have a solution that works well with decimal numbers too!!