aplavin - 7 months ago 24

R Question

I have a quite large data frame, about 10 millions of rows. It has columns

`x`

`y`

`hypot <- function(x) {sqrt(x[1]^2 + x[2]^2)}`

for each row. Using

`apply`

But it seems to be too much for me, so I've tried different things:

- compiling the function reduces the time by about 10%
`hypot`

- using functions from greatly increases the running time.
`plyr`

What's the fastest way to do this thing?

Answer

What about `with(my_data,sqrt(x^2+y^2))`

?

```
set.seed(101)
d <- data.frame(x=runif(1e5),y=runif(1e5))
library(rbenchmark)
```

Two different per-line functions, one taking advantage of vectorization:

```
hypot <- function(x) sqrt(x[1]^2+x[2]^2)
hypot2 <- function(x) sqrt(sum(x^2))
```

Try compiling these too:

```
library(compiler)
chypot <- cmpfun(hypot)
chypot2 <- cmpfun(hypot2)
benchmark(sqrt(d[,1]^2+d[,2]^2),
with(d,sqrt(x^2+y^2)),
apply(d,1,hypot),
apply(d,1,hypot2),
apply(d,1,chypot),
apply(d,1,chypot2),
replications=50)
```

Results:

```
test replications elapsed relative user.self sys.self
5 apply(d, 1, chypot) 50 61.147 244.588 60.480 0.172
6 apply(d, 1, chypot2) 50 33.971 135.884 33.658 0.172
3 apply(d, 1, hypot) 50 63.920 255.680 63.308 0.364
4 apply(d, 1, hypot2) 50 36.657 146.628 36.218 0.260
1 sqrt(d[, 1]^2 + d[, 2]^2) 50 0.265 1.060 0.124 0.144
2 with(d, sqrt(x^2 + y^2)) 50 0.250 1.000 0.100 0.144
```

As expected the `with()`

solution and the column-indexing solution à la Tyler Rinker are essentially identical; `hypot2`

is twice as fast as the original `hypot`

(but still about 150 times slower than the vectorized solutions). As already pointed out by the OP, compilation doesn't help very much.