dbarbosa - 7 months ago 28

R Question

First of all, I am new to R (I started yesterday).

I have two groups of points,

`data`

`centers`

`n`

`K`

`n = 3823`

`K = 10`

`i`

`j`

My idea is simple: for each

`i`

`dist[j]`

`i`

`j`

`which.min(dist)`

Each point is an array of

`64`

`> dim(data)`

[1] 3823 64

> dim(centers)

[1] 10 64

I have tried with

`for (i in 1:n) {`

for (j in 1:K) {

d[j] <- sqrt(sum((centers[j,] - data[i,])^2))

}

S[i] <- which.min(d)

}

which is extremely slow (with

`n = 200`

`distance <- function(point, group) {`

return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])

}

for (i in 1:n) {

d <- distance(data[i,], centers)

which.min(d)

}

Even if it does a lot of computation that I don't use (because

`dist(m)`

`m`

`distance`

`distance <- function(point, group) {`

return (dist(rbind(point,group))[1:nrow(group)])

}

but this seems to be twice slower. I also tried to use

`dist`

I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?

ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily.

`stats::kmeans`

Answer

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across `K`

.

```
# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)
system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)
```

Runs in:

```
utilisateur système écoulé
0.100 0.008 0.108
```

on my laptop.