dbarbosa - 1 year ago 71
R Question

# Calculating all distances between one point and a group of points efficiently in R

First of all, I am new to R (I started yesterday).

I have two groups of points,

`data`
and
`centers`
, the first one of size
`n`
and the second of size
`K`
(for instance,
`n = 3823`
and
`K = 10`
), and for each
`i`
in the first set, I need to find
`j`
in the second with the minimum distance.

My idea is simple: for each
`i`
, let
`dist[j]`
be the distance between
`i`
and
`j`
, I only need to use
`which.min(dist)`
to find what I am looking for.

Each point is an array of
`64`
doubles, so

``````> dim(data)
[1] 3823   64
> dim(centers)
[1] 10 64
``````

I have tried with

``````for (i in 1:n) {
for (j in 1:K) {
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
S[i] <- which.min(d)
}
``````

which is extremely slow (with
`n = 200`
, it takes more than 40s!!). The fastest solution that I wrote is

``````distance <- function(point, group) {
return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}

for (i in 1:n) {
d <- distance(data[i,], centers)
which.min(d)
}
``````

Even if it does a lot of computation that I don't use (because
`dist(m)`
computes the distance between all rows of
`m`
), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the
`distance`
code is very ugly. I tried to replace it with

``````distance <- function(point, group) {
return (dist(rbind(point,group))[1:nrow(group)])
}
``````

but this seems to be twice slower. I also tried to use
`dist`
for each pair, but it is also slower.

I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?

ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily.
`stats::kmeans`
do all computation in less than one second.

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across `K`.

``````# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)

system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)
``````

Runs in:

``````utilisateur     système      écoulé
0.100       0.008       0.108
``````

on my laptop.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download