dbarbosa dbarbosa - 2 months ago 11
R Question

Calculating all distances between one point and a group of points efficiently in R

First of all, I am new to R (I started yesterday).

I have two groups of points,

data
and
centers
, the first one of size
n
and the second of size
K
(for instance,
n = 3823
and
K = 10
), and for each
i
in the first set, I need to find
j
in the second with the minimum distance.

My idea is simple: for each
i
, let
dist[j]
be the distance between
i
and
j
, I only need to use
which.min(dist)
to find what I am looking for.

Each point is an array of
64
doubles, so

> dim(data)
[1] 3823 64
> dim(centers)
[1] 10 64


I have tried with

for (i in 1:n) {
for (j in 1:K) {
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
S[i] <- which.min(d)
}


which is extremely slow (with
n = 200
, it takes more than 40s!!). The fastest solution that I wrote is

distance <- function(point, group) {
return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}

for (i in 1:n) {
d <- distance(data[i,], centers)
which.min(d)
}


Even if it does a lot of computation that I don't use (because
dist(m)
computes the distance between all rows of
m
), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the
distance
code is very ugly. I tried to replace it with

distance <- function(point, group) {
return (dist(rbind(point,group))[1:nrow(group)])
}


but this seems to be twice slower. I also tried to use
dist
for each pair, but it is also slower.

I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?

ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily.
stats::kmeans
do all computation in less than one second.

Answer

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.

# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)

system.time(
  dists <- apply(centers, 2, function(center) {
    colSums((x - center)^2)
})
)

Runs in:

utilisateur     système      écoulé 
      0.100       0.008       0.108 

on my laptop.

Comments