Joachim Schork - 3 months ago 13

R Question

I want to calculate the distances between all rows of a large matrix. For each row, I need to find another row, which has the lowest distance. The final output should be a list, containing of IDs of the rows with the lowest distances (see low_dis_ids in the example below).

I was able to find a solution for small samplesizes (example below). However, I am not able to perform these steps with larger samplesizes, because the matrix with the distances gets loo big. Is there a way to omit such a big matrix? I do only need the list with the IDs (like low_dis_ids).

Reproducible Example:

`set.seed(123)`

# Calculation of distances with small samplesize is working well

N <- 100

data_100 <- data.frame(x1 = rnorm(N, 5, 10),

x2 = rnorm(N, 5, 10),

x3 = rnorm(N, 5, 10),

x4 = rnorm(N, 5, 10),

x5 = rnorm(N, 5, 10))

# Matrix with all distances (no problem for the smaller samplesize)

dist_100 <- as.matrix(dist(data_100))

# Find the row with the smallest distance

for(i in 1:nrow(dist_100)) {

dist_100[i, i] <- Inf

}

low_dis <- numeric()

for(i in 1:nrow(dist_100)) {

low_dis[i] <- as.numeric(sort(dist_100[ , i]))[1]

}

low_dis_ids <- list()

for(i in 1:length(low_dis)) {

low_dis_ids[[i]] <- as.numeric(names(dist_100[ , i][dist_100[ , i] == low_dis[i]]))

}

# low_dis_ids is the desired output and stores the rows with the smallest distances

# The same procedure is not working for larger samplesizes

N <- 100000

data_100000 <- data.frame(x1 = rnorm(N, 5, 10),

x2 = rnorm(N, 5, 10),

x3 = rnorm(N, 5, 10),

x4 = rnorm(N, 5, 10),

x5 = rnorm(N, 5, 10))

dist_100000 <- dist(data_100000)

# Error: cannot allocate vector of size 37.3 Gb

Answer

You can definitely avoid the creation of the large matrix that comes as a result of using `dist`

. One such solution is to calculate the distances one row at a time, we could write a function that given the whole data frame and a row id finds which row corresponds to the smallest distance. For example:

```
f = function(rowid, whole){
d = colSums((whole[rowid,] - t(whole))^2) # calculate distance
d[rowid] = Inf # replace the zero
which.min(d)
}
```

The `colSums`

function is fairly well optimized so this is relatively quick. I suspect `which.min`

is also a slightly faster and possibly neater approach to looping through the vectors of distances.

To make a solution which then applies to any such data frame I wrote another short function which applies this to every row and gives you a vector of row ids

```
mindists = function(dat) do.call(c,lapply(1:nrow(dat),f,whole = as.matrix(dat)))
```

If you want the list instead of a vector, just omit the `do.call`

function. I had this to make it easier to check that the output gave what you expected.

```
all(do.call(c,low_dis_ids) == mindists(data_100))
[1] TRUE
```

This also runs for the larger example on my laptop. It isn't fast because you are making `nrow(data)`

calls to `f`

but it does avoid the creation of one large object. There may be better solutions out there but this was the first one that sprung to mind that works. Hope that helps.

EDIT:

Edited since there is an additional C++ answer by Roland
I did a quick benchmark on the smaller data set. The C++ answer is definitely quicker in this case.
Some extra sales pitch for this answer is it is I think easier to understand if you are purely an R programmer (no need to learn C++ and RCpp). The R version is trivial to parallelise using a parallel replacement of `lapply`

. I will note this is not to take away from Rolands answer, personally I like Rcpp, just to give extra bits of info for any future readers of this question.