CMUEngineer - 6 months ago 17

Python Question

In order to make clustering a more feasible task, I want to remove items from an array if they have another item which is within some threshold in n-dimensional euclidean space. The input data into this truncation is an array of pixel-wise feature vectors. My first thought was to compute the pairwise euclidean distance matrix between all the items and then operate on them as such:

`indices = list(range(len(X)))`

dist_matrix = euclidean_distances(X,X)

index = 0

while True:

deletion = np.where(dist_matrix[index]<=threshold)[0]

indices = [i for i in indices if i==index or i not in deletion]

try:

index = indices[indices.index(index) + 1]

except IndexError:

break

dictionary = []

for index in indices:

dictionary.append(X[index])

However, this leads to a Memory Error for my large dataset when creating the distance matrix with sklearn.metrics.pairwise.euclidean_distances. What is an effective, memory-conservative manner to perform this operation? I've realized that the computation of this distance matrix is what is causing problems in the clustering algorithm, so I would like to be able to avoid the computation of such a large distance matrix by truncating the input array.

Answer

Depending on the number of dimensions n, number of points N, the size of the problem in each dimension L, and your acceptable separation distance d, one option would be to grid your space into boxes of dimension d and retain at most one point within each grid box. Memory requirement would change from O(N^2) to O((L/d)^n), and running time would change from O(N^2) to O(N + (L/d)^n), so it might be more efficient if L/d and n are not too large.

Alternately, it might be practical to use the following algorithm

```
for each point p in points
for each point q in points
if p <> q and p.dist(q) < Dmin
q.delete
```

This should be O(N^2) running time and O(0) extra memory.