Mark Morrisson Mark Morrisson - 5 months ago 70
Python Question

How to specify a distance function for clustering?

I'd like to cluster points given to a custom distance and strangely, it seems that neither scipy nor sklearn clustering methods allow the specification of a distance function.

For instance, in

sklearn.cluster.AgglomerativeClustering
, the only thing I may do is enter an affinity matrix (which will be very memory-heavy). In order to build this very matrix, it is recommended to use
sklearn.neighbors.kneighbors_graph
, but I don't understand how I can specify a distance function either between two points. Could someone enlighten me?

Answer

All of the scipy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata:

import numpy as np
from scipy.cluster.hierarchy import fclusterdata

# a custom function that just computes Euclidean distance
def mydist(p1, p2):
    diff = p1 - p2
    return np.vdot(diff, diff) ** 0.5

X = np.random.randn(100, 2)

fclust1 = fclusterdata(X, 1.0, metric=mydist)
fclust2 = fclusterdata(X, 1.0, metric='euclidean')

print(np.allclose(fclust1, fclust2))
# True

Valid inputs for the metric= kwarg are the same as for scipy.spatial.distance.pdist.