UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line and with proper indexing, performs this task within a few hours. Use the GUI and small sample datasets to work out the options you want to use and then go to town. Worth looking into. Anywho, read on for a description of my original problem and some interesting discussion.
I have a dataset with ~2.5 million samples, each with 35 features (floating point values) that I'm trying to cluster. I've been trying to do this with scikit-learn's implementation of DBSCAN, using the Manhattan distance metric and a value of epsilon estimated from some small random samples drawn from the data. So far, so good. (here is the snippet, for reference)
db = DBSCAN(eps=40, min_samples=10, metric='cityblock').fit(mydata)
Traceback (most recent call last):
File "tDBSCAN.py", line 34, in <module>
db = DBSCAN(eps=float(sys.argv), min_samples=10, metric='cityblock').fit(mydata)
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/base.py", line 329, in fit_predict
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 186, in fit
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 69, in dbscan
D = pairwise_distances(X, metric=metric)
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 651, in pairwise_distances
return func(X, Y, **kwds)
File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 237, in manhattan_distances
D = np.abs(X[:, np.newaxis, :] - Y[np.newaxis, :, :])
The problem apparently is a low-quality DBSCAN implementation in
DBSCAN does not need a distance matrix. The algorithm was designed around using a database that can accelerate a
regionQuery function, and return the neighbors within the query radius efficiently (a spatial index should support such queries in
The implementation in
scikit however, apparently, computes the full
O(n^2) distance matrix, which comes at a cost both memory-wise and runtime-wise.
So I see two choices:
You may want to try the DBSCAN implementation in ELKI instead, which when used with an R*-tree index usually is substantially faster than a naive implementation.
Otherwise, you may want to reimplement DBSCAN, as the implementation in
scikit apparently isn't too good. Don't be scared of that: DBSCAN is really simple to implement yourself. The trickiest part of a good DBSCAN implementation is actually the
regionQuery function. If you can get this query fast, DBSCAN will be fast. And you can actually reuse this function for other algorithms, too.