farrellmr - 6 months ago 52

Python Question

Ive been looking at Geoff Boeing's excellent blog posts on DBSCAN. The page I'm most interested in is -

http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/

How can I amend this approach to return the center of the largest cluster(cluster center with the surrounded by the most lat/lng points)? Is there a density rating associated with the center point of each cluster?

The core dbscan -

`db = DBSCAN(eps=.01, min_samples=1).fit(coordinates)`

labels = db.labels_

num_clusters = len(set(labels)) - (1 if -1 in labels else 0)

clusters = pd.Series([coordinates[labels == i] for i in xrange(num_clusters)])

print('Number of clusters: %d' % num_clusters)

Answer

Unfortunately, that blog post is wrong in a number of key points.

*Never*use DBSCAN with`min_samples=1`

. That is single-linkage clustering.**If you want single-linkage, use single-linkage, not DBSCAN**. Here, Leader clustering may be a good choice, too.Choose

`eps`

wisely. In his example, he chose eps so small, he mostly removed (near-) duplicates...DBSCAN clusters do not have a meaningful

*center*. Because they can be non-convex.In particular, the center needs to take haversine distance into account, which he doesn't. The first version used the mean, the new version uses the point closest to the mean (but which may still be distorted, because the mean didn't take earth into account).On latitude, longitude you should be using great-circle distance

*during*clustering already, not only afterwards. (Fixed in the blog by now).

Point 3 above also answers your question: DBSCAN clusters **may not have a meaningful center**. The center can be *outside* the cluster.

Since the original post, some points (in particular #4) have been improved. DBSCAN now actually uses haversine, and a ball tree index.