Auxiliary Auxiliary - 3 years ago 98
Python Question

What are noisy samples in Scikit's DBSCAN clustering algorithm?

If I apply Scikit's DBSCAN (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) on a similarity matrix, I get a series of labels back. Some of these labels are -1. The documentation calls them noisy samples.

What are these? Do they all belong to a single cluster, or do they each belong to their own cluster since they're noisy?

Thank you

Answer Source

These are not exactly part of a cluster. They are simply points that do not belong to any clusters and can be "ignored" to some extent.

Remember, DBSCAN stands for "Density-Based Spatial Clustering of Applications with Noise." DBSCAN checks to make sure a point has enough neighbors within a specified range to classify the points into the clusters.

But what happens to the points that do not meet the criteria for falling into any of the main clusters? What if a point does not have enough neighbors within the specified radius to be considered part of a cluster? These are the points that are given the cluster label of -1 and are considered noise.

So what?

Well, if you are analyzing data points and you are only interested in the general clusters, you lower the size of the data and cut out the noise. Or, if you are using cluster analysis to classify data, in some cases it is possible to discard the noise as outliers.

In anomaly detection, points that do not fit into any category are also significant, as they can represent a problem or rare event.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download