Aaron Santos Aaron Santos - 3 months ago 9
Scala Question

Why does Spark's GaussianMixture return identical clusters?

I'm using spark-1.5.2 to cluster a dataset using

GaussianMixture
. No errors occur other than the resulting
GaussianMixtureModel
s and their weights are identical. The number of iterations it takes to reach the specified tolerance is about 2 which seems far too low.

What parameters can I adjust so that clusters form with different values?

import org.apache.spark.SparkContext
import org.apache.spark.rdd._
import org.apache.spark.mllib.clustering.GaussianMixture
import org.apache.spark.mllib.linalg.{Vector, Vectors}

def sparkContext: SparkContext = {
import org.apache.spark.SparkConf
new SparkContext(new SparkConf().setMaster("local[*]").setAppName("console"))
}

implicit val sc = sparkContext

def observationsRdd(implicit sc: SparkContext): RDD[Vector] = {
sc.textFile("observations.csv")
.map { line => Vectors.dense(line.split(",").map { _.toDouble }) }
}

val gmm = {new GaussianMixture()
.setK(6)
.setMaxIterations(1000)
.setConvergenceTol(0.001)
.setSeed(1)
.run(observationsRdd)}

for (i <- 0 until gmm.k) {
println("weight=%f\nmu=%s\nsigma=\n%s\n" format
(gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}


Truncated output:

weight=0.166667
mu=[4730.358845338535,4391.695550847029,4072.3224046605947,4253.183898304653,4454.124682202946,4775.553442796136,4980.3952860164545,4812.717637711368,5120.44449152493,2820.1827330505857,180.10291313557565,4189.185858050445,3690.793644067457]
sigma=
422700.24745093845 382225.3248240414 398121.9356855869 ... (13 total)
382225.3248240414 471186.33178427175 455777.0565262309 ...
398121.9356855869 455777.0565262309 461210.0532084378 ...
469361.3787142044 497432.39963363775 515341.1303306988 ...
474369.6318494179 482754.83801426284 500047.5114985542 ...
453832.62301188655 443147.58931290614 461017.7038258409 ...
458641.51202210854 433511.1974652861 452015.6655154465 ...
387980.29836054996 459673.3283909025 455118.78272128507 ...
461724.87201332086 423688.91832506843 442649.18455604656 ...
291940.48273324646 257309.1054220978 269116.23674394307 ...
16289.3063964479 14790.06803739929 15387.484828872432 ...
334045.5231910066 338403.3492767321 350531.7768916226 ...
280036.0894114749 267624.69326772855 279651.401859903 ...

weight=0.166667
mu=[4730.358845338535,4391.695550847029,4072.3224046605947,4253.183898304653,4454.124682202946,4775.553442796136,4980.3952860164545,4812.717637711368,5120.44449152493,2820.1827330505857,180.10291313557565,4189.185858050445,3690.793644067457]
sigma=
422700.24745093845 382225.3248240414 398121.9356855869 ... (13 total)
382225.3248240414 471186.33178427175 455777.0565262309 ...
398121.9356855869 455777.0565262309 461210.0532084378 ...
469361.3787142044 497432.39963363775 515341.1303306988 ...
474369.6318494179 482754.83801426284 500047.5114985542 ...
453832.62301188655 443147.58931290614 461017.7038258409 ...
458641.51202210854 433511.1974652861 452015.6655154465 ...
387980.29836054996 459673.3283909025 455118.78272128507 ...
461724.87201332086 423688.91832506843 442649.18455604656 ...
291940.48273324646 257309.1054220978 269116.23674394307 ...
16289.3063964479 14790.06803739929 15387.484828872432 ...
334045.5231910066 338403.3492767321 350531.7768916226 ...
280036.0894114749 267624.69326772855 279651.401859903 ...

weight=0.166667
mu=[4730.358845338535,4391.695550847029,4072.3224046605947,4253.183898304653,4454.124682202946,4775.553442796136,4980.3952860164545,4812.717637711368,5120.44449152493,2820.1827330505857,180.10291313557565,4189.185858050445,3690.793644067457]
sigma=
422700.24745093845 382225.3248240414 398121.9356855869 ... (13 total)
382225.3248240414 471186.33178427175 455777.0565262309 ...
398121.9356855869 455777.0565262309 461210.0532084378 ...
469361.3787142044 497432.39963363775 515341.1303306988 ...
474369.6318494179 482754.83801426284 500047.5114985542 ...
453832.62301188655 443147.58931290614 461017.7038258409 ...
458641.51202210854 433511.1974652861 452015.6655154465 ...
387980.29836054996 459673.3283909025 455118.78272128507 ...
461724.87201332086 423688.91832506843 442649.18455604656 ...
291940.48273324646 257309.1054220978 269116.23674394307 ...
16289.3063964479 14790.06803739929 15387.484828872432 ...
334045.5231910066 338403.3492767321 350531.7768916226 ...
280036.0894114749 267624.69326772855 279651.401859903 ...

weight=0.166667
mu=[4730.358845338535,4391.695550847029,4072.3224046605947,4253.183898304653,4454.124682202946,4775.553442796136,4980.3952860164545,4812.717637711368,5120.44449152493,2820.1827330505857,180.10291313557565,4189.185858050445,3690.793644067457]
sigma=
422700.24745093845 382225.3248240414 398121.9356855869 ... (13 total)
382225.3248240414 471186.33178427175 455777.0565262309 ...
398121.9356855869 455777.0565262309 461210.0532084378 ...
469361.3787142044 497432.39963363775 515341.1303306988 ...
474369.6318494179 482754.83801426284 500047.5114985542 ...
453832.62301188655 443147.58931290614 461017.7038258409 ...
458641.51202210854 433511.1974652861 452015.6655154465 ...
387980.29836054996 459673.3283909025 455118.78272128507 ...
461724.87201332086 423688.91832506843 442649.18455604656 ...
291940.48273324646 257309.1054220978 269116.23674394307 ...
16289.3063964479 14790.06803739929 15387.484828872432 ...
334045.5231910066 338403.3492767321 350531.7768916226 ...
280036.0894114749 267624.69326772855 279651.401859903 ...

weight=0.166667
mu=[4730.358845338535,4391.695550847029,4072.3224046605947,4253.183898304653,4454.124682202946,4775.553442796136,4980.3952860164545,4812.717637711368,5120.44449152493,2820.1827330505857,180.10291313557565,4189.185858050445,3690.793644067457]
sigma=
422700.24745093845 382225.3248240414 398121.9356855869 ... (13 total)
382225.3248240414 471186.33178427175 455777.0565262309 ...
398121.9356855869 455777.0565262309 461210.0532084378 ...
469361.3787142044 497432.39963363775 515341.1303306988 ...
474369.6318494179 482754.83801426284 500047.5114985542 ...
453832.62301188655 443147.58931290614 461017.7038258409 ...
458641.51202210854 433511.1974652861 452015.6655154465 ...
387980.29836054996 459673.3283909025 455118.78272128507 ...
461724.87201332086 423688.91832506843 442649.18455604656 ...
291940.48273324646 257309.1054220978 269116.23674394307 ...
16289.3063964479 14790.06803739929 15387.484828872432 ...
334045.5231910066 338403.3492767321 350531.7768916226 ...
280036.0894114749 267624.69326772855 279651.401859903 ...

weight=0.166667
mu=[4730.358845338535,4391.695550847029,4072.3224046605947,4253.183898304653,4454.124682202946,4775.553442796136,4980.3952860164545,4812.717637711368,5120.44449152493,2820.1827330505857,180.10291313557565,4189.185858050445,3690.793644067457]
sigma=
422700.24745093845 382225.3248240414 398121.9356855869 ... (13 total)
382225.3248240414 471186.33178427175 455777.0565262309 ...
398121.9356855869 455777.0565262309 461210.0532084378 ...
469361.3787142044 497432.39963363775 515341.1303306988 ...
474369.6318494179 482754.83801426284 500047.5114985542 ...
453832.62301188655 443147.58931290614 461017.7038258409 ...
458641.51202210854 433511.1974652861 452015.6655154465 ...
387980.29836054996 459673.3283909025 455118.78272128507 ...
461724.87201332086 423688.91832506843 442649.18455604656 ...
291940.48273324646 257309.1054220978 269116.23674394307 ...
16289.3063964479 14790.06803739929 15387.484828872432 ...
334045.5231910066 338403.3492767321 350531.7768916226 ...
280036.0894114749 267624.69326772855 279651.401859903 ...


...

Additionally, the code, input data, and output data is available as a gist @ https://gist.github.com/aaron-santos/91b4931a446c460e082b2b3055b9950f

Thank you

Answer

I ran your data through ELKI (I had to remove the last line, which is incomplete). It at first did not work either, which I assume is due to the scale of the attributes, along with the default initialization. Probably the same problem is present in Spark.

After scaling the data, I could get some reasonable clusters with ELKI (visualizing the first three of 13 dimensions):

enter image description here

But judging from the distribution of the data points I do not think Gaussian Mixture Modeling is appropriate for this data. The points appear to be grid-sampled from some hypersurface or some trajectories; not from Gaussian (!) distributions.

Here are the ELKI parameters I used:

-dbc.in /tmp/observations.csv
-dbc.filter normalization.columnwise.AttributeWiseVarianceNormalization
-algorithm clustering.em.EM -em.k 6
-em.centers RandomlyChosenInitialMeans -kmeans.seed 0

It may be worth experimenting with other clustering algorithms such as HDBSCAN, which can identify density-based clusters:

enter image description here

Parameters:

-dbc.in /tmp/observations.csv
-dbc.filter normalization.columnwise.AttributeWiseVarianceNormalization
-algorithm clustering.hierarchical.extraction.HDBSCANHierarchyExtraction
-algorithm SLINKHDBSCANLinearMemory
-hdbscan.minPts 50 -hdbscan.minclsize 100

I would also try OPTICS, as I find HDBSCAN to often only capture the core of a cluster (by design). From the OPTICS plot, I would not say the clusters are very clearly defined.

Apart from trying other clustering algorithms, I think you also need to work a lot on preprocessing and projecting your data, because it has very strong correlations. Try to put as much prior knowledge on the data into your preprocessing to improve results.