EB2127 EB2127 - 1 month ago 18
Python Question

How to properly remove redundant components for Scikit-Learn's DPGMM?

I am using scikit-learn to implement the Dirichlet Process Gaussian Mixture Model:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/dpgmm.py
http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html

That is, it is

sklearn.mixture.BayesianGaussianMixture()
with default set to
weight_concentration_prior_type = 'dirichlet_process'
. As opposed to k-means, where users set the number of clusters "k" a priori, DPGMM is an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters.

My DPGMM model consistently outputs the exact number of clusters as
n_components
. As discussed here, the correct way to deal with this is to "reduce redundant components" with
predict(X)
:

Scikit-Learn's DPGMM fitting: number of components?

However, the example linked to does not actually remove redundant components and show the "correct" number of clusters in the data. Rather, it simply plots the correct number of clusters.

http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html

How do users actually remove the redundant components, and output an array which should these components? Is this the "official"/only way to remove redundant clusters?

Here is my code:

>>> import pandas as pd
>>> import numpy as np
>>> import random
>>> from sklearn import mixture
>>> X = pd.read_csv(....) # my matrix
>>> X.shape
(20000, 48)
>>> dpgmm3 = mixture.BayesianGaussianMixture(n_components = 20, weight_concentration_prior_type='dirichlet_process', max_iter = 1000, verbose = 2)
>>> dpgmm3.fit(X) # Fitting the DPGMM model
>>> labels = dpgmm3.predict(X) # Generating labels after model is fitted
>>> max(labels)
>>> np.unique(labels) #Number of lab els == n_components specified above
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])

#Trying with a different n_components

>>> dpgmm3_1 = mixture.BayesianGaussianMixture( weight_concentration_prior_type='dirichlet_process', max_iter = 1000) #not specifying n_components
>>> dpgmm3_1.fit(X)
>>> labels_1 = dpgmm3_1.predict(X)
>>> labels_1
array([0, 0, 0, ..., 0, 0, 0]) #All were classified under the same label

#Trying with n_components = 7

>>> dpgmm3_2 = mixture.BayesianGaussianMixture(n_components = 7, weight_concentration_prior_type='dirichlet_process', max_iter = 1000)
>>> dpgmm3_2.fit()

>>> labels_2 = dpgmm3_2.predict(X)
>>> np.unique(labels_2)
array([0, 1, 2, 3, 4, 5, 6]) #number of labels == n_components

Answer

There is no automated method to do so yet but you can have a look at the estimated weights_ attribute and prune components that have a small value (e.g. below 0.01).

Edit: yo count the number of components effectively used by the model you can do:

model = BayesianGaussianMixture(n_components=30).fit(X)
print("active components: %d" % np.sum(model.weights_ > 0.01)

This should print a number of active components lower than the provided upper bound (30 in this example).

Edit 2: the n_components parameter specifies the maximum number of components the model can use. The effective number of components actually used by the model can be retrieved by introspecting the weigths_ attribute at the end of the fit. It will mostly depend on the structure of the data and on the value of weight_concentration_prior (especially if the number of samples is small).