EB2127 - 6 months ago 48

Python Question

I am using scikit-learn to implement the Dirichlet Process Gaussian Mixture Model:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/dpgmm.py

http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html

That is, it is

`sklearn.mixture.BayesianGaussianMixture()`

`weight_concentration_prior_type = 'dirichlet_process'`

My DPGMM model consistently outputs the exact number of clusters as

`n_components`

`predict(X)`

Scikit-Learn's DPGMM fitting: number of components?

However, the example linked to does not actually remove redundant components and show the "correct" number of clusters in the data. Rather, it simply plots the correct number of clusters.

http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html

How do users actually remove the redundant components, and output an array which should these components? Is this the "official"/only way to remove redundant clusters?

Here is my code:

`>>> import pandas as pd`

>>> import numpy as np

>>> import random

>>> from sklearn import mixture

>>> X = pd.read_csv(....) # my matrix

>>> X.shape

(20000, 48)

>>> dpgmm3 = mixture.BayesianGaussianMixture(n_components = 20, weight_concentration_prior_type='dirichlet_process', max_iter = 1000, verbose = 2)

>>> dpgmm3.fit(X) # Fitting the DPGMM model

>>> labels = dpgmm3.predict(X) # Generating labels after model is fitted

>>> max(labels)

>>> np.unique(labels) #Number of lab els == n_components specified above

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,

17, 18, 19])

#Trying with a different n_components

>>> dpgmm3_1 = mixture.BayesianGaussianMixture( weight_concentration_prior_type='dirichlet_process', max_iter = 1000) #not specifying n_components

>>> dpgmm3_1.fit(X)

>>> labels_1 = dpgmm3_1.predict(X)

>>> labels_1

array([0, 0, 0, ..., 0, 0, 0]) #All were classified under the same label

#Trying with n_components = 7

>>> dpgmm3_2 = mixture.BayesianGaussianMixture(n_components = 7, weight_concentration_prior_type='dirichlet_process', max_iter = 1000)

>>> dpgmm3_2.fit()

>>> labels_2 = dpgmm3_2.predict(X)

>>> np.unique(labels_2)

array([0, 1, 2, 3, 4, 5, 6]) #number of labels == n_components

Answer

There is no automated method to do so yet but you can have a look at the estimated `weights_`

attribute and prune components that have a small value (e.g. below 0.01).

**Edit**: yo count the number of components effectively used by the model you can do:

```
model = BayesianGaussianMixture(n_components=30).fit(X)
print("active components: %d" % np.sum(model.weights_ > 0.01)
```

This should print a number of active components lower than the provided upper bound (30 in this example).

**Edit 2**: the `n_components`

parameter specifies the maximum number of components the model can use. The effective number of components actually used by the model can be retrieved by introspecting the `weigths_`

attribute at the end of the fit. It will mostly depend on the structure of the data and on the value of `weight_concentration_prior`

(especially if the number of samples is small).