HalfPintBoy - 2 months ago 36

Python Question

I'm doing the EM clustering using 3 components on a dataset (x), that is just dataframe with 15 features.

`from sklearn import mixture`

import pandas as pd

x=pd.read_csv('tr.csv', sep=';')

em = mixture.GMM(n_components=3)

em.fit(x)

Then I want to create an additional column in my dataframe for cluster and append to in the labels of each cluster for each variable (for example, like using labels_ in k-means approach). But the best I have are weights and it seems not very correct:

`x['CLUSTER'] = pd.Series(em.weights_, index=x.index).astype(str)`

It gives me an error (like there are 100000 rows in your data but you try to append only 3).

So how can I be able to use the labels of the clusters in EM algorithms and how can they be inserted in a column for each variable in a first df?

Thanks!

Answer

In order to get "labels" you need to call `.predict(x)`

not `.weights`

, `.weights`

are (one of many!) parameters of the fitted distribution, not point-wise labels.

```
x['CLUSTER'] = em.predict(x)
```