HalfPintBoy HalfPintBoy - 1 month ago 22
Python Question

Labels appending in EM clustering algorithm

I'm doing the EM clustering using 3 components on a dataset (x), that is just dataframe with 15 features.

from sklearn import mixture
import pandas as pd

x=pd.read_csv('tr.csv', sep=';')
em = mixture.GMM(n_components=3)
em.fit(x)


Then I want to create an additional column in my dataframe for cluster and append to in the labels of each cluster for each variable (for example, like using labels_ in k-means approach). But the best I have are weights and it seems not very correct:

x['CLUSTER'] = pd.Series(em.weights_, index=x.index).astype(str)


It gives me an error (like there are 100000 rows in your data but you try to append only 3).

So how can I be able to use the labels of the clusters in EM algorithms and how can they be inserted in a column for each variable in a first df?

Thanks!

Answer

In order to get "labels" you need to call .predict(x) not .weights, .weights are (one of many!) parameters of the fitted distribution, not point-wise labels.

x['CLUSTER'] = em.predict(x)