mazieres - 1 year ago 246
Python Question

# Recovering features names of explained_variance_ratio_ in PCA with sklearn

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.

A classic example with IRIS dataset.

``````import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
df_norm = (df - df.mean()) / df.std()

# PCA
pca = PCA(n_components=2)
pca.fit_transform(df_norm.values)
print pca.explained_variance_ratio_
``````

This returns

``````In [42]: pca.explained_variance_ratio_
Out[42]: array([ 0.72770452,  0.23030523])
``````

How can I recover which two features allow these two explained variance among the dataset ?
Said diferently, how can i get the index of this features in iris.feature_names ?

``````In [47]: print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
``````

Each principal component is a linear combination of the original variables:

where `X_i`s are the original variables, and `Beta_i`s are the corresponding weights or so called coefficients.

To obtain the weights, you may simply pass identity matrix to the `transform` method:

``````>>> i = np.identity(df.shape[1])  # identity matrix
>>> i
array([[ 1.,  0.,  0.,  0.],
[ 0.,  1.,  0.,  0.],
[ 0.,  0.,  1.,  0.],
[ 0.,  0.,  0.,  1.]])

>>> coef = pca.transform(i)
>>> coef
array([[ 0.5224, -0.3723],
[-0.2634, -0.9256],
[ 0.5813, -0.0211],
[ 0.5656, -0.0654]])
``````

Each column of the `coef` matrix above shows the weights in the linear combination which obtains corresponding principal component:

``````>>> pd.DataFrame(coef, columns=['PC-1', 'PC-2'], index=df.columns)
PC-1   PC-2
sepal length (cm)  0.522 -0.372
sepal width (cm)  -0.263 -0.926
petal length (cm)  0.581 -0.021
petal width (cm)   0.566 -0.065

[4 rows x 2 columns]
``````

For example, above shows that the second principal component (`PC-2`) is mostly aligned with `sepal width`, which has the highest weight of `0.926` in absolute value;

Since the data were normalized, you can confirm that the principal components have variance `1.0` which is equivalent to each coefficient vector having norm `1.0`:

``````>>> np.linalg.norm(coef,axis=0)
array([ 1.,  1.])
``````

One may also confirm that the principal components can be calculated as the dot product of the above coefficients and the original variables:

``````>>> np.allclose(df_norm.values.dot(coef), pca.fit_transform(df_norm.values))
True
``````

Note that we need to use `numpy.allclose` instead of regular equality operator, because of floating point precision error.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download