mazieres mazieres - 1 year ago 246
Python Question

Recovering features names of explained_variance_ratio_ in PCA with sklearn

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.

A classic example with IRIS dataset.

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(, columns=iris.feature_names)

# normalize data
df_norm = (df - df.mean()) / df.std()

pca = PCA(n_components=2)
print pca.explained_variance_ratio_

This returns

In [42]: pca.explained_variance_ratio_
Out[42]: array([ 0.72770452, 0.23030523])

How can I recover which two features allow these two explained variance among the dataset ?
Said diferently, how can i get the index of this features in iris.feature_names ?

In [47]: print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Thanks in advance for your help.

Answer Source

Each principal component is a linear combination of the original variables:


where X_is are the original variables, and Beta_is are the corresponding weights or so called coefficients.

To obtain the weights, you may simply pass identity matrix to the transform method:

>>> i = np.identity(df.shape[1])  # identity matrix
>>> i
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

>>> coef = pca.transform(i)
>>> coef
array([[ 0.5224, -0.3723],
       [-0.2634, -0.9256],
       [ 0.5813, -0.0211],
       [ 0.5656, -0.0654]])

Each column of the coef matrix above shows the weights in the linear combination which obtains corresponding principal component:

>>> pd.DataFrame(coef, columns=['PC-1', 'PC-2'], index=df.columns)
                    PC-1   PC-2
sepal length (cm)  0.522 -0.372
sepal width (cm)  -0.263 -0.926
petal length (cm)  0.581 -0.021
petal width (cm)   0.566 -0.065

[4 rows x 2 columns]

For example, above shows that the second principal component (PC-2) is mostly aligned with sepal width, which has the highest weight of 0.926 in absolute value;

Since the data were normalized, you can confirm that the principal components have variance 1.0 which is equivalent to each coefficient vector having norm 1.0:

>>> np.linalg.norm(coef,axis=0)
array([ 1.,  1.])

One may also confirm that the principal components can be calculated as the dot product of the above coefficients and the original variables:

>>> np.allclose(, pca.fit_transform(df_norm.values))

Note that we need to use numpy.allclose instead of regular equality operator, because of floating point precision error.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download