mazieres - 5 months ago 50

Python Question

I'm trying to recover from a PCA done with scikit-learn, **which** features are selected as *relevant*.

A classic example with IRIS dataset.

`import pandas as pd`

import pylab as pl

from sklearn import datasets

from sklearn.decomposition import PCA

# load dataset

iris = datasets.load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data

df_norm = (df - df.mean()) / df.std()

# PCA

pca = PCA(n_components=2)

pca.fit_transform(df_norm.values)

print pca.explained_variance_ratio_

This returns

`In [42]: pca.explained_variance_ratio_`

Out[42]: array([ 0.72770452, 0.23030523])

Said diferently, how can i get the index of this features in iris.feature_names ?

`In [47]: print iris.feature_names`

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Thanks in advance for your help.

Answer

Each principal component is a linear combination of the original variables:

where `X_i`

s are the original variables, and `Beta_i`

s are the corresponding weights or so called coefficients.

To obtain the weights, you may simply pass identity matrix to the `transform`

method:

```
>>> i = np.identity(df.shape[1]) # identity matrix
>>> i
array([[ 1., 0., 0., 0.],
[ 0., 1., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 0., 0., 1.]])
>>> coef = pca.transform(i)
>>> coef
array([[ 0.5224, -0.3723],
[-0.2634, -0.9256],
[ 0.5813, -0.0211],
[ 0.5656, -0.0654]])
```

Each column of the `coef`

matrix above shows the weights in the linear combination which obtains corresponding principal component:

```
>>> pd.DataFrame(coef, columns=['PC-1', 'PC-2'], index=df.columns)
PC-1 PC-2
sepal length (cm) 0.522 -0.372
sepal width (cm) -0.263 -0.926
petal length (cm) 0.581 -0.021
petal width (cm) 0.566 -0.065
[4 rows x 2 columns]
```

For example, above shows that the second principal component (`PC-2`

) is mostly aligned with `sepal width`

, which has the highest weight of `0.926`

in absolute value;

Since the data were normalized, you can confirm that the principal components have variance `1.0`

which is equivalent to each coefficient vector having norm `1.0`

:

```
>>> np.linalg.norm(coef,axis=0)
array([ 1., 1.])
```

One may also confirm that the principal components can be calculated as the dot product of the above coefficients and the original variables:

```
>>> np.allclose(df_norm.values.dot(coef), pca.fit_transform(df_norm.values))
True
```

Note that we need to use `numpy.allclose`

instead of regular equality operator, because of floating point precision error.