Ajay H - 1 year ago 131
Python Question

# How do I correctly use PCA followed by Logistic Regression?

In the program, I am scanning a number of brain samples taken in a time series of 40 x 64 x 64 images every 2.5 seconds. The number of 'voxels' (3D pixels) in each image is thus ~ 168,000 ish (40 * 64 * 64), each of which is a 'feature' for an image sample.

I thought of using Principle Component Analysis (PCA) because of the rediculously high

`n`
to perform dimensionality reduction. I am aware PCA does not actually reduce the number of features. Correct me if I'm wrong, but PCA will produce a new set of features from the original ones. The new features however must follow certain conditions.

I definined a method to get the number of components :

``````def get_optimal_number_of_components():
cov = np.dot(X,X.transpose())/float(X.shape[0])
U,s,v = svd(cov)

S_nn = sum(s)

for num_components in range(0,s.shape[0]):
temp_s = s[0:num_components]
S_ii = sum(temp_s)
if (1 - S_ii/float(S_nn)) <= 0.01:
return num_components

return s.shape[0]
``````

This function will return the number of components such that 99% of the variance from the original data is retained. Now, we can create these components :

``````#Scaling the values
X = scale(X)

n_comp = get_optimal_number_of_components()
print 'optimal number of components = ', n_comp

pca = PCA(n_components = n_comp)
X_new = pca.fit_transform(X)
``````

I get the optimal number of components = 1001 on running the program on this dataset. This number agrees with the plot I obtained on executing :

``````#Cumulative Variance explains
var1 = np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

plt.plot(var1)
plt.title('Principle Component Analysis for Feature Selection')
plt.ylabel('Percentage of variance')
plt.xlabel('Number of voxels considered')

plt.show()
``````

After this PCA stage is complete, I used the newly created 'X_new' in place of X for the next stage : Logistic Regression

``````#After PCA
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.10, random_state=42)

classifier = LogisticRegression()
classifier.fit(X_train,y_train)
``````

When I test for the accuracy, I get around 77.57%

But this is less than when I just analysed the middle voxel samples (like 9K samples in the middle of brain image). I was wondering if I pipelined PCA and logistic regression correctly.

I even tried this in another method using
`sklearn.pipeline`
:

``````pipe = Pipeline(steps=[('pca', pca), ('logistic', classifier)])

pipe.set_params(pca__n_components = n_comp).fit(X_train,y_train)

print 'score = ', pipe.score(X_test,y_test)
``````

But I got the exact same accuracy of 77.57%. Am I implementing PCA + Logistic Regression Correctly? There must be something wrong, I just can't figure out what it is.

While I can't immediately find any error, you should try and test how the error behaves if you increase the `number of components`. Maybe there is some low variance information missing to give the logistic regression the edge it needs?

The `99%` in PCA are more a guideline than fact.

Other things you could try: Instead of PCA just remove every feature with zero (or very low) variance. DTI data often has features that never change, and are therefore completely unnecessary for classification.

Try finding features that correlate strongly with your result and try only using these to classify.

Always be careful not to overfit!

Correct me if I'm wrong, but PCA will produce a new set of features from the original ones.

I will try to describe it as untechnical as possible

Yes. PCA is basically a fancy axis transformation. Yes you get a new set of features, but they are linear combinations of the previous features in a ordered way such that the first feature describes as much as possible of the data.

The idea is that, if you have a hyperplane, PCA will actually project the hyperplane to the first axes and leave the last ones nearly empty.

PCA is linear dimensionality reduction, so if the true data distribution is not linear, it gives worse results.

Also a friend of mine worked with Brain data similar to yours (lots of features, very little examples) and PCA almost never helped. It might be that the significant pieces of information are not found because too much "noise" is around.

EDIT: Typo

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download