Ajay H - 5 months ago 29

Python Question

In the program, I am scanning a number of brain samples taken in a time series of 40 x 64 x 64 images every 2.5 seconds. The number of 'voxels' (3D pixels) in each image is thus ~ 168,000 ish (40 * 64 * 64), each of which is a 'feature' for an image sample.

I thought of using Principle Component Analysis (PCA) because of the rediculously high

`n`

I definined a method to get the number of components :

`def get_optimal_number_of_components():`

cov = np.dot(X,X.transpose())/float(X.shape[0])

U,s,v = svd(cov)

S_nn = sum(s)

for num_components in range(0,s.shape[0]):

temp_s = s[0:num_components]

S_ii = sum(temp_s)

if (1 - S_ii/float(S_nn)) <= 0.01:

return num_components

return s.shape[0]

This function will return the number of components such that 99% of the variance from the original data is retained. Now, we can create these components :

`#Scaling the values`

X = scale(X)

n_comp = get_optimal_number_of_components()

print 'optimal number of components = ', n_comp

pca = PCA(n_components = n_comp)

X_new = pca.fit_transform(X)

I get the optimal number of components = 1001 on running the program on this dataset. This number agrees with the plot I obtained on executing :

`#Cumulative Variance explains`

var1 = np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

plt.plot(var1)

plt.title('Principle Component Analysis for Feature Selection')

plt.ylabel('Percentage of variance')

plt.xlabel('Number of voxels considered')

plt.show()

After this PCA stage is complete, I used the newly created 'X_new' in place of X for the next stage : Logistic Regression

`#After PCA`

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.10, random_state=42)

classifier = LogisticRegression()

classifier.fit(X_train,y_train)

When I test for the accuracy, I get around

But this is less than when I just analysed the middle voxel samples (like 9K samples in the middle of brain image). I was wondering if I pipelined PCA and logistic regression correctly.

I even tried this in another method using

`sklearn.pipeline`

`pipe = Pipeline(steps=[('pca', pca), ('logistic', classifier)])`

pipe.set_params(pca__n_components = n_comp).fit(X_train,y_train)

print 'score = ', pipe.score(X_test,y_test)

But I got the exact same accuracy of 77.57%. Am I implementing PCA + Logistic Regression Correctly? There must be something wrong, I just can't figure out what it is.

Answer

While I can't immediately find any error, you should try and test how the error behaves if you increase the `number of components`

. Maybe there is some low variance information missing to give the logistic regression the edge it needs?

The `99%`

in PCA are more a guideline than fact.

Other things you could try:
Instead of PCA just remove every feature with *zero (or very low) variance*. DTI data often has features that never change, and are therefore completely unnecessary for classification.

Try finding features that correlate strongly with your result and try only using these to classify.

Always be careful **not to overfit!**

Correct me if I'm wrong, but PCA will produce a new set of features from the original ones.

I will try to describe it as untechnical as possible

Yes. PCA is basically a fancy axis transformation. Yes you get a new set of features, but they are linear combinations of the previous features in a ordered way such that the first feature describes as much as possible of the data.

The idea is that, if you have a hyperplane, PCA will actually *project the hyperplane* to the first axes and leave the last ones nearly empty.

PCA is **linear** dimensionality reduction, so if the true data distribution is not linear, it gives worse results.

Also a friend of mine worked with Brain data similar to yours (lots of features, very little examples) and PCA almost never helped. It might be that the significant pieces of information are not found because too much "noise" is around.

EDIT: Typo