Moveton Moveton - 1 month ago 14
Python Question

Python - machine learning

currently I am trying to understand the way machine learning algorithms work and one thing I don't really get is the obvious difference between calculated accuracy of predicted labels and the visual confusion matrix. I will try to explain as clear as it is possible.

Here is the snippet of the dataset (here you can see 9 samples (about 4k in real dataset), 6 features and 9 labels (which stand for not numbers, but some meanings and cannot be compared like 7 > 4 > 1)):

f1 f2 f3 f4 f5 f6 label
89.18 0.412 9.1 24.17 2.4 1 1
90.1 0.519 14.3 16.555 3.2 1 2
83.42 0.537 13.3 14.93 3.4 1 3
64.82 0.68 9.1 8.97 4.5 2 4
34.53 0.703 4.9 8.22 3.5 2 5
87.19 1.045 4.7 5.32 5.4 2 6
43.23 0.699 14.9 12.375 4.0 2 7
43.29 0.702 7.3 6.705 4.0 2 8
20.498 1.505 1.321 6.4785 3.8 2 9


In favor of curiosity I tried a number of algorithms (Linear, Gaussian, SVM (SVC, SVR), Bayesian etc.). As far as I understood the manual, in my case it is better to work with classifiers (discrete), rather than regression (continuous). Using common:

model.fit(X_train, y_train)
model.score(X_test, y_test)


I got:

Lin_Reg: 0.855793988736
Log_Reg: 0.463251670379
DTC: 0.400890868597
KNC: 0.41425389755
LDA: 0.550111358575
Gaus_NB: 0.391982182628
Bay_Rid: 0.855698151574
SVC: 0.483296213808
SVR: 0.647914795849


Continuous algorithms did better results. When I used confusion matrix for Bayesian Ridge (had to convert float to integers) to verify its result, I got the following:

Pred l1 l2 l3 l4 l5 l6 l7 l8 l9
True
l1 23, 66, 0, 0, 0, 0, 0, 0, 0
l2 31, 57 1, 0, 0, 0, 0, 0, 0
l3 13, 85, 19 0, 0, 0, 0, 0, 0
l4 0, 0, 0, 0 1, 6, 0, 0, 0
l5 0, 0, 0, 4, 8 7, 0, 0, 0
l6 0, 0, 0, 1, 27, 36 7, 0, 0
l7 0, 0, 0, 0, 2, 15, 0 0, 0
l8 0, 0, 0, 1, 1, 30, 8, 0 0
l9 0, 0, 0, 1, 0, 9, 1, 0, 0


What gave me an understanding that 85% accuracy is wrong.
How can this be explained? Is this because float/int conversion?

Would be thankful for any direct answer/link etc.

Answer

You are mixing here two very distinct concepts of machine learning: regression and classification. Regression typically deals with continuous values, e.g. temperature or stock market value. Classification on the other hand can tell you which bird species is in the recording - that's exactly where you would use a confusion matrix. It would tell you how many times the algorithm correctly predicted the label and where it made mistakes. SciPy, which you are using, has separate sections for both.

Both for classification and regression problems you can use different metrics for scoring them, so never assume they are comparable. As @javad pointed out, the 'coefficient of determination', is very different than accuracy. I would also recommend reading on precision and recall.

In your case you clearly have a classification problem and as such it should be treated. Also, mind that f6 looks like it has a discrete set of values.

If you'd like quickly experiment with different approaches I can recommend e.g. H2O, which, next to nice API, has great user interface and allows for massive parallel processing. XGBoost is also excellent.

Comments