Md. Rezwanul Haque - 1 month ago 12

Python Question

I am newbie in **Machine learning**. Recently, I have learnt how to calculate

`confusion_matrix`

`Test set`

`KNN Classification`

`confusion_matrix`

`Training set`

`KNN Classification`

How can I compute

`confusion_matrix`

`Training set`

`KNN Classification`

Following code is for computing

`confusion_matrix`

`Test set`

`# Split test and train data`

import numpy as np

from sklearn.model_selection import train_test_split

X = np.array(dataset.ix[:, 1:10])

y = np.array(dataset['benign_malignant'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#Define Classifier

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

knn.fit(X_train, y_train)

# Predicting the Test set results

y_pred = knn.predict(X_test)

# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred) # Calulate Confusion matrix for test set.

I am also trying to find

`confusion_matrix`

`Training set`

`k-fold cross-validation`

I am confused to this line

`knn.fit(X_train, y_train)`

Whether I will change this line

`knn.fit(X_train, y_train)`

Where should I change

`following code`

`confusion_matrix`

`training set`

`# Applying k-fold Method`

from sklearn.cross_validation import StratifiedKFold

kfold = 10 # no. of folds (better to have this at the start of the code)

skf = StratifiedKFold(y, kfold, random_state = 0)

# Stratified KFold: This first divides the data into k folds. Then it also makes sure that the distribution of the data in each fold follows the original input distribution

# Note: in future versions of scikit.learn, this module will be fused with kfold

skfind = [None]*len(skf) # indices

cnt=0

for train_index in skf:

skfind[cnt] = train_index

cnt = cnt + 1

# skfind[i][0] -> train indices, skfind[i][1] -> test indices

# Supervised Classification with k-fold Cross Validation

from sklearn.metrics import confusion_matrix

from sklearn.neighbors import KNeighborsClassifier

conf_mat = np.zeros((2,2)) # Initializing the Confusion Matrix

n_neighbors = 1; # better to have this at the start of the code

# 10-fold Cross Validation

for i in range(kfold):

train_indices = skfind[i][0]

test_indices = skfind[i][1]

clf = []

clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

X_train = X[train_indices]

y_train = y[train_indices]

X_test = X[test_indices]

y_test = y[test_indices]

# fit Training set

clf.fit(X_train,y_train)

# predict Test data

y_predcit_test = []

y_predict_test = clf.predict(X_test) # output is labels and not indices

# Compute confusion matrix

cm = []

cm = confusion_matrix(y_test,y_predict_test)

print(cm)

# conf_mat = conf_mat + cm

Answer Source

You dont have to make much changes

```
# Predicting the train set results
y_train_pred = knn.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
```

Here instead of using `X_test`

we use `X_train`

for classification and then we produce a classification matrix using the predicted classes for the training dataset and the actual classes.

The idea behind a classification matrix is essentially to find out the number of classifications falling into four categories(if `y`

is binary) -

- predicted True but actually false
- predicted True and actually True
- predicted False but actually True
- predicted False and actually False

So as long as you have two sets - predicted and actual, you can create the confusion matrix. All you got to do is predict the classes, and use the actual classes to get the confusion matrix.

**EDIT**

In the cross validation part, you can add a line `y_predict_train = clf.predict(X_train)`

to calculate the confusion matrix for each iteration. You can do this because in the loop, you initialize the `clf`

everytime which basically means reseting your model.

Also, in your code you are finding the confusion matrix each time but you are not storing it anywhere. At the end you'll be left with a cm of just the last test set.