gsamaras - 7 months ago 173

Python Question

I have 9164 points, where 4303 are labeled as the class I want to predict and 4861 are labeled as not that class. They are no duplicate points.

Following How to split into train, test and evaluation sets in sklearn?, and since my

`dataset`

`df = pd.DataFrame(dataset)`

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

train_labels = construct_labels(train)

train_data = construct_data(train)

test_labels = construct_labels(test)

test_data = construct_data(test)

def predict_labels(test_data, classifier):

labels = []

for test_d in test_data:

labels.append(classifier.predict([test_d]))

return np.array(labels)

def construct_labels(df):

labels = []

for index, row in df.iterrows():

if row[2] == 'Trump':

labels.append('Atomium')

else:

labels.append('Not Trump')

return np.array(labels)

def construct_data(df):

first_row = df.iloc[0]

data = np.array([first_row[1]])

for index, row in df.iterrows():

if first_row[0] != row[0]:

data = np.concatenate((data, np.array([row[1]])), axis=0)

return data

and then:

`>>> classifier = SVC(verbose=True)`

>>> classifier.fit(train_data, train_labels)

[LibSVM].......*..*

optimization finished, #iter = 9565

obj = -2718.376533, rho = 0.132062

nSV = 5497, nBSV = 2550

Total nSV = 5497

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',

max_iter=-1, probability=False, random_state=None, shrinking=True,

tol=0.001, verbose=True)

>>> predicted_labels = predict_labels(test_data, classifier)

>>> for p, t in zip(predicted_labels, test_labels):

... if p == t:

... correct = correct + 1

and I get correct only 943 labels out of 1833 (=len(test_labels)) -> ((1833-943)/1833*100 = 48.5%)

I am suspecting I am missing something big time here, maybe I should set a parameter to the classifier to do more refined work or something?

Note: First time using SVMs here, so anything you might get for granted, I might have not even imagine...

Attempt:

I went ahed and decreased the number of negative examples to 4303 (same number as positive examples). This slightly improved accuracy to 49.9%, but...

Edit after the answer:

`>>> print(clf.best_estimator_)`

SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,

decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',

max_iter=-1, probability=False, random_state=None, shrinking=True,

tol=0.001, verbose=False)

>>> classifier = SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,

... decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',

... max_iter=-1, probability=False, random_state=None, shrinking=True,

... tol=0.001, verbose=False)

>>> classifier.fit(train_data, train_labels)

SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,

decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',

max_iter=-1, probability=False, random_state=None, shrinking=True,

tol=0.001, verbose=False)

Also I tried

`clf.fit(train_data, train_labels)`

Edit with data (the data are not random):

`>>> train_data[0]`

array([ 20.21062112, 27.924016 , 137.13815308, 130.97432804,

... # there are 256 coordinates in total

67.76352596, 56.67798138, 104.89566517, 10.02616417])

>>> train_labels[0]

'Not Trump'

>>> train_labels[1]

'Trump'

Answer

Most estimators in scikit-learn such as SVC are initiated with a number of input parameters, also known as hyper parameters. Depending on your data, you will have to figure out what to pass as inputs to the estimator during initialization. If you look at the SVC documentation in scikit-learn, you see that it can be initialized using several different input parameters.

For simplicity, let's consider kernel which can be 'rbf' or â€˜linearâ€™ (among a few other choices); and C which is a penalty parameter, and you want to try values 0.01, 0.1, 1, 10, 100 for C. That will lead to 10 different possible models to create and evaluate.

One simple solution is to write two nested for-loops one for kernel and the other for C and create the 10 possible models and see which one is the best model amongst others. However, if you have several hyper parameters to tune, then you have to write several nested for loops which can be tedious.

Luckily, scikit learn has a better way to create different models based on different combinations of values for your hyper model and choose the best one. For that, you use GridSearchCV. GridSearchCV is initialized using two things: an instance of an estimator, and a dictionary of hyper parameters and the desired values to examine. It will then run and create all possible models given the choices of hyperparameters and finds the best one, hence you need not to write any nested for-loops. Here is an example:

```
from sklearn.grid_search import GridSearchCV
print("Fitting the classifier to the training set")
param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf', 'linear']}
clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)
clf = clf.fit(train_data, train_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)
```

You will need to use something similar to this example, and play with different hyperparameters. If you have a good variety of values for your hyperparameters, there is a very good chance you will find a much better model this way.

It is however possible for GridSearchCV to take a very long time to create all these models to find the best one. A more practical approach is to use RandomizedSearchCV instead, which creates a subset of all possible models (using the hyperparameters) at random. It should run much faster if you have a lot of hyperparameters, and its best model is usually pretty good.