Hannah Montanna Hannah Montanna - 3 years ago 395
Python Question

ValueError: could not convert string to float: med

I am writing a very simple script. All I have to do is read data using panda and then train a decision tree on data. Data that I am using is:

https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data


And following is my script

import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn import preprocessing
import pandas as pd
balance_data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep= ',', header= None)
#print "Dataset:: "

#df1.head()

X = balance_data.values[:, 0:5]
Y = balance_data.values[:,6]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 100)
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
max_depth=3, min_samples_leaf=5)

clf_gini.fit(X_train, y_train)


From the error I am guessing that it couldn't convert "med" attribute value to float. And by looking at the data my random guess is that low has a space before it and med doesn't. That is why it is getting confused. But I am not sure of it. Please tell what could be wrong with it.
PS: error is occurring at the last line and here is the traceback

ValueError Traceback (most recent call last)
<ipython-input-26-b495e5f26174> in <module>()
18 max_depth=3, min_samples_leaf=5)
19 X_train[X_train != '']
---> 20 clf_gini.fit(X_train, y_train)

/home/fatima/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
788 sample_weight=sample_weight,
789 check_input=check_input,
--> 790 X_idx_sorted=X_idx_sorted)
791 return self
792

/home/fatima/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
114 random_state = check_random_state(self.random_state)
115 if check_input:
--> 116 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
117 y = check_array(y, ensure_2d=False, dtype=None)
118 if issparse(X):

/home/fatima/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
400 force_all_finite)
401 else:
--> 402 array = np.array(array, dtype=dtype, order=order, copy=copy)
403
404 if ensure_2d:

ValueError: could not convert string to float: med

Answer Source

The dataset looks like this:

       0      1  2  3      4     5      6
0  vhigh  vhigh  2  2  small   low  unacc
1  vhigh  vhigh  2  2  small   med  unacc
2  vhigh  vhigh  2  2  small  high  unacc
3  vhigh  vhigh  2  2    med   low  unacc
4  vhigh  vhigh  2  2    med   med  unacc

Where the data types (dtypes) are the following all objects. However, machine learning algorithms can only learn from numbers (int, float, doubles .. ) thus, you need to encode your data before you use it for training.

There are several ways to encode your data, one way is to use label encoding, to do that, add the following lines to your code just after loading the dataset:

le = preprocessing.LabelEncoder()
balance_data = balance_data.apply(le.fit_transform)

Now the data in balance_data looks like this:

   0  1  2  3  4  5  6
0  3  3  0  0  2  1  2
1  3  3  0  0  2  2  2
2  3  3  0  0  2  0  2
3  3  3  0  0  1  1  2
4  3  3  0  0  1  2  2

where all data types are int.

In general, you need to perform some data preprocessing before training/fitting your model. For that, I recommend that you go through some tutorial to understand the process. For instance, check this:


Your overall code will look this:

import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn import preprocessing
import pandas as pd
balance_data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
                           sep= ',', header= None)
#print "Dataset:: "

#df1.head()

le = preprocessing.LabelEncoder()
balance_data = balance_data.apply(le.fit_transform)

X = balance_data.values[:, 0:5]
Y = balance_data.values[:,6]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 100)
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
                               max_depth=3, min_samples_leaf=5)

clf_gini.fit(X_train, y_train)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download