dukebody - 1 year ago 349

Python Question

I'm trying to convert a Pandas dataframe to a NumPy array to create a model with Sklearn. I'll simplify the problem here.

`>>> mydf.head(10)`

IdVisita

445 latam

446 NaN

447 grados

448 grados

449 eventos

450 eventos

451 Reescribe-medios-clases-online

454 postgrados

455 postgrados

456 postgrados

Name: cat1, dtype: object

>>> from sklearn import preprocessing

>>> enc = preprocessing.OneHotEncoder()

>>> enc.fit(mydf)

Traceback:

`ValueError Traceback (most recent call last)`

<ipython-input-74-f581ab15cbed> in <module>()

2 mydf.head(10)

3 enc = preprocessing.OneHotEncoder()

----> 4 enc.fit(mydf)

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit(self, X, y)

996 self

997 """

--> 998 self.fit_transform(X)

999 return self

1000

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)

1052 """

1053 return _transform_selected(X, self._fit_transform,

-> 1054 self.categorical_features, copy=True)

1055

1056 def _transform(self, X):

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)

870 """

871 if selected == "all":

--> 872 return transform(X)

873

874 X = atleast2d_or_csc(X, copy=copy)

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _fit_transform(self, X)

1001 def _fit_transform(self, X):

1002 """Assumes X contains only categorical features."""

-> 1003 X = check_arrays(X, sparse_format='dense', dtype=np.int)[0]

1004 if np.any(X < 0):

1005 raise ValueError("X needs to contain only non-negative integers.")

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)

279 array = np.ascontiguousarray(array, dtype=dtype)

280 else:

--> 281 array = np.asarray(array, dtype=dtype)

282 if not allow_nans:

283 _assert_all_finite(array)

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)

460

461 """

--> 462 return array(a, dtype, copy=False, order=order)

463

464 def asanyarray(a, dtype=None, order=None):

ValueError: invalid literal for long() with base 10: 'postgrados'

Notice

`IdVisita`

Any clues?

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

Your error here is that you are calling OneHotEncoder which from the docs

The input to this transformer should be a matrix of integers

but your df has a single column 'cat1' which is of dtype `object`

which is in fact a String.

You should use LabelEcnoder:

```
In [13]:
le = preprocessing.LabelEncoder()
le.fit(df.dropna().values)
le.classes_
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\sklearn\preprocessing\label.py:108: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Out[13]:
array(['Reescribe-medios-clases-online', 'eventos', 'grados', 'latam',
'postgrados'], dtype=object)
```

Note I had to drop the `NaN`

row as this will introduce a mixed dtype which cannot be used for ordering e.g. float > str will not work

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**