LeoCella - 6 months ago 102

Python Question

I'm making a fit with a scikit model (that is a ExtraTreesRegressor ) with the aim of make supervised features selection.

I've made a toy example in order to be as most clear as possible. That's the toy code:

`import pandas as pd`

import numpy as np

from sklearn.ensemble import ExtraTreesRegressor

from itertools import chain

# Original Dataframe

df = pd.DataFrame({"A": [[10,15,12,14],[20,30,10,43]], "R":[2,2] ,"C":[2,2] , "CLASS":[1,0]})

X = np.array([np.array(df.A).reshape(1,4) , df.C , df.R])

Y = np.array(df.CLASS)

# prints

X = np.array([np.array(df.A), df.C , df.R])

Y = np.array(df.CLASS)

print("X",X)

print("Y",Y)

print(df)

df['A'].apply(lambda x: print("ORIGINAL SHAPE",np.array(x).shape,"field:",x))

df['A'] = df['A'].apply(lambda x: np.array(x).reshape(4,1),"field:",x)

df['A'].apply(lambda x: print("RESHAPED SHAPE",np.array(x).shape,"field:",x))

model = ExtraTreesRegressor()

model.fit(X,Y)

model.feature_importances_

`X [[[10, 15, 12, 14] [20, 30, 10, 43]]`

[2 2]

[2 2]]

Y [1 0]

A C CLASS R

0 [10, 15, 12, 14] 2 1 2

1 [20, 30, 10, 43] 2 0 2

ORIGINAL SHAPE (4,) field: [10, 15, 12, 14]

ORIGINAL SHAPE (4,) field: [20, 30, 10, 43]

---------------------------

That's the arise exception:

`---------------------------------------------------------------------------`

ValueError Traceback (most recent call last)

<ipython-input-37-5a36c4c17ea0> in <module>()

7 print(df)

8 model = ExtraTreesRegressor()

----> 9 model.fit(X,Y)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)

210 """

211 # Validate or convert input data

--> 212 X = check_array(X, dtype=DTYPE, accept_sparse="csc")

213 if issparse(X):

214 # Pre-sort indices to avoid that each individual tree of the

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)

371 force_all_finite)

372 else:

--> 373 array = np.array(array, dtype=dtype, order=order, copy=copy)

374

375 if ensure_2d:

ValueError: setting an array element with a sequence.

I've noticed that involves np.arrays. So I've tried to fit another toy dataframe, that is the most basic one, with only scalars and there are not arised errors. I've tried to keep the same code and just modify the same toy dataframe by adding another field that contains monodimensional arrays, and now the same exception was arised.

I've looked around but so far I've not found a solution even by trying to make some reshapes, conversions into lists, np.array etc. and matrixed in my real problem. Now I'm keeping trying along this direction.

I've also seen that usually this kind of problem is arised when there are arrays withdifferent lengths betweeen samples but that is not the case of the toy example.

Anyone know how to deal with this structures/exception ?

Thanks in advance for any help.

Answer

Have a closer look at your X:

```
>>> X
array([[[10, 15, 12, 14], [20, 30, 10, 43]],
[2, 2],
[2, 2]], dtype=object)
>>> type(X[0,0])
<class 'list'>
```

Notice that it's `dtype=object`

, and one of these objects is a `list`

, hence "setting array element with sequence. Part of the problem is that `np.array(df.A)`

does not correctly create a 2D array:

```
>>> np.array(df.A)
array([[10, 15, 12, 14], [20, 30, 10, 43]], dtype=object)
>>> _.shape
(2,) # oops!
```

But using `np.stack(df.A)`

fixes the problem.

Are you looking for:

```
>>> X = np.concatenate([
np.stack(df.A), # condense A to (N, 4)
np.expand_dims(df.C, axis=-1), # expand C to (N, 1)
np.expand_dims(df.R, axis=-1), # expand R to (N, 1)
axis=-1
)
>>> X
array([[10, 15, 12, 14, 2, 2],
[20, 30, 10, 43, 2, 2]], dtype=int64)
```