jbrown - 4 months ago 21

Python Question

Forgive my terminology, I'm not an ML pro. I might use the wrong terms below.

I'm trying to perform multivariable linear regression. Let's say I'm trying to work out user gender by analysing page views on a web site.

For each user whose gender I know, I have a feature matrix where each row represents a web site section, and the second element whether they visited it, e.g.:

`male1 = [`

[1, 1], # visited section 1

[2, 0], # didn't visit section 2

[3, 1], # visited section 3, etc

[4, 0]

]

So in scikit, I am building

`xs`

`ys`

The above would be represented as:

`features = male1`

gender = 1

Now, I'm obviously not just training a model for a single user, but instead I have tens of thousands of users whose data I'm using for training.

I would have thought I should create my

`xs`

`ys`

`xs = [`

[ # user1

[1, 1],

[2, 0],

[3, 1],

[4, 0]

],

[ # user2

[1, 0],

[2, 1],

[3, 1],

[4, 0]

],

...

]

ys = [1, 0, ...]

scikit doesn't like this:

`from sklearn import linear_model`

clf = linear_model.LinearRegression()

clf.fit(xs, ys)

It complains:

`ValueError: Found array with dim 3. Estimator expected <= 2.`

How am I supposed to supply a feature matrix to the linear regression algorithm in scikit-learn?

Answer

You need to create `xs`

in a different way. According to the docs:

`fit(X, y, sample_weight=None)`

Parameters:

`X : numpy array or sparse matrix of shape [n_samples, n_features] Training data y : numpy array of shape [n_samples, n_targets] Target values sample_weight : numpy array of shape [n_samples] Individual weights for each sample`

Hence `xs`

should be a 2D array with as many rows as users and as many columns as web site sections. Your `xs`

is currently a 3D array. In order to reduce the number of dimensions by one you could get rid of the section numbers through a list comprehension:

```
xs = [[visit for section, visit in user] for user in xs]
```

If you do so, the data you provided as an example gets transformed into:

```
xs = [[1, 0, 1, 0], # user1
[0, 1, 1, 0], # user2
...
]
```

and `clf.fit(xs, ys)`

should work as expected.

A more efficient approach to dimension reduction would be that of slicing a NumPy array:

```
import numpy as np
xs = np.asarray(xs)[:,:,1]
```