mlo - 3 months ago 35

Python Question

I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

Answer

You can pass sample weights argument to Random Forest fit method

```
sample_weight : array-like, shape = [n_samples] or None
```

Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

In older version there were a `preprocessing.balance_weights`

method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.

**Update**

Some clarification, as you seems to be confused. `sample_weight`

usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have `X`

as observations and `y`

as classes (labels), then `len(X) == len(y) == len(sample_wight)`

, and each element of `sample witght`

1-d array represent weight for a corresponding `(observation, label)`

pair. For your case, if `1`

class is represented 5 times as `0`

class is, and you balance classes distributions, you could use simple

```
sample_weight = np.array([5 if i == 0 else 1 for i in y])
```

assigning weight of `5`

to all `0`

instances and weight of `1`

to all `1`

instances. See link above for a bit more crafty `balance_weights`

weights evaluation function.