Gabriel Gabriel - 13 days ago 9
Python Question

Random Forest with bootstrap = False in scikit-learn python

What does RandomForestClassifier() do if we choose bootstrap = False?

According to the definition in this link

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier


bootstrap : boolean, optional (default=True) Whether bootstrap samples
are used when building trees.


Asking this because I want to use a Random Forest approach to a time series, so train with a rolling window of size (t-n) and predict date (t+k) and wanted to know if this is what would happen if we choose True or False:

1) If
Bootstrap = True
, so when training samples can be of any day and of any number of features. So for example can have samples from day (t-15), day (t-19) and day (t-35) each one with randomly chosen features and then predict the output of date (t+1).

2) If
Bootstrap = False
, its going to use all the samples and all the features from date (t-n) to t, to train, so its actually going to respect the dates order (meaning its going to use t-35, t-34, t-33... etc until t-1). And then will predict output of date (t+1).

If this is how Bootstrap works I would be inclined to use Boostrap = False, as if not it would be a bit strange (think of financial series) to just ignore the consecutive days returns and jump from day t-39 to t-19 and then to day t-15 to predict day t+1. We would be missing all the info between those days.

So... is this how Bootstrap works?

Answer

It seems like you're conflating the bootstrap of your observations with the sampling of your features. An Introduction to Statistical Learning provides a really good introduction to Random Forests.

The benefit of random forests comes from its creating a large variety of trees by sampling both observations and features. Bootstrap = False is telling it to sample observations with or without replacement - it should still sample when it's False, just without replacement.

You tell it what share of features you want to sample by setting max_features, either to a share of the features or just an integer number (and this is something that you would typically tune to find the best parameter for).

It will be fine that you're not going to have every day when you're building each tree - that's where the value of RF comes from. Each individual tree will be a pretty bad predictor, but when you average together the predictions from hundreds or thousands of trees you'll (probably) end up with a good model.

Comments