lspinheiro lspinheiro - 1 month ago 20
Python Question

Validation in online/streaming learning

I have to train a classification model on some data that is too big to fit in memory and I'm using scikit learn and pandas to do the analysis. So here is my problem, how do I use validation for hyperparameter tuning in a online learning pipeline?

I'm streaming data from a sql database using pandas read_sql_query with chucksize and using sklearn SGDClassifier partial_fit. Here is a example:

clf = SGDCClassifier()
for chunk in pd.read_sql_query("""
select *
from table;
""",
con = conn,
chunksize = n):

preprocess chunk
.
.
.
clf.partial_fit(chunk)


My question is: What is the best approach to do validation in a setting like this?

Answer

Validation (for tuning or anything otherwise) is actually quite natural for streams.

Say this is the logical representation of the stream

|-------------------------------------------------------------------------->

it starts at the left, and elements are added as you go right. Since this is a streaming setting, let's assume it can't all fit into memory.

At step i you have this chunk in memory

|--------------------(cccccccccccc)---------------------------------------->

So you decide in advance on the size of the train (r) and test (t) parts, then you have something like this:

|--------------------(rrrrrrrrrrrtt)---------------------------------------->

At this point, you're allowed to learn only from the rs, and you check yourself on the ts.

At step i + 1, some of the ts become rs, and you must discard some of the old rs (it's more than permissible to store aggregates of the data, though).

Don't forget to leave some data for clean testing.