Arslán Arslán - 6 months ago 41
Python Question

Mining massive data sets in Python

I have a dataset of over 5GBs. Is there a way I can train my model with this data chunk by chunk in a Stochastic Gradient Descent kind of way? In other words, break the set in 5 chunks of 1 GB each, and then train parameters.

I want to do this in a Python environment.


Yes, you can. SGD in scikit learn has partial fit ; use it with your chunks

partial_fit(X, y[, classes, sample_weight]) Fit linear model with Stochastic Gradient Descent.