Arslán Arslán - 4 months ago 20
Python Question

Mining massive data sets in Python

I have a dataset of over 5GBs. Is there a way I can train my model with this data chunk by chunk in a Stochastic Gradient Descent kind of way? In other words, break the set in 5 chunks of 1 GB each, and then train parameters.

I want to do this in a Python environment.

Answer

Yes, you can. SGD in scikit learn has partial fit ; use it with your chunks

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

partial_fit(X, y[, classes, sample_weight]) Fit linear model with Stochastic Gradient Descent.
Comments