Donbeo Donbeo - 8 months ago 112
Python Question

sklearn and large datasets

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.

I use a lot sklearn but for much smaller datasets.

In this situations the classical approach should be something like.

Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.

I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.

Now I am wondering is there an easy why to do that in sklearn?
I am looking for something like

r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
x = r.read_next_chunk(20 lines)


Maybe sklearn is not the right tool for these kind of things?
Let me know.

Answer Source

I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.

This link may be useful... Working with big data in python and numpy, not enough ram, how to save partial results on disc?

I agree that h5py is useful but you may wish to use tools that are already in your quiver.

Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).