CowNorris CowNorris - 2 months ago 15
Python Question

Design pattern for handling large datasets for machine learning

I'm currently attempting to scrape data from websites and building a large (and potentially growing with time) dataset from it. I'm wondering if there's any good practices to adopt when processing, saving and loading large datasets.

More concretely, what should I do when the dataset I want to save is too large to store in RAM, then writing to disk in one go; and writing it one data-point at a time is too inefficient? Is there an approach smarter than writing to file a moderately-sized-batch at a time?

Thank you for your time!

Answer Source

Sure, use a database.

You should probably take a look at MongoDB or elasticSearch since what you store seems to be documents and not relational data.