bluesummers bluesummers - 1 month ago 28
Python Question

Caching CSV-read data with pandas for multiple runs

I'm trying to apply machine learning (Python with scikit-learn) to a large data stored in a CSV file which is about 2.2 gigabytes.

As this is a partially empirical process I need to run the script numerous times which results in the

pandas.read_csv()
function being called over and over again and it takes a lot of time.

Obviously, this is very time consuming so I guess there is must be a way to make the process of reading the data faster - like storing it in a different format or caching it in some way.

Code example in the solution would be great!

Answer

I would store already parsed DFs in one of the following formats:

All of them are very fast

PS it's important to know what kind of data (what dtypes) you are going to store, because it might affect the speed dramatically