bluesummers bluesummers - 1 year ago 119
Python Question

Caching CSV-read data with pandas for multiple runs

I'm trying to apply machine learning (Python with scikit-learn) to a large data stored in a CSV file which is about 2.2 gigabytes.

As this is a partially empirical process I need to run the script numerous times which results in the

function being called over and over again and it takes a lot of time.

Obviously, this is very time consuming so I guess there is must be a way to make the process of reading the data faster - like storing it in a different format or caching it in some way.

Code example in the solution would be great!

Answer Source

I would store already parsed DFs in one of the following formats:

All of them are very fast

PS it's important to know what kind of data (what dtypes) you are going to store, because it might affect the speed dramatically

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download