Given is a 1.5 Gb list of pandas dataframes.
I am wondering which is a better approach to handle loading this data:
pickle (via cPickle), hdf5, or something else in python?
First, "dumping" the data is OK to take long, I only do this once.
I am also not concerned with file size on disk.
What I am concerned about is the speed of loading the data into memory as quickly as possible.
I would consider only two storage formats: HDF5 (PyTables) and Feather (currently available only for Linux and Mac)
Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.
Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).
Comparison for the following storage formats: (
HDF5 [various compression]):
read_s write_s size_ratio_to_CSV storage CSV 17.900 69.00 1.000 CSV.gzip 18.900 186.00 0.047 Pickle 0.173 1.77 0.374 HDF_fixed 0.196 2.03 0.435 HDF_tab 0.230 2.60 0.437 HDF_tab_zlib_c5 0.845 5.44 0.035 HDF_tab_zlib_c9 0.860 5.95 0.035 HDF_tab_bzip2_c5 2.500 36.50 0.011 HDF_tab_bzip2_c9 2.500 36.50 0.011
But it might be different for you, because all my data was of the
datetime dtype, so it's always better to make such a comparison with your real data or at least with the similar data...