denvar denvar - 1 month ago 18
Python Question

which is faster for load: pickle or hdf5 in python

Given is a 1.5 Gb list of pandas dataframes.

I am wondering which is a better approach to handle loading this data:
pickle (via cPickle), hdf5, or something else in python?

First, "dumping" the data is OK to take long, I only do this once.

I am also not concerned with file size on disk.

Question:
What I am concerned about is the speed of loading the data into memory as quickly as possible.

Answer

I would consider only two storage formats: HDF5 (PyTables) and Feather (currently available only for Linux and Mac)

NOTE: What should you not use Feather for?

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

But it might be different for you, because all my data was of the datetime dtype, so it's always better to make such a comparison with your real data or at least with the similar data...