Joseph Roxas Joseph Roxas - 1 month ago 17
Python Question

Python Pandas to_pickle cannot pickle large dataframes

I have a dataframe "DF" with with 500,000 rows. Here are the data types per column:

ID int64
time datetime64[ns]
data object


each entry in the "data" column is an array with size = [5,500]

When I try to save this dataframe using

DF.to_pickle("my_filename.pkl")


it returned me the following error:

12 """
13 with open(path, 'wb') as f:
---> 14 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)

OSError: [Errno 22] Invalid argument


I also try this method but I get the same error:

import pickle


with open('my_filename.pkl', 'wb') as f:
pickle.dump(DF, f)


I try to save 10 rows of this dataframe:

DF.head(10).to_pickle('test_save.pkl')


and I have no error at all. Therefore, it can save small DF but not large DF.

I am using python 3, ipython notebook 3 in Mac.

Please help me to solve this problem. I really need to save this DF to a pickle file. I can not find the solution in the internet.

Answer

Probably not the answer you were hoping for but this is what I did......

Split the dataframe into smaller chunks using np.array_split (although numpy functions are not guaranteed to work, it does now, although there used to be a bug for it).

Then pickle the smaller dataframes.

When you unpickle them use pandas.append or pandas.concat to glue everything back together.

I agree it is a fudge and suboptimal. If anyone can suggest a "proper" answer I'd be interested in seeing it, but I think it as simple as dataframes are not supposed to get above a certain size.

Split a large pandas dataframe