Alex Lenail Alex Lenail - 9 months ago 80
Python Question

Does pickle randomly fail with OSError on large files?

Problem Statement



I'm using python3 and trying to pickle a dictionary of IntervalTrees which weighs something like 2 to 3 GB. This is my console output:

10:39:25 - project: INFO - Checking if motifs file was generated by pickle...
10:39:25 - project: INFO - - Motifs file does not seem to have been generated by pickle, proceeding to parse...
10:39:38 - project: INFO - - Parse complete, constructing IntervalTrees...
11:04:05 - project: INFO - - IntervalTree construction complete, saving pickle file for next time.
Traceback (most recent call last):
File "/Users/alex/Documents/project/src/project.py", line 522, in dict_of_IntervalTree_from_motifs_file
save_as_pickled_object(motifs, output_dir + 'motifs_IntervalTree_dictionary.pickle')
File "/Users/alex/Documents/project/src/project.py", line 269, in save_as_pickled_object
def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))
OSError: [Errno 22] Invalid argument


The line in which I attempt the save is

def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))


The error comes maybe 15 minutes after
save_as_pickled_object
is invoked (at 11:20).

I tried this with a much smaller subsection of the motifs file and it worked fine, with all of the exact same code, so it must be an issue of scale. Are there any known bugs with pickle in python 3.6 relating to the scale of what you try to pickle? Are there known bugs with pickling large files in general? Are there any known ways around this?

Thanks!

Update: This question might be a duplicate of Python 3 - Can pickle handle byte objects larger than 4GB?



Solution



This is the code I used instead.

def save_as_pickled_object(obj, filepath):
"""
This is a defensive way to write pickle.write, allowing for very large files on all platforms
"""
max_bytes = 2**31 - 1
bytes_out = pickle.dumps(obj)
n_bytes = sys.getsizeof(bytes_out)
with open(filepath, 'wb') as f_out:
for idx in range(0, n_bytes, max_bytes):
f_out.write(bytes_out[idx:idx+max_bytes])


def try_to_load_as_pickled_object_or_None(filepath):
"""
This is a defensive way to write pickle.load, allowing for very large files on all platforms
"""
max_bytes = 2**31 - 1
try:
input_size = os.path.getsize(filepath)
bytes_in = bytearray(0)
with open(filepath, 'rb') as f_in:
for _ in range(0, input_size, max_bytes):
bytes_in += f_in.read(max_bytes)
obj = pickle.loads(bytes_in)
except:
return None
return obj

Answer Source

Alex, if I am not mistaken this bug report perfectly describes your issue.

http://bugs.python.org/issue24658

As a workaround, I think you can pickle.dumps instead of pickle.dump and then write to your file in chunks of size smaller than 2**31.