FooBar FooBar - 4 months ago 36
Python Question

Interaction between pathos.ProcessingPool and pickle

I have a list of calculations I need to run. I'm parallelizing them using

from pathos.multiprocessing import ProcessingPool
pool = ProcessingPool(nodes=7)
values = pool.map(helperFunction, someArgs)


helperFunction
does create a class called
Parameters
, which is defined in the same file as

import otherModule
class Parameters(otherModule.Parameters):
...


So far, so good.
helperFunction
will do some calculations, based on the
Parameters
object, change some of its attributes, and finally store them using
pickle
. Here's the relevant excerpt of the helper function (from a different module) that does the saving:

import pickle
import hashlib
import os
class cacheHelper():

def __init__(self, fileName, attr=[], folder='../cache/'):
self.folder = folder

if len(attr) > 0:
attr = self.attrToName(attr)
else:
attr = ''
self.fileNameNaked = fileName
self.fileName = fileName + attr

def write(self, objects):
with open(self.getFile(), 'wb') as output:
for object in objects:
pickle.dump(object, output, pickle.HIGHEST_PROTOCOL)


when it gets to
pickle.dump()
, it raises an Exception which is hard to debug because the debugger wont step into the worker that actually faced that exception. Therefore I created a breakpoint right before the dumping happened, and manually entered that command. Here is the output:

>>> pickle.dump(objects[0], output, pickle.HIGHEST_PROTOCOL)
Traceback (most recent call last):
File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-1-4d2cbb7c63d1>", line 1, in <module>
pickle.dump(objects[0], output, pickle.HIGHEST_PROTOCOL)
File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 1376, in dump
Pickler(file, protocol).dump(obj)
File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 396, in save_reduce
save(cls)
File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/site-packages/dill/dill.py", line 1203, in save_type
StockPickler.save_global(pickler, obj)
File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 754, in save_global
(obj, module, name))
PicklingError: Can't pickle <class '__main__.Parameters'>: it's not found as __main__.Parameters


The odd thing is that this doesn't happen when I don't parallelize, i.e. loop through
helperFunction
manually. I'm pretty sure that I'm opening the right
Parameters
(and not the parent class).

I know it is tough to debug things without a reproducible example, I don't expect any solutions on this part. Perhaps the more general question is:

What does one have to pay attention to when parallelizing code that uses
pickle.dump()
via another module?


Answer

Straight from the Python docs.

12.1.4. What can be pickled and unpickled? The following types can be pickled:

  • None, True, and False
  • integers, floating point numbers, complex
  • strings, bytes, bytearrays
  • tuples, lists, sets, and
  • dictionaries containing only picklable objects functions defined at the top level of a module (using def, not lambda)
  • built-in functions defined at the top level of a module
  • classes that are defined at the top level of a module
  • instances of such classes whose __dict__ or the result of calling __getstate__() is picklable (see section Pickling Class Instances for details).

Everything else can't be pickled. In your case, though it's very hard to say given the excerpt of your code, I believe the problem is that the class Parameters is not defined at the top level of the module, hence its instances can't be pickled.

The whole point of using pathos.multiprocessing (or its actively developing fork multiprocess) instead of the built-in multiprocessing is to avoid pickle, because there are far too many things the later can't dump. pathos.multiprocessing and multiprocess use dill instead of pickle. And if you want to debug a worker, you can use trace.

NOTE As Mike McKerns (the main contributor of multiprocess) rightfully noticed, there are cases that even dill can't handle, though it will be hard to formulate some universal rules on that matter.