Chonghao  Huang Chonghao Huang - 4 months ago 28
Python Question

python multiprocessing.pool.map, passing arguments to spawned processes

def content_generator(applications, dict):
for app in applications:
yield(app, dict[app])

with open('abc.pickle', 'r') as f:
very_large_dict = pickle.load(f)
all_applications = set(very_large_dict.keys())

pool = multiprocessing.Pool()
for result in pool.imap_unordered(func_process_application, content_generator(all_applications, very_large_dict)):
do some aggregation on result


I have a really large dictionary whose keys are strings (application names), values are information concerning the application. Since applications are independent, I want to use multiprocessing to process them in parallel. Parallelization works when the dictionary is not that big but all the python processes were killed when the dictionary is too big. I used
dmesg
to check what went wrong and found they were killed since the machine ran out of memory. I did
top
when the pool processes are running and found that they all occupy the same amount of resident memory(RES), which is all 3.4G. This confuses me since it seems to have copied the whole dictionaries into the spawned processes. I thought I broke up the dictionary and passing only what is relevant to the spawned process by yielding only
dict[app]
instead of
dict
. Any thoughts on what I did wrong?

Answer

The comments are becoming impossible to follow, so I'm pasting in my important comment here:

On a Linux-y system, new processes are created by fork(), so get a copy of the entire parent-process address space at the time they're created. It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating your Pool before creating giant data structures. Then the child processes will inherit a much smaller address space.

Then some answers to questions:

so in python 2.7, there is no way to spawn a new process?

On Linux-y systems, no. The ability to use "spawn" on those was first added in Python 3.4. On Windows systems, "spawn" has always been the only choice (no fork() on Windows).

The big dictionary is passed in to a function as an argument and I could only create the pool inside this function. How would I be able to create the pool before the big dictionary

As simple as this: make these two lines the first two lines in your program:

import multiprocessing
pool = multiprocessing.Pool()

You can create the pool any time you like (just so long as it exists sometime before you actually use it), and worker processes will inherit the entire address space at the time the Pool constructor is invoked.

ANOTHER SUGGESTION

If you're not mutating the dict after it's created, try using this instead:

def content_generator(dict):
    for app in dict:
        yield app, dict[app]

That way you don't have to materialize a giant set of the keys either. Or, even better (if possible), skip all that and iterate directly over the items:

for result in pool.imap_unordered(func_process_application, very_large_dict.iteritems()):