def content_generator(applications, dict):
for app in applications:
with open('abc.pickle', 'r') as f:
very_large_dict = pickle.load(f)
all_applications = set(very_large_dict.keys())
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func_process_application, content_generator(all_applications, very_large_dict)):
do some aggregation on result
The comments are becoming impossible to follow, so I'm pasting in my important comment here:
On a Linux-y system, new processes are created by
fork(), so get a copy of the entire parent-process address space at the time they're created. It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating your
Pool before creating giant data structures. Then the child processes will inherit a much smaller address space.
Then some answers to questions:
so in python 2.7, there is no way to spawn a new process?
On Linux-y systems, no. The ability to use "spawn" on those was first added in Python 3.4. On Windows systems, "spawn" has always been the only choice (no
fork() on Windows).
The big dictionary is passed in to a function as an argument and I could only create the pool inside this function. How would I be able to create the pool before the big dictionary
As simple as this: make these two lines the first two lines in your program:
import multiprocessing pool = multiprocessing.Pool()
You can create the pool any time you like (just so long as it exists sometime before you actually use it), and worker processes will inherit the entire address space at the time the
Pool constructor is invoked.
If you're not mutating the dict after it's created, try using this instead:
def content_generator(dict): for app in dict: yield app, dict[app]
That way you don't have to materialize a giant set of the keys either. Or, even better (if possible), skip all that and iterate directly over the items:
for result in pool.imap_unordered(func_process_application, very_large_dict.iteritems()):