jorgehumberto jorgehumberto - 5 months ago 27
Python Question

Parallelizing a dictionary comprehension

I have the following function and dictionary comprehension:

def function(name, params):
results = fits.open(name)
<do something more to results>
return results

dictionary = {name: function(name, params) for name in nameList}


and would like to parallelize this. Any simple way to do this?

In here I have seend that the
multiprocessing
module can be used, but could not understand how to make it pass my results to my dictionary.

NOTE: If possible, please give an answer that can be applied to any function that returns a result.

NOTE 2: the is mainly manipulate the fits file and assigning the results to a class

UPDATE

So here's what worked for me in the end (from @code_onkel answer):

def function(name, params):
results = fits.open(name)
<do something more to results>
return results

def function_wrapper(args):
return function(*args)

params = [...,...,..., etc]

p = multiprocessing..Pool(processes=(max([2, mproc.cpu_count() // 10])))
args_generator = ((name, params) for name in names)

dictionary = dict(zip(names, p.map(function_wrapper, args_generator)))


using tqdm only worked partially since I could use my custom bar as tqdm reverts to a default bar with only the iterations.

Answer

The dictionary comprehension itself can not be parallelized. Here is an example how to use the multiprocessing module with Python 2.7.

from __future__ import print_function
import time
import multiprocessing

params = [0.5]

def function(name, params):
    print('sleeping for', name)
    time.sleep(params[0])
    return time.time()

def function_wrapper(args):
    return function(*args)

names = list('onecharNAmEs')

p = multiprocessing.Pool(3)
args_generator = ((name, params) for name in names)
dictionary = dict(zip(names, p.map(function_wrapper, args_generator)))
print(dictionary)
p.close()

This works with any function, though the restrictions of the multiprocssing module apply. Most important, the classes passed as arguments and return values as well as the function to be parallelized itself have to be defined at the module level, otherwise the (de)serializer will not find them. The wrapper function is necessary since function() takes two arguments, but Pool.map() can only handle functions with one arguments (as the built-in map() function).

Using Python >3.3 it can be simplified by using the Pool as a context manager and the starmap() function.

from __future__ import print_function
import time
import multiprocessing

params = [0.5]

def function(name, params):
    print('sleeping for', name)
    time.sleep(params[0])
    return time.time()

names = list('onecharnamEs')

with multiprocessing.Pool(3) as p:
    args_generator = ((name, params) for name in names)
    dictionary = dict(zip(names, p.starmap(function, args_generator)))

print(dictionary)

This is a more readable version of the with block:

with multiprocessing.Pool(3) as p:
    args_generator = ((name, params) for name in names)
    results = p.starmap(function, args_generator)
    name_result_tuples = zip(names, results)
    dictionary = dict(name_result_tuples)

The Pool.map() function is for functions with a single argument, that's why the Pool.starmap() function was added in 3.3.

Comments