I want to use Python's multiprocessing module for the following:
Map an input line to a list of integers and calculate the sum of this list.
The input line is initially a string where the items to be summed are separated by spaces.
What I have tried is this:
from itertools import imap
my_input = '1000000000 ' * int(1e6)
print sum(imap(int, my_input.split()))
>>> int_list = [int(1e9)] * int(1e6)
>>> %time sum(int_list)
CPU times: user 7.38 ms, sys: 5 µs, total: 7.38 ms
Wall time: 7.4 ms
So, this seems to roughly boil down to three steps:
if __name__ == '__main__': import multiprocessing my_input = '1000000000 ' * int(1e6) string_list = my_input.split() # Pool over all CPUs int_list = multiprocessing.Pool().map(int, string_list) print sum(int_list)
It may be more efficient for time to use generators where possible:
if __name__ == '__main__': import multiprocessing import re my_input = '1000000000 ' * int(1e6) # use a regex iterator matching whitespace string_list = (x.group(0) for x in re.finditer(r'[^\s]+\s', my_input)) # Pool over all CPUs int_list = multiprocessing.Pool().imap(int, string_list) print sum(int_list)
The regex will likely be slower than
split, but using
re.finditer should allow the
Pool to start mapping as fast as individual splits are made, and using
imap rather than
map should do similarly for
sum (allowing it to start adding numbers as they become available). Credit to this answer for the
It may or may not be more efficient to multiprocess than doing it in a single process. You might end up losing more time making new processes and passing the results back from them than you gain in doing things all at once. The same goes for if you were to try putting the adding into the pool as well.
On the system I'm testing this on, which has two CPUs, I get the one-process solution to run in about half a second, the non-generator multiprocess solution in about 1 second, and the generator solution in 12-13 seconds.