starkbot starkbot - 3 months ago 18
Python Question

Using Python's multiprocessing to calculate the sum of integers from one long input line

I want to use Python's multiprocessing module for the following:
Map an input line to a list of integers and calculate the sum of this list.

The input line is initially a string where the items to be summed are separated by spaces.

What I have tried is this:

from itertools import imap

my_input = '1000000000 ' * int(1e6)
print sum(imap(int, my_input.split()))


This takes about 600ms on my machine, but I would like to make it faster with multiprocessing.

It seems that the bottleneck is in the mapping-part since the sum-method is pretty fast when applied to a ready list of integers:

>>> int_list = [int(1e9)] * int(1e6)
>>> %time sum(int_list)
CPU times: user 7.38 ms, sys: 5 ┬Ás, total: 7.38 ms
Wall time: 7.4 ms
>>> 1000000000000000


I tried to apply the instructions from this question but as I'm quite new to using multiprocessing, I couldn't fit the instructions to this problem.

Answer

So, this seems to roughly boil down to three steps:

  1. Make a pool
  2. Map int() across the list within that pool
  3. Sum the results.

So:

if __name__ == '__main__':
    import multiprocessing
    my_input = '1000000000 ' * int(1e6)
    string_list = my_input.split()
    # Pool over all CPUs
    int_list = multiprocessing.Pool().map(int, string_list)
    print sum(int_list)

It may be more efficient for time to use generators where possible:

if __name__ == '__main__':
    import multiprocessing
    import re
    my_input = '1000000000 ' * int(1e6)
    # use a regex iterator matching whitespace
    string_list = (x.group(0) for x in re.finditer(r'[^\s]+\s', my_input))
    # Pool over all CPUs
    int_list = multiprocessing.Pool().imap(int, string_list)
    print sum(int_list)

The regex will likely be slower than split, but using re.finditer should allow the Pool to start mapping as fast as individual splits are made, and using imap rather than map should do similarly for sum (allowing it to start adding numbers as they become available). Credit to this answer for the re.finditer idea.

It may or may not be more efficient to multiprocess than doing it in a single process. You might end up losing more time making new processes and passing the results back from them than you gain in doing things all at once. The same goes for if you were to try putting the adding into the pool as well.

On the system I'm testing this on, which has two CPUs, I get the one-process solution to run in about half a second, the non-generator multiprocess solution in about 1 second, and the generator solution in 12-13 seconds.