starkbot - 1 year ago 93
Python Question

Using Python's multiprocessing to calculate the sum of integers from one long input line

I want to use Python's multiprocessing module for the following:
Map an input line to a list of integers and calculate the sum of this list.

The input line is initially a string where the items to be summed are separated by spaces.

What I have tried is this:

``````from itertools import imap

my_input = '1000000000 ' * int(1e6)
print sum(imap(int, my_input.split()))
``````

This takes about 600ms on my machine, but I would like to make it faster with multiprocessing.

It seems that the bottleneck is in the mapping-part since the sum-method is pretty fast when applied to a ready list of integers:

``````>>> int_list = [int(1e9)] * int(1e6)
>>> %time sum(int_list)
CPU times: user 7.38 ms, sys: 5 µs, total: 7.38 ms
Wall time: 7.4 ms
>>> 1000000000000000
``````

I tried to apply the instructions from this question but as I'm quite new to using multiprocessing, I couldn't fit the instructions to this problem.

So, this seems to roughly boil down to three steps:

1. Make a pool
2. Map int() across the list within that pool
3. Sum the results.

So:

``````if __name__ == '__main__':
import multiprocessing
my_input = '1000000000 ' * int(1e6)
string_list = my_input.split()
# Pool over all CPUs
int_list = multiprocessing.Pool().map(int, string_list)
print sum(int_list)
``````

It may be more efficient for time to use generators where possible:

``````if __name__ == '__main__':
import multiprocessing
import re
my_input = '1000000000 ' * int(1e6)
# use a regex iterator matching whitespace
string_list = (x.group(0) for x in re.finditer(r'[^\s]+\s', my_input))
# Pool over all CPUs
int_list = multiprocessing.Pool().imap(int, string_list)
print sum(int_list)
``````

The regex will likely be slower than `split`, but using `re.finditer` should allow the `Pool` to start mapping as fast as individual splits are made, and using `imap` rather than `map` should do similarly for `sum` (allowing it to start adding numbers as they become available). Credit to this answer for the `re.finditer` idea.

It may or may not be more efficient to multiprocess than doing it in a single process. You might end up losing more time making new processes and passing the results back from them than you gain in doing things all at once. The same goes for if you were to try putting the adding into the pool as well.

On the system I'm testing this on, which has two CPUs, I get the one-process solution to run in about half a second, the non-generator multiprocess solution in about 1 second, and the generator solution in 12-13 seconds.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download