MartyMacGyver MartyMacGyver - 2 months ago 13
Python Question

How to efficiently fan out large chunks of data into multiple concurrent sub-processes in Python?

[I'm using Python 3.5.2 (x64) in Windows.]

I'm reading binary data in large blocks (on the order of megabytes) and would like to efficiently share that data into 'n' concurrent Python sub-processes (each process will deal with the data in a unique and computationally expensive way).

The data is read-only, and each sequential block will not be considered to be "processed" until all the sub-processes are done.

I've focused on shared memory (Array (locked / unlocked) and RawArray): Reading the data block from the file into a buffer was quite quick, but copying that block to the shared memory was noticeably slower.

With queues, there will be a lot of redundant data copying going on there relative to shared memory. I chose shared memory because it involved one copy versus 'n' copies of the data).

Architecturally, how would one handle this problem efficiently in Python 3.5?

Edit: I've gathered two things so far: memory mapping in Windows is cumbersome because of the pickling involved to make it happen, and

(more specifically,
) is faster though not (yet) optimal.

Suggestions - preferably ones that use
components - are still very welcome!


In Unix this might be tractable because fork() is used for multiprocessing, but in Windows the fact that spawn() is the only way it works really limits the options. However, this is meant to be a multi-platform solution (which I'll use mainly in Windows) so I am working within that constraint.

I could open the data source in each subprocess, but depending on the data source that can be expensive in terms of bandwidth or prohibitive if it's a stream. That's why I've gone with the read-once approach.

Shared memory via mmap and an anonymous memory allocation seemed ideal, but to pass the object to the subprocesses would require pickling it - but you can't pickle mmap objects. So much for that.

Shared memory via a cython module might be impossible or it might not but it's almost certainly prohibitive - and begs the question of using a more appropriate language to the task.

Shared memory via the shared Array and RawArray functionality was costly in terms of performance.

Queues worked the best - but the internal I/O due to what I think is pickling in the background is prodigious. However, the performance hit for a small number of parallel processes wasn't too noticeable (this may be a limiting factor on faster systems though).

I will probably re-factor this in another language for a) the experience! and b) to see if I can avoid the I/O demands the Python Queues are causing. Fast memory caching between processes (which I hoped to implement here) would avoid a lot of redundant I/O.

While Python is widely applicable, no tool is ideal for every job and this is just one of those cases. I learned a lot about Python's multiprocessing module in the course of this!

At this point it looks like I've gone as far as I can go with standard CPython, but suggestions are still welcome!