loganecolss loganecolss - 4 years ago 206
Python Question

cython prange not so fast than single thread

I just wrote a trivial program to test how

cython
's
prange
performs, and here is the code:

from cython.parallel import prange
import numpy as np

def func(int r, int c):
cdef:
double[:,:] a = np.arange(r*c, dtype=np.double).reshape(r,c)
double total = 0
int i, j

for i in prange(r, nogil=True, schedule='static', chunksize=1):
for j in range(c):
total += a[i,j]

return total


On Mac Book pro, with
OMP_NUM_THREADS=3
, the above code takes almost 18 sec for
(r,c) = (10000, 100000)
, and with single thread, it takes about 21 sec.

Why there is so little performance boost? Am I using this
prange
correctly?

Answer Source

Have you timed how long it takes just to allocate a? A 10000 x 100000 float64 array takes up 8GB of memory.

a = np.ones((10000, 100000), np.double)

takes over six seconds on my laptop with 16GB of RAM. If you don't have 8GB free then you'll hit the swap and it will take a lot longer. Since func spends almost all of its time just allocating a, parallelizing your outer for loop can therefore only gain you a small fractional improvement on the total runtime.

To demonstrate this, I have modified your function to accept a as an input. In tmp.pyx:

#cython: boundscheck=False, wraparound=False, initializedcheck=False

from cython.parallel cimport prange

def serial(double[:, :] a):
    cdef:
        double total = 0
        int i, j
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            total += a[i, j]
    return total

def parallel(double[:, :] a):
    cdef:
        double total = 0
        int i, j
    for i in prange(a.shape[0], nogil=True, schedule='static', chunksize=1):
        for j in range(a.shape[1]):
            total += a[i, j]
    return total

For example:

In [1]: import tmp

In [2]: r, c = 10000, 100000

In [3]: a = np.random.randn(r, c)   # this takes ~6.75 sec

In [4]: %timeit tmp.serial(a)
1 loops, best of 3: 1.25 s per loop

In [5]: %timeit tmp.parallel(a)
1 loops, best of 3: 450 ms per loop

Parallelizing the function gave about a 2.8x speed-up* on my laptop with 4 cores, but this is only a small fraction of the time taken to allocate a.

The lesson here is to always profile your code to understand where it spends its most of its time before you dive into optimizations.


* You could do a little better by passing larger chunks of a to each worker process, e.g. by increasing chunksize or using schedule='guided'

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download