Charles Menguy Charles Menguy - 2 months ago 16
C Question

Why is matrix multiplication faster with numpy than with ctypes in Python?

I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:


  • Pure python implementation: no surprises here.

  • Numpy implementation using
    numpy.dot(a, b)

  • Interfacing with C using
    ctypes
    module in Python.



This is the C code that is transformed into a shared library:

#include <stdio.h>
#include <stdlib.h>

void matmult(float* a, float* b, float* c, int n) {
int i = 0;
int j = 0;
int k = 0;

/*float* c = malloc(nay * sizeof(float));*/

for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
int sub = 0;
for (k = 0; k < n; k++) {
sub = sub + a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sub;
}
}
return ;
}


And the Python code that calls it:

def C_mat_mult(a, b):
libmatmult = ctypes.CDLL("./matmult.so")

dima = len(a) * len(a)
dimb = len(b) * len(b)

array_a = ctypes.c_float * dima
array_b = ctypes.c_float * dimb
array_c = ctypes.c_float * dima

suma = array_a()
sumb = array_b()
sumc = array_c()

inda = 0
for i in range(0, len(a)):
for j in range(0, len(a[i])):
suma[inda] = a[i][j]
inda = inda + 1
indb = 0
for i in range(0, len(b)):
for j in range(0, len(b[i])):
sumb[indb] = b[i][j]
indb = indb + 1

libmatmult.matmult(ctypes.byref(suma), ctypes.byref(sumb), ctypes.byref(sumc), 2);

res = numpy.zeros([len(a), len(a)])
indc = 0
for i in range(0, len(sumc)):
res[indc][i % len(a)] = sumc[i]
if i % len(a) == len(a) - 1:
indc = indc + 1

return res


I would have bet that the version using C would have been faster ... and I'd have lost ! Below is my benchmark which seems to show that I either did it incorrectly, or that
numpy
is stupidly fast:

benchmark

I'd like to understand why the
numpy
version is faster than the
ctypes
version, I'm not even talking about the pure Python implementation since it is kind of obvious.

Answer Source

I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:

/**begin repeat
 *
 * #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
 * LONG, ULONG, LONGLONG, ULONGLONG,
 * FLOAT, DOUBLE, LONGDOUBLE,
 * DATETIME, TIMEDELTA#
 * #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 * #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 */
static void
@name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
           void *NPY_UNUSED(ignore))
{
    @out@ tmp = (@out@)0;
    npy_intp i;

    for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
        tmp += (@out@)(*((@type@ *)ip1)) *
               (@out@)(*((@type@ *)ip2));
    }
    *((@type@ *)op) = (@type@) tmp;
}
/**end repeat**/

This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.

One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).

Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.

Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.

Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:

for (i = mp1; i <= *n; i += 5) {
stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) * 
    SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
}

This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.