user2133814 - 2 months ago 43

Python Question

Can a single numpy einsum statement replicate gemm functionality? Scalar and matrix multiplication seem straightforward, but I haven't found how to get the "+" working. In case its simpler, D = alpha * A * B + beta * C would be acceptable (preferable actually)

`alpha = 2`

beta = 3

A = np.arange(9).reshape(3, 3)

B = A + 1

C = B + 1

left_part = alpha*np.dot(A, B)

print(left_part)

left_part = np.einsum(',ij,jk->ik', alpha, A, B)

print(left_part)

Answer

There seems to be some confusion here: `np.einsum`

handles operations that can be cast in the following form: broadcast–multiply–reduce. Element-wise summation is not part of its scope.

The reason why you need this sort of thing for the multiplication is that writing these operations out "naively" may exceed memory or computing resources quickly. Consider, for example, matrix multiplication:

```
import numpy as np
x, y = np.ones((2, 2000, 2000))
# explicit loop - ridiculously slow
a = sum(x[:,j,np.newaxis] * y[j,:] for j in range(2000))
# explicit broadcast-multiply-reduce: throws MemoryError
a = (x[:,:,None] * y[:,None,:]).sum(1)
# einsum or dot: fast and memory-saving
a = np.einsum('ij,jk->ik', x, y)
```

The Einstein convention however *factorizes* for addition, so you
can write your BLAS-like problem simply as:

```
d = np.einsum(',ij,jk->ik', alpha, a, b) + np.einsum(',ik', beta, c)
```

with minimal memory overhead (you can rewrite most of it as in-place operations if you are really concerned about memory) and constant runtime overhead (the cost of two python-to-C calls).

So regarding performance, this seems, respectfully, like a case of premature optimization to me: have you actually verified that the split of GEMM-like operations into two separate numpy calls is a bottleneck in your code? If it indeed is, then I suggest the following (in order of increasing involvedness):

Try, carefully!,

`scipy.linalg.blas.dgemm`

. I would be surprised if you get significantly better performance, since`dgemm`

s are usually only building block themselves.Try a expression compiler (essentially you are proposing such a thing) like Theano.

Write your own generalised ufunc using Cython or C.