ericmjl - 6 months ago 12
Python Question

# Code optimization - number of function calls in Python

I'd like to know how I might be able to transform this problem to reduce the overhead of the

`np.sum()`
function calls in my code.

I have an
`input`
matrix, say of
`shape=(1000, 36)`
. Each row represents a node in a graph. I have an operation that I am doing, which is iterating over each row and doing an element-wise addition to a variable number of other rows. Those "other" rows are defined in a dictionary
`nodes_nbrs`
that records, for each row, a list of rows that must be summed together. An example is as such:

``````nodes_nbrs = {0: [0, 1],
1: [1, 0, 2],
2: [2, 1],
...}
``````

Here, node
`0`
would be transformed into the sum of nodes
`0`
and
`1`
. Node
`1`
would be transformed into the sum of nodes
`1`
,
`0`
, and
`2`
. And so on for the rest of the nodes.

The current (and naive) way I currently have implemented is as such. I first instantiate a zero array of the final shape that I want, and then iterate over each key-value pair in the
`nodes_nbrs`
dictionary:

``````output = np.zeros(shape=input.shape)
for k, v in nodes_nbrs.items():
output[k] = np.sum(input[v], axis=0)
``````

This code is all cool and fine in small tests (
`shape=(1000, 36)`
), but on larger tests (
`shape=(~1E(5-6), 36)`
), it takes ~2-3 seconds to complete. I end up having to do this operation thousands of times, so I'm trying to see if there's a more optimized way of doing this.

After doing line profiling, I noticed that the key killer here is calling the
`np.sum`
function over and over, which takes about 50% of the total time. Is there a way I can eliminate this overhead? Or is there another way I can optimize this?

Apart from that, here is a list of things I have done, and (very briefly) their results:

• A
`cython`
version: eliminates the
`for`
loop type checking overhead, 30% reduction in time taken. With the
`cython`
version,
`np.sum`
takes about 80% of the total wall clock time, rather than 50%.

• Pre-declare
`np.sum`
as a variable
`npsum`
, and then call
`npsum`
inside the
`for`
loop. No difference with original.

• Replace
`np.sum`
with
`np.add.reduce`
, and assign that to the variable
`npsum`
, and then call
`npsum`
inside the
`for`
loop. ~10% reduction in wall clock time, but then incompatible with
`autograd`
(explanation below in sparse matrices bullet point).

• `numba`
JIT-ing: did not attempt more than adding decorator. No improvement, but didn't try harder.

• Convert the
`nodes_nbrs`
dictionary into a dense
`numpy`
binary array (1s and 0s), and then do a single
`np.dot`
operation. Good in theory, bad in practice because it would require a square matrix of
`shape=(10^n, 10^n)`
, which is quadratic in memory usage.

Things I have not tried, but am hesitant to do so:

• `scipy`
sparse matrices: I am using
`autograd`
, which does not support automatic differentiation of the
`dot`
operation for
`scipy`
sparse matrices.

For those who are curious, this is essentially a convolution operation on graph-structured data. Kinda fun developing this for grad school, but also somewhat frustrating being at the cutting edge of knowledge.

If scipy.sparse is not an option, one way you might approach this would be to massage your data so that you can use vectorized functions to do everything in the compiled layer. If you change your neighbors dictionary into a two-dimensional array with appropriate flags for missing values, you can use `np.take` to extract the data you want and then do a single `sum()` call.

Here's an example of what I have in mind:

``````import numpy as np

def make_data(N=100):
X = np.random.randint(1, 20, (N, 36))
connections = np.random.randint(2, 5, N)
nbrs = {i: list(np.random.choice(N, c))
for i, c in enumerate(connections)}
return X, nbrs

def original_solution(X, nbrs):
output = np.zeros(shape=X.shape)
for k, v in nbrs.items():
output[k] = np.sum(X[v], axis=0)
return output

def vectorized_solution(X, nbrs):
# Make neighbors all the same length, filling with -1
new_nbrs = np.full((X.shape[0], max(map(len, nbrs.values()))), -1, dtype=int)
for i, v in nbrs.items():
new_nbrs[i, :len(v)] = v

# add a row of zeros to X
new_X = np.vstack([X, 0 * X[0]])

# compute the sums
return new_X.take(new_nbrs, 0).sum(1)
``````

Now we can confirm that the results match:

``````>>> X, nbrs = make_data(100)
>>> np.allclose(original_solution(X, nbrs),
vectorized_solution(X, nbrs))
True
``````

And we can time things to see the speedup:

``````X, nbrs = make_data(1000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
# 100 loops, best of 3: 13.7 ms per loop
# 100 loops, best of 3: 1.89 ms per loop
``````

Going up to larger sizes:

``````X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
1 loop, best of 3: 1.42 s per loop
1 loop, best of 3: 249 ms per loop
``````

It's about a factor of 5-10 faster, which may be good enough for your purposes (though this will heavily depend on the exact characteristics of your `nbrs` dictionary).

Edit: Just for fun, I tried a couple other approaches, one using `numpy.add.reduceat`, one using `pandas.groupby`, and one using `scipy.sparse`. It seems that the vectorized approach I originally proposed above is probably the best bet. Here they are for reference:

``````from itertools import chain

def reduceat_solution(X, nbrs):
ind, j = np.transpose([[i, len(v)] for i, v in nbrs.items()])
i = list(chain(*(nbrs[i] for i in ind)))
j = np.concatenate([[0], np.cumsum(j)[:-1]])

np.allclose(original_solution(X, nbrs),
reduceat_solution(X, nbrs))
# True
``````

-

``````import pandas as pd

def groupby_solution(X, nbrs):
i, j = np.transpose([[k, vi] for k, v in nbrs.items() for vi in v])
return pd.groupby(pd.DataFrame(X[j]), i).sum().values

np.allclose(original_solution(X, nbrs),
groupby_solution(X, nbrs))
# True
``````

-

``````from scipy.sparse import csr_matrix
from itertools import chain

def sparse_solution(X, nbrs):
items = (([i]*len(col), col, [1]*len(col)) for i, col in nbrs.items())
rows, cols, data = (np.array(list(chain(*a))) for a in zip(*items))
M = csr_matrix((data, (rows, cols)))
return M.dot(X)

np.allclose(original_solution(X, nbrs),
sparse_solution(X, nbrs))
# True
``````

And all the timings together:

``````X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
%timeit reduceat_solution(X, nbrs)
%timeit groupby_solution(X, nbrs)
%timeit sparse_solution(X, nbrs)
# 1 loop, best of 3: 1.46 s per loop
# 1 loop, best of 3: 268 ms per loop
# 1 loop, best of 3: 416 ms per loop
# 1 loop, best of 3: 657 ms per loop
# 1 loop, best of 3: 282 ms per loop
``````