Sam - 1 year ago 66
Python Question

# Functional assignment in numpy

Suppose I have two arrays

``````A = [ 6, 4, 5, 7, 9 ]
ind = [ 0, 0, 2, 1, 2 ]
``````

and a function f.

I want to build a new array B of size the number of distinct elements in ind with B[i] the result of f with parameter the subarray of A indexed by i.

For this example, if I take f = sum, then

`B = [10, 7, 14]`

or f = max

`B = [6, 7, 9]`

Is there a more efficient way than a for loop in numpy ?

Thanks

For the special case of `f = sum`:

``````In [32]: np.bincount(ind,A)
Out[32]: array([ 10.,   7.,  14.])
``````

Assuming:

• `f` is a ufunc
• You have enough memory to make a 2D array of shape `len(A) x len(A)`

You could make a 2D array `B`:

``````B=np.zeros((len(A),max(ind)+1))
``````

and fill in various locations in `B` with values from `A`, such that the first column of `B` only gets values from `A` when `ind == 0`, and the second column of `B` only gets values from `A` when `ind == 1`, etc:

``````B[zip(*enumerate(ind))]=A
``````

you'd end up with an array like

``````[[ 6.  0.  0.]
[ 4.  0.  0.]
[ 0.  0.  5.]
[ 0.  7.  0.]
[ 0.  0.  9.]]
``````

You could then apply `f` along axis=0 to obtain your desired result. There is a third assumption used here:

• The extra zeros in `B` do not affect the desired result.

If you can stomach these assumptions then:

``````import numpy as np

A = np.array([ 6, 4, 5, 7, 9 ])
ind = np.array([ 0, 0, 2, 1, 2 ])

N=100
M=10
A2 = np.array([np.random.randint(M) for i in range(N)])
ind2 = np.array([np.random.randint(M) for i in range(N)])

def use_extra_axis(A,ind,f):
B=np.zeros((len(A),max(ind)+1))
B[zip(*enumerate(ind))]=A
return f(B)

def use_loop(A,ind,f):
n=max(ind)+1
B=np.empty(n)
for i in range(n):
B[i]=f(A[ind==i])
return B

def fmax(arr):
return np.max(arr,axis=0)

if __name__=='__main__':
print(use_extra_axis(A,ind,fmax))
print(use_loop(A,ind,fmax))
``````

For certain values of `M` and `N` (e.g. M=10, N=100), using an extra axis may be faster than using a loop:

``````% python -mtimeit -s'import test,numpy' 'test.use_extra_axis(test.A2,test.ind2,test.fmax)'
10000 loops, best of 3: 162 usec per loop

% python -mtimeit -s'import test,numpy' 'test.use_loop(test.A2,test.ind2,test.fmax)'
1000 loops, best of 3: 222 usec per loop
``````

However, as N grows larger (say M=10, N=10000), using a loop may be faster:

``````% python -mtimeit -s'import test,numpy' 'test.use_extra_axis(test.A2,test.ind2,test.fmax)'
100 loops, best of 3: 13.9 msec per loop
% python -mtimeit -s'import test,numpy' 'test.use_loop(test.A2,test.ind2,test.fmax)'
100 loops, best of 3: 4.4 msec per loop
``````

Incorporating thouis's excellent idea of using a sparse matrix:

``````def use_sparse_extra_axis(A,ind,f):
B=scipy.sparse.coo_matrix((A, (range(len(A)), ind))).toarray()
return f(B)

def use_sparse(A,ind,f):
return [f(v) for v in scipy.sparse.coo_matrix((A, (ind, range(len(A))))).tolil().data]
``````

Which implementation is best depends on the parameters `N` and `M`:

``````N=1000, M=100
·───────────────────────·────────────────────·
│ use_sparse_extra_axis │ 1.15 msec per loop │
│        use_extra_axis │ 2.79 msec per loop │
│              use_loop │ 3.47 msec per loop │
│            use_sparse │ 5.25 msec per loop │
·───────────────────────·────────────────────·

N=100000, M=10
·───────────────────────·────────────────────·
│ use_sparse_extra_axis │ 35.6 msec per loop │
│              use_loop │ 43.3 msec per loop │
│            use_sparse │ 91.5 msec per loop │
│        use_extra_axis │  150 msec per loop │
·───────────────────────·────────────────────·

N=100000, M=50
·───────────────────────·────────────────────·
│            use_sparse │ 94.1 msec per loop │
│              use_loop │  107 msec per loop │
│ use_sparse_extra_axis │  170 msec per loop │
│        use_extra_axis │  272 msec per loop │
·───────────────────────·────────────────────·

N=10000, M=50
·───────────────────────·────────────────────·
│              use_loop │ 10.9 msec per loop │
│            use_sparse │ 11.7 msec per loop │
│ use_sparse_extra_axis │ 15.1 msec per loop │
│        use_extra_axis │ 25.4 msec per loop │
·───────────────────────·────────────────────·
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download