rxu rxu - 2 months ago 12
Python Question

How did numpy implement multi-dimensional broadcasting?

Memory (row major order):

[[A(0,0), A(0,1)]
[A(1,0), A(1,1)]]

has this memory layout:
[A(0,0), A(0,1), A(1,0), A(1,1)]


I guess the algorithm work like this in the following cases.

Broadcasting Dimension is last dimension:

[[0, 1, 2, 3] [[1]
x
[4, 5, 6, 7]] [10]]

A (2 by 4) B (2 by 1)

Iterate 0th dimensions of A and B simultaneously {
Iterate last dimension of A{
multiply;
}
}


Broadcasting dimension is 0th dimension:

[[0, 1, 2, 3]
x [[1,10,100,1000]]
[4, 5, 6, 7]]

A (2 by 4) B (1 by 4)

Iterate 0th dimension of A{
Iterate 1st dimensions of A and B simultaneously{
multiply;
}
}


Question:


  1. How does numpy know which order of multiplication is the best.
    (reading memory in order is better than reading memory all over the place. but how did numpy figure that out?)

  2. What would numpy do if the arrays have more than two dimension

  3. What would numpy do if the broadcasting dimension is not the last dimension?


Answer

To really get into broadcasting details you need to understand array shape and strides. But a lot of the work is now implemented in c code using nditer. You can read about it at http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html. np.nditer gives you access to the tool at the Python level, but its real value comes when used with cython or your own c code.

np.lib.stride_tricks has functions that let you play with strides. One of its functions helps visualize how arrays are broadcasted together. In practice the work is done with nditer, but this function may help understand the action:

In [629]: np.lib.stride_tricks.broadcast_arrays(np.arange(6).reshape(2,3), 
                                  np.array([[1],[2]]))
Out[629]: 
[array([[0, 1, 2],
        [3, 4, 5]]), 
 array([[1, 1, 1],
        [2, 2, 2]])]

Note that, effectively the 2nd array has been replicated to match the 1st's shape. But the replication is done with stride tricks, not with actual copies.

In [631]: A,B=np.lib.stride_tricks.broadcast_arrays(np.arange(6).reshape(2,3), 
                                      np.array([[1],[2]]))
In [632]: A.shape
Out[632]: (2, 3)
In [633]: A.strides
Out[633]: (12, 4)
In [634]: B.shape
Out[634]: (2, 3)
In [635]: B.strides
Out[635]: (4, 0)         

It's this (4,0) strides that does the replication without copy.

=================

Using the python level nditer, here's what it does during broadcasting.

In [1]: A=np.arange(6).reshape(2,3)
In [2]: B=np.array([[1],[2]])

The plain nditer feeds elements one set at a time http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html#using-an-external-loop

In [5]: it =np.nditer((A,B))
In [6]: for a,b in it:
   ...:     print(a,b)

0 1
1 1
2 1
3 2
4 2
5 2

But when I turn extenal_loop on, it iterates in chunks, here respective rows of the broadcasted arrays:

In [7]: it =np.nditer((A,B), flags=['external_loop'])
In [8]: for a,b in it:
   ...:     print(a,b)

[0 1 2] [1 1 1]
[3 4 5] [2 2 2]

With a more complex broadcasting the external_loop still produces 1d arrays that allow simple c style iteration:

In [13]: A1=np.arange(24).reshape(3,2,4)
In [18]: it =np.nditer((A1,np.arange(3)[:,None,None]), flags=['external_loop'])
In [19]: while not it.finished:
    ...:     print(it[:])
    ...:     it.iternext()
    ...:     
(array([0, 1, 2, 3, 4, 5, 6, 7]), array([0, 0, 0, 0, 0, 0, 0, 0]))
(array([ 8,  9, 10, 11, 12, 13, 14, 15]), array([1, 1, 1, 1, 1, 1, 1, 1]))
(array([16, 17, 18, 19, 20, 21, 22, 23]), array([2, 2, 2, 2, 2, 2, 2, 2]))

Note that while A1 is (3,2,4), the nditer loop yields 3 steps (1st axis) with 2*4 length elements.

I found in another cython/nditer SO question that the first approach did not produce much of a speed improvement, but the 2nd helped a lot. In c or cython the external_loop case would do simple low level iteration.

Comments