Phil Glau Phil Glau - 9 months ago 55
Python Question

get unique count ~and~ unique values on per row basis using numpy

I'm trying to get the equivalent of np.unique, but with an 'axis=1' option.

a = np.array([[8, 8, 8, 5, 8],
[8, 2, 0, 8, 8],
[4, 5, 4, 2, 4],
[4, 6, 5, 2, 6]])

I'm looking to get the value with the highest count in each row and save it to a 1D vector. Basically "which value is most seen in each row."

Correct answer: [8,8,4,6] in this example.

Right now I'm doing something like:

y = np.zeros(len(a))

for i in xrange(len(a)):
[u,cnt] = np.unique(a[i,:],return_counts=True)
# pick the value from 'u' that is seen the most.
y[i] = u[np.argmax(cnt)]

Which gives the desired results but is very slow in Python when looping over thousands of rows. I'm looking for a fully vectorized approach.

I found unique row elements post, but it doesn't quite do what I want (and either I'm not quite clever enough to munge it into the desired form or it's not applicable directly.)

Thank you in advance for any help you can provide.


One option is to use scipy.stats.mode:

In [36]: from scipy.stats import mode

In [37]: a
array([[8, 8, 8, 5, 8],
       [8, 2, 0, 8, 8],
       [4, 5, 4, 2, 4],
       [4, 6, 5, 2, 6]])

In [38]: vals, counts = mode(a, axis=1)

In [39]: vals

In [40]: counts

However, it is written in Python using numpy, and depending on the distribution of the values in the input, it might not be any faster than your solution. You can find the implementation in (and as I write this, it is here:

The essential part of the function depends only on numpy, so if it works well enough for you but you don't want the dependency on scipy, you could copy the function to your own project--just be sure to follow the terms of the BSD license that scipy uses.