Phil Glau - 1 year ago 122

Python Question

I'm trying to get the equivalent of np.unique, but with an 'axis=1' option.

`a = np.array([[8, 8, 8, 5, 8],`

[8, 2, 0, 8, 8],

[4, 5, 4, 2, 4],

[4, 6, 5, 2, 6]])

I'm looking to get the value with the highest count in each row and save it to a 1D vector. Basically "which value is most seen in each row."

Correct answer: [8,8,4,6] in this example.

Right now I'm doing something like:

`y = np.zeros(len(a))`

for i in xrange(len(a)):

[u,cnt] = np.unique(a[i,:],return_counts=True)

# pick the value from 'u' that is seen the most.

y[i] = u[np.argmax(cnt)]

Which gives the desired results but is very slow in Python when looping over thousands of rows. I'm looking for a fully vectorized approach.

I found unique row elements post, but it doesn't quite do what I want (and either I'm not quite clever enough to munge it into the desired form or it's not applicable directly.)

Thank you in advance for any help you can provide.

Answer Source

One option is to use `scipy.stats.mode`

:

```
In [36]: from scipy.stats import mode
In [37]: a
Out[37]:
array([[8, 8, 8, 5, 8],
[8, 2, 0, 8, 8],
[4, 5, 4, 2, 4],
[4, 6, 5, 2, 6]])
In [38]: vals, counts = mode(a, axis=1)
In [39]: vals
Out[39]:
array([[8],
[8],
[4],
[6]])
In [40]: counts
Out[40]:
array([[4],
[3],
[3],
[2]])
```

However, it is written in Python using numpy, and depending on the distribution of the values in the input, it might not be any faster than your solution. You can find the implementation in https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py (and as I write this, it is here: https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py#L372).

The essential part of the function depends only on numpy, so if it works well enough for you but you don't want the dependency on scipy, you could copy the function to your own project--just be sure to follow the terms of the BSD license that scipy uses.