jhegedus jhegedus - 11 months ago 52
Python Question

map with failure in Numpy

Inspired by Haskell:

How can I implement the following with a numpy array in Python?

In [13]: [(x if x>3 else None) for x in range(10)]
Out[13]: [None, None, None, None, 4, 5, 6, 7, 8, 9]

In other words, I am looking for a function for numpy that would have the signature:
f:[a]->(a->Maybe a)->[Maybe a]
in Haskell, where
would be a numpy list.

I was trying this:

np.apply_along_axis(lambda x:x if x>3 else None,0,np.arange(10))

but it does not work:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()


NumPy's where() will do the trick:

In [429]: import numpy as np

In [430]: arr = np.arange(10, dtype=np.object)

In [431]: np.where(arr > 3, arr, None)
Out[431]: array([None, None, None, None, 4, 5, 6, 7, 8, 9], dtype=object)

The code above creates a new array. If you wish to modify arr in place, you could use boolean indexing arr[arr < 4] = None (as pointed out by @Chris Mueller) or putmask():

In [432]: np.putmask(arr, arr < 4, None)

In [433]: arr
Out[433]: array([None, None, None, None, 4, 5, 6, 7, 8, 9], dtype=object)

Unless you are constrained to use None as a "flag" value, I would suggest you to stick to @ev-br's recommendation and use np.nan instead. I will follow that approach to assess performance:

In [434]: arr = np.arange(1000000, dtype=np.float)

In [435]: timeit np.where(arr > 3, arr, np.nan)
100 loops, best of 3: 3.61 ms per loop

In [436]: timeit arr[arr < 4] = np.nan
1000 loops, best of 3: 564 µs per loop

In [437]: timeit np.putmask(arr, arr < 4, np.nan)
1000 loops, best of 3: 1.08 ms per loop

Notice that I used a much larger array to further highlight efficiency differences. And the winner is... boolean indexing.