michel michel - 2 months ago 29
Python Question

masking a series with a boolean array

This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance

x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask = x- y > delta

delta mask creates a boolean pandas series.

However, if you do


You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like


results in an error:

print type(x-y)
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]), len(y[deltamask])

Even more perplexing, I noticed that the operator < is treated differently. For instance

print type(2*x < x*y)
print type(2 < x*y)

will give you a pd.series and np.array respectively.


5 < x - y

results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.

What is the reason for this?


Fancy Indexing

As numpy currently stands, fancy indexing in numpy works as follows:

  1. If the thing between brackets is a tuple (whether with explicit parens or not), the elements of the tuple are indices for different dimensions of x. For example, both x[(True, True)] and x[True, True] will raise IndexError: too many indices for array in this case because x is 1D. However, before the exception happens, a telling warning will be raised too: VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future.

  2. If the thing between brackets is exactly an ndarray, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why x[deltamask.values] gives the expected result (empty array since deltamask is all False.

  3. If the thing between brackets is any array-like, whether a subclass like Series or just a list, or something else, it is converted to an np.intp array (if possible) and used as an integer index. So x[deltamask] yeilds something equivalent to x[[False] * 7] or just x[[0] * 7]. In this case, len(deltamask)==7 and x[0]==1 so the result is [1, 1, 1, 1, 1, 1, 1].

This behavior is counterintuitive, and the FutureWarning: in the future, boolean array-likes will be handled as a boolean array index it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.

This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.

Relational Operators

Now let's address the second part of your question about how the comparison works. Relational operators (<, >, <=, >=) work by calling the corresponding method on one of the objects being compared. For < this is __lt__. However, instead of just calling x.__lt__(y) for the expression x < y, Python actually checks the types of the objects being compared. If y is a subtype of x that implements the comparison, then Python prefers to call y.__gt__(x) instead, regardless of how you wrote the original comparison. The only way that x.__lt__(y) will get called if y is a subclass of x is if y.__gt__(x) returns NotImplemented to indicate that the comparison is not supported in that direction.

A similar thing happens when you do 5 < x - y. While ndarray is not a subclass of int, the comparison int.__lt__(ndarray) returns NotImplemented, so Python actually ends up calling (x - y).__gt__(5), which is of course defined and works just fine.

A much more succinct explanation of all this can be found in the Python docs.