 michel -4 years ago 403
Python Question

# masking a series with a boolean array

This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance

``````x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask =  x- y > delta
``````

delta mask creates a boolean pandas series.

However, if you do

``````x[deltamask]
y[deltamask]
``````

You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like

``````x[deltamask]*y[deltamask]
``````

results in an error:

``````print type(x-y)
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]),  len(y[deltamask])
``````

Even more perplexing, I noticed that the operator < is treated differently. For instance

``````print type(2*x < x*y)
print type(2 <  x*y)
``````

will give you a pd.series and np.array respectively.

Also,

``````5 < x - y
``````

results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.

What is the reason for this? Mad Physicist
Answer Source

Fancy Indexing

As numpy currently stands, fancy indexing in numpy works as follows:

1. If the thing between brackets is a `tuple` (whether with explicit parens or not), the elements of the tuple are indices for different dimensions of `x`. For example, both `x[(True, True)]` and `x[True, True]` will raise `IndexError: too many indices for array` in this case because `x` is 1D. However, before the exception happens, a telling warning will be raised too: `VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future`.

2. If the thing between brackets is exactly an `ndarray`, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why `x[deltamask.values]` gives the expected result (empty array since `deltamask` is all `False`.

3. If the thing between brackets is any array-like, whether a subclass like `Series` or just a `list`, or something else, it is converted to an `np.intp` array (if possible) and used as an integer index. So `x[deltamask]` yeilds something equivalent to `x[[False] * 7]` or just `x[ * 7]`. In this case, `len(deltamask)==7` and `x==1` so the result is `[1, 1, 1, 1, 1, 1, 1]`.

This behavior is counterintuitive, and the `FutureWarning: in the future, boolean array-likes will be handled as a boolean array index` it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.

This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.

Relational Operators

Now let's address the second part of your question about how the comparison works. Relational operators (`<`, `>`, `<=`, `>=`) work by calling the corresponding method on one of the objects being compared. For `<` this is `__lt__`. However, instead of just calling `x.__lt__(y)` for the expression `x < y`, Python actually checks the types of the objects being compared. If `y` is a subtype of `x` that implements the comparison, then Python prefers to call `y.__gt__(x)` instead, regardless of how you wrote the original comparison. The only way that `x.__lt__(y)` will get called if `y` is a subclass of `x` is if `y.__gt__(x)` returns `NotImplemented` to indicate that the comparison is not supported in that direction.

A similar thing happens when you do `5 < x - y`. While `ndarray` is not a subclass of `int`, the comparison `int.__lt__(ndarray)` returns `NotImplemented`, so Python actually ends up calling `(x - y).__gt__(5)`, which is of course defined and works just fine.

A much more succinct explanation of all this can be found in the Python docs.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download
Latest added