This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance
x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask = x- y > delta
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]), len(y[deltamask])
print type(2*x < x*y)
print type(2 < x*y)
5 < x - y
As numpy currently stands, fancy indexing in numpy works as follows:
If the thing between brackets is a
tuple (whether with explicit parens or not), the elements of the tuple are indices for different dimensions of
x. For example, both
x[(True, True)] and
x[True, True] will raise
IndexError: too many indices for array in this case because
x is 1D. However, before the exception happens, a telling warning will be raised too:
VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future.
If the thing between brackets is exactly an
ndarray, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why
x[deltamask.values] gives the expected result (empty array since
deltamask is all
If the thing between brackets is any array-like, whether a subclass like
Series or just a
list, or something else, it is converted to an
np.intp array (if possible) and used as an integer index. So
x[deltamask] yeilds something equivalent to
x[[False] * 7] or just
x[ * 7]. In this case,
x==1 so the result is
[1, 1, 1, 1, 1, 1, 1].
This behavior is counterintuitive, and the
FutureWarning: in the future, boolean array-likes will be handled as a boolean array index it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.
This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.
Now let's address the second part of your question about how the comparison works. Relational operators (
>=) work by calling the corresponding method on one of the objects being compared. For
< this is
__lt__. However, instead of just calling
x.__lt__(y) for the expression
x < y, Python actually checks the types of the objects being compared. If
y is a subtype of
x that implements the comparison, then Python prefers to call
y.__gt__(x) instead, regardless of how you wrote the original comparison. The only way that
x.__lt__(y) will get called if
y is a subclass of
x is if
NotImplemented to indicate that the comparison is not supported in that direction.
A similar thing happens when you do
5 < x - y. While
ndarray is not a subclass of
int, the comparison
NotImplemented, so Python actually ends up calling
(x - y).__gt__(5), which is of course defined and works just fine.
A much more succinct explanation of all this can be found in the Python docs.