michel - 1 year ago 178

Python Question

This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance

`x = np.array([1,2,3,4,5,6,7])`

y = pd.Series([1,2,3,4,5,6,7])

delta = np.percentile(x, 50)

deltamask = x- y > delta

delta mask creates a boolean pandas series.

However, if you do

`x[deltamask]`

y[deltamask]

You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like

`x[deltamask]*y[deltamask]`

results in an error:

`print type(x-y)`

print type(x[deltamask]), len(x[deltamask])

print type(y[deltamask]), len(y[deltamask])

Even more perplexing, I noticed that the operator < is treated differently. For instance

`print type(2*x < x*y)`

print type(2 < x*y)

will give you a pd.series and np.array respectively.

Also,

`5 < x - y`

results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.

What is the reason for this?

Answer Source

**Fancy Indexing**

As numpy currently stands, fancy indexing in numpy works as follows:

If the thing between brackets is a

`tuple`

(whether with explicit parens or not), the elements of the tuple are indices for different dimensions of`x`

. For example, both`x[(True, True)]`

and`x[True, True]`

will raise`IndexError: too many indices for array`

in this case because`x`

is 1D. However, before the exception happens, a telling warning will be raised too:`VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future`

.If the thing between brackets is

**exactly**an`ndarray`

, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why`x[deltamask.values]`

gives the expected result (empty array since`deltamask`

is all`False`

.If the thing between brackets is any array-like, whether a subclass like

`Series`

or just a`list`

, or something else, it is converted to an`np.intp`

array (if possible) and used as an integer index. So`x[deltamask]`

yeilds something equivalent to`x[[False] * 7]`

or just`x[[0] * 7]`

. In this case,`len(deltamask)==7`

and`x[0]==1`

so the result is`[1, 1, 1, 1, 1, 1, 1]`

.

This behavior is counterintuitive, and the `FutureWarning: in the future, boolean array-likes will be handled as a boolean array index`

it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.

This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.

**Relational Operators**

Now let's address the second part of your question about how the comparison works. Relational operators (`<`

, `>`

, `<=`

, `>=`

) work by calling the corresponding method on one of the objects being compared. For `<`

this is `__lt__`

. However, instead of just calling `x.__lt__(y)`

for the expression `x < y`

, Python actually checks the types of the objects being compared. If `y`

is a subtype of `x`

that implements the comparison, then Python prefers to call `y.__gt__(x)`

instead, regardless of how you wrote the original comparison. The only way that `x.__lt__(y)`

will get called if `y`

is a subclass of `x`

is if `y.__gt__(x)`

returns `NotImplemented`

to indicate that the comparison is not supported in that direction.

A similar thing happens when you do `5 < x - y`

. While `ndarray`

is not a subclass of `int`

, the comparison `int.__lt__(ndarray)`

returns `NotImplemented`

, so Python actually ends up calling `(x - y).__gt__(5)`

, which is of course defined and works just fine.

A much more succinct explanation of all this can be found in the Python docs.