Nikita Sivukhin - 5 months ago 337

Python Question

I recently start working with

`pandas`

`.corrwith()`

`Series`

`DataFrame`

Suppose i have one

`DataFrame`

`frame = pd.DataFrame(data={'a':[1,2,3], 'b':[-1,-2,-3], 'c':[10, -10, 10]})`

And i want calculate correlation between features 'a' and all other features.

I can do it in the following way:

`frame.drop(labels='a', axis=1).corrwith(frame['a'])`

And result will be:

`b -1.0`

c 0.0

But very similar code:

`frame.drop(labels='a', axis=1).corrwith(frame[['a']])`

Generate absolutely different and unacceptable table:

`a NaN`

b NaN

c NaN

So, my question is: why in case of

`DataFrame`

Answer

Let's say your frame is:

```
frame = pd.DataFrame(np.random.rand(10, 6), columns=['cost', 'amount', 'day', 'month', 'is_sale', 'hour'])
```

You want the `'cost'`

and `'amount'`

columns to be correlated with all other columns in every combination.

```
focus_cols = ['cost', 'amount']
frame.corr().filter(focus_cols).drop(focus_cols)
```

Compute pairwise correlation between rows or columns of two DataFrame objects.

Parameters:

other: DataFrameaxis : {0 or ‘index’, 1 or ‘columns’},

default 0 0 or ‘index’ to compute column-wise, 1 or ‘columns’ for row-wise drop : boolean, default False Drop missing indices from result, default returns union of all Returns: correls : Series

`corrwith`

is behaving similarly to `add`

, `sub`

, `mul`

, `div`

in that it expects to find a `DataFrame`

or a `Series`

being passed in `other`

despite the documentation saying just `DataFrame`

.

When `other`

is a `Series`

it broadcast that series and matches along the axis specified by `axis`

, default is 0. This is why the following worked:

```
frame.drop(labels='a', axis=1).corrwith(frame.a)
b -1.0
c 0.0
dtype: float64
```

When `other`

is a `DataFrame`

it will match the axis specified by `axis`

and correlate each pair identified by the other axis. If we did:

```
frame.drop('a', axis=1).corrwith(frame.drop('b', axis=1))
a NaN
b NaN
c 1.0
dtype: float64
```

Only `c`

was in common and only `c`

had its correlation calculated.

In the case you specified:

```
frame.drop(labels='a', axis=1).corrwith(frame[['a']])
```

`frame[['a']]`

is a `DataFrame`

because of the `[['a']]`

and now plays by the `DataFrame`

rules in which its columns must match up with what its being correlated with. But you explicitly drop `a`

from the first frame then correlate with a `DataFrame`

with nothing but `a`

. The result is `NaN`

for every column.