Heinz - 4 months ago 26

Python Question

I got a numpy array as below:

`[[3.4, 87]`

[5.5, 11]

[22, 3]

[4, 9.8]

[41, 11.22]

[32, 7.6]]

and I want to:

- compare elements in column 2,
**3 rows a time** - delete the row with the biggest value in column 2, 3 rows a time

For example, in the first 3 rows, 3 values in column 2 are

`87`

`11`

`3`

`11`

`3`

The output numpy array I expected would be:

`[[5.5, 11]`

[22, 3]

[4, 9.8]

[32, 7.6]]

I am new to numpy array, and please give me advice to achieve this.

Answer

```
import numpy as np
x = np.array([[3.4, 87],
[5.5, 11],
[22, 3],
[4, 9.8],
[41, 11.22],
[32, 7.6]])
y = x.reshape(-1,3,2)
idx = y[..., 1].argmax(axis=1)
mask = np.arange(3)[None, :] != idx[:, None]
y = y[mask]
print(y)
# This might be helpful for the deleted part of your question
# y = y.reshape(-1,2,2)
# z = y[...,1]/y[...,1].sum(axis=1)
# result = np.dstack([y, z[...,None]])
```

yields

```
[[ 5.5 11. ]
[ 22. 3. ]
[ 4. 9.8]
[ 32. 7.6]]
```

"Grouping by three" with NumPy can be done by reshaping the array to create a new axis of length 3 -- provided the original number of rows is divisible by 3:

```
In [92]: y = x.reshape(-1,3,2); y
Out[92]:
array([[[ 3.4 , 87. ],
[ 5.5 , 11. ],
[ 22. , 3. ]],
[[ 4. , 9.8 ],
[ 41. , 11.22],
[ 32. , 7.6 ]]])
In [93]: y.shape
Out[93]: (2, 3, 2)
| | |
| | o--- 2 columns in each group
| o------ 3 rows in each group
o--------- 2 groups
```

For each group, we can select the second column and find the row with the maximum value:

```
In [94]: idx = y[..., 1].argmax(axis=1); idx
Out[94]: array([0, 1])
```

`array([0, 1])`

indicates that in the first group, the 0th indexed row contains the maximum (i.e. 87), and in the second group, the 1st indexed row contains the maximum (i.e. 11.22).

Next, we can generate a 2D boolean selection mask which is True where the rows do not contain the maximum value:

```
In [95]: mask = np.arange(3)[None, :] != idx[:, None]; mask
Out[95]:
array([[False, True, True],
[ True, False, True]], dtype=bool)
In [96]: mask.shape
Out[96]: (2, 3)
```

`mask`

has shape (2,3). `y`

has shape (2,3,2). If `mask`

is used to index `y`

as in `y[mask]`

, then the mask is aligned with the first two axes of `y`

, and all values where `mask`

is `True`

are returned:

```
In [98]: y[mask]
Out[98]:
array([[ 5.5, 11. ],
[ 22. , 3. ],
[ 4. , 9.8],
[ 32. , 7.6]])
In [99]: y[mask].shape
Out[99]: (4, 2)
```

By the way, the same calculation could be done using Pandas like this:

```
import numpy as np
import pandas as pd
x = np.array([[3.4, 87],
[5.5, 11],
[22, 3],
[4, 9.8],
[41, 11.22],
[32, 7.6]])
df = pd.DataFrame(x)
idx = df.groupby(df.index // 3)[1].idxmax()
# drop the row with the maximum value in each group
df = df.drop(idx.values, axis=0)
```

which yields the DataFrame:

```
0 1
1 5.5 11.0
2 22.0 3.0
3 4.0 9.8
5 32.0 7.6
```

You might find Pandas syntax easier to use, but for the above calculation NumPy is faster.