SiLiKhon SiLiKhon - 3 years ago 100
Python Question

Working with structured object arrays in NumPy

Say, I have an array of (x, y) points of the following structure:

arr = np.array([([1. ], [2. ]),
([1., 93.], [5., 46.]),
([4. ], [3. ])],
dtype=[('x','O'), ('y', 'O')])


i.e. these points are grouped into such innermost arrays. The size of the innermost array might by arbitrary, but it's always same for x and y.

I want to be able to perform two things:

a) Expand the innermost arrays by concatenating their content, so for the above example the result looks like:

np.array([( 1., 2.),
( 1., 5.),
(93., 46.),
( 4., 3.)],
dtype=[('x','f8'), ('y','f8')])


b) For each (outermost) entry select element with, say, largest y:

np.array([( 1., 2.),
(93., 46.),
( 4., 3.)],
dtype=[('x','f8'), ('y','f8')])


I believe there should be a way of doing this efficiently without using ugly for loops. Would appreciate any help.

UPD ( a and b using ugly loops ):

(arr is the array defined in the beginning of the post)

a)

np.array([(x_, y_) for x, y in arr for x_, y_ in zip(x, y)], dtype=[('x','f8'), ('y','f8')])


b)

np.array([(x[np.argmax(np.array(y))], y[np.argmax(np.array(y))]) for x, y in arr],dtype=[('x','f8'), ('y','f8')])


Problem is also that in reality I have not just two fields (x and y), but 77 fields of various types (floats, integers, booleans)... So these expressions will grow to many lines.

Answer Source

Using Pandas, you could store your data in a flat DataFrame, using the group value to indicate which row of the original array the data came from:

import numpy as np
import pandas as pd
df = pd.DataFrame([
    (0, 1, 2),
    (1, 1, 5),
    (1, 93, 46),
    (2, 4, 3)], dtype='f8', columns=['group', 'x', 'y'])
print(df)
#    group     x     y
# 0    0.0   1.0   2.0
# 1    1.0   1.0   5.0
# 2    1.0  93.0  46.0
# 3    2.0   4.0   3.0

Then the first operation is merely a slice of the x and y columns:

print(df[['x','y']])
#       x     y
# 0   1.0   2.0
# 1   1.0   5.0
# 2  93.0  46.0
# 3   4.0   3.0

and the second operation can be done using groupby/idxmax:

print(df.loc[df.groupby('group')['y'].idxmax(), ['x', 'y']])
#       x     y
# 0   1.0   2.0
# 2  93.0  46.0
# 3   4.0   3.0

Given the structured NumPy array, arr, you're going to have to loop through the lists at least once to perform any of these operations. So you might as well pay the price once to organize the data in a better data structure, such as a Pandas DataFrame.

Here is one way you could convert arr to df:

import numpy as np
import pandas as pd

arr = np.array([([1.     ], [2.     ]),
                ([1., 93.], [5., 46.]),
                ([4.     ], [3.     ])],
               dtype=[('x','O'), ('y', 'O')])

df = pd.DataFrame(arr)
df = (pd.concat({col: df[col].apply(pd.Series).stack() for col in df}, axis=1)
      .reset_index(drop=True))
print(df)

yields

      x     y
0   1.0   2.0
1   1.0   5.0
2  93.0  46.0
3   4.0   3.0
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download