mortysporty mortysporty - 1 year ago 136
Python Question

Cast dataframe of tuples to numpy matrix in Python

I have a dataframe

df = pd.DataFrame({'age' : [(1, 2), (1, 3), (1, 1)], \
'year' : [(20, 30), (30, 40), (30, 40)]})
df
Out[58]:
age year
0 (1, 2) (20, 30)
1 (1, 3) (30, 40)
2 (1, 1) (30, 40)


I want to convert this as a numpy array like this

array([[ 1, 2, 20, 30],
[ 1, 3, 30, 40],
[ 1, 1, 30, 40]])


i.e. a row in the dataframe is a row in the matrix, and one tuple column in the dataframe is two columns in the matrix. There could concievably be more tuples in the dataframe (resulting in more columns in the array).

So,if
col_names
is an array of the column names (here
col_names = ['age', 'year']
)

I want something like
numpy_array = some_clever_expression(col_names)

Answer Source

Stack with np.concatenate to get a 1D flattened array and then reshape -

np.concatenate(np.concatenate(df.values)).reshape(df.shape[0],-1)

Sample output -

In [460]: np.concatenate(np.concatenate(df.values)).reshape(df.shape[0],-1)
Out[460]: 
array([[ 1,  2, 20, 30],
       [ 1,  3, 30, 40],
       [ 1,  1, 30, 40]])

Alternatively, we could use np.hstack to get the flattened version -

np.hstack(np.hstack(df.values))

To select specific columns, simple index into those columns, get the array data and proceed. Thus, for a list of column names in col_names, use df[col_names].values instead.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download