Igor Raush Igor Raush - 3 months ago 19
Python Question

Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

Question



Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?




Example



Suppose I set up a DataFrame like

from pandas import DataFrame, MultiIndex

index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame


which outputs

value
0 0 0
1 1
2 3
1 1 5
2 6


The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using

print frame.unstack().values


which outputs

[[ 0. 1. 2.]
[ nan 4. 5.]]


How does this generalize to an n-level index?

Playing with
unstack()
, it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.

I cannot use e.g.
frame.values.reshape(x, y, z)
, since this would require that the frame contains exactly
x * y * z
rows, which cannot be guaranteed. This is what I tried to demonstrate by
drop()
ing a row in the above example.

Any suggestions are highly appreciated.

Answer

Came up with a possible solution, would be glad to hear any possible improvements (or holes in the logic).


Given a setup similar to above, but in 3-D,

from pandas import DataFrame, MultiIndex
from itertools import product

index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
                  index=MultiIndex.from_product(index)).drop((1, 0, 1))
print frame

we have

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
  1 0      6
    1      7

Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.

First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.

levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)

which outputs

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
    1    NaN
  1 0      6
    1      7

Now, reshape() will work as intended.

shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))

which outputs

[[[  0.   1.]
  [  2.   3.]]

 [[  4.  nan]
  [  6.   7.]]]

The (rather ugly) one-liner is

frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
     .reshape(map(len, frame.index.levels))