O.rka O.rka - 7 months ago 28
Python Question

Create DataArray from Dict of 2D DataFrames/Arrays

I'm trying to transition from

Pandas
into
Xarray
for
N-Dimensional DataArrays
to expand my repertoire.

Realistically, I'm going to have a bunch of different
pd.DataFrames
(in this case row=month, col=attribute) along a particular axis (patients in the mock example below) that I would like to merge (w/o using panels or multindex :), thank you). I want to convert them to
xr.DataArrays
so I can build dimensions upon them. I made a mock dataset to give a gist of what I'm talking about.

For this dataset I made up, imagine
100 patients, 12 months, 10000 attributes, 3 replicates (per attribute)
which would be a typical 4D dataset. Basically, I'm condensing the
3 replicates per attribute
by the
mean
so I end up with a 2D
pd.DataFrame
(row=months, col=attributes) this DataFrame is the value in my dictionary and the patient it came from is the key (i.e. (patient_x : DataFrame_X) )

I'm also going to include a round about way I did it with
np.ndarray
placeholder but it would be really convenient if I could generate a N-dimensional DataArray from a dictionary whose key was patient_x and the value was a DataFrame_X

How can I create a N-Dimensional
DataArray
using
Xarray
from a dictionary of
Pandas DataFrames
?


import xarray as xr
import numpy as np
import pandas as pd

np.random.seed(1618033)

#Set dimensions
a,b,c,d = 100,12,10000,3 #100 patients, 12 months, 10000 attributes, 3 replicates

#Create labels
patients = ["patient_%d" % i for i in range(a)]
months = [j for j in range(b)]
attributes = ["attr_%d" % k for k in range(c)]
replicates = [l for l in range(d)]

coords = [patients,months,attributes]
dims = ["Patients","Months","Attributes"]

#Dict of DataFrames
D_patient_DF = dict()

for i, patient in enumerate(patients):
A_placeholder = np.zeros((b,c))
for j, month in enumerate(months):
#Attribute x Replicates
A_attrReplicates = np.random.random((c,d))
#Collapse into 1D Vector
V_attrExp = A_attrReplicates.mean(axis=1)
#Fill array with row
A_placeholder[j,:] = V_attrExp
#Assign dataframe for every patient
DF_data = pd.DataFrame(A_placeholder, index = months, columns = attributes)
D_patient_DF[patient] = DF_data

xr.DataArray(D_patient_DF).dims
#() its empty

D_patient_DF
#{'patient_0': attr_0 attr_1 attr_2 attr_3 attr_4 attr_5 attr_6 \
# 0 0.445446 0.422018 0.343454 0.140700 0.567435 0.362194 0.563799
# 1 0.440010 0.548535 0.810903 0.482867 0.469542 0.591939 0.579344
# 2 0.645719 0.450773 0.386939 0.418496 0.508290 0.431033 0.622270
# 3 0.555855 0.633393 0.555197 0.556342 0.489865 0.204200 0.823043
# 4 0.916768 0.590534 0.597989 0.592359 0.484624 0.478347 0.507789
# 5 0.847069 0.634923 0.591008 0.249107 0.655182 0.394640 0.579700
# 6 0.700385 0.505331 0.377745 0.651936 0.334216 0.489728 0.282544
# 7 0.777810 0.423889 0.414316 0.389318 0.565144 0.394320 0.511034
# 8 0.440633 0.069643 0.675037 0.365963 0.647660 0.520047 0.539253
# 9 0.333213 0.328315 0.662203 0.594030 0.790758 0.754032 0.602375
# 10 0.470330 0.419496 0.171292 0.677439 0.683759 0.646363 0.465788
# 11 0.758556 0.674664 0.801860 0.612087 0.567770 0.801514 0.179939

Answer

From a dictionary of DataFrames, you might convert each value into a DataArray (adding dimensions labels), load the results into a Dataset and then convert into a DataArray:

variables = {(k, xr.DataArray(v, dims=['month', 'attribute']))
             for k, v in D_patient_DF.items()}
combined = xr.Dataset(variables).to_array(dim='patient')
print(combined)

However, beware that the result will not necessarily be ordered in sorted order, but rather use the arbitrary order of dictionary iteration. If you want sorted order, you should use an OrderedDict instead (insert after setting variables above):

variables = collections.OrderedDict((k, variables[k]) for k in patients)

This outputs:

<xarray.DataArray (patient: 100, month: 12, attribute: 10000)>
array([[[ 0.61176399,  0.26172557,  0.74657302, ...,  0.43742111,
          0.47503291,  0.37263983],
        [ 0.34970732,  0.81527751,  0.53612895, ...,  0.68971198,
          0.68962168,  0.75103198],
        [ 0.71282751,  0.23143891,  0.28481889, ...,  0.52612376,
          0.56992843,  0.3483683 ],
        ...,
        [ 0.84627257,  0.5033482 ,  0.44116194, ...,  0.55020168,
          0.48151353,  0.36374339],
        [ 0.53336826,  0.59566147,  0.45269417, ...,  0.41951078,
          0.46815364,  0.44630235],
        [ 0.25720899,  0.18738289,  0.66639783, ...,  0.36149276,
          0.58865823,  0.33918553]],

       ...,

       [[ 0.42933273,  0.58642504,  0.38716496, ...,  0.45667285,
          0.72684589,  0.52335464],
        [ 0.34946576,  0.35821339,  0.33097093, ...,  0.59037927,
          0.30233665,  0.6515749 ],
        [ 0.63673498,  0.31022272,  0.65788374, ...,  0.47881873,
          0.67825066,  0.58704331],
        ...,
        [ 0.44822441,  0.502429  ,  0.50677081, ...,  0.4843405 ,
          0.84396521,  0.45460029],
        [ 0.61336348,  0.46338301,  0.60715273, ...,  0.48322379,
          0.66530209,  0.52204897],
        [ 0.47520639,  0.43490559,  0.27309414, ...,  0.35280585,
          0.30280485,  0.77537204]]])
Coordinates:
  * month      (month) int64 0 1 2 3 4 5 6 7 8 9 10 11
  * patient    (patient) <U10 'patient_80' 'patient_73' 'patient_79' ...
  * attribute  (attribute) object 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...

Alternatively, you could create a list of 2D DataArrays and then use concat:

patient_list = []
for i, patient in enumerate(patients):
    df = ...
    array = xr.DataArray(df, dims=['patient', 'attribute'])
    patient_list.append(df)
combined = xr.concat(patient_list, dim=pd.Index(patients, name='patient')

This would give the same result, and is probably the cleanest code.