Kartik Kartik - 1 year ago 118
Python Question

Storing Pandas DataFrame to HDF5 with both axes MultiIndexed

I would think that storing a DataFrame with both MultiIndexed axes should be possible. However, I am getting the following error:

In [1]: index = pd.MultiIndex.from_product([['Foo', 'Bar'],['One','Two','Three']])
column = pd.MultiIndex.from_product([['foo', 'bar'],['one','two','three']])
df = pd.DataFrame(np.random.rand(6,6), index=index, columns=column)
Out[1]: foo bar
one two three one two three
Foo One 0.605352 0.882382 0.472946 0.615619 0.108022 0.389674
Two 0.746384 0.594509 0.556881 0.457000 0.529793 0.929574
Three 0.270978 0.956778 0.515201 0.626850 0.852708 0.861962
Bar One 0.219994 0.648191 0.677824 0.408439 0.079326 0.414059
Two 0.186167 0.767103 0.880667 0.205253 0.647471 0.449379
Three 0.353171 0.249900 0.723791 0.458349 0.977604 0.691188

In [2]: with pd.HDFStore('test.h5', 'w') as store:
store.append('output', df)
Out[2]: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-55-87e96c141a7f> in <module>()
1 with pd.HDFStore('test.h5', 'w') as store:
----> 2 store.append('output', df)

/home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in append(self, key, value, format, append, columns, dropna, **kwargs)
917 kwargs = self._validate_format(format, kwargs)
918 self._write_to_group(key, value, append=append, dropna=dropna,
--> 919 **kwargs)
921 def append_to_multiple(self, d, value, selector, data_columns=None,

/home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
1263 # write the object
-> 1264 s.write(obj=value, append=append, complib=complib, **kwargs)
1266 if s.is_table and index:

/home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in write(self, obj, data_columns, **kwargs)
4195 data_columns.insert(0, n)
4196 return super(AppendableMultiFrameTable, self).write(
-> 4197 obj=obj, data_columns=data_columns, **kwargs)
4199 def read(self, **kwargs):

/home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwargs)
3785 self.create_axes(axes=axes, obj=obj, validate=append,
3786 min_itemsize=min_itemsize,
-> 3787 **kwargs)
3789 for a in self.axes:

/home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
3383 axis, axis_labels = self.non_index_axes[0]
3384 data_columns = self.validate_data_columns(
-> 3385 data_columns, min_itemsize)
3386 if len(data_columns):
3387 mgr = block_obj.reindex_axis(

/home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in validate_data_columns(self, data_columns, min_itemsize)
3246 if info.get('type') == 'MultiIndex' and data_columns:
3247 raise ValueError("cannot use a multi-index on axis [{0}] with "
-> 3248 "data_columns {1}".format(axis, data_columns))
3250 # evaluate the passed data_columns, True == use all columns

ValueError: cannot use a multi-index on axis [1] with data_columns ['level_1', 'level_0']

It makes most sense for me to store the data like this. Mainly because my needs will vary hugely. For some applications I will need all rows and all columns. For many others, I will only need all rows, and one parent column: say I would need all rows under
. I might also need only one parent row and one parent column:
Foo, foo

I will certainly need all secondary rows and columns.

In my case, primary row indexes are States, secondary row indexes are sensor names, primary column indexes are the different things sensed, and secondary column indexes are statistics of the sensor outputs. Therefore, it is easy to see that I might need data for just one type of sensing for all states, or for just one state, or I might need all things sensed from one state, or all states.

I am looking for either a fix to the error, or a better way to store the data.

Answer Source

I ended up stacking the DataFrame once, to get rid of the MultiIndex in the columns before storing:

with pd.HDFStore('test.h5', 'w') as store:
    store.append('output', df.stack().astype(float))

It is not ideal, but the querying capabilities of the HDF5 tables compensates for it.