Suraj Suraj - 1 month ago 18
Python Question

On disk indexing of Pandas multiindexed HDFStore

In order to improve performance and reduce memory footprint, I am trying to read a multi-indexed HDFStore created in Pandas. The original Store is quite large, but the problem can be reproduced with a similar but smaller example.

df = pd.DataFrame([0.25, 0.5, 0.75, 1.0],
index=['Item0', 'Item1', 'Item2', 'Item3'], columns=['Values'])

df = pd.concat((df.iloc[:],df.iloc[:]), axis=0,names=['Item','N'],
keys = ['Items0','Items1'])

df.to_hdf('hdfs.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc',data_columns=True)

store = pd.HDFStore('hdfs.h5', mode= 'r')

store.select('df',where='Item="Items0"')


This is expected to be return the values of the sub-index, however it returns an error

> ValueError: The passed where expression: Item="Items0"
> contains an invalid variable reference
> all of the variable refrences must be a reference to
> an axis (e.g. 'index' or 'columns'), or a data_column
> The currently defined references are: index,iron,columns


The indices are:

store['df'].index

> MultiIndex(levels=[['Items0', 'Items1'], ['Item0', 'Item1', 'Item2',
> 'Item3']],
> labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]],
> names=['Item', 'N'])


Could any one just explain what may be the cause? or how it should be done properly...

Answer

For me works if remove data_columns=True:

df.to_hdf('hdfs3.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc') 
store = pd.HDFStore('hdfs3.h5', mode= 'r')
print (store.select('df','Item="Items0"'))
              Values
Item   N            
Items0 Item0    0.25
       Item1    0.50
       Item2    0.75
       Item3    1.00