leka0024 leka0024 - 18 days ago 5
Python Question

Saving Pandas DataFrames as a HDF5 store, various errors

Just want to archive some Pandas DataFrames in a HDF5 store (.h5 file). Below is the code I'm using.

# Fake data over N runs
Data_N = []
for n in range(5):
Data_N.append(np.random.randn(5000,15,125))

# Create HDFStore object
store = pd.HDFStore('test.h5')

# For each run:
for n in range(len(Data_N)):
Data = Data_N[n]

# Pandas DataFrame for "flattened" fake data
Data_subDFs = []
nanbuff = np.nan*np.zeros((1,len(Data[0,0])))

for i in range(len(Data)):
Data_i = np.vstack((nanbuff,Data[i,:,:]))
Data_subDFs.append(pd.DataFrame(data = Data_i))

Data_DF = pd.concat(Data_subDFs)

# Row and column labels for the DataFrame
Data_rows = []
for i in range(len(Data)):
Data_rows.append(['Layer %d:' % (i+1)] + range(1,len(Data[0])+1))

Data_DF.index = sum(Data_rows,[])
Data_DF.columns = range(1,len(Data[0,0])+1)

# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (n+1), Data_DF)
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table', data_columns=True)

# Save the HDF5 file
store.close()


This gives the following output:

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis1] [items->None]


If I use the second version of put, it gives:

TypeError: Passing an incorrect value to a table column. Expected a Col (or subc
lass) instance and got: "ObjectAtom()". Please make use of the Col(), or descend
ant, constructor to properly initialize columns.


And if I use the third version of put, it gives:

ValueError: cannot have non-object label DataIndexableCol


Can someone please explain the different versions, and why I can't save what I think is a valid Pandas DataFrame in HDF5 without the pickling?

If it helps, I don't think I need to be able to append the DataFrame/store. I just want the best performing way to save the DF using Pandas HDF5 interface.

Thanks!




EDIT 1:

I updated the code after "For each run:" to this

# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])

# Pandas DataFrame for "flattened" fake data
Data_subDFs = []

for i in range(len(Data)):
Data_i = Data[i,:,:]
Data_subDFs.append(pd.DataFrame(data = Data_i))

Data_DF = pd.concat(Data_subDFs)

# Row and column labels for the DataFrame
L1 = np.zeros((l*m,1), dtype=object) # Layer number
L2 = np.zeros((l*m,1), dtype=object) # Row number

for i in range(l):
for j in range(m):
L1[i*m + j,0] = 'Layer %d' % (i+1)
L2[i*m + j,0] = '%d' % (j+1)

Data_DF.index = np.hstack((L1,L2))
Data_DF.columns = range(1,n+1)

# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)


But that gives the same warning or errors, for each put line.




EDIT 2 (this worked!):

# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])

# Pandas DataFrame for "flattened" fake data
Data_DF = pd.DataFrame(Data.reshape(l*m,n))

# Layer and row labels
layers = np.arange(1,l+1)
rows = np.arange(1,m+1)

# Pandas multi-index
mindex = pd.MultiIndex.from_product([layers,rows], names=['Layer','Row'])

# DataFrame multi-index and column labels
Data_DF.index = mindex
Data_DF.columns = range(1,n+1)

# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)


The third put line still gives the same error, but since the second line works I'll assume that the third line is just an invalid command in this case.

Second line was quite a bit faster than first line, as well, and both considerably faster than the pickling route. Thanks!

Answer

UPDATE:

Here is a small demo:

Setup:

data = np.random.randn(5,10,5)
index = pd.MultiIndex.from_product([np.arange(1, len(data)+1),
                                  np.arange(1,len(data[0])+1)], names=['Layer','No'])
df = pd.DataFrame(data.reshape(data.shape[0] * data.shape[1], data.shape[2]),
                  index=index)

Data:

In [82]: df
Out[82]:
                 0         1         2         3         4
Layer No
1     1   1.167144  0.640303  0.059197 -1.637180  0.667196
      2   2.150872 -0.825325 -0.332458 -1.307043  1.361330
      3  -0.931299 -0.931882  0.153943 -0.446289  0.651594
      4  -0.131500 -0.489745  1.264029  0.889779  1.081613
      5  -0.479022 -1.516204  0.616170  0.126860  0.125559
      6   1.114287 -0.939504  0.058869  0.321159  0.340881
      7  -0.527516 -0.362337 -0.590430 -0.609017  1.835716
      8   0.063372  0.000792  0.855485 -0.113592  0.890687
      9  -0.160041  1.978954  0.778428  1.988354  2.095665
      10  0.687911  0.115918 -0.653885  0.486365 -0.775659
2     1  -0.123350  0.674359 -0.120634 -1.350044 -0.176252
      2  -1.986077 -0.846584  0.895982  0.236790  0.240023
      3   0.878597  0.241594  0.405382  1.785109  1.228188
      4  -1.510238 -0.303274  0.247082  1.841996 -0.864595
      5  -1.424249 -0.183216 -0.044330  0.324894 -0.271179
      6  -0.345720 -0.942421  0.538227 -0.558793 -1.075346
      7   1.327952 -2.335520 -0.164645  1.489798 -0.876896
      8   1.043723  0.770489 -1.052739 -0.830190  1.005406
      9   0.789100 -0.706633 -1.014431 -1.164513 -0.266424
      10  2.061175  0.933526 -1.601836 -1.542535 -1.220943
3     1  -0.061520 -0.932599  0.103480 -0.318529 -0.311965
      2  -0.401409 -0.308739 -1.399233 -1.172032 -0.550774
      3   0.670272  1.215724  0.711328  2.332297 -1.326704
      4   0.377469  0.752313 -1.223832  0.431555 -0.901796
      5  -2.386383  0.053921 -1.175427 -0.794099 -0.469374
      6   0.951571 -2.220609  0.208136 -2.141828  0.010316
      7   1.047133  0.924568  0.282091  1.367981 -0.617389
      8   1.083008 -1.519416  0.535690  0.196885 -0.022692
      9   1.307252  1.099716  0.766976 -0.466699  1.113605
      10 -0.614214  0.702395 -0.131248  1.773092  0.241553
4     1  -1.280026  0.278248 -0.518560 -0.395394  0.434473
      2   1.498882 -1.359542  0.012312 -0.231728 -2.643232
      3  -0.539773 -0.755483 -1.002526  0.198792 -0.120656
      4   0.056788  1.289477 -0.440122 -1.454418 -0.043193
      5  -0.777678  1.734322 -1.270129  0.160094  0.355290
      6  -1.037775 -0.542944 -0.913428  0.885965 -0.155220
      7  -0.855498 -0.330268 -1.903738  0.098101 -0.670830
      8   0.786258  0.599100 -0.426781  0.425572  0.132932
      9  -0.430497 -1.414292 -0.997637  0.696176 -0.480886
      10  1.211665 -1.233842  0.137176  1.520013 -1.052884
5     1  -0.267698 -1.013917 -1.324896 -1.189835 -0.192396
      2   1.047264 -0.454829  1.051039  1.565423  0.749844
      3   0.159177  0.481088  0.711499 -1.217079  0.444402
      4   0.254420 -0.114102  0.620231  1.890822  1.269808
      5   0.673696 -0.321638 -0.887355  0.426549 -0.935591
      6  -1.836808  0.450332  1.187512 -0.215318 -1.142346
      7  -1.496568  0.633886  0.625143  0.295385  1.445084
      8  -0.473427 -0.608318 -0.602080  0.134105  0.704027
      9   2.319899  0.763272  0.861798  1.464612 -0.708869
      10 -0.199555  0.721122  0.099777 -0.466488  0.923112

In [84]: df.index.levels
Out[84]: FrozenList([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])

Now you can slice it like as follows:

In [85]: idx = pd.IndexSlice

In [86]: df.loc[idx[[2,4], 2:5], :]
Out[86]:
                 0         1         2         3         4
Layer No
2     2  -1.986077 -0.846584  0.895982  0.236790  0.240023
      3   0.878597  0.241594  0.405382  1.785109  1.228188
      4  -1.510238 -0.303274  0.247082  1.841996 -0.864595
      5  -1.424249 -0.183216 -0.044330  0.324894 -0.271179
4     2   1.498882 -1.359542  0.012312 -0.231728 -2.643232
      3  -0.539773 -0.755483 -1.002526  0.198792 -0.120656
      4   0.056788  1.289477 -0.440122 -1.454418 -0.043193
      5  -0.777678  1.734322 -1.270129  0.160094  0.355290

Saving to and selecting from HDF store:

In [88]: store = pd.HDFStore('d:/temp/test.h5')

In [89]: store.append('test', df, complib='blosc', complevel=5)

In [90]: store.close()

In [91]: store = pd.HDFStore('d:/temp/test.h5')

In [92]: store.select('test', where="Layer in [2,4] and No in [2,4,6]")
Out[92]:
                 0         1         2         3         4
Layer No
2     2  -1.986077 -0.846584  0.895982  0.236790  0.240023
      4  -1.510238 -0.303274  0.247082  1.841996 -0.864595
      6  -0.345720 -0.942421  0.538227 -0.558793 -1.075346
4     2   1.498882 -1.359542  0.012312 -0.231728 -2.643232
      4   0.056788  1.289477 -0.440122 -1.454418 -0.043193
      6  -1.037775 -0.542944 -0.913428  0.885965 -0.155220

MultiIndex documentation (with two levels: Layer, No) instead.

Comments