ultra909 ultra909 - 11 months ago 98
Python Question

HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there

I have a Pandas DataFrame stored via an HDFStore that essentially stores summary rows about test runs I am doing.

Several of the fields in each row contain descriptive strings of variable length.

When I do a test run, I create a new DataFrame with a single row in it:

def export_as_df(self):
return pd.DataFrame(data=[self._to_dict()], index=[datetime.datetime.now()])

And then call
HDFStore.append(string, DataFrame)
to add the new row to the existing DataFrame.

This works fine, apart from where one of the string columns contents is larger than the longest instance already existing, whereupon I get the following error:

File "<ipython-input-302-a33c7955df4a>", line 516, in save_pytables
store.append('tests', test.export_as_df())
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 532, in append
self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 788, in _write_to_group
s.write(obj = value, append=append, complib=complib, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2491, in write
min_itemsize=min_itemsize, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2254, in create_axes
raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
Exception: cannot find the correct atom type -> [dtype->object,items->Index([bp, id, inst, per, sp, st, title], dtype=object)] [values_block_3] column has a min_itemsize of [51] but itemsize [46] is required!

I can't find any documentation about how to specify string length when creating a DataFrame. What is the solution here?


Code that is failing:

store = pd.HDFStore(pytables_store)
for test in self.backtests:
min_itemsizes = { 'buy_pattern' : 60, 'sell_pattern': 60, 'strategy': 60, 'title': 60 }
store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes)

Here's the error under 0.11rc1:

File "<ipython-input-110-492b7b6603d7>", line 522, in save_pytables
store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 610, in append
self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 871, in _write_to_group
s.write(obj = value, append=append, complib=complib, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2707, in write
min_itemsize=min_itemsize, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2447, in create_axes
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2184, in validate_min_itemsize
raise ValueError("min_itemsize has [%s] which is not an axis or data_column" % k)
ValueError: min_itemsize has [buy_pattern] which is not an axis or data_column

Data sample:

all_day buy_pattern \
2013-04-14 12:11:44.377695 False Hammer() and LowerLow()

id instrument \
2013-04-14 12:11:44.377695 tafdcc96ba4eb11e2a86d14109fcecd49 EURUSD

open_margin periodicity sell_pattern strategy \
2013-04-14 12:11:44.377695 0.0001 1:00:00 Tsl()

title top_bottom wick_body
2013-04-14 12:11:44.377695 tsl 0.5 2


print prob_test.export_as_df().get_dtype_counts()

bool 1
float64 2
int64 1
object 7
dtype: int64

I am deleting the h5 file each time as I want clean results. Wondering if there is something as silly as it is failing because the df does not exist in the h5 (and hence neither do any columns) at the time of the first append?

Answer Source

Here is the link to the new docs section about this: http://pandas.pydata.org/pandas-docs/dev/io.html#string-columns

This issue is that you are specifiying a column in min_itemsize that is not a data_column. Simple workaround is to add data_columns=True to your append statement, but I have also updated the code to automatically create the data_columns if you pass a valid column name. I think this makes sense, you want to have a minimum column size, so let it happen.

Also created a new doc section String Columns to show a more complete example with explanation (docs will be updated soon).

# this is the new behavior (after code updates)
n [340]: dfs = DataFrame(dict(A = 'foo', B = 'bar'),index=range(5))

In [341]: dfs
     A    B
0  foo  bar
1  foo  bar
2  foo  bar
3  foo  bar
4  foo  bar

# A and B have a size of 30
In [342]: store.append('dfs', dfs, min_itemsize = 30)

In [343]: store.get_storer('dfs').table
/dfs/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=30, shape=(2,), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (963,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

# A is created as a data_column with a size of 30
# B is size is calculated
In [344]: store.append('dfs2', dfs, min_itemsize = { 'A' : 30 })

In [345]: store.get_storer('dfs2').table
/dfs2/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=3, shape=(1,), dflt='', pos=1),
  "A": StringCol(itemsize=30, shape=(), dflt='', pos=2)}
  byteorder := 'little'
  chunkshape := (1598,)
  autoIndex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}