ShanZhengYang ShanZhengYang - 1 year ago 68
Python Question

Using PyTables to index a 500 GB HDF5 file

I would like to dump a keyed 500GB-800GB table into HDF5, and then then retrieve rows matching specific keys.

For an HDF5 file, items like all the data access uses an integer "row" number, so seems like I would have to implement a 'key to row number map" outside of HDF5.

Would this work? Do I need to access the entire HDF5 "in memory (RAM)"?

Can anyone give me some sense how poorly HDF5 would perform in this situation? If there's decent indexing, this is just a huge dictionary, right?

Should I be using something else?

Answer Source

Assume that you have defined this record type in PyTables

class Record(tables.IsDescription):
    row = tables.Int32Col()
    col1 = tables.Int32Col()
    col2 = tables.Float64Col()
    col3 = tables.Float64Col()

A regular range query might look like this:

result = [rec for rec in table if (rec['row'] > 100 and rec['row'] < 200)]

This works fine of your table is not too big. However for large tables it will be relatively slow because each row has to be brought into the Python space to be evaluated for the range condition.

In order to speed up this query one can rely on so called in-kernel queries, which allows to check the condition using a PyTables kernel that is written in C with the help of the numexpr library.

result = [rec for rec in table.where(
            'row > 100 & row < 200')]

You can also mix and match a regular query with a in-kernel query:

result = [rec for rec in table.where(
            'row > 100 & row < 200')] if your_function(rec['col2']) ]

If you have large tables that don't fit into memory the speedup is around 2x. Using compression (i.e. BLOSC, LZF, etc) will give you a slight speed bump as the CPU overhead of uncompressing is less than the I/O overhead (so use compression for huge tables that don't fit into the memory).

When you use compression, the dataset will be split up in chunks and the chunks are being compressed separately. That means if you query for a specific range (rows 100 - 200) the corresponding compressed chunks will be loaded from disk to the memory and then uncompressed by the CPU in memory. This will speed up things compared to using no compression or storing the dataset continuously. Blosc is a meta-compressor and lzf is the default compressor of h5py. For differences between Blosc and lzf see this thread.

If the in-kernel queries are not fast enough, you can also create an index on one or more columns. This way the query will use a binary search instead of a sequential scan. To create an index on an existing table for the row column just run:

indexrows = table.cols.row.create_index()

But beware that the index won't be used on all conditions (see reference below). To check whether your query is using the index properly, you can use the Table.will_query_use_indexing() method.