Eric Kaschalk Eric Kaschalk - 1 month ago 19
Python Question

Dask/hdf5: Read by group?

I must read in and operate independently over many chunks of a large dataframe/numpy array. However, these chunks are chosen in a specific, non-uniform manner and are broken naturally into groups within a hdf5 file. Each group is small enough to fit into memory (though even without restriction, I suppose the standard chunking procedure should suffice.)

Specifically, instead of

f = h5py.File('myfile.hdf5')
x = da.from_array(f['/data'], chunks=(1000, 1000))


I want something closer to (pseudocode):

f = h5py.File('myfile.hdf5')
x = da.from_array(f, chunks=(f['/data1'], f['/data2'], ...,))


http://dask.pydata.org/en/latest/delayed-collections.html I believe hints this is possible but I am still reading into and understanding dask/hdf5.

My previous implementation uses a number of CSV files and reads them in as needed with its own multi-processing logic. I would like to collapse all this functionality into dask with hdf5.

Is chunking by hdf5 group/read possible and my line of thought ok?

Answer

I would read many dask.arrays from many groups as single-chunk dask.arrays and then concatenate or stack those groups.

Read many dask.arrays

f = h5py.File(...)
dsets = [f[dset] for dset in datasets]
arrays = [da.from_array(dset, chunks=dset.shape) for dset in dsets]

Stack or Concatenate arrays together

array = da.concatenate(arrays, axis=0)

See http://dask.pydata.org/en/latest/array-stack.html

Or use dask.delayed

You could also, as you suggest, use dask.delayed to do the first step in reading single-chunk dask.arrays

Comments