NirIzr NirIzr - 11 months ago 68
Python Question

Collecting attributes from dask dataframe providers

TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection.

I currently have a proprietary file format i'm using to feed into dask.DataFrame.
I have a function that accepts a file path and returns a pandas.DataFrame, used internally by dask.DataFrame successfully to load multiple files to the same dask.DataFrame.

Up until recently, I was using my own code to merge several pandas.DataFrames into one, and now i'm working on using dask instead. When parsing the file format i may encounter errors and certain conditions i want to log and associate with the dask.DataFrame object as metadata (logs, origin of data, etc).

Its important to note that when reasonable, I'm using MultiImdices quite heavily (13 index levels, 3 column levels). For metadata that describes the entire dataframe and not specific rows, I'm using attributes.

Using a custom function, I could pass the metadata in a tuple with the actual DataFrame. Using pandas, I could add it to the _metadata field and as attributes to the DataFrame obejcts.
How can I collect metadata from separate pandas.DataFrame objects when using the dask framework?


Answer Source

There are a few potential questions here:

  • Q: How do I load data from many files in a custom format into a single dask dataframe
  • A: You might check out the dask.delayed to load data and dask.dataframe.from_delayed to convert several dask Delayed objects into a single dask dataframe. Or, as you're probably doing now, you can use dask.dataframe.from_pandas and dask.dataframe.concat. See this example notebook on using dask.delayed from custom objects/functions.

  • Q: How do I store arbitrary metadata onto a dask.dataframe?

  • A: This is not supported. Generally I recommend using a different data structure to store your metadata if possible. If there are a number of use cases for this then we should consider adding it to dask dataframe. If this is the case then please raise an issue. Generally thought it'd be good to see better support for this in Pandas before dask.dataframe considers supporting it.

  • Q: I use multi-indexes heavily in Pandas, how can I integrate this workflow into dask.dataframe?

  • A: Unfortunately dask.dataframe does not currently support multi-indexes. These would clearly be helpful.