I have a calculation that expects a pandas dataframe as input. I'd like to run this calculation on data stored in a netCDF file that expands to 51GB - currently I've been opening the file with
This is a good question. This should be doable, but I'm not quite sure what the right approach is.
Ideally, we could simply implement a
xarray.Dataset.to_dask_dataframe() method. But there are several challenges here -- the biggest one being that dask currently does not support dataframes with a MultiIndex.
Alternatively, you might want to construct a list of
dask.Delayed objects holding
pandas.DataFrames for each chunk of the
xarray.Dataset. Toward this end, it would be nice if xarray had something like dask.array's
to_delayed method for converting a Dataset into an array of delayed datasets, which you could then lazily convert into DataFrame objects and do your computation.
I encourage you to open an issue on either the dask or xarray GitHub pages to discuss, especially if you might be interested in contributing code. EDIT: You can find that issue here.