user3766692 user3766692 - 29 days ago 19
Python Question

How to convert an xarray dataset to pandas dataframes inside a dask dataframe

I have a calculation that expects a pandas dataframe as input. I'd like to run this calculation on data stored in a netCDF file that expands to 51GB - currently I've been opening the file with

xarray.open_dataset
and using chunks (my understanding is that this opened file is actually a dask array, so only loads chunks of data into memory at a time). However, I don't seem to be able to take advantage of this lazy loading, since I have to convert the xarray data into a pandas dataframe in order to run my calculation - and my understanding is that at that point all the data is loaded into memory (which is bad).

So I guess long story short, my question is: how can I get from an xarray dataset to a pandas dataframe without any intermediate steps that load my entire data into memory? I've seen dask work with
pandas.read_csv
, and I see it work with xarray, but I'm not sure how to convert an already opened netCDF xarray dataset to a pandas dataframe in chunks.

Thanks and sorry for the vague question!

Answer

This is a good question. This should be doable, but I'm not quite sure what the right approach is.

Ideally, we could simply implement a xarray.Dataset.to_dask_dataframe() method. But there are several challenges here -- the biggest one being that dask currently does not support dataframes with a MultiIndex.

Alternatively, you might want to construct a list of dask.Delayed objects holding pandas.DataFrames for each chunk of the xarray.Dataset. Toward this end, it would be nice if xarray had something like dask.array's to_delayed method for converting a Dataset into an array of delayed datasets, which you could then lazily convert into DataFrame objects and do your computation.

I encourage you to open an issue on either the dask or xarray GitHub pages to discuss, especially if you might be interested in contributing code. EDIT: You can find that issue here.