James Warner James Warner - 11 days ago 6
Python Question

NetCDF Big data

I need to read in large (+15GB) NetCDF files into a program, which holds a 3D variable (etc Time as the record dimension, and the data is latitudes by longitudes).

I'm processing the data in a 3 level nested loop (checking each block of the NetCDF if it passes a certain criteria. For example;

from netCDF4 import Dataset
import numpy as np

File = Dataset('Somebigfile.nc', 'r')
Data = File.variables['Wind'][:]

Getdimensions = np.shape(Data)
Time = Getdimensions[0]
Latdim = Getdimensions[1]
Longdim = Getdimensions[2]

for t in range(0,Time):
for i in range(0,Latdim):
for j in range(0,Longdim):

if Data[t,i,j] > Somethreshold:
#Do something


Is there anyway I can read in the NetCDF file one time record at a time? Reducing the memory usage hugely. Any help hugely appreciated.

I know of NCO operators but would prefer not to use these methods to break up files before using the script.

Answer

It sounds like you've already settled on a solution but I'll throw out a much more elegant and vectorized (likely faster) solution that uses xarray and dask. Your nested for loop is going to be very inefficient. Combining xarray and dask, you can work on the data in your file incrementally in a semi-vectorized manor.

Since your Do something step isn't all that specific, you'll have to extrapolate from my example.

import xarray as xr

# xarray will open your file but doesn't load in any data until you ask for it
# dask handles the chunking and memory management for you
# chunk size can be optimized for your specific dataset.
ds = xr.open_dataset('Somebigfile.nc', chunks={'time': 100})

# mask out values below the threshold
da_thresh = ds['Wind'].where(ds['Wind'] > Somethreshold)

# Now just operate on the values greater than your threshold
do_something(da_thresh)

Xarray/Dask docs: http://xarray.pydata.org/en/stable/dask.html