ChristineB - 2 months ago 6x

Python Question

**The issue**

So I have 50 netCDF4 data files that contain decades of monthly temperature predictions on a global grid. I'm using np.mean() to make an ensemble average of all 50 data files together while preserving time length & spatial scale, but np.mean() gives me two different answers. The first time I run its block of code, it gives me a number that, when averaged over latitude & longitude & plotted against the individual runs, is slightly lower than what the ensemble mean should be. If I re-run the block, it gives me a different mean which looks correct.

**The code**

I can't copy every line here since it's long, but here's what I do for each run.

`#Historical (1950-2020) data`

ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_195001-202012.nc") #Import data file

tash1 = ncin_1.variables['tas'][:] #extract tas (temperature) variable

ncin_1.close() #close to save memory

#Repeat for future (2021-2100) data

ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_202101-210012.nc")

tasr1 = ncin_1.variables['tas'][:]

ncin_1.close()

#Concatenate historical & future files together to make one time series array

tas11 = np.concatenate((tash1,tasr1),axis=0)

#Subtract the 1950-1979 mean to obtain anomalies

tas11 = tas11 - np.mean(tas11[0:359],axis=0,dtype=np.float64)

And I repeat that 49 times more for other datasets. Each tas11, tas12, etc file has the shape (1812, 64, 128) corresponding to time length in months, latitude, and longitude.

To get the ensemble mean, I do the following.

`#Move all tas data to one array`

alltas = np.zeros((1812,64,128,51)) #years, lat, lon, members (no ensemble mean value yet)

alltas[:,:,:,0] = tas11

(...)

alltas[:,:,:,49] = tas50

#Calculate ensemble mean & fill into 51st slot in axis 3

alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)

When I check a coordinate & month, the ensemble mean is off from what it should be. Here's what a plot of globally averaged temperatures from 1950-2100 looks like with the first mean (with monhly values averaged into annual values. Black line is ensemble mean & colored lines are individual runs.

Obviously that deviated below the real ensemble mean. Here's what the plot looks like when I run alltas[:,:,:,50]=np.mean(alltas,axis=3,dtype=np.float64) a second time & keep everything else the same.

Much better.

Why does np.mean() calculate the wrong value the first time? I tried specifying the data type as a float when using np.mean() like in this question- Wrong numpy mean value?

But it didn't work. Any way I can fix it so it works correctly the first time? I don't want this problem to occur on a calculation where it's not so easy to notice a math error.

Answer

In the line

```
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
```

the argument to `mean`

should be `alltas[:,:,:,:50]`

:

```
alltas[:,:,:,50] = np.mean(alltas[:,:,:,:50], axis=3, dtype=np.float64)
```

Otherwise you are including those final zeros in the calculation of the ensemble means.

Source (Stackoverflow)

Comments