Louise Stevens Louise Stevens - 26 days ago 8
Python Question

Grouping in pandas, calculation of mean

I am working with pandas.

This is my data: http://www.esapubs.org/archive/ecol/E084/094/MOMv3.3.txt
Where each column is Continent (SA, NA, EA, insular, oceanic, AUS, AF), Status (extinct, historical, introduction, or extant), Order, Family, Genus, Species, Log Mass, Combined Mass, and Reference, respectively.

I am trying to compare masses of extinct and extant species for each continent. Therefore I need to group this dataframe by 'continent' and 'status' and then to work out the average mass of each group - extinct and extant species - for each species. (For example, I need to work out the average combined mass for all extinct species in the continent AF, then the average combined mass for all extant species in the continent AF..etc..)

I want to create a csv file where the first entry on each line is the continent, the second entry is the average mass of the extant species on that continent, the third entry is the average mass of the extinct species on that continent, and the forth entry is the difference between the average extant and average extinct masses.

This is my code so far:

import pandas as pd
url="http://www.esapubs.org/archive/ecol/E084/094/MOMv3.3.txt"
mammalian_data = pd.read_csv(url, delimiter="\t", header=None,
names= ['continent', 'status', 'order', 'family', 'genus', 'species', 'log mass', 'combined mass', 'reference'])

AF_extant = mammalian_data.groupby(["continent", "status"]).get_group(("AF","extant"))
print(AF_extant["combined mass"].mean())
AF_extinct = mammalian_data.groupby(["continent", "status"]).get_group(("AF","extinct"))
print(AF_extinct["combined mass"].mean())


How do I repeat this code for each of the 7 continents? How also do I create a new dataframe with this information to then export to a CSV file with all the necessary entries - as specified above.

Please can anyone help with this!

Answer
import pandas as pd
url="http://www.esapubs.org/archive/ecol/E084/094/MOMv3.3.txt"
mammalian_data = pd.read_csv(
    url, delimiter="\t", header=None, 
    names=['continent', 'status', 'order', 'family', 'genus',
           'species', 'log mass', 'combined mass',
           'reference'])

result = mammalian_data.groupby(["continent", "status"])['combined mass'].mean()
result = result.unstack('status')
result['diff'] = result['extinct']-result['extant']
result = result[['extant', 'extinct', 'diff']]

yields

status           extant        extinct           diff
continent                                            
AF         2.220823e+04  970038.461538  947830.232208
AUS        1.691833e+04  188355.555556  171437.228736
Af        -9.990000e+02            NaN            NaN
EA         2.114960e+04            NaN            NaN
Insular    7.757171e+03   81912.131034   74154.960145
Oceanic    8.238885e+06            NaN            NaN
SA         3.864696e+03  973072.402597  969207.706264

You could then use result.to_csv('out.csv') to write result to a CSV.


mammalian_data.groupby(["continent", "status"])['combined mass'].mean() computes all the means for all the groups:

In [35]: mammalian_data.groupby(["continent", "status"])['combined mass'].mean()
Out[35]: 
continent  status      
AF         extant          2.220823e+04
           extinct         9.700385e+05
           historical      1.475138e+05
AUS        extant          1.691833e+04
           extinct         1.883556e+05
           historical      2.653043e+03
           introduction    1.797469e+05
Af         extant         -9.990000e+02
EA         extant          2.114960e+04
           historical      2.014170e+05
Insular    extant          7.757171e+03
           extinct         8.191213e+04
           historical      8.433233e+04
Oceanic    extant          8.238885e+06
SA         extant          3.864696e+03
           extinct         9.730724e+05
Name: combined mass, dtype: float64

This Series has a 2-level MultiIndex. The levels are continent and status. To move the status index level to columns, use the unstack method:

In [36]: mammalian_data.groupby(["continent", "status"])['combined mass'].mean().unstack()
Out[36]: 
status           extant        extinct     historical   introduction
continent                                                           
AF         2.220823e+04  970038.461538  147513.750000            NaN
AUS        1.691833e+04  188355.555556    2653.043478  179746.852941
Af        -9.990000e+02            NaN            NaN            NaN
EA         2.114960e+04            NaN  201417.000000            NaN
Insular    7.757171e+03   81912.131034   84332.326000            NaN
Oceanic    8.238885e+06            NaN            NaN            NaN
SA         3.864696e+03  973072.402597            NaN            NaN
Comments