Georg Heiler Georg Heiler - 1 month ago 4
Python Question

Pandas per group imputation of missing values

How can I achieve such a per-country imputation for each indicator in pandas?

I want to impute the missing values per group

  • no-A-state should get
    per indicatorKPI

  • no-ISO-state should get the
    per indicatorKPI

  • for states with missing values, I want to impute with the per
    mean. Here, this would mean to impute the missing values for Serbia

    mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state','germany','serbia', 'austria', 'germany','serbia', 'austria',], 'indicatorKPI':[np.nan,np.nan,'SP.DYN.LE00.IN','NY.GDP.MKTP.CD','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN'], 'value':[np.nan,np.nan,0.9,np.nan,0.7, 0.2, 0.3, 0.6]})
    enter image description here


The desired output should be similar to

mydf = pd.DataFrame({'Country':['no-A-state','no-ISO-state', 'no-A-state','no-ISO-state',
'germany','serbia','serbia', 'austria',
'germany','serbia', 'austria',],
'indicatorKPI':['SP.DYN.LE00.IN','NY.GDP.MKTP.CD', 'SP.DYN.LE00.IN',
'value':['MIN of all for this indicator', 'MEAN of all for this indicator','MIN of all for this indicator','MEAN of all for this indicator', 0.9,'MEAN of all for SP.DYN.LE00.IN indicator',0.7, 'MEAN of all for NY.GDP.MKTP.CD indicator',0.2, 0.3, 0.6]

enter image description here


Based on your new example df the following works for me:

In [185]:
mydf.loc[mydf['Country'] == 'no-A-state', 'value'] = mydf['value'].min()
mydf.loc[mydf['Country'] == 'no-ISO-state', 'value'] = mydf['value'].mean()
mydf.loc[mydf['value'].isnull(), 'value'] = mydf['indicatorKPI'].map(mydf.groupby('indicatorKPI')['value'].mean())

         Country    indicatorKPI     value
0     no-A-state  SP.DYN.LE00.IN  0.200000
1   no-ISO-state  NY.GDP.MKTP.CD  0.442857
2     no-A-state  SP.DYN.LE00.IN  0.200000
3   no-ISO-state  SP.DYN.LE00.IN  0.442857
4        germany  NY.GDP.MKTP.CD  0.900000
5         serbia  SP.DYN.LE00.IN  0.328571
6         serbia  NY.GDP.MKTP.CD  0.700000
7        austria  NY.GDP.MKTP.CD  0.585714
8        germany  SP.DYN.LE00.IN  0.200000
9         serbia  NY.GDP.MKTP.CD  0.300000
10       austria  SP.DYN.LE00.IN  0.600000

Basically what this does is to fill the missing values for each condition, so we set the min for the 'no-A-state' countries, then mean for 'no-ISO-state' countries. We then groupby on 'indicatorKPI' and calc the mean for each group and assign again to the null value rows, the respective countries' mean using map which performs a lookup

Here are the steps broken down:

In [187]:


NY.GDP.MKTP.CD    0.633333
SP.DYN.LE00.IN    0.400000
Name: value, dtype: float64

In [188]:

0     0.400000
1     0.633333
2     0.400000
3     0.400000
4     0.633333
5     0.400000
6     0.633333
7     0.633333
8     0.400000
9     0.633333
10    0.400000
Name: indicatorKPI, dtype: float64