TomTom101 TomTom101 - 5 months ago 22
Python Question

Zero fill data frame forward fill

I am trying to fill a dataframe with zeros, however I do not want to touch leading NaNs:

rng = pd.date_range('2016-06-01', periods=9, freq='D')
df = pd.DataFrame({'data': pd.Series([np.nan]*3 + [20, 30, 40] + [np.nan]*3, rng)})

2016-06-01 NaN
2016-06-02 NaN
2016-06-03 NaN
2016-06-04 20.0
2016-06-05 30.0
2016-06-06 40.0
2016-06-07 NaN
2016-06-08 NaN
2016-06-09 NaN


The df I want after filling/replacing is this:

pd.DataFrame({'data': pd.Series([np.nan]*3 + [20, 30, 40] + [0.]*3, rng)})

2016-06-01 NaN
2016-06-02 NaN
2016-06-03 NaN
2016-06-04 20.0
2016-06-05 30.0
2016-06-06 40.0
2016-06-07 0.0
2016-06-08 0.0
2016-06-09 0.0


Since
fillna()
only allows value or method and
fillna(0)
replaces all NaNs, including leading, I was hoping replace could jump in here, but

df.replace([np.nan], 0, method='ffill')


also replaces all NaNs.

How can I zero fill values only after the first non-NaN value, also with multiple data columns?

Answer

I think you can find first group of NaN by isnull with cumsum and then fillna all other values:

print (df.data.notnull().cumsum())
2016-06-01    0
2016-06-02    0
2016-06-03    0
2016-06-04    1
2016-06-05    2
2016-06-06    3
2016-06-07    3
2016-06-08    3
2016-06-09    3
Freq: D, Name: data, dtype: int32

print (df.data.mask(df.data.notnull().cumsum() != 0, df.data.fillna(0)))
2016-06-01     NaN
2016-06-02     NaN
2016-06-03     NaN
2016-06-04    20.0
2016-06-05    30.0
2016-06-06    40.0
2016-06-07     0.0
2016-06-08     0.0
2016-06-09     0.0
Freq: D, Name: data, dtype: float64

EDIT:

With multiple columns it works nice too:

df = pd.DataFrame({'data': pd.Series([np.nan]*3 + [20, 30, 40] + [np.nan]*3, rng), 
                   'data1': pd.Series([np.nan]*2 + [20, 30, 40,30] + [np.nan]*3, rng),
                   'data2': pd.Series([np.nan]*1 + [10,20, 20, 30, 40] + [np.nan]*3, rng)})

print (df.mask(df.notnull().cumsum() != 0, df.fillna(0)))
            data  data1  data2
2016-06-01   NaN    NaN    NaN
2016-06-02   NaN    NaN   10.0
2016-06-03   NaN   20.0   20.0
2016-06-04  20.0   30.0   20.0
2016-06-05  30.0   40.0   30.0
2016-06-06  40.0   30.0   40.0
2016-06-07   0.0    0.0    0.0
2016-06-08   0.0    0.0    0.0
2016-06-09   0.0    0.0    0.0

EDIT2 by DSM comment - nicer is use cummax:

print (df.mask(df.notnull().cummax(), df.fillna(0)))
            data  data1  data2
2016-06-01   NaN    NaN    NaN
2016-06-02   NaN    NaN   10.0
2016-06-03   NaN   20.0   20.0
2016-06-04  20.0   30.0   20.0
2016-06-05  30.0   40.0   30.0
2016-06-06  40.0   30.0   40.0
2016-06-07   0.0    0.0    0.0
2016-06-08   0.0    0.0    0.0
2016-06-09   0.0    0.0    0.0
Comments