splinter splinter - 1 month ago 26
Python Question

Using fillna() selectively in pandas

I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.

That is, if the input DataFrame is

2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1


I want the output to be:

2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1


The
fillna
function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify
method='ffill'
and
limit=3
, but that fills in the first 3 nans of any sequence, not selectively as described above.

I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?

DSM DSM
Answer

Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)

One way to get what you want would be to use the compare-cumsum-groupby pattern:

In [68]: nulls = df.isnull()
    ...: groups = (nulls != nulls.shift()).cumsum()
    ...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
    ...: df.where(~to_fill, df.ffill())
    ...: 
Out[68]: 
     0    1    2
0  2.0  5.0  4.0
1  2.0  5.0  NaN
2  2.0  5.0  NaN
3  5.0  5.0  NaN
4  9.0  3.0  NaN
5  7.0  9.0  1.0

Okay, another alternative which I don't like because it's too tricky:

def method_2(df):
    nulls = df.isnull()
    filled = df.ffill(limit=3)
    unfilled = nulls & (~filled.notnull())
    nf = nulls.replace({False: 2.0, True: np.nan})
    do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
    return df.where(do_not_fill, df.ffill())

This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.