splinter - 3 months ago 68

Python Question

I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.

That is, if the input DataFrame is

`2 5 4`

nan nan nan

nan nan nan

5 nan nan

9 3 nan

7 9 1

I want the output to be:

`2 5 4`

2 5 nan

2 5 nan

5 5 nan

9 3 nan

7 9 1

The

`fillna`

`method='ffill'`

`limit=3`

I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?

Answer

Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)

One way to get what you want would be to use the compare-cumsum-groupby pattern:

```
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
```

Okay, another alternative which I don't like because it's too tricky:

```
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
```

This doesn't use any `groupby`

tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.