duckertito duckertito - 17 days ago 14
Python Question

Conditional replace for pandas DataFrame

I have the following pandas


123 1 ABC
123 1 CCC
123 NaN AVV
345 2 FGG
345 NaN FRG
345 NaN FGT

I need to substitute all NaN values in
based on the same
in order to get this result:

123 1 ABC
123 1 CCC
123 1 AVV
345 2 FGG
345 2 FRG
345 2 FGT

I can write
loop, but it will take a long time for my dataset to execute the script. Is there any conditional replace function?


Starting with an example as follows:

df = pd.DataFrame({'ID': list(range(10)), 'COL1': [np.random.choice([1,np.nan]) for _ in range(10)]})
df = pd.concat([df]*100000).reset_index(drop = True)


#  COL1 ID
#0  NaN  0
#1  1.0  1
#2  1.0  2
#3  NaN  3
#4  1.0  4

You can use the forward fill and backward fill methods within each group to fill missing values:

%timeit df.groupby('ID').ffill().bfill()
1 loop, best of 3: 212 ms per loop

Or an alternative is to sort values by ID and COL1, this sorts ID firstly and then sort COL1 within each ID which pushes all missing values to the end of each ID and then you can use ffill() which seems to be faster than the ffill(), bfill() methods above for this example:

%timeit df.sort_values(['ID', 'COL1']).ffill()
10 loops, best of 3: 71.6 ms per loop

If there are other unwanted strings, you can call the replace method to replace the strings with NaN firstly. For instance, if there are empty strings in the data frame you want to fill. You can do df.replace('', np.nan).sort_values(['ID', 'COL1']).ffill()