duckertito duckertito - 17 days ago 14
Python Question

Conditional replace for pandas DataFrame

I have the following pandas

DataFrame
:

ID COL1 COL2
123 1 ABC
123 1 CCC
123 NaN AVV
345 2 FGG
345 NaN FRG
345 NaN FGT


I need to substitute all NaN values in
Col1
based on the same
ID
in order to get this result:

ID COL1 COL2
123 1 ABC
123 1 CCC
123 1 AVV
345 2 FGG
345 2 FRG
345 2 FGT


I can write
for
loop, but it will take a long time for my dataset to execute the script. Is there any conditional replace function?

Answer

Starting with an example as follows:

df = pd.DataFrame({'ID': list(range(10)), 'COL1': [np.random.choice([1,np.nan]) for _ in range(10)]})
df = pd.concat([df]*100000).reset_index(drop = True)

df.head()

#  COL1 ID
#0  NaN  0
#1  1.0  1
#2  1.0  2
#3  NaN  3
#4  1.0  4

You can use the forward fill and backward fill methods within each group to fill missing values:

%timeit df.groupby('ID').ffill().bfill()
1 loop, best of 3: 212 ms per loop

Or an alternative is to sort values by ID and COL1, this sorts ID firstly and then sort COL1 within each ID which pushes all missing values to the end of each ID and then you can use ffill() which seems to be faster than the ffill(), bfill() methods above for this example:

%timeit df.sort_values(['ID', 'COL1']).ffill()
10 loops, best of 3: 71.6 ms per loop

If there are other unwanted strings, you can call the replace method to replace the strings with NaN firstly. For instance, if there are empty strings in the data frame you want to fill. You can do df.replace('', np.nan).sort_values(['ID', 'COL1']).ffill()