Pablo A Pablo A - 1 month ago 10
Python Question

efficiently replace values from a column to another column Pandas DataFrame

I have a Pandas DataFrame like the following one:

col1 col2 col3
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0 0.4 0.4
4 0 0 0.3
5 0 0 0
6 0.1 0.4 0.4


I wanna replace the
col1
values with the values in the second column (
col2
) only if
col1
values are equal to 0, and after (for the zero values remaining), do it again but with the third column (
col3
). The Desired Result is the next one:

col1 col2 col3
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0.4 0.4 0.4
4 0.3 0 0.3
5 0 0 0
6 0.1 0.4 0.4


I did it using the
pd.replace
function, but it seems too slow.. I think must be a faster way to accomplish that.

df.col1.replace(0,df.col2,inplace=True)
df.col1.replace(0,df.col3,inplace=True)


is there a faster way to do that?, using some other function instead of the
pd.replace
function?, maybe slicing the
df.col1
?

PS1. Additional info: the original goal of the problem is to replace the zero values (in
col1
) with the statistic mode of some groups of ids. That is why I have
col2
(best mode, first option of replacement) and
col3
(last mode, still functional but not so desirable as
col2
).

Answer

Using np.where is faster. Using a similar pattern as you used with replace:

df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])

However, using a nested np.where is slightly faster:

df['col1'] = np.where(df['col1'] == 0, 
                      np.where(df['col2'] == 0, df['col3'], df['col2']),
                      df['col1'])

Timings

Using the following setup to produce a larger sample DataFrame and timing functions:

df = pd.concat([df]*10**4, ignore_index=True)

def root_nested(df):
    df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
    return df

def root_split(df):
    df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
    df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
    return df

def pir2(df):
    df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
    return df

def pir2_2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

def andrew(df):
    df.col1[df.col1 == 0] = df.col2
    df.col1[df.col1 == 0] = df.col3
    return df

def pablo(df):
    df['col1'] = df['col1'].replace(0,df['col2'])
    df['col1'] = df['col1'].replace(0,df['col3'])
    return df

I get the following timings:

%timeit root_nested(df.copy())
100 loops, best of 3: 2.25 ms per loop

%timeit root_split(df.copy())
100 loops, best of 3: 2.62 ms per loop

%timeit pir2(df.copy())
100 loops, best of 3: 6.25 ms per loop

%timeit pir2_2(df.copy())
1 loop, best of 3: 2.4 ms per loop

%timeit andrew(df.copy())
100 loops, best of 3: 8.55 ms per loop

I tried timing your method, but it's been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.

Comments