piRSquared piRSquared - 5 months ago 10
Python Question

Most efficient way to fill missing elements of dataframe with a function of column and row indices

I have a dataframe with missing values.

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice((0, np.nan), (5, 5)))
print df

0 1 2 3 4
0 0.0 NaN 0.0 NaN 0.0
1 0.0 NaN 0.0 NaN NaN
2 NaN NaN 0.0 NaN NaN
3 0.0 NaN 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0


Question



How do I efficiently fill the missing values with what a function returns when passed the missing cell's row and column index values.

Suppose my function
f
is defined as:

f = lambda i, j: i ** 2 - np.sqrt(abs(j))


I expect to get:

0 1 2 3 4
0 0.0 -1.0 0.0 -1.732051 0.0
1 0.0 0.0 0.0 -0.732051 -1.0
2 4.0 3.0 0.0 2.267949 2.0
3 0.0 8.0 0.0 0.000000 0.0
4 0.0 0.0 0.0 0.000000 0.0


I've created two functions so far that generate this output:

def pir1(df, f):
dfi = df.stack(dropna=False).index.to_series().unstack()
return df.combine_first(dfi.applymap(lambda x: f(*x)))

def pir2(df, f):
dfc = df.copy()
for i in dfc.index:
for j in dfc.columns:
dfv = df.get_value(i, j)
dfc.at[i, j] = dfv if pd.notnull(dfv) else f(i, j)
return dfc


Timing






%%timeit
pir1(df, f)

100 loops, best of 3: 3.74 ms per loop





%%timeit
pir2(df, f)

1000 loops, best of 3: 714 µs per loop


Can anyone improve on these?

Answer

Both arraymap in pir1 and the double for-loop in pir2 call f once for each pair of index and column values. If f can be vectorized -- i.e. defined so as to accept NumPy arrays as input instead of scalars, then for large inputs the calculation can be sped up considerably by passing whole 2D arrays to f.

For the f you posted, no change is necessary; f is already vectorized -- it can accept arrays as input just as easily as scalars.

import numpy as np
import pandas as pd

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice((0, np.nan), (5, 5)))
def f(i, j): return i ** 2 - np.sqrt(abs(j))

def using_meshgrid(df, f):
    I, J = np.meshgrid(df.index, df.columns, sparse=True, indexing='ij')
    return df.combine_first(pd.DataFrame(f(I, J), index=df.index, columns=df.columns))

def pir1(df, f):
    dfi = df.stack(dropna=False).index.to_series().unstack()
    return df.combine_first(dfi.applymap(lambda x: f(*x)))

def pir2(df, f):
    dfc = df.copy()
    for i in dfc.index:
        for j in dfc.columns:
            dfv = df.get_value(i, j)
            dfc.at[i, j] = dfv if pd.notnull(dfv) else f(i, j)
    return dfc

For small inputs, pir2 is still the fastest:

In [290]: %timeit using_meshgrid(df, f)
100 loops, best of 3: 2.01 ms per loop

In [291]: %timeit pir1(df, f)
100 loops, best of 3: 4.61 ms per loop

In [292]: %timeit pir2(df, f)
1000 loops, best of 3: 825 µs per loop

But for large input, using_meshgrid is faster:

In [293]: df = pd.DataFrame(np.random.choice((0, np.nan), (500, 500)))

In [294]: %timeit using_meshgrid(df, f)
10 loops, best of 3: 160 ms per loop

In [295]: %timeit pir1(df, f)
1 loop, best of 3: 1.15 s per loop

In [296]: %timeit pir2(df, f)
1 loop, best of 3: 4.79 s per loop