kip6000 kip6000 - 5 months ago 4
Python Question

Pandas/numpy help vectorizing sequence of lookup, calculation and reordering across dataframes

I'm trying to process data from three (csv) files say p, c, f:


  • In p, each row has labels

  • In c, each row has scores for labels in corresponding row in p (p is matched to c)

  • In f, each row is a label and another score



For e.g., loaded into df_p, df_c and df_f respectively:

>>> df_p
p1 p2 p3 p4 p5
2614 104 104 102 102 102
3735 100 103 101 100 104
1450 100 102 100 102 102
>>> df_c
c1 c2 c3 c4 c5
2614 0.338295 0.190882 0.157231 0.135776 0.177816
3735 0.097800 0.124296 0.268475 0.265111 0.244319
1450 0.160922 0.403703 0.122390 0.130612 0.182373
>>> df_f
c
100 0.183946
101 0.290311
102 0.192049
103 0.725704
104 0.143359


Algo

For each row in df_p, df_c:
1. update each score in df_c row with df_c * df_f[label] where label is from p
2. reorder elements of df_c in descending scores
3. reorder elements in df_p with order from df_c


For eg, the first calculated cell in
df_c
will be
0.338295*0.143359


this is the code I have that's working albeit very very slowly:

np_p = []
np_c = []
for i in range(len(df_p)):

## determine revised scores
# Step 1. Revise scores
r_conf = df_c.iloc[[i]].values[0] # scores for row
r_place_id = df_p.iloc[[i]].values[0] # labels for row
p_c = df_f.ix[r_place_id].c.values # class conf for labels
t_conf = r_conf*p_c # total score

# Reorder labels
# Step 2. reorder by revised score
c = np.sort(t_conf)[::-1]
c_sort = np.argsort(t_conf)[::-1]
# Step 3. reorder labels with revised score order
p_sort = df_p.iloc[[i]][df_p.columns[c_sort]].values
np_c.append(c)
np_p.append(p_sort)


Ideally I'd like to create a dataframe like
df_p
and
df_c
but with the reordered and revised values (in
np_p
and
np_c
).

Any ideas on how I can make this go faster.

Thanks!!!

Answer

You could use the DateFrame.replace method to replace the values in df_p with values from df_f:

In [124]: df_pf = df_p.replace(df_f['c']); df_pf
Out[124]: 
            p1        p2        p3        p4        p5
2614  0.143359  0.143359  0.192049  0.192049  0.192049
3735  0.183946  0.725704  0.290311  0.183946  0.143359
1450  0.183946  0.192049  0.183946  0.192049  0.192049

Since Pandas aligns indices before multiplying two DataFrames, if we strip the ps and cs off the column labels, then we could obtain the desired products using df_pf.mul(df_c):

df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
df_c = df_pf.mul(df_c)

The correct order for the columns for each row can be obtained using np.argsort with axis=1 specified. The order array returned by np.argsort can then be used to reorder df_c and df_p:

order = np.argsort(-df_c.values, axis=1)
nrows, ncols = df_c.shape
np_c = df_c.values[np.arange(nrows)[:,None], order]
np_p = df_p.values[np.arange(nrows)[:,None], order]

The above uses NumPy's advanced integer indexing to reorder the values in each row separately.


import numpy as np
import pandas as pd

df_p = pd.DataFrame({'p1': [104, 100, 100],
 'p2': [104, 103, 102],
 'p3': [102, 101, 100],
 'p4': [102, 100, 102],
 'p5': [102, 104, 102]}, index=[2614,3735,1450])

df_c = pd.DataFrame({'c1': [0.33829499999999996, 0.097799999999999998, 0.16092200000000001],
 'c2': [0.190882, 0.124296, 0.40370300000000003],
 'c3': [0.15723099999999998, 0.26847500000000002, 0.12239000000000001],
 'c4': [0.13577600000000001, 0.26511099999999999, 0.13061199999999998],
 'c5': [0.177816, 0.24431900000000001, 0.18237300000000001]}, index=[2614,3735,1450])

df_f = pd.DataFrame({'c': [0.183946,
  0.29031099999999999,
  0.192049,
  0.72570400000000002,
  0.14335899999999999]}, index=list(range(100,105)))

def using_pandas(df_p, df_c, df_f):
    # this works no matter the order of the columns and rows of `df_p` and `df_c`.
    # aligns `df_p` and `df_c` based on the numeric part of their column names
    df_pf = df_p.replace(df_f['c'])
    # change the column names to match since Pandas will align the indices before multiplying
    df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
    df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
    df_c = df_pf.mul(df_c)
    order = np.argsort(-df_c.values, axis=1)
    nrows, ncols = df_c.shape
    np_c = df_c.values[np.arange(nrows)[:,None], order]
    np_p = df_p.values[np.arange(nrows)[:,None], order]
    return np_c, np_p

np_c, np_p = using_pandas(df_p, df_c, df_f)
print(np_c)
print(np_p)

yields

[[ 0.04849763  0.03414938  0.03019606  0.02736465  0.02607565]
 [ 0.0902021   0.07794125  0.04876611  0.03502533  0.01798992]
 [ 0.07753076  0.03502455  0.02960096  0.0250839   0.02251315]]
[[104 102 102 104 102]
 [103 101 100 104 100]
 [102 102 100 102 100]]

Alternatively, if the columns and rows of df_p and df_c are already aligned, then you can gain a bit more speed by doing the multiplication in NumPy instead of Pandas:

def using_numpy(df_p, df_c, df_f):
    # faster than using_pandas, but assumes `df_p` and `df_c` are already aligned
    df_pf = df_p.replace(df_f['c'])
    df_pf = df_pf.values
    df_c = df_c.values
    df_p = df_p.values
    df_c = df_pf * df_c
    order = np.argsort(-df_c, axis=1)
    nrows, ncols = df_c.shape
    np_c = df_c[np.arange(nrows)[:,None], order]
    np_p = df_p[np.arange(nrows)[:,None], order]
    return np_c, np_p

For these small DataFrames, using_numpy is slightly faster than using_pandas. The difference in speed would be more pronounced if the DataFrames were larger. But again, note that using_numpy relies on the indices already being aligned.

In [138]: %timeit using_numpy(df_p, df_c, df_f)
1000 loops, best of 3: 1.15 ms per loop

In [139]: %timeit using_pandas(df_p, df_c, df_f)
1000 loops, best of 3: 1.62 ms per loop
Comments