kip6000 - 2 years ago 126
Python Question

Pandas/numpy help vectorizing sequence of lookup, calculation and reordering across dataframes

I'm trying to process data from three (csv) files say p, c, f:

• In p, each row has labels

• In c, each row has scores for labels in corresponding row in p (p is matched to c)

• In f, each row is a label and another score

For e.g., loaded into df_p, df_c and df_f respectively:

``````>>> df_p
p1   p2   p3   p4   p5
2614  104  104  102  102  102
3735  100  103  101  100  104
1450  100  102  100  102  102
>>> df_c
c1        c2        c3        c4        c5
2614  0.338295  0.190882  0.157231  0.135776  0.177816
3735  0.097800  0.124296  0.268475  0.265111  0.244319
1450  0.160922  0.403703  0.122390  0.130612  0.182373
>>> df_f
c
100  0.183946
101  0.290311
102  0.192049
103  0.725704
104  0.143359
``````

Algo

``````For each row in df_p, df_c:
1. update each score in df_c row with df_c * df_f[label] where label is from p
2. reorder elements of df_c in descending scores
3. reorder elements in df_p with order from df_c
``````

For eg, the first calculated cell in
`df_c`
will be
`0.338295*0.143359`

this is the code I have that's working albeit very very slowly:

``````np_p = []
np_c = []
for i in range(len(df_p)):

## determine revised scores
# Step 1. Revise scores
r_conf = df_c.iloc[[i]].values[0] # scores for row
r_place_id = df_p.iloc[[i]].values[0] # labels for row
p_c = df_f.ix[r_place_id].c.values # class conf for labels
t_conf = r_conf*p_c # total score

# Reorder labels
# Step 2. reorder by revised score
c = np.sort(t_conf)[::-1]
c_sort = np.argsort(t_conf)[::-1]
# Step 3. reorder labels with revised score order
p_sort = df_p.iloc[[i]][df_p.columns[c_sort]].values
np_c.append(c)
np_p.append(p_sort)
``````

Ideally I'd like to create a dataframe like
`df_p`
and
`df_c`
but with the reordered and revised values (in
`np_p`
and
`np_c`
).

Any ideas on how I can make this go faster.

Thanks!!!

You could use the `DateFrame.replace` method to replace the values in `df_p` with values from `df_f`:

``````In [124]: df_pf = df_p.replace(df_f['c']); df_pf
Out[124]:
p1        p2        p3        p4        p5
2614  0.143359  0.143359  0.192049  0.192049  0.192049
3735  0.183946  0.725704  0.290311  0.183946  0.143359
1450  0.183946  0.192049  0.183946  0.192049  0.192049
``````

Since Pandas aligns indices before multiplying two DataFrames, if we strip the `p`s and `c`s off the column labels, then we could obtain the desired products using `df_pf.mul(df_c)`:

``````df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
df_c = df_pf.mul(df_c)
``````

The correct order for the columns for each row can be obtained using `np.argsort` with `axis=1` specified. The `order` array returned by `np.argsort` can then be used to reorder `df_c` and `df_p`:

``````order = np.argsort(-df_c.values, axis=1)
nrows, ncols = df_c.shape
np_c = df_c.values[np.arange(nrows)[:,None], order]
np_p = df_p.values[np.arange(nrows)[:,None], order]
``````

The above uses NumPy's advanced integer indexing to reorder the values in each row separately.

``````import numpy as np
import pandas as pd

df_p = pd.DataFrame({'p1': [104, 100, 100],
'p2': [104, 103, 102],
'p3': [102, 101, 100],
'p4': [102, 100, 102],
'p5': [102, 104, 102]}, index=[2614,3735,1450])

df_c = pd.DataFrame({'c1': [0.33829499999999996, 0.097799999999999998, 0.16092200000000001],
'c2': [0.190882, 0.124296, 0.40370300000000003],
'c3': [0.15723099999999998, 0.26847500000000002, 0.12239000000000001],
'c4': [0.13577600000000001, 0.26511099999999999, 0.13061199999999998],
'c5': [0.177816, 0.24431900000000001, 0.18237300000000001]}, index=[2614,3735,1450])

df_f = pd.DataFrame({'c': [0.183946,
0.29031099999999999,
0.192049,
0.72570400000000002,
0.14335899999999999]}, index=list(range(100,105)))

def using_pandas(df_p, df_c, df_f):
# this works no matter the order of the columns and rows of `df_p` and `df_c`.
# aligns `df_p` and `df_c` based on the numeric part of their column names
df_pf = df_p.replace(df_f['c'])
# change the column names to match since Pandas will align the indices before multiplying
df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
df_c = df_pf.mul(df_c)
order = np.argsort(-df_c.values, axis=1)
nrows, ncols = df_c.shape
np_c = df_c.values[np.arange(nrows)[:,None], order]
np_p = df_p.values[np.arange(nrows)[:,None], order]
return np_c, np_p

np_c, np_p = using_pandas(df_p, df_c, df_f)
print(np_c)
print(np_p)
``````

yields

``````[[ 0.04849763  0.03414938  0.03019606  0.02736465  0.02607565]
[ 0.0902021   0.07794125  0.04876611  0.03502533  0.01798992]
[ 0.07753076  0.03502455  0.02960096  0.0250839   0.02251315]]
[[104 102 102 104 102]
[103 101 100 104 100]
[102 102 100 102 100]]
``````

Alternatively, if the columns and rows of `df_p` and `df_c` are already aligned, then you can gain a bit more speed by doing the multiplication in NumPy instead of Pandas:

``````def using_numpy(df_p, df_c, df_f):
# faster than using_pandas, but assumes `df_p` and `df_c` are already aligned
df_pf = df_p.replace(df_f['c'])
df_pf = df_pf.values
df_c = df_c.values
df_p = df_p.values
df_c = df_pf * df_c
order = np.argsort(-df_c, axis=1)
nrows, ncols = df_c.shape
np_c = df_c[np.arange(nrows)[:,None], order]
np_p = df_p[np.arange(nrows)[:,None], order]
return np_c, np_p
``````

For these small DataFrames, `using_numpy` is slightly faster than `using_pandas`. The difference in speed would be more pronounced if the DataFrames were larger. But again, note that `using_numpy` relies on the indices already being aligned.

``````In [138]: %timeit using_numpy(df_p, df_c, df_f)
1000 loops, best of 3: 1.15 ms per loop

In [139]: %timeit using_pandas(df_p, df_c, df_f)
1000 loops, best of 3: 1.62 ms per loop
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download