kip6000 - 1 year ago 105

Python Question

I'm trying to process data from three (csv) files say p, c, f:

- In p, each row has labels
- In c, each row has scores for labels in corresponding row in p (p is matched to c)
- In f, each row is a label and another score

For e.g., loaded into df_p, df_c and df_f respectively:

`>>> df_p`

p1 p2 p3 p4 p5

2614 104 104 102 102 102

3735 100 103 101 100 104

1450 100 102 100 102 102

>>> df_c

c1 c2 c3 c4 c5

2614 0.338295 0.190882 0.157231 0.135776 0.177816

3735 0.097800 0.124296 0.268475 0.265111 0.244319

1450 0.160922 0.403703 0.122390 0.130612 0.182373

>>> df_f

c

100 0.183946

101 0.290311

102 0.192049

103 0.725704

104 0.143359

Algo

`For each row in df_p, df_c:`

1. update each score in df_c row with df_c * df_f[label] where label is from p

2. reorder elements of df_c in descending scores

3. reorder elements in df_p with order from df_c

For eg, the first calculated cell in

`df_c`

`0.338295*0.143359`

this is the code I have that's working albeit very very slowly:

`np_p = []`

np_c = []

for i in range(len(df_p)):

## determine revised scores

# Step 1. Revise scores

r_conf = df_c.iloc[[i]].values[0] # scores for row

r_place_id = df_p.iloc[[i]].values[0] # labels for row

p_c = df_f.ix[r_place_id].c.values # class conf for labels

t_conf = r_conf*p_c # total score

# Reorder labels

# Step 2. reorder by revised score

c = np.sort(t_conf)[::-1]

c_sort = np.argsort(t_conf)[::-1]

# Step 3. reorder labels with revised score order

p_sort = df_p.iloc[[i]][df_p.columns[c_sort]].values

np_c.append(c)

np_p.append(p_sort)

Ideally I'd like to create a dataframe like

`df_p`

`df_c`

`np_p`

`np_c`

Any ideas on how I can make this go faster.

Thanks!!!

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

You could use the `DateFrame.replace`

method to replace the values in `df_p`

with values from `df_f`

:

```
In [124]: df_pf = df_p.replace(df_f['c']); df_pf
Out[124]:
p1 p2 p3 p4 p5
2614 0.143359 0.143359 0.192049 0.192049 0.192049
3735 0.183946 0.725704 0.290311 0.183946 0.143359
1450 0.183946 0.192049 0.183946 0.192049 0.192049
```

Since Pandas aligns indices before multiplying two DataFrames, if we strip the
`p`

s and `c`

s off the column labels, then we could obtain the desired products
using `df_pf.mul(df_c)`

:

```
df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
df_c = df_pf.mul(df_c)
```

The correct order for the columns for each row can be obtained using `np.argsort`

with `axis=1`

specified. The `order`

array returned by `np.argsort`

can then be used to reorder `df_c`

and `df_p`

:

```
order = np.argsort(-df_c.values, axis=1)
nrows, ncols = df_c.shape
np_c = df_c.values[np.arange(nrows)[:,None], order]
np_p = df_p.values[np.arange(nrows)[:,None], order]
```

The above uses NumPy's advanced integer indexing to reorder the values in each row separately.

```
import numpy as np
import pandas as pd
df_p = pd.DataFrame({'p1': [104, 100, 100],
'p2': [104, 103, 102],
'p3': [102, 101, 100],
'p4': [102, 100, 102],
'p5': [102, 104, 102]}, index=[2614,3735,1450])
df_c = pd.DataFrame({'c1': [0.33829499999999996, 0.097799999999999998, 0.16092200000000001],
'c2': [0.190882, 0.124296, 0.40370300000000003],
'c3': [0.15723099999999998, 0.26847500000000002, 0.12239000000000001],
'c4': [0.13577600000000001, 0.26511099999999999, 0.13061199999999998],
'c5': [0.177816, 0.24431900000000001, 0.18237300000000001]}, index=[2614,3735,1450])
df_f = pd.DataFrame({'c': [0.183946,
0.29031099999999999,
0.192049,
0.72570400000000002,
0.14335899999999999]}, index=list(range(100,105)))
def using_pandas(df_p, df_c, df_f):
# this works no matter the order of the columns and rows of `df_p` and `df_c`.
# aligns `df_p` and `df_c` based on the numeric part of their column names
df_pf = df_p.replace(df_f['c'])
# change the column names to match since Pandas will align the indices before multiplying
df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
df_c = df_pf.mul(df_c)
order = np.argsort(-df_c.values, axis=1)
nrows, ncols = df_c.shape
np_c = df_c.values[np.arange(nrows)[:,None], order]
np_p = df_p.values[np.arange(nrows)[:,None], order]
return np_c, np_p
np_c, np_p = using_pandas(df_p, df_c, df_f)
print(np_c)
print(np_p)
```

yields

```
[[ 0.04849763 0.03414938 0.03019606 0.02736465 0.02607565]
[ 0.0902021 0.07794125 0.04876611 0.03502533 0.01798992]
[ 0.07753076 0.03502455 0.02960096 0.0250839 0.02251315]]
[[104 102 102 104 102]
[103 101 100 104 100]
[102 102 100 102 100]]
```

Alternatively, if the columns and rows of `df_p`

and `df_c`

are already aligned,
then you can gain a bit more speed by doing the multiplication in NumPy instead of Pandas:

```
def using_numpy(df_p, df_c, df_f):
# faster than using_pandas, but assumes `df_p` and `df_c` are already aligned
df_pf = df_p.replace(df_f['c'])
df_pf = df_pf.values
df_c = df_c.values
df_p = df_p.values
df_c = df_pf * df_c
order = np.argsort(-df_c, axis=1)
nrows, ncols = df_c.shape
np_c = df_c[np.arange(nrows)[:,None], order]
np_p = df_p[np.arange(nrows)[:,None], order]
return np_c, np_p
```

For these small DataFrames, `using_numpy`

is slightly faster than `using_pandas`

.
The difference in speed would be more pronounced if the DataFrames were larger.
But again, note that `using_numpy`

relies on the indices already being aligned.

```
In [138]: %timeit using_numpy(df_p, df_c, df_f)
1000 loops, best of 3: 1.15 ms per loop
In [139]: %timeit using_pandas(df_p, df_c, df_f)
1000 loops, best of 3: 1.62 ms per loop
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**