neha neha - 4 months ago 7
Python Question

calculating top 10 products for a given query based on its priority in python

suppose we are given a dataframe like:

Query Productid priority
index
0 3ds 2125233 0.018946
1 rca 2009324 0.027599
2 nook 1517163 0.009443
3 rca 2877125 0.012054
4 rca 2877134 0.005557
5 flatscreentvs 2416092 0.011961
6 macbook 3108172 0.010459
7 3ds 2264036 0.165948
8 rca 8280834 0.004006
9 memorycard 2740208 0.013744
10 acpowercord 2584273 0.006865
11 zaggiphone 1230537 0.136073
12 watchthethrone 3168067 0.104679
13 remotecontrolextender 7997055 0.113058
14 camcorder 2009041 0.017809
15 3ds 1988047 0.031711
16 3ds 1686079 0.043783
17 wirelessheadphones 3770439 0.014714
18 wirelessheadphones 2602403 0.008525
19 samsung40 2126065 0.018066


i want to find top 2
product_ids
on the basis of priority with respect to a given query.

for eg. if we have
query=3ds
then top 2 products should be:

1. 1988047
2. 1686079

Answer

This would be an equivalent for Oracle's row_number() analytic function:

In [172]: df.assign(rn=df.sort_values('priority', ascending=0).groupby('Query').cumcount() + 1).query('rn < 3').sort_values(['Query','rn'])
Out[172]:
                       Query  Productid  priority  rn
index
7                        3ds    2264036  0.165948   1
16                       3ds    1686079  0.043783   2
10               acpowercord    2584273  0.006865   1
14                 camcorder    2009041  0.017809   1
5              flatscreentvs    2416092  0.011961   1
6                    macbook    3108172  0.010459   1
9                 memorycard    2740208  0.013744   1
2                       nook    1517163  0.009443   1
1                        rca    2009324  0.027599   1
3                        rca    2877125  0.012054   2
13     remotecontrolextender    7997055  0.113058   1
19                 samsung40    2126065  0.018066   1
12            watchthethrone    3168067  0.104679   1
17        wirelessheadphones    3770439  0.014714   1
18        wirelessheadphones    2602403  0.008525   2
11                zaggiphone    1230537  0.136073   1

show Productid for selected Query:

In [180]: (df.assign(rn=df.sort_values('priority', ascending=0).groupby('Query').cumcount() + 1)
   .....:    .query('Query=="3ds" and rn < 3')['Productid']
   .....: )
Out[180]:
index
7     2264036
16    1686079
Name: Productid, dtype: int64