William III William III - 7 months ago 130
Python Question

Pandas: top N rows, top N rows per group, equivalent for ROW_NUMBER OVER(PARTITION BY ... ORDER BY ...)

What is the python equivalent to the TOP function in T-SQL? I'm looking to filter my dataframe to the top 50K rows. I've looked online and I can't find a simple example.

Answer

UPDATE: - shows different approaches, including top N rows per group, equivalent for SQL aggregate functions: ROW_NUMBER() / RANK() OVER(PARTITION BY ... ORDER BY ...)", etc.:

sample DF:

df = pd.DataFrame({
  'dep': np.random.choice(list('ABC'), 20),
  'manager_id': np.random.randint(0, 10, 20),
  'salary': np.random.randint(5000, 5006, 20)
})

----------------------- Original DF ------------------------

In [136]: df
Out[136]:
   dep  manager_id  salary
0    C           4    5000
1    C           7    5004
2    B           9    5000
3    B           6    5003
4    B           0    5001
5    B           8    5004
6    A           8    5002
7    A           7    5002
8    B           5    5004
9    C           1    5003
10   C           3    5003
11   A           9    5005
12   C           6    5005
13   B           0    5004
14   A           3    5001
15   A           0    5001
16   B           6    5004
17   C           2    5002
18   A           5    5000
19   A           0    5003

------------------ top 5 rows (sorted by original index) -------------------

In [137]: df.head(5)
Out[137]:
  dep  manager_id  salary
0   C           4    5000
1   C           7    5004
2   B           9    5000
3   B           6    5003
4   B           0    5001

--- top 5 rows (sorted by manager_id DESC, dep ASC) ----

In [138]: df.sort_values(by=['manager_id', 'dep'], ascending=[False, True]).head(5)
Out[138]:
   dep  manager_id  salary
11   A           9    5005
2    B           9    5000
6    A           8    5002
5    B           8    5004
7    A           7    5002

---- top 2 salaries in each department (no duplicates) -----

In [139]: df.sort_values(by=['dep','salary'], ascending=[True, False]).groupby('dep').head(2)
Out[139]:
   dep  manager_id  salary
11   A           9    5005
19   A           0    5003
5    B           8    5004
8    B           5    5004
12   C           6    5005
1    C           7    5004

--- top 2 salaries in each department (using "nlargest") ----

In [140]: df.ix[df.groupby(by='dep')['salary'].nlargest(2).reset_index()['level_1']]
Out[140]:
   dep  manager_id  salary
11   A           9    5005
19   A           0    5003
5    B           8    5004
8    B           5    5004
12   C           6    5005
1    C           7    5004

--- equivalent for SQL: row_number() over(partition by DEP order by SALARY desc) ---

In [141]: df['RN'] = df.sort_values(by=['dep','salary'], ascending=[True, False]).groupby('dep').cumcount() + 1

In [142]: df.sort_values(by=['dep', 'RN'])
Out[142]:
   dep  manager_id  salary  RN
11   A           9    5005   1
19   A           0    5003   2
6    A           8    5002   3
7    A           7    5002   4
14   A           3    5001   5
15   A           0    5001   6
18   A           5    5000   7
5    B           8    5004   1
8    B           5    5004   2
13   B           0    5004   3
16   B           6    5004   4
3    B           6    5003   5
4    B           0    5001   6
2    B           9    5000   7
12   C           6    5005   1
1    C           7    5004   2
9    C           1    5003   3
10   C           3    5003   4
17   C           2    5002   5
0    C           4    5000   6

--- second and third highest salaries in each department ---

In [143]: df.loc[df.RN.between(2, 3)].sort_values(by=['dep', 'RN'])
Out[143]:
   dep  manager_id  salary  RN
19   A           0    5003   2
6    A           8    5002   3
8    B           5    5004   2
13   B           0    5004   3
1    C           7    5004   2
9    C           1    5003   3

--- equivalent for SQL: rank() over(partition by DEP order by SALARY desc) ---

In [144]: df['rank_first'] = df.groupby('dep')['salary'].rank(method='first', ascending=False).astype(int)

In [145]: df['rank_min'] = df.groupby('dep')['salary'].rank(method='min', ascending=False).astype(int)

In [146]: df.sort_values(by=['dep', 'RN'])
Out[146]:
   dep  manager_id  salary  RN  rank_first  rank_min
11   A           9    5005   1           1         1
19   A           0    5003   2           2         2
6    A           8    5002   3           3         3
7    A           7    5002   4           4         3
14   A           3    5001   5           5         5
15   A           0    5001   6           6         5
18   A           5    5000   7           7         7
5    B           8    5004   1           1         1
8    B           5    5004   2           2         1
13   B           0    5004   3           3         1
16   B           6    5004   4           4         1
3    B           6    5003   5           5         5
4    B           0    5001   6           6         6
2    B           9    5000   7           7         7
12   C           6    5005   1           1         1
1    C           7    5004   2           2         2
9    C           1    5003   3           3         3
10   C           3    5003   4           4         3
17   C           2    5002   5           5         5
0    C           4    5000   6           6         6

--- top 2 salaries in each department (with duplicates) ----

In [147]: df.loc[df.rank_min <= 2].sort_values(by=['dep', 'rank_min'])
Out[147]:
   dep  manager_id  salary  RN  rank_first  rank_min
11   A           9    5005   1           1         1
19   A           0    5003   2           2         2
5    B           8    5004   1           1         1
8    B           5    5004   2           2         1
13   B           0    5004   3           3         1
16   B           6    5004   4           4         1
12   C           6    5005   1           1         1
1    C           7    5004   2           2         2

OLD answer:

first 50 rows that satisfy the following condition: df['col_A'] == 100

df.loc[df['col_A'] == 100].head(50)

simply first 50 rows:

df.head(50)

if you look for pandas equivalent for select * from table order by col_A limit 50:

df.sort_values(by=['col_A']).head(50)