RJH2 RJH2 - 1 year ago 66
Python Question

Do an inner join in python pandas using criteria

I’m trying to replicate in python/pandas what would be fairly straightforward in SQL, but am stuck.

I want to take a data frame with three columns:

dataframe1

Org Des Score
0 A B 10
1 A B 11
2 A B 15
3 A C 4
4 A C 4.5
5 A C 6
6 A D 100
7 A D 110
8 A D 130


And filter out all score values that are greater than the minimum * 1.2 for each Org-Des combination.

So the output table would be:

output_dataframe

Org Des Score
0 A B 10
1 A B 11
3 A C 4
4 A C 4.5
6 A D 100
7 A D 110


For the first Org-Des combo, A-B, the min Score is 10 and (1.2 * min) = 12. So rows 0 and 1 would be preserved because Scores 10 and 11 are < 12. Row 3 would be eliminated because it is > 12.

For A-C, the min Score is 4 and (1.2 * min) = 5. So rows 3 and 4 are preserved because they are < 5. And so on...

My approach

I thought I'd use the following approach:


  1. Use a groupby function to create a dataframe with the mins by Org-Des pair:

    dataframe2 = pd.DataFrame(dataframe1.groupby(['Org','Des'])['Score'].min())

  2. Then do an inner join (or a merge?) between dataframe1 and dataframe2 with the criteria that the Score < 1.2 * min for each Org-Des pair type.



But I haven't been able to get this to work for two reasons, 1) dataframe2 ends up being a funky shape, which I would need to figure out how to join or merge with dataframe1, or transform then join/merge and 2) I don't know how to set criteria as part of a join/merge.

Is this the right approach or is there a more pythonic way to achieve the same goal?

Edit to reflect @Psidom answer:

I tried the code you suggested and it gave me an error, here's the full code and output:

In: import pandas as pd
import numpy as np

In: df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
'Des': ['B','B','B','C','C','C','D','D','D'],
'Score': ['10','11','15','4','4.5','6','100','110','130'], })

Out: Org Des Score
0 A B 10
1 A B 11
2 A B 15
3 A C 4
4 A C 4.5
5 A C 6
6 A D 100
7 A D 110
8 A D 130

In: df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())
df2

Out: Score
Org Des
A B 10
C 4
D 100

In: df1 = pd.merge(df1, df2.groupby(['Org', 'Des']).min()*1.2, left_on = ['Org', 'Des'], right_index=True)
df.loc[df1.Score_x < df1.Score_y, :]

Out: KeyError: 'Org' #It's a big error but this seems to be the relevant part. Let me know if it would be useful to past the whole error.


I suspect I may have the df1, df2 and df's mixed up? I changed from the original answer post to match the code.

Answer Source

You can set up the join criteria as this. For the original data frame, set the join columns as ['Org', 'Des'], and for the aggregated data frame the grouped columns become index so you will need to set right_index to be true, then it should work as expected:

import pandas as pd
df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
                    'Des': ['B','B','B','C','C','C','D','D','D'],
                    'Score': [10,11,15,4,4.5,6,100,110,130]})
df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())

df3 = pd.merge(df1, df2, left_on = ['Org', 'Des'], right_index=True)
df1.loc[df3.Score_x < df3.Score_y * 1.2, ]

#  Org  Des Score
#0  A   B   10.0
#1  A   B   11.0
#3  A   C   4.0
#4  A   C   4.5
#6  A   D   100.0
#7  A   D   110.0