RJH2 RJH2 - 4 months ago 21
Python Question

Do an inner join in python pandas using criteria

I’m trying to replicate in python/pandas what would be fairly straightforward in SQL, but am stuck.

I want to take a data frame with three columns:

dataframe1

Org Des Score
0 A B 10
1 A B 11
2 A B 15
3 A C 4
4 A C 4.5
5 A C 6
6 A D 100
7 A D 110
8 A D 130


And filter out all score values that are greater than the minimum * 1.2 for each Org-Des combination.

So the output table would be:

output_dataframe

Org Des Score
0 A B 10
1 A B 11
3 A C 4
4 A C 4.5
6 A D 100
7 A D 110


For the first Org-Des combo, A-B, the min Score is 10 and (1.2 * min) = 12. So rows 0 and 1 would be preserved because Scores 10 and 11 are < 12. Row 3 would be eliminated because it is > 12.

For A-C, the min Score is 4 and (1.2 * min) = 5. So rows 3 and 4 are preserved because they are < 5. And so on...

My approach

I thought I'd use the following approach:


  1. Use a groupby function to create a dataframe with the mins by Org-Des pair:

    dataframe2 = pd.DataFrame(dataframe1.groupby(['Org','Des'])['Score'].min())

  2. Then do an inner join (or a merge?) between dataframe1 and dataframe2 with the criteria that the Score < 1.2 * min for each Org-Des pair type.



But I haven't been able to get this to work for two reasons, 1) dataframe2 ends up being a funky shape, which I would need to figure out how to join or merge with dataframe1, or transform then join/merge and 2) I don't know how to set criteria as part of a join/merge.

Is this the right approach or is there a more pythonic way to achieve the same goal?

Edit to reflect @Psidom answer:

I tried the code you suggested and it gave me an error, here's the full code and output:

In: import pandas as pd
import numpy as np

In: df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
'Des': ['B','B','B','C','C','C','D','D','D'],
'Score': ['10','11','15','4','4.5','6','100','110','130'], })

Out: Org Des Score
0 A B 10
1 A B 11
2 A B 15
3 A C 4
4 A C 4.5
5 A C 6
6 A D 100
7 A D 110
8 A D 130

In: df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())
df2

Out: Score
Org Des
A B 10
C 4
D 100

In: df1 = pd.merge(df1, df2.groupby(['Org', 'Des']).min()*1.2, left_on = ['Org', 'Des'], right_index=True)
df.loc[df1.Score_x < df1.Score_y, :]

Out: KeyError: 'Org' #It's a big error but this seems to be the relevant part. Let me know if it would be useful to past the whole error.


I suspect I may have the df1, df2 and df's mixed up? I changed from the original answer post to match the code.

Answer

You can set up the join criteria as this. For the original data frame, set the join columns as ['Org', 'Des'], and for the aggregated data frame the grouped columns become index so you will need to set right_index to be true, then it should work as expected:

import pandas as pd
df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
                    'Des': ['B','B','B','C','C','C','D','D','D'],
                    'Score': [10,11,15,4,4.5,6,100,110,130]})
df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())

df3 = pd.merge(df1, df2, left_on = ['Org', 'Des'], right_index=True)
df1.loc[df3.Score_x < df3.Score_y * 1.2, ]

#  Org  Des Score
#0  A   B   10.0
#1  A   B   11.0
#3  A   C   4.0
#4  A   C   4.5
#6  A   D   100.0
#7  A   D   110.0