RJH2 - 1 year ago 66

Python Question

I’m trying to replicate in python/pandas what would be fairly straightforward in SQL, but am stuck.

I want to take a data frame with three columns:

`dataframe1`

Org Des Score

0 A B 10

1 A B 11

2 A B 15

3 A C 4

4 A C 4.5

5 A C 6

6 A D 100

7 A D 110

8 A D 130

And filter out all score values that are greater than the minimum * 1.2 for each Org-Des combination.

So the output table would be:

`output_dataframe`

Org Des Score

0 A B 10

1 A B 11

3 A C 4

4 A C 4.5

6 A D 100

7 A D 110

For the first Org-Des combo, A-B, the min Score is 10 and (1.2 * min) = 12. So rows 0 and 1 would be preserved because Scores 10 and 11 are < 12. Row 3 would be eliminated because it is > 12.

For A-C, the min Score is 4 and (1.2 * min) = 5. So rows 3 and 4 are preserved because they are < 5. And so on...

I thought I'd use the following approach:

- Use a groupby function to create a dataframe with the mins by Org-Des pair:

`dataframe2 = pd.DataFrame(dataframe1.groupby(['Org','Des'])['Score'].min())`

- Then do an inner join (or a merge?) between dataframe1 and dataframe2 with the criteria that the Score < 1.2 * min for each Org-Des pair type.

But I haven't been able to get this to work for two reasons, 1) dataframe2 ends up being a funky shape, which I would need to figure out how to join or merge with dataframe1, or transform then join/merge and 2) I don't know how to set criteria as part of a join/merge.

Is this the right approach or is there a more pythonic way to achieve the same goal?

I tried the code you suggested and it gave me an error, here's the full code and output:

`In: import pandas as pd`

import numpy as np

In: df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],

'Des': ['B','B','B','C','C','C','D','D','D'],

'Score': ['10','11','15','4','4.5','6','100','110','130'], })

Out: Org Des Score

0 A B 10

1 A B 11

2 A B 15

3 A C 4

4 A C 4.5

5 A C 6

6 A D 100

7 A D 110

8 A D 130

In: df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())

df2

Out: Score

Org Des

A B 10

C 4

D 100

In: df1 = pd.merge(df1, df2.groupby(['Org', 'Des']).min()*1.2, left_on = ['Org', 'Des'], right_index=True)

df.loc[df1.Score_x < df1.Score_y, :]

Out: KeyError: 'Org' #It's a big error but this seems to be the relevant part. Let me know if it would be useful to past the whole error.

I suspect I may have the df1, df2 and df's mixed up? I changed from the original answer post to match the code.

Answer Source

You can set up the join criteria as this. For the original data frame, set the join columns as `['Org', 'Des']`

, and for the aggregated data frame the grouped columns become index so you will need to set `right_index`

to be true, then it should work as expected:

```
import pandas as pd
df1 = pd.DataFrame({'Org': ['A','A','A','A','A','A','A','A','A'],
'Des': ['B','B','B','C','C','C','D','D','D'],
'Score': [10,11,15,4,4.5,6,100,110,130]})
df2 = pd.DataFrame(df1.groupby(['Org','Des'])['Score'].min())
df3 = pd.merge(df1, df2, left_on = ['Org', 'Des'], right_index=True)
df1.loc[df3.Score_x < df3.Score_y * 1.2, ]
# Org Des Score
#0 A B 10.0
#1 A B 11.0
#3 A C 4.0
#4 A C 4.5
#6 A D 100.0
#7 A D 110.0
```