unpairestgood unpairestgood - 3 months ago 11
Python Question

Pandas GroupBy apply all

I've got an involved situation. Let's say I have the following example dataframe of loans:

test_df = pd.DataFrame({'name': ['Jack','Jill','John','Jack','Jill'],
'date': ['2016-08-08','2016-08-08','2016-08-07','2016-08-08','2016-08-08'],
'amount': [1000.0,1500.0,2000.0,2000.0,3000.0],
'return_amount': [5000.0,2000.0,3000.0,0.0,0.0],
'return_date': ['2017-08-08','2017-08-08','2017-08-07','','2017-08-08']})

test_df.head()

amount date name return_amount return_date
0 1000.0 2016-08-08 Jack 5000.0 2017-08-08
1 1500.0 2016-08-08 Jill 2000.0 2017-08-08
2 2000.0 2016-08-07 John 3000.0 2017-08-07
3 2500.0 2016-08-08 Jack 0.0
4 2500.0 2016-08-08 Jill 0.0 2017-08-08


There are a few operations I need to perform after grouping this dataframe by name (grouping loans by person):

1)
return amount
needs to allocated proportionally by the sum of
amount
.

2) If
return date
is missing for ANY loan for a given person, then all return_dates should be converted to empty strings ''.

I already have a function that I use to allocate the proportional return amount:

def allocate_return_amount(group):
loan_amount = group['amount']
return_amount = group['return_amount']
sum_amount = loan_amount.sum()
sum_return_amount = return_amount.sum()
group['allocated_return_amount'] = (loan_amount/sum_amount) * sum_return_amount
return group


And I use
grouped_test_df = grouped_test_df.apply(allocate_return_amount)
to apply it.

What I am struggling with is the second operation I need to perform, checking if any of the loans to a person are missing a
return_date
, and if so, changing all
return_dates
for that person to ''.

I've found GroupBy.all in the pandas documentation, but I haven't figured out how to use it yet, anyone with experience with this?

Since this example might be a bit hard to follow, here's my ideal output for this example:

ideal_test_df.head()

amount date name return_amount return_date
0 1000.0 2016-08-08 Jack 0.0 ''
1 1500.0 2016-08-08 Jill 666.66 2017-08-08
2 2000.0 2016-08-07 John 3000.0 2017-08-07
3 2500.0 2016-08-08 Jack 0.0 ''
4 2500.0 2016-08-08 Jill 1333.33 2017-08-08


Hopefully this makes sense, and thank you in advance to any pandas expert who takes the time to help me out!

Answer

You can do it by iterating through the groups, testing the condition using any, then setting back to the original dataframe using loc:

test_df = pd.DataFrame({'name': ['Jack','Jill','John','Jack','Jill'],
                   'date': ['2016-08-08','2016-08-08','2016-08-07','2016-08-08','2016-08-08'],
                   'amount': [1000.0,1500.0,2000.0,2000.0,3000.0],
                   'return_amount': [5000.0,2000.0,3000.0,0.0,0.0],
                   'return_date': ['2017-08-08','2017-08-08','2017-08-07','','2017-08-08']})

grouped = test_df.groupby('name')

for name, group in grouped:
    if any(group['return_date'] == ''):
        test_df.loc[group.index,'return_date'] = ''

And if you want to reset return_amount also, and don't mind the additional overhead, just add this line right after:

test_df.loc[group.index, 'return_amount'] = 0