Niels Alebregtse Niels Alebregtse - 26 days ago 6
Python Question

Remove cancelling rows from Pandas Dataframe

I have a list of invoices sent out to customers. However, sometimes a bad invoice is sent, which is later cancelled. My Pandas Dataframe looks something like this, except much larger (~3 million rows)

index | customer | invoice_nr | amount | date
---------------------------------------------------
0 | 1 | 1 | 10 | 01-01-2016
1 | 1 | 1 | -10 | 01-01-2016
2 | 1 | 1 | 11 | 01-01-2016
3 | 1 | 2 | 10 | 02-01-2016
4 | 2 | 3 | 7 | 01-01-2016
5 | 2 | 4 | 12 | 02-01-2016
6 | 2 | 4 | 8 | 02-01-2016
7 | 2 | 4 | -12 | 02-01-2016
8 | 2 | 4 | 4 | 02-01-2016
... | ... | ... | ... | ...
... | ... | ... | ... | ...


Now, I want to drop all rows for which the
customer
,
invoice_nr
and
date
are identical, but the
amount
has opposite values.

Corrections of invoices always take place on the same day with identical invoice number. The invoice number is uniquely bound to the customer and always corresponds to one transaction (which can consist of multiple components, for example for
customer = 2
,
invoice_nr = 4
). Corrections of invoices only occur either to change
amount
charged, or to split
amount
in smaller components. Hence, the cancelled value is not repeated on the same
invoice_nr
.

Any help how to program this would be much appreciated.

Answer
def remove_cancelled_transactions(df):
    trans_neg = df.amount < 0
    return df.loc[~(trans_neg | trans_neg.shift(-1))]

groups = [df.customer, df.invoice_nr, df.date, df.amount.abs()]
df.groupby(groups, as_index=False, group_keys=False) \
  .apply(remove_cancelled_transactions)

enter image description here

Comments