Shatnerz Shatnerz - 26 days ago 8
Python Question

pandas - binning data and getting 2 columns

I have a very simple dataframe. There are 2 columns, day_created (int, could change to datetime) and suspended (int, could change to boolean). I can change the data if it makes it easier to work with.

Day created Suspended
0 12 0
1 6 1
2 24 0
3 8 0
4 100 1
5 30 0
6 1 1
7 6 0


The day_created column is the integer of the day the account was created (from a start date), starting at 1 and increasing. The suspended column is a 1 for suspension and a 0 for no suspension.

What I would like to do is bin these accounts into groups of 30 days or months, but from each bin get a total number of accounts for that month and the number of accounts suspended that were created in that month. I then plan on creating a bar graph with 2 bars for each month.

How should I go about this? I don't use pandas often. I assume I need to do some tricks with resample and count.

Answer

Use

df.index = start_date + pd.to_timedelta(df['Day created'], unit='D')

to give the DataFrame an index of Timestamps representing when the accounts were created.

Then you can use

result = df.groupby(pd.TimeGrouper(freq='M')).agg(['count', 'sum'])

to group the rows of the DataFrame (by months) according to the Timestamps in the index. .agg(['count', 'sum']) computes the number of accounts (the count) and the number of suspended accounts for each group.

Then result.plot(kind='bar', ax=ax) plots the bar graph:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
    {'Day created': [12, 6, 24, 8, 100, 30, 1, 6],
     'Suspended': [0, 1, 0, 0, 1, 0, 1, 0]})
start_date = pd.Timestamp('2016-01-01')
df.index = start_date + pd.to_timedelta(df['Day created'], unit='D')

result = df.groupby(pd.TimeGrouper(freq='M'))['Suspended'].agg(['count', 'sum'])
result = result.rename(columns={'sum':'suspended'})
fig, ax = plt.subplots()
result.plot(kind='bar', ax=ax)
fig.autofmt_xdate()
plt.show()

yields

enter image description here