Christopher Christopher - 2 months ago 6
Python Question

Create groups/classes based on conditions within columns

I need help transforming my data so I can read through transaction data.

Business Case

I'm trying to group together some related transactions to create some groups or classes of events. This data set represents workers going out on various leaves of absence events. I want to create one class of leaves based on any transaction falling within 365 days of the leave event class. For charting trends, I want to number the classes so I get a sequence/pattern.

My code allows me to see when the very first event occurred, and it can identify when a new class starts, but it doesn't bucket each transaction into a class.

Requirements:


  • Tag all rows based on what leave class they fall into.

  • Number each Unique Leave Event. Using this example index 0 would be Unique Leave Event 2, index 1 would be Unique Leave Event 2, index 3 would be Unique Leave Event 2, AND index 4 would be Unique Leave Event 1, etc.



I added in a column for the desired output, labeled as "Desired Output". Note, there can be many more rows/events per person; and there can be many more people.

Some Data

import pandas as pd

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])


Some Code I've Tried

df['Effective Date'] = df['Effective Date'].astype('datetime64[ns]')
df['EmplidShift'] = df['Employee ID'].shift(-1)
df['Effdt-Shift'] = df['Effective Date'].shift(-1)
df['Prior Row in Same Emplid Class'] = "No"
df['Effdt Diff'] = df['Effdt-Shift'] - df['Effective Date']
df['Effdt Diff'] = (pd.to_timedelta(df['Effdt Diff'], unit='d') + pd.to_timedelta(1,unit='s')).astype('timedelta64[D]')
df['Cumul. Count'] = df.groupby('Employee ID').cumcount()


df['Groupby'] = df.groupby('Employee ID')['Cumul. Count'].transform('max')
df['First Row Appears?'] = ""
df['First Row Appears?'][df['Cumul. Count'] == df['Groupby']] = "First Row"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"

df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"

df['Effdt > 1 Yr?'] = ""
df['Effdt > 1 Yr?'][ ((df['Prior Row in Same Emplid Class'] == "Yes" ) & (df['Effdt Diff'] < -365)) ] = "Yes"

df['Unique Leave Event'] = ""
df['Unique Leave Event'][ (df['Effdt > 1 Yr?'] == "Yes") | (df['First Row Appears?'] == "First Row") ] = "Unique Leave Event"

df

Answer

This is a bit clunky but it yields the right output at least for your small example:

import pandas as pd

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
        'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01-01"],
        'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])

df["Effective Date"] = pd.to_datetime(df["Effective Date"])
df = df.sort_values(["Employee ID","Effective Date"]).reset_index(drop=True)

for i,_ in df.iterrows():
  df.ix[0,"Result"] = "Unique Leave Event 1"
  if i < len(df)-1:
    if df.ix[i+1,"Employee ID"] == df.ix[i,"Employee ID"]:
      if df.ix[i+1,"Effective Date"] - df.ix[i,"Effective Date"] > pd.Timedelta('365 days'):
        df.ix[i+1,"Result"] = "Unique Leave Event " + str(int(df.ix[i,"Result"].split()[-1])+1)
      else:
        df.ix[i+1,"Result"] = df.ix[i,"Result"]
    else:
      df.ix[i+1,"Result"] = "Unique Leave Event 1"

Note that this code assumes that the first row always contains the string Unique Leave Event 1.

EDIT: Some explanation.

First I convert the dates to datetime format and then reorder the dataframe such that the dates for every Employee ID are ascending.

Then I iterate over the rows of the frame using the built-int iterator iterrows. The _ in for i,_ is merely a placeholder for the second variable I do not use because the iterator gives back both row numbers and row names.

In the iterator I'm doing row-wise comparisons so by default I fill in the first row by hand and then assign to the i+1-th row. I do it like this because I know the value of the first row but not the value of the last row. Then I compare the i+1-th row with the i-th row within an if-safeguard because i+1 would give an index-error on the last iteration.

In the loop I first check if the Employee ID has changed between the two rows. If it has not then I compare the dates of the two rows and see if they are apart more than 365 days. If this is the case I read the string "Unique Leave Event X" from the i-th row, increase the number by one and write it in the i+1-row. If the dates are closer I just copy the string from the previous row.

If the Employee ID does change on the other hand I just write "Unique Leave Event 1" to start over.

Note 1: iterrows() has no options to set so I can't iterate only over a subset.

Note 2: Always iterate using one of the built-in iterators and only iterate if you can't solve the problem otherwise.

Note 3: When assigning values in an iteration always use ix, loc, or iloc.