boothtp boothtp - 4 months ago 18
Python Question

Python Data Wrangling: Loop through values in DataFrame and check if current iterator matches the former

I am stuck on a data wrangling problem. Below is my data:

Year = ['2010','2011','2012','2013','2014','2015','2010','2011','2014','2015','2016','2010','2011','2012','2015']
Type = ['WAS','WAS','BOS','BOS','WAS','BOS','BOS','BOS','WAS','WAS','BOS','BOS','BOS','BOS','BOS']
ID = ['a','a','a','a','a','a','b','b','b','b','b','c','c','c','c']
df = pd.DataFrame({'ID': ID,'Type': Type,'Year': Year})

df
a WAS 2010
a WAS 2011
a BOS 2012
a BOS 2013
a WAS 2014
and so on...............


I am trying to accomplish two things...
First - I want to loop through the dataframe and for every row, check that the id is the same and determine if the previous Type matches the current iterators type. Then, create two new binary variables 'WAStoBOS' and 'BOStoWAS' and return 0 if there is no change at all or if the change is not the same as the variable name, and return 1 if the change is in the direction of the variable name.

For example, the output would be:

df
ID Type Year WAStoBOS BOStoWAS
a WAS 2010 0 0
a WAS 2011 0 0
a BOS 2012 1 0
a BOS 2013 0 0
a WAS 2014 0 1
a BOS 2015 1 0


Second:
Within the same construct, by ID, find the difference between the current rows year and the previous rows year.

The final result dataframe would be:

df
ID Type Year WAStoBOS BOStoWAS YearDiff
a WAS 2010 0 0 0
a WAS 2011 0 0 1
a BOS 2012 1 0 1
a BOS 2013 0 0 1
a WAS 2014 0 1 1
a BOS 2015 1 0 1
b BOS 2010 0 0 0
b BOS 2011 0 0 1
b WAS 2014 0 1 3
b WAS 2015 0 0 1
b BOS 2016 1 0 1
c BOS 2010 0 0 0
c BOS 2011 0 0 1
c BOS 2012 0 0 1
c BOS 2015 0 0 3


Any help would be appreciated.




This edit is after Scotts suggestion.

For example, your code incorrectly assigns a 1 to an instance where the ID and Type change. Where if an ID changes, we don't care what the previous Type was... Ill change the data frame below a little to account for a change in ID and Type while also showing what the desired output should be...

df
ID Type Year WAStoBOS BOStoWAS YearDiff
a WAS 2010 0 0 0
a WAS 2011 0 0 1
a BOS 2012 1 0 1
a BOS 2013 0 0 1
a WAS 2014 0 1 1
**a BOS 2015** 1 0 1
**b WAS 2010** 0 0 0
b BOS 2011 1 0 1
b WAS 2014 0 1 3
b WAS 2015 0 0 1
**b WAS 2016** 0 0 1
**c BOS 2010** 0 0 0
c BOS 2011 0 0 1
c BOS 2012 0 0 1
c BOS 2015 0 0 3


I asterisks beside the instances where there is a change in ID and Type for your reference. Thank you for your help, I never thought about using assign.

Answer Source

EDIT assigning binary with 'ID' in consideration:

df.assign(WAStoBOS=df.groupby('ID')['Type'].transform(lambda x: ((x == 'BOS') & (x.shift(1) == 'WAS')).astype(int)),
          BOStoWAS=df.groupby('ID')['Type'].transform(lambda x: ((x == 'WAS') & (x.shift(1) == 'BOS')).astype(int)),
          YearDiff=df.groupby('ID')['Year'].transform(lambda x: x.astype(int).diff().fillna(0)))

Let's do this in one statement:

df.assign(WAStoBost=((df.Type == 'BOS') & (df.shift(1).Type == 'WAS')).astype(int),
          BOStoWAS=((df.Type=='WAS')&(df.shift(1).Type == 'BOS')).astype(int),
          YearDiff=df.groupby('ID')['Year'].transform(lambda x: x.astype(int).diff().fillna(0)))

Output:

   ID Type  Year  BOStoWAS  WAStoBost  YearDiff
0   a  WAS  2010         0          0       0.0
1   a  WAS  2011         0          0       1.0
2   a  BOS  2012         0          1       1.0
3   a  BOS  2013         0          0       1.0
4   a  WAS  2014         1          0       1.0
5   a  BOS  2015         0          1       1.0
6   b  BOS  2010         0          0       0.0
7   b  BOS  2011         0          0       1.0
8   b  WAS  2014         1          0       3.0
9   b  WAS  2015         0          0       1.0
10  b  BOS  2016         0          1       1.0
11  c  BOS  2010         0          0       0.0
12  c  BOS  2011         0          0       1.0
13  c  BOS  2012         0          0       1.0
14  c  BOS  2015         0          0       3.0