jv22 jv22 - 1 year ago 73
Python Question

Extract minimum and maximum year from string in Pandas DataFrame

I have a CSV file that I read into a Pandas DataFrame that contains a column with multiple year values separated by a semicolon.

I need to extract the minimum and maximum value from the string and save each in a new column.

I am able to print the minimum and maximum but I can't seem to get the correct values from each row saved into a new columm.

Any help is much appreciated.

sample DataFrame

import pandas as pd
import numpy as np

raw_data = {'id': ['1473-2262', '2327-9214', '1949-8349', '2375-6314',
'0095-6562'],
'years': ['2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005',
'2003; 2004; 2005', '2015', np.nan, '2012; 2014']}
df = pd.DataFrame(raw_data, columns = ['id', 'years'])


This is the DataFrame that I need

id years minyear maxyear
0 1473-2262 2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005 2000.0 2005.0
1 2327-9214 2003; 2004; 2005 2003.0 2005.0
2 1949-8349 2015 2015.0 2015.0
3 2375-6314 NaN NaN NaN
4 0095-6562 2012; 2014 2012.0 2014.0


I can print the minimum and maximum

x = df['years'].notnull()

for row in df['years'][x].str.split(pat=';'):
lst = list()
for item in row:
lst.append(int(item))
print('Min=',min(lst),'Max=',max(lst))

Min= 2000 Max= 2005
Min= 2003 Max= 2005
Min= 2015 Max= 2015
Min= 2012 Max= 2014


Here's how I've tried to capture the values to new columns

x = df['years'].notnull()

for row in df['years'][x].str.split(pat=';'):
lst = list()
for item in row:
lst.append(int(item))
df['minyear']=min(lst)
df['maxyear']=max(lst)


Only the values from the last row are saved to the new columns.

id years minyear maxyear
0 1473-2262 2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005 2012 2014
1 2327-9214 2003; 2004; 2005 2012 2014
2 1949-8349 2015 2012 2014
3 2375-6314 NaN 2012 2014
4 0095-6562 2012; 2014 2012 2014

Answer Source

I think you need str.split with expand=True for new DataFrame, then cast to float.

Index values are same, so assign new columns:

df1 = df['years'].str.split('; ', expand=True).astype(float)
df = df.assign(maxyear=df1.max(axis=1),minyear=df1.min(axis=1))
#same as
#df['maxyear'], df['minyear'] = df1.min(axis=1), df1.max(axis=1)
print (df)
          id                                           years  maxyear  minyear
0  1473-2262  2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005   2000.0   2005.0
1  2327-9214                                2003; 2004; 2005   2003.0   2005.0
2  1949-8349                                            2015   2015.0   2015.0
3  2375-6314                                             NaN      NaN      NaN
4  0095-6562                                      2012; 2014   2012.0   2014.0
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download