user3471881 user3471881 - 2 months ago 7
Python Question

Add column to dataframe based on date column range

I have a

df
containing
n
columns
. One of these is a
column
named
date
which contains values formatted as
mm-dd-yy
. Now I want to add a
column
interval
to my
df
. This
column
should return the
year
contained in
date
but also if it's
H1
or
H2
.
H1
is half year one and should be all
date
values between
01-dd-yy
06-dd-yy
and thus
H2
should be all
date
values between
07-dd-yy
and
12-dd-yy
.

This is an example of the data in
df['date']
:

0 01-27-16
1 02-27-16
2 03-27-16
3 04-27-16
4 05-27-16
5 06-27-16
6 07-27-16
7 08-24-16
8 09-24-16
9 10-16-15
...etc...


In
df
I want to add another column named
interval
containing:

0 16H1
1 16H1
2 16H1
3 16H1
4 16H1
5 16H1
6 16H2
7 16H2
8 16H2
9 15H2
...etc...


So I thought I'd create a
function
and then use
map
.

def is_in_range(x):
if x['date'] >= '01-01-16' x['date'] <= '06-31-16':
print '16H1'
elif x['date'] >= '07-01-16' and x['date'] <= '12-31-16':
print '16H2'
elif x['date'] >= '01-01-15' and x['date'] <= '06-31-15':
print '15H1'
elif x['date'] >= '07-01-15' and x['date'] <= '12-31-15':
print '15H2'
...etc...


I call the function like this:

df.groupby(df['date'].map(is_in_range))


Now this gives me:


`TypeError: 'Timestamp' object has no attribute 'getitem'


to begin with. I'm not sure why, but either way there surely must be a better way?

Answer

You can use dt.quarter or dt.month.

First convert int year value to str by astype, then select last 2 chars. Last use numpy.where with condition:

#first convert to datetime if not datetime
df.date = pd.to_datetime(df.date)

df['interval'] = df.date.astype(str).str[2:4] + np.where(df.date.dt.month < 7, 'H1','H2')
print (df)
        date interval
0 2016-01-27     16H1
1 2016-02-27     16H1
2 2016-03-27     16H1
3 2016-04-27     16H1
4 2016-05-27     16H1
5 2016-06-27     16H1
6 2016-07-27     16H2
7 2016-08-24     16H2
8 2016-09-24     16H2
9 2015-10-16     15H2

Or:

df['interval'] = df.date.astype(str).str[2:4] + np.where(df.date.dt.quarter < 3,'H1','H2')
print (df)
        date interval
0 2016-01-27     16H1
1 2016-02-27     16H1
2 2016-03-27     16H1
3 2016-04-27     16H1
4 2016-05-27     16H1
5 2016-06-27     16H1
6 2016-07-27     16H2
7 2016-08-24     16H2
8 2016-09-24     16H2
9 2015-10-16     15H2

string solution:

df['interval'] = df.date.str[6:] + np.where(df.date.str[:2].astype(int) < 7, 'H1','H2')
print (df)
       date interval
0  01-27-16     16H1
1  02-27-16     16H1
2  03-27-16     16H1
3  04-27-16     16H1
4  05-27-16     16H1
5  06-27-16     16H1
6  07-27-16     16H2
7  08-24-16     16H2
8  09-24-16     16H2
9  10-16-15     15H2

List comprehension solutions work if not NaN:

string column:

df['interval'] = [x[6:] + 'H1' if int(x[:2])< 7 else x[6:] + 'H2' for x in df['date']]

datetime column:

#first convert to datetime if not datetime
df.date = pd.to_datetime(df.date)

df['interval'] = [x[2:4] + 'H1' if int(x[5:7])< 7 else x[2:4] + 'H2' for x in df['date'].astype(str)]

print (df)
       date interval
0  01-27-16     16H1
1  02-27-16     16H1
2  03-27-16     16H1
3  04-27-16     16H1
4  05-27-16     16H1
5  06-27-16     16H1
6  07-27-16     16H2
7  08-24-16     16H2
8  09-24-16     16H2
9  10-16-15     15H2
Comments