Kevin Kevin - 2 months ago 17
Python Question

Replacing a certain part of a string in a Pandas Data Frame with Regex

My data frame has a date column (that currently are strings). I am trying to fix a problem with the column.

df[:15]

Date Customer ID
0 01/25/2016 104064596300
1 02/28/2015 102077474472
2 11/17/2016 106430081724
3 02/24/2016 107770391692
4 10/05/2016 106523680888
5 02/24/2016 107057691592
6 11/24/2015 102472820188
7 10/12/2016 107195498128
8 01/05/2016 104796266660
9 09/30/2016 107812562924
10 10/13/2015 102809057000
11 11/21/2016 107379017712
12 11/08/2015 106642145040
13 02/26/2015 107862343816
14 10/16/2016 107383084928


My data is supposed to be within the date range of: Sept 2015 to Feb 2016.

Some of the data has their years mixed up (see row 2 above for example - its November 17, 2016!)

What I am trying to do is change the years for the observations with incorrect dates.

I have played around the replace() command in Pandas but cannot come to a command that works:

df.Date.str.replace(('^(09|10|11|12)\/\d\d\/2016$'), '2015')

0 01/25/2016
1 02/28/2015
2 2015
3 02/24/2016
4 2015
5 02/24/2016
6 11/24/2015
7 2015
8 01/05/2016
9 2015
10 10/13/2015
11 2015
12 11/08/2015
13 02/26/2015
14 2015
15 12/17/2015
16 01/05/2015
17 01/21/2015
18 2015
19 2015
20 02/06/2016
21 10/06/2015
22 02/18/2016


To be specific, I am simply trying to change the last 4 digits (the year) of each row depending on some conditions:


  1. If the month is within September to December (09 to 12) and has year
    2016, change the year for this observation to 2015

  2. If the month is January or February (01 or 02) and has year 2015, change the year for this observation to 2016



The command I wrote above identifies the correct observations for scenario 1) but I am having trouble replacing the last 4 digits and inputting the results back into the original data frame.

One final note: You might be thinking why don't I simply change the column to a datetime type and then add or subtract a year based on my needs? If I attempt to do that, I will run into an error as some observations have a date of: 2/29/2015 -> you will run into an error as there was no Feb. 29 during 2015!

Answer

Do not treat date as strings. You can first transform the string format of date to timestamp, then slice.

import pandas ad pd
df.loc[:, 'Date'] = pd.DatetimeIndex(df['Date'], name='Date')
df = df.set_index('Date')
df['2015-09': '2016-02']

Update:

df.loc[:, 'year_month'] = df.Date.map(lambda s: int(s[-4:]+s[:3]))
df.query('201509<=year_month<=201602').drop('year_month', axis=1)

sorry, I misunderstood your question.

def transform(date_string):
    year = date_string[-4:]
    month = date_string[:2]
    day = date_string[3:5]
    if year== '2016' and month in ['09', '10', '11', '12']:
        return month + '/' + day + '/' + str(int(year)-1)
    elif year == '2015' and month in ['01', '02', '03']:
        return month + '/' + day + '/' + str(int(year)+1)
    else:
        return date_string

df.loc[:, 'Date'] = df.Date.map(transform)