Kevin Kevin - 2 months ago 12
Python Question

Running a Regex loop in a Pandas Dataframe

I currently have a date column that has some issues. I have attempted to fix the problem but cannot come to a conclusion.

Here is the data:

# Import data
df_views = pd.read_excel('PageViews.xlsx')

# Check data types
df_views.dtypes
Out[57]:
Date object
Customer ID int64
dtype: object


The date column is not in a 'datetime' data format as expected. Further inspection yields:

df_views.ix[:5]
Date Customer ID
0 01/25/2016 104064596300
1 02/28/2015 102077474472
2 11/17/2016 106430081724
3 02/24/2016 107770391692
4 10/05/2016 106523680888
5 02/24/2016 107057691592


I quickly check which rows does not follow the proper format xx/xx/xxxx

print (df_views[df_views["Date"].str.len() != 10])
Date Customer ID
189513 12/14/ 106285770688
189514 10/28/ 107520462840
189515 11/01/ 102969804360
189516 11/10/ 102106417100
189517 02/16/ 107810168068
189518 10/25/ 102096164504
189519 02/08/ 107391760644
189520 02/29/ 107353558928
189521 10/24/ 107209142140
189522 12/20/ 107875461336
189523 12/23/ 107736375428
189524 11/12/ 106561080372
189525 01/27/ 102676548120
189526 11/19/ 107733043896
189527 12/31/ 107774452412
189528 01/21/ 102610956040
189529 01/09/ 108052836888
189530 02/21/ 106380330112
189531 02/02/ 107844459772
189532 12/12/ 102006641640
189533 12/16/ 106604647688
189534 11/14/ 102383102504


I have attempted to create a for loop but cannot figure out how to approach my loop.

Important note: I know that the time period for all observations is between September 2015 through February 2016.

So if the month is 09/10/11/12 - then I can add "2015" to the date,
otherwise if the month is 01/02, I can add "2016".

for row in df_views["Date"]:
if len(row) != 10:
if row.str.contains("^09|10|11|12\/"):
row.str.cat("2015")
elif row.str.contains("^01|02\/"):
row.str.cat("2016")
else:
continue
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-87-684e121dd62d> in <module>()
5 for row in df_views["Date"]:
6 if len(row) != 10:
----> 7 if row.str.contains("^09|10|11|12\/"):
8 row.str.cat("2015")
9 elif row.str.contains("^01|02\/"):

AttributeError: 'str' object has no attribute 'str'

Answer

As @BrenBam has already written in the comment - try to avoid using loops. Pandas gives us tons of vectorized (read fast and efficient) methods:

In [67]: df
Out[67]:
          Date   Customer ID
0   12/14/2001  106285770688
1   10/28/2000  107520462840
2       11/01/  102969804360
3       11/10/  102106417100
4       02/16/  107810168068
5       10/25/  102096164504
6       02/08/  107391760644
7       02/29/  107353558928
8       10/24/  107209142140
9       12/20/  107875461336
10      12/23/  107736375428
11      11/12/  106561080372
12      01/27/  102676548120
13      11/19/  107733043896
14      12/31/  107774452412
15      01/21/  102610956040
16      01/09/  108052836888
17      02/21/  106380330112
18      02/02/  107844459772
19      12/12/  102006641640
20      12/16/  106604647688
21      11/14/  102383102504

In [68]: df.ix[df.Date.str.match(r'^(?:09|10|11|12)\/\d{2}\/$', as_indexer=True), 'Date'] += '2015'

In [69]: df.ix[df.Date.str.match(r'^(?:01|02)\/\d{2}\/$', as_indexer=True), 'Date'] += '2016'

In [70]: df
Out[70]:
          Date   Customer ID
0   12/14/2001  106285770688
1   10/28/2000  107520462840
2   11/01/2015  102969804360
3   11/10/2015  102106417100
4   02/16/2016  107810168068
5   10/25/2015  102096164504
6   02/08/2016  107391760644
7   02/29/2016  107353558928
8   10/24/2015  107209142140
9   12/20/2015  107875461336
10  12/23/2015  107736375428
11  11/12/2015  106561080372
12  01/27/2016  102676548120
13  11/19/2015  107733043896
14  12/31/2015  107774452412
15  01/21/2016  102610956040
16  01/09/2016  108052836888
17  02/21/2016  106380330112
18  02/02/2016  107844459772
19  12/12/2015  102006641640
20  12/16/2015  106604647688
21  11/14/2015  102383102504