Irfanullah Irfanullah - 2 months ago 14
Python Question

Apply regular expressions on a specific column in Pandas

I have a dataset with columns tweetID, tweet-text, RegExp1, RegExp2, RegExp3, RegExp4 and a list of 4 regular expressions.
I want to apply regular expressions one by one on tweet-text column, if tweet-text satisfy the regular expression then I want to set value to 1 in corresponding RegExp column, and if it does not satisfy then I want to set it to 0.

For example, suppose tweet-text satisfy regular expression number 1 then I want to set corresponding RegExp1 columns's value to 1, and does not satisfy regular expression 2 then I want to set corresponding RegExp2 column's value to 0 and so on. I tried the code given at the end, but it didn't worked for me.

My dataset look like

tweetID | tweet-text | RegExp1 | RexExp2 | RegExp3 | RexExp4
---------------------------------------------------------------------
10001 | to get it or? | | | |
10333 | I just wonder :) | | | |
10933 | is it possible dude| | | |
14633 | he is good at | | | |


code:

`regexes = [
re.compile('i asked .* said'),
re.compile('you asked me what .*'),
re.compile('(to get|to see|to look|is it true|is it possible) .*'),
re.compile('I .* wonder .*')
]
for regex, i in zip(regexes, range(4)):
columnName = "RegExp"+str(i+1)
for row in df['tweet-text']:
if(regex.search(row) != None):
df[columnName] = 1
else:
df[columnName] = 0`


(use of pandas will be preferred)thanks

Answer Source

You can use str.contains inside a loop. You'll need to pass the regex pattern (not a compiled regex object).

This is what I'm starting with:

In [1062]: df.head()
Out[1062]: 
   tweetID            tweet-text    RegExp1    RegExp2    RegExp3 RegExp4
0    10001   to get it or?                                               
1    10333   I just wonder :)                                            
2    10933   is it possible dude                                         
3    14633   he is good at 

In [1063]: regexes = [
      ...:     'i asked .* said',
      ...:     'you asked me what .*',
      ...:     '(?:to get|to see|to look|is it true|is it possible) .*',
      ...:     'I .* wonder .*'
      ...: ]

Next, run a loop for each regex pattern. Call str.contains and assign the result to each column in turn:

In [1090]: for i, r in enumerate(regexes):
      ...:     df['RegExp%d' %(i + 1)] = df['tweet-text'].str.contains(r).astype(int)
      ...:     

In [1091]: df.head()
Out[1091]: 
   tweetID            tweet-text  RegExp1  RegExp2  RegExp3  RegExp4
0    10001   to get it or?              0        0        1        0
1    10333   I just wonder :)           0        0        0        1
2    10933   is it possible dude        0        0        1        0
3    14633   he is good at              0        0        0        0