Sitz Blogz Sitz Blogz - 4 months ago 49
Python Question

str.extractall('#') Pandas gives an error

I am trying to filter all the

#
keywords from the tweet text. I am using
str.extractall()
to extract all the keywords with
#
keywords.
This is the first time I am working on filtering keywords from the tweetText using pandas. Inputs, code, expected output and error are given below.

Input:

userID,tweetText
01, home #sweet home
01, #happy #life
02, #world peace
03, #all are one
04, world tour


and so on... the total datafile is in GB size scraped tweets with several other columns. But I am interested in only two columns.

Code:

import re
import pandas as pd

data = pd.read_csv('Text.csv', index_col=0, header=None, names=['userID', 'tweetText'])

fout = data['tweetText'].str.extractall('#')

print fout


Expected Output:

userID,tweetText
01,#sweet
01,#happy
01,#life
02,#world
03,#all


Error:

Traceback (most recent call last):
File "keyword_split.py", line 7, in <module>
fout = data['tweetText'].str.extractall('#')
File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 1621, in extractall
return str_extractall(self._orig, pat, flags=flags)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 694, in str_extractall
raise ValueError("pattern contains no capture groups")
ValueError: pattern contains no capture groups


Thanks in advance for the help. What should be the simplest way to filter keywords with respect to userid?

Answer

If you are not too tied to using extractall, you can try the following to get your final output:

from io import StringIO
import pandas as pd
import re


data_text = """userID,tweetText
01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
"""

data = pd.read_csv(StringIO(data_text),header=0)

data['tweetText'] = data.tweetText.apply(lambda x: re.findall('#(?=\w+)\w+',x))
s = data.apply(lambda x: pd.Series(x['tweetText']),axis=1).stack().reset_index(level=1, drop=True)
s.name = "tweetText"
data = data.drop('tweetText', axis=1).join(s)

     userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all
4       4       NaN

You drop the rows where the textTweet column returns Nan's by doing the following:

data = data[~data['tweetText'].isnull()]

This should return:

   userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all

I hope this helps.

Comments