Sitz Blogz Sitz Blogz - 4 months ago 17
Python Question

Pandas gives an error from str.extractall('#')

I am trying to filter all the

#
keywords from the tweet text. I am using
str.extractall()
to extract all the keywords with
#
keywords.
This is the first time I am working on filtering keywords from the tweetText using pandas. Inputs, code, expected output and error are given below.

Input:

userID,tweetText
01, home #sweet home
01, #happy #life
02, #world peace
03, #all are one
04, world tour


and so on... the total datafile is in GB size scraped tweets with several other columns. But I am interested in only two columns.

Code:

import re
import pandas as pd

data = pd.read_csv('Text.csv', index_col=0, header=None, names=['userID', 'tweetText'])

fout = data['tweetText'].str.extractall('#')

print fout


Expected Output:

userID,tweetText
01,#sweet
01,#happy
01,#life
02,#world
03,#all


Error:

Traceback (most recent call last):
File "keyword_split.py", line 7, in <module>
fout = data['tweetText'].str.extractall('#')
File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 1621, in extractall
return str_extractall(self._orig, pat, flags=flags)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 694, in str_extractall
raise ValueError("pattern contains no capture groups")
ValueError: pattern contains no capture groups


Thanks in advance for the help. What should be the simplest way to filter keywords with respect to userid?

Output Update:

When used only this the output is like above
s.name = "tweetText"
data_1 = data[~data['tweetText'].isnull()]


When used only this the output us what needed but with
NAN


s.name = "tweetText"
data_2 = data_1.drop('tweetText', axis=1).join(s)


enter image description here

Answer

If you are not too tied to using extractall, you can try the following to get your final output:

from io import StringIO
import pandas as pd
import re


data_text = """userID,tweetText
01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
"""

data = pd.read_csv(StringIO(data_text),header=0)

data['tweetText'] = data.tweetText.apply(lambda x: re.findall('#(?=\w+)\w+',x))
s = data.apply(lambda x: pd.Series(x['tweetText']),axis=1).stack().reset_index(level=1, drop=True)
s.name = "tweetText"
data = data.drop('tweetText', axis=1).join(s)

     userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all
4       4       NaN

You drop the rows where the textTweet column returns Nan's by doing the following:

data = data[~data['tweetText'].isnull()]

This should return:

   userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all

I hope this helps.

Comments