Question:
I need to match and replace on the whole words in the pandas df column 'messages' with the dictionary values. Is there any way I can do this within the df["column"].replace command? Or do I need to find another way to replace whole words?
Background:
in my pandas data frame I have a column of text messages that contain English human names keys i'm trying to replace with dictionary value of "First Name". The specific column in the data frame looks like this, where you can see "tommy" as a single name.
tester.df["message"]
message
0 what do i need to do
1 what do i need to do
2 hi tommy thank you for contacting app ...
3 hi tommy thank you for contacting app ...
4 hi we are just following up to see if you read...
import requests
#import the total name
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)
str1 = ','.join(list1)
str1 = (str1.lower())
#turn into dictionary with "First Name" as value
str1 = dict((el, 'FirstName') for el in str1)
In [254]: tester["message"].replace(str1, regex = True)
Out[254]:
0 wFirstNamet do i neFirstName to do
1 wFirstNamet do i neFirstName to do
2 hi FirstNameFirstName tFirstName you for conFi...
3 hi FirstNameFirstName tFirstName you for conFi...
4 hi we are just followFirstNameg up to FirstNam...
Name: message, dtype: object
#import the total name
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
l = requests.get('https://deron.meranda.us/data/popular-last.txt')
#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#add regex before
string = 'r"\\'
endstring = '\\b'
list1 = [ string + x + endstring for x in list1]
#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)
str1 = ','.join(list1)
str1 = (str1.lower())
##if we do print(str1) it shows one backslash
##turn to list ..but print() doesn't let us have one backlash anymore
str1 = [x.strip() for x in str1.split(',')]
#turn to dictionary with "firstname"
str1 = dict((el, 'FirstName') for el in str1)
tester["message"].replace(str1, regex = True)
First you need to prepare the list of names such that it matches the name preceded by either the beginning of the string (^
) or a whitespace (\s
) and followed by either a whitespace or the end of the string ($
). Then you need to make sure to preserve the preceding and following element (via backreferences). Assuming you have a list first_names
which contains all first names that should be replaced:
replacement_dict = {
r'(^|\s){}($|\s)'.format(name): r'\1FirstName\2'
for name in first_names
}
Let's take a look at the regex:
( # Start group.
^|\s # Match either beginning of string or whitespace.
) # Close group.
{} # This is where the actual name will be inserted.
(
$|\s # Match either end of string or whitespace.
)
And the replacement regex:
\1 # Backreference; whatever was matched by the first group.
FirstName
\2 # Backreference; whatever was matched by the second group.