Peachazoid Peachazoid - 3 years ago 256
Python Question

Python: replace whole word dictionary values in pandas df with dictionary key

Question:
I need to match and replace on the whole words in the pandas df column 'messages' with the dictionary values. Is there any way I can do this within the df["column"].replace command? Or do I need to find another way to replace whole words?

Background:
in my pandas data frame I have a column of text messages that contain English human names keys i'm trying to replace with dictionary value of "First Name". The specific column in the data frame looks like this, where you can see "tommy" as a single name.

tester.df["message"]
message
0 what do i need to do
1 what do i need to do
2 hi tommy thank you for contacting app ...
3 hi tommy thank you for contacting app ...
4 hi we are just following up to see if you read...


The dictionary is created from a list I extracted from the 2000 census data base. It has many different first names that could match inline text including 'al' or 'tom', and if i'm not careful could place my value "First Name" everywhere across the pandas df column messages:

import requests

#import the total name
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')

#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)


#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)

str1 = ','.join(list1)
str1 = (str1.lower())

#turn into dictionary with "First Name" as value

str1 = dict((el, 'FirstName') for el in str1)


Now I want to replace whole words within the DF column "message" that match the dictionary keys with the 'FirstName' value. Unfortunately when I do the following it replaces the text in messages where it matches even the short names like "al" or 'tom".

In [254]: tester["message"].replace(str1, regex = True)
Out[254]:
0 wFirstNamet do i neFirstName to do
1 wFirstNamet do i neFirstName to do
2 hi FirstNameFirstName tFirstName you for conFi...
3 hi FirstNameFirstName tFirstName you for conFi...
4 hi we are just followFirstNameg up to FirstNam...
Name: message, dtype: object


Any help matching and replacing the whole key with value is appreciated!

Update / attempt to fix 1: Tried adding some regular expression features to match whole words only**

I tried adding a break character to each word within the extracted string that the dictionary of which the dictionary is constructed. Unfortunately the single slashes are limited words that get turned into double slashes and won't match the dictionary key -> value replace.

#import the total name
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
l = requests.get('https://deron.meranda.us/data/popular-last.txt')
#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)

#add regex before

string = 'r"\\'
endstring = '\\b'

list1 = [ string + x + endstring for x in list1]

#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)

str1 = ','.join(list1)
str1 = (str1.lower())


##if we do print(str1) it shows one backslash
##turn to list ..but print() doesn't let us have one backlash anymore

str1 = [x.strip() for x in str1.split(',')]



#turn to dictionary with "firstname"
str1 = dict((el, 'FirstName') for el in str1)


And then when I try to match and replace with the updated dictionary keys with the break regular expressions, I get a bad escape

tester["message"].replace(str1, regex = True)


" Traceback (most recent call last):
error: bad escape \j "

This might be the right direction, but the backslash to double backslash conversion seems to be tricky...

Answer Source

First you need to prepare the list of names such that it matches the name preceded by either the beginning of the string (^) or a whitespace (\s) and followed by either a whitespace or the end of the string ($). Then you need to make sure to preserve the preceding and following element (via backreferences). Assuming you have a list first_names which contains all first names that should be replaced:

replacement_dict = {
    r'(^|\s){}($|\s)'.format(name): r'\1FirstName\2'
    for name in first_names
}

Let's take a look at the regex:

(         # Start group.
  ^|\s    # Match either beginning of string or whitespace.
)         # Close group.
{}        # This is where the actual name will be inserted.
(
  $|\s    # Match either end of string or whitespace.
)

And the replacement regex:

\1     # Backreference; whatever was matched by the first group.
FirstName
\2     # Backreference; whatever was matched by the second group.
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download