medev21 medev21 - 2 months ago 5
Python Question

replacing html syntax/ascii code in pandas columns

Hi Stackoverflow community,

I'm trying to find a way to replace the html syntax within a string. I get a lot of csv files, so I found pandas to be an awesome tool to deal with csv.

Sometimes I receive data that has html syntax embedded in a string, for example in the address column, i see

125&#45128 downing st
, which is
125-128 downing st
. It's not only
&#49
, I also get
&#39
and
&amp
, which are
'
and
&
.

I tried to do this code, but I know it's not giving me the right format

df = df.replace(r'[&#45]','-', regex=True)


I get several columns such as business name, address, city, state - so I would like to target all columns just to make sure all the html syntax are removed/replaced

Dataframe format

Address 1 Company
0 1st&#452st Avenue N johnson &amp johnson
1 243&#454800 Kingsway Ave cold &amp brew
2 300 Hwy 7 coder&#39s club


Desired format

Address 1 Company
0 1st-2st Avenue N johnson and johnson
1 243-4800 Kingsway Ave cold and brew
2 300 Hwy 7 coder's club


I'm a newbie to pandas, but so I far I love this tool. Thanks for your help.

UPDATE:

Ok, I found my mistake on the code I use, this
df = df.replace(r'[&#45]','-', regex=True)
should be
df = df.replace(r'&#45','-', regex=True)
; this will replace that html syntax. However I still would like to find a pythonic solution to this problem.

I am tempted to do this code in order to remove the other html syntax.

df = df.replace(r'&#45','-', regex=True).replace(r'&#39','', regex=True).replace(r'&amp','and', regex=True)


is there a clean solution to this line of code?

Answer

Just to leave an answer to new people coming here, a possible solution, as discussed on the comments, is:

df.replace([r'&#45', r'&#39', r'&amp'], ['-', '', 'and'], regex=True)

or

df.replace({'&#45':'-', r'&#39':'', '&amp':'and'}, regex=True)
Comments