user3386406 user3386406 - 3 months ago 20
Python Question

Python nested lists replace unicode characters in strings

Trying to replace or strip strings in this list to insert into a database which does not allow them

info=[[u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05'], [u' \xa0RCKIN 0 - 1 WITHACK.nq\xa0 ', u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05']]


I used this code

info = [[x.replace(u'\xa0', u'') for x in l] for l in info]
info = [[y.replace('\u2019s', '') for y in o] for o in info]


the first line worked but the second one not, any suggestions ?

Answer

Drop the second line and do:

info = [[x.encode('ascii', 'ignore')  for x in l] for l in info]

and see if the results are acceptable. This will attempt to convert all the unicode to ascii and drop any characters that fail to convert. You just want to be sure that if you lose an important unicode character, it's not a problem.

>>> info=[[u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05'], [u' \xa0RCKIN 0 - 1 WITHACK.nq\xa0  ', u'\xa0Buffalo\u2019s League of legends ...', '2012-09-05']]
>>> info = [[x.encode('ascii', 'ignore')  for x in l] for l in info]
>>> info
[['Buffalos League of legends ...', '2012-09-05'], [' RCKIN 0 - 1 WITHACK.nq  ', 'Buffalos League of legends ...', '2012-09-05']]

What's going on:

You have data in your Python program that's Unicode (and that's good.)

>>> u = u'\u2019'

Best practice, for interoperability, is to write Unicode strings out to utf-8. These are the bytes you should be storing in your database:

>>> u.encode('utf-8')
'\xe2\x80\x99'
>>> utf8 = u.encode('utf-8')
>>> print utf8
’

And then when you read those bytes back into your program, you should then decode them:

>>> utf8.decode('utf8')
u'\u2019'
>>> print utf8.decode('utf8')
’

If your database can't handle utf-8 then I would consider getting a new database.

Comments