mgruber mgruber - 1 month ago 7
Python Question

Splitting strings of a list on a separator if it appears

I fetched HTML code from a webpage (within a project of codecademy.com).
The fetching resulted in a text. Which I splitted into a list.

The problem: Some results contain Unicode characters, that I want to cut from the strings they appear.

['Normal String', 'Company\xc2\xae', 'againnormal', '\xc2\xb7']


The result should look like this:

['Normal String', 'Company', 'againnormal', '']


OR ideally like this

['Normal String', 'Company', 'againnormal']

Answer

How about

>>> stuff = ['Normal String', 'Company\xc2\xae', 'againnormal', '\xc2\xb7']
>>> filter(None, [x.decode('utf8').encode('ascii', 'ignore') for x in stuff])
['Normal String', 'Company', 'againnormal']

alternatively with a regex

>>> import re
>>> filter(None, [re.sub(r'[^\x00-\x7F]+', '', x) for x in stuff])
['Normal String', 'Company', 'againnormal']

Without list comprehensions:

keep = []
for item in stuff:
    item = item.decode('utf8').encode('ascii', 'ignore')
    if item:
        keep.append(item)
Comments