A.collin A.collin - 1 month ago 15
Python Question

Remove non alphanumeric from string but keeping encoded non-ASCII characters åäö

How can I keep åäö but remove all other non alphanumeric chars from a string?
(I found similar questions but none seem to have a proper answer.)

I tried stuff like extending the regex to try making it skip åäö in the sub but it seems to just make the regex stop working all together letting whitespaces and such stay aswell.
I'm not usually programming in python, just trying to help a friend out, so there might be some better way to clean a string than using re.

From googling I think it has to do with Unicode, but no good solutions.

def ordnaText(text):
text = text.lower()
text = re.sub('\W', '', text)
if text.isalnum() == True:
return text

Answer

You are trying to match against encoded input; raw_input() in Python 2 always returns a byte string. This means that the terminal, console or IDE you are using determines what encoding is used for the input.

Trying to match non-ASCII characters with a regular expression, using byte strings requires you to match the encoded bytes exactly, which usually means that any change in the terminal environment or your source code editor settings will lead to the match failing.

You want to explicitly decode the raw_input() here, and use Unicode matching:

import sys
import re

def ordnaText(text): 
    text = text.lower()
    text = re.sub(u'\W', '', text, flags=re.UNICODE)
    if text.isalnum() == True:
        return text

userinput = raw_input('....')
userinput = userinput.decode(sys.stdin.encoding)
something = ordnaText(userinput)

sys.stdin.encoding tells you what Python thinks the input codec is. Using flags=re.UNICODE specifically switches on unicode support in the regular expression engine. And u'\W' gives the engine a Unicode string literal; the latter is optional but it is better to be explicit.

If you want to learn more about Unicode, encoded byte strings and how it relates to Python, I recommend you read:

Comments