pir pir - 6 months ago 14
Python Question

Cleaning away symbols/whitespace efficiently

I have strings such as

"- memphis , tn! "
,
"~~~memphis,tn"
,
":) memphis , tn (:"
,
". - memphis,tn - ."
,
"memphis tn?"
. I want to clean each of these strings such that each string becomes
"memphis,tn"
. Currently, I use the code below, but is there a more efficient way of doing this? Perhaps using regex?

Note that I currently have the issue that the ordering of the special characters affects the end result. For instance,
". - memphis,tn - ."
gives the right result, whereas
"- . memphis,tn . -"
does not. This is not intended. If it could be fixed as a sideeffect, that would be great!

The strings are pure ASCII and I may be tempted to remove more special characters than the ones below.

Edit:
Sorry, I should note that not all strings have the "x,y" format. Also strings such as "-- New York City --" or "* Texas *" should be cleaned up.

# remove emoticons
smileys = [":)",":\\",":(",";)",
"(:","\\:","):","(;"]
for s in smileys:
loc = loc.replace(s, '')

# cleaning whitespace uses
loc = ' '.join(loc.split())
loc = loc.strip()
loc = loc.replace(' ,', ',')
loc = loc.replace(', ', ',')
loc = loc.replace(' .', '.')
loc = loc.replace('. ', '.')

# clean special symbols off the sides
symbols = ['.', ',', '!', '-', '#', '~',
'*', '^', '?', '@', '"', "'"]
for s in symbols:
loc = loc.strip(s, '')

loc = loc.strip()

Answer

You can use

','.join(y for y in re.split("[- ,!~?]", x) if len(y) > 0)
                                ^^
                                ||
                    List all the symbols here

Python Code

y = ["- memphis , tn! ", "~~~memphis,tn", ":) memphis , tn (:", ". - memphis,tn - .", "memphis tn?", ". - memphis,tn - .", "- . memphis,tn . -"]

for x in y:
    print(','.join(y for y in re.split("[- ,!~?:;)(.]", x) if len(y) > 0))

Ideone Demo

If you want to remove any symbol other than alphanumeric, you can use

print(','.join(y for y in re.split("_|[^\w]", x) if len(y) > 0))
Comments