MlleStrife MlleStrife - 8 months ago 69
Python Question

Python how to separate punctuation from text

So I want to separate group of punctuation from the text with spaces.

my_text = "!where??and!!or$$then:)"

I want to have a
! where ?? and !! or $$ then :)
as a result.

I wanted something like in Javascript, where you can use
to get your matching string. What I have tried so far:

my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|}~]*', my_text)

Here my_matches is empty so I had to delete
from the expression:

my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=#@?\^_`{|}~]*', my_text)

I have this result:

['!', '', '', '', '', '', '??', '', '', '', '!!', '', '', '$$', '', '', '', '',
':)', '']

So I delete all the redundant entry like this:

my_matches_distinct = list(set(my_matches))

And I have a better result:

['', '??', ':)', '$$', '!', '!!']

Then I replace every match by himself and space:

for match in my_matches:
if match != '':
my_text = re.sub(match, ' ' + match + ' ', my_text)

And of course it's not working ! I tried to cast the match as a string, but it's not working either... When I try to put directly the string to replace it's working though.

But I think I'm not doing it right, because I will have problems with '!' et '!!' right?

Thanks :)

Answer Source

It is recommended to use raw string literals when defining a regex pattern. Besides, do not escape arbitrary symbols inside a character class, only \ must be always escaped, and others can be placed so that they do not need escaping. Also, your regex matches an empty string - and it does - due to *. Replace with + quantifier. Besides, if you want to remove these symbols from your string, use re.sub directly.

import re
my_text = "!where??and!!or$$then:)"
print(re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip())

See the Python demo

Details: The []!"$%&\'()*+,./:;=#@?[\^_`{|}~-]+ matches any 1+ symbols from the set (note that only \ is escaped here since - is used at the end, and ] at the start of the class), and the replacement inserts a space + the whole match (the \g<0> is the backreference to the whole match) and a space.