N_B N_B - 1 month ago 27
Python Question

Alternative approach to strip symbols in a string

I am working on a function which retains symbols that is inside of a word(a word can consist of a-zA-Z,0-9 and _), but removes every other symbol outside the word:

For example:
Input String - hell_o ? my name _ i's <hel'lo/>
Output - ['hell_o' ,'my', 'name', '_', "i's" ,'hel'lo']


The function i am using :

l = ' '.join(filter(None,(word.strip(punctuation.replace("_","")) for word in input_String.split())))
l = re.sub(r'\s+'," ",l)
t = str.split(l.lower())


I know this is not the best, optimal way!!Does anyone recommend any alternatives that i can try??Probably a regEx to do this??


  • I tried using:
    negative look around and look behinds:
    \W+(?!\S*[a-z])|(?<!\S)\W+

  • s.strip(punctuation)

  • re.sub('[^\w]', ' ', doc.strip(' ').lower())
    - This Removes punctuation inside the word too


Answer

You can match any character different than a-zA-Z, 0-9 and _ as you mention, between 2 letters with (?<=[a-z])\W(?=[a-z]) and replace it with nothing, to remove it.

In the end you will have a very dangerous algorithm for instance in the sentence I'm fine.And you? if there is no space after the dot it will end up in I'm fineAnd you? which may not be what you want.


[EDIT] after your comments.

Ok I misunderstood your question.

Now I came along with the one regex you want to select 'hell_o' ,'my', 'name', "i's" ,'hel'lo':

(?<![a-z])[a-z][^\s]*[a-z](?![a-z]).

You can see it working here: https://regex101.com/r/EAEelq/3. (don't forget the i and g flags).


[EDIT] As you also want to match the _ outside a word

ok so if you want the underscores to be matched also update as is: (?<![a-z_])[a-z_][^\s]*[a-z_](?![a-z_])|(?<= )[a-z_](?= ).

See it working here: https://regex101.com/r/EAEelq/4