I am working on a function which retains symbols that is inside of a word(a word can consist of a-zA-Z,0-9 and _), but removes every other symbol outside the word:
Input String - hell_o ? my name _ i's <hel'lo/>
Output - ['hell_o' ,'my', 'name', '_', "i's" ,'hel'lo']
l = ' '.join(filter(None,(word.strip(punctuation.replace("_","")) for word in input_String.split())))
l = re.sub(r'\s+'," ",l)
t = str.split(l.lower())
re.sub('[^\w]', ' ', doc.strip(' ').lower())
You can match any character different than
a-zA-Z, 0-9 and _ as you mention, between 2 letters with
(?<=[a-z])\W(?=[a-z]) and replace it with nothing, to remove it.
In the end you will have a very dangerous algorithm for instance in the sentence
I'm fine.And you? if there is no space after the dot it will end up in
I'm fineAnd you? which may not be what you want.
[EDIT] after your comments.
Ok I misunderstood your question.
Now I came along with the one regex you want to select
'hell_o' ,'my', 'name', "i's" ,'hel'lo':
You can see it working here: https://regex101.com/r/EAEelq/3. (don't forget the
[EDIT] As you also want to match the
_ outside a word
ok so if you want the underscores to be matched also update as is:
(?<![a-z_])[a-z_][^\s]*[a-z_](?![a-z_])|(?<= )[a-z_](?= ).
See it working here: https://regex101.com/r/EAEelq/4