I am trying to clean my sentences and what I want to remove these tags in my sentences (They are in the form of underscore followed by a word eg "_UH").
Basically I want to remove the string followed by an underscore (also removing the underscore itself)
text:
['hanks_NNS sir_VBP',
'Oh_UH thanks_NNS to_TO remember_VB']
['hanks sir',
'Oh thanks to remember']
for i in text:
k= i.split(" ")
print (k)
for z in k:
if "_" in z:
j=z.replace("_",'')
print (j)
ThanksNNS
sirVBP
OhUH
thanksNNS
toTO
rememberVB
RemindVB
You can do it with re.sub()
. Match the desired substring in a string and replace the substring with empty string:
import re
text = ['hanks_NNS sir_VBP', 'Oh_UH thanks_NNS to_TO remember_VB']
curated_text = [re.sub(r'_\S*', r'', a) for a in text]
print curated_text
Output:
['hanks sir', 'Oh thanks to remember']
Regex:
_\S* - Underscore followed by 0 or more non space characters
text = ['hanks_NNS sir_VBP', 'Oh_UH thanks_NNS to_TO remember_VB']
curated_text = [] # Outer container for holding strings in text.
for i in text:
d = [] # Inner container for holding different parts of same string.
for b in i.split():
c = b.split('_')[0] # Discard second element after split
d.append(c) # Append first element to inner container.
curated_text.append(' '.join(d)) # Join the elements of inner container.
#Append the curated string to the outer container.
print curated_text
Output:
['hanks sir', 'Oh thanks to remember']
You are just replacing '_'
with empty string when infact you want to replace '_'
and characters after it with empty string.
for i in text:
k= i.split(" ")
print (k)
for z in k:
if "_" in z:
j=z.replace("_",'') # <--- 'hanks_NNS' becomes 'hanksNNS'
print (j)