Suhairi Suhaimin Suhairi Suhaimin - 2 months ago 15
Python Question

How to get only word for selected tag in NLTK Part of Speech (POS) tagging?

Sorry I am new to Pandas and NLTK. I'm trying to build set of customize returned POS. My data contents:

comment
0 [(have, VERB), (you, PRON), (pahae, VERB)]
1 [(radio, NOUN), (television, NOUN), (lid, NOUN)]
2 [(yes, ADV), (you're, ADJ)]
3 [(ooi, ADJ), (work, NOUN), (barisan, ADJ)]
4 [(national, ADJ), (debt, NOUN), (increased, VERB)]


Any idea how can I get only word that match selected tag (
VERB
or
NOUN
), like below? And return
NaN
if none matching.

comment
0 [(have), (pahae)]
1 [(radio), (television), (lid)]
2 [NaN]
3 [(work)]
4 [(debt), (increased)]

Answer

You can use list comprehension and then replace empty list to [NaN]:

df = pd.DataFrame({'comment': [
        [('have', 'VERB'), ('you', 'PRON'), ('pahae', 'VERB')],
        [('radio', 'NOUN'), ('television', 'NOUN'), ('lid', 'NOUN')],
        [('yes', 'ADV'), ("you're", 'ADJ')],
        [('ooi', 'ADJ'), ('work', 'NOUN'), ('barisan', 'ADJ')],
        [('national', 'ADJ'), ('debt', 'NOUN'), ('increased', 'VERB')]
    ]})

print (df)    
                                             comment
0         [(have, VERB), (you, PRON), (pahae, VERB)]
1   [(radio, NOUN), (television, NOUN), (lid, NOUN)]
2                        [(yes, ADV), (you're, ADJ)]
3         [(ooi, ADJ), (work, NOUN), (barisan, ADJ)]
4  [(national, ADJ), (debt, NOUN), (increased, VE...
df.comment = df.comment.apply(lambda x: [(t[0],) for t in x if t[1]=='VERB' or t[1]=='NOUN'])
df.ix[df.comment.apply(len) == 0, 'comment'] = [[np.nan]]
print (df)
                             comment
0                [(have,), (pahae,)]
1  [(radio,), (television,), (lid,)]
2                              [nan]
3                          [(work,)]
4            [(debt,), (increased,)]