James Allen-Robertson James Allen-Robertson - 2 months ago 14x
Python Question

NLTK Lemmatizing with list comprehension

How can I verify whether I am correctly using the NLTK lemmatizer in this list comprehension, specifically whether it is taking account of the POS tags?

clean_article_string = (article_db.loc[0,'clean_text']) # pandas dataframe cell containing string.
tokens = word_tokenize(clean_article_string)
treebank_tagged_tokens = tagger.tag(tokens)
wordnet_tagged_tokens = [(w,get_wordnet_pos(t)) for (w, t) in treebank_tagged_tokens]
lemmatized_tokens = [(lemmatizer.lemmatize(w).lower(),t) for (w,t) in wordnet_tagged_tokens]
423 384

I'm using a converter I found on Stackoverflow to switch from treebank to Wordnet tokens, and it works fine. My issue is whether for
the lemmatizer is actually taking both the word and the tag of my
tuple into account, or if it is just looking at the
and lemmatizing based on that (presuming everything to be a noun). I tried...

lemmatized_tokens = [(lemmatizer.lemmatize(w,t)) for (w,t) in wordnet_tagged_tokens]


lemmatized_tokens = [(lemmatizer.lemmatize(w, pos=t)) for (w,t) in wordnet_tagged_tokens]

which produces a
KeyError: ''
in the Wordnet lemmatize function. So the initial code actually functions, but I don't know if it is using the POS tag or not. Does anyone know whether the lemmmatizer will be taking it into account in the working code, and/or if I can verify it is?


Answer by ewcz in comments. Labelled as community wiki. This helped me, might help others.

You use lemmatizer.lemmatize(w), then it will use the default POS tag n, the error suggests that some of the tags are empty - in this case, perhaps one could use a fallback to nouns, i.e., to use lemmatizer.lemmatize(w, pos=t if t else 'n')