Stereo Stereo - 1 month ago 10
Python Question

NLTK tag Dutch sentence

I am beginning with NLTK and want to tag a Dutch sentence but I am having trouble specifying the corpus.

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import alpino

pos_tag(word_tokenize("Python is een goede data science taal."), tagset = 'alpino')


gives,

[('Python', 'UNK'),
('is', 'UNK'),
('een', 'UNK'),
('goede', 'UNK'),
('data', 'UNK'),
('science', 'UNK'),
('taal', 'UNK'),
('.', 'UNK')]


So clearly I am not specifying the corpus correctly. I downloaded the alpino corpus. Can anyone help me to figure out how to specify the corpus correctly?

Answer

The model will be as good as:

  • what data it is trained on
  • which algorithm it is trained with

From UnigramTagger and BigramTagger example:

>>> from nltk.corpus import alpino as alp
>>> from nltk.tag import UnigramTagger, BigramTagger
>>> training_corpus = alp.tagged_sents()
>>> unitagger = UnigramTagger(training_corpus)
>>> bitagger = BigramTagger(training_corpus, backoff=unitagger)
>>> pos_tag = bitagger.tag
>>> sent = 'NLTK is een goeda taal voor NLP'.split()
>>> pos_tag(sent)
[('NLTK', None), ('is', u'verb'), ('een', u'det'), ('goeda', None), ('taal', u'noun'), ('voor', u'prep'), ('NLP', None)]

Using PerceptronTagger:

>>> from nltk.corpus import alpino as alp
>>> from nltk.tag import PerceptronTagger
>>> training_corpus = alp.tagged_sents()
>>> tagger = PerceptronTagger(training_corpus)
>>> pos_tag = tagger.tag
>>> sent = 'NLTK is een goeda taal voor het leren over NLP'.split()
>>> pos_tag(sent)

As @WasiAhmed noted, this is another good example: https://github.com/evanmiltenburg/Dutch-tagger and as @evanmiltenburg stated on the github, try to use a faster taggger in production.