joe wong joe wong - 1 month ago 13
Python Question

How can I POS tag German texts?

I've been doing some natural language processing work.

For English POS tagging, it's rather simple because I only need to use built-in nltk functions. I want to process German texts similarly.

Since nltk doesn't have a built-in function for German, I've tried using Stanford POSTagger:

from nltk.tag.stanford import StanfordPOSTagger
import os
import nltk
java_path = "C:/Program Files/Java/jdk1.8.0_71/bin/java.exe"
os.environ['JAVAHOME'] = java_path
sentence = "Man könnte Klöckner vorhalten, sich an ihre eigenen Appelle nicht zu halten. Doch niemand in der Union wagte das. Nicht einmal die von ihr attackierten Briefschreiber. Klöckner genießt im Moment Narrenfreiheit."
tokens = nltk.word_tokenize(sentence, 'german')
german_postagger1 = StanfordPOSTagger(r'E:/python/nlptest/models/german-hgc.tagger', r'E:/python/nlptest/stanford-postagger.jar')
gp1 = german_postagger1.tag(tokens)


It takes almost 7 seconds to finish processing, which is unbearable for me.

I also tried the module Pattern, but it doesn't support Python 3 and I'm using Python 3.4.

Is there an alternative and faster way to POS tag German sentences?

Answer

TreeTagger is a fast easy-to-install well-documented decison-tree based tagger with support for many languages (and yeah, it's built by a German) and a python wrapper.

Comments