KillBill KillBill - 12 days ago 6
Python Question

issue recognizing NEs with StanfordNER in python NLTK

This happens when there is a potential NE followed by a comma, for example if my strings are something like,


"These names Praveen Kumar,, David Harrison, Paul Harrison, blah "


or


"California, United States"


my output is something as follows, respectively.


[[(u'These', u'O'), (u'names', u'O'), (u'Praveen', u'O'), (u'Kumar,,', u'O'), (u'David', u'PERSON'), (u'Harrison,', u'O'), (u'Paul', u'PERSON'), (u'Harrison,', u'O'), (u'blah', u'O')]]


or


[[(u'California,', u'O'), (u'United', u'LOCATION'), (u'States', u'LOCATION')]]


why it doesn't recognize potential NEs such as "Praveen Kumar", "Harrison" and "California"?

Here is how is use it in the code:

from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')

tags = st.tag("California, United States".split())


Is it because I tokenize the input stirng with
split()
? How can I resolve this as it's working fine when tried in Java?

Answer

Since you are doing this through the nltk, use its tokenizers to split your input:

alltext = myfile.read()
tokenized_text = nltk.word_tokenize(alltext)

Edit: You're probably better off with the stanford toolkit's own tokenizer, as recommended by the other answer. So if you'll be feeding the tokens to one of the Stanford tools, tokenize your text like this to get exactly the tokenization that the tools expect:

from nltk.tokenize.stanford import StanfordTokenizer
tokenize = StanfordTokenizer().tokenize

alltext = myfile.read()
tokenized_text = tokenize(alltext)

To use this method you'll need to have the Stanford tools installed, and the nltk must be able to find them. I assume you have already taken care of this, since you're using the Stanford NER tool.

Comments