This happens when there is a potential NE followed by a comma, for example if my strings are something like,
"These names Praveen Kumar,, David Harrison, Paul Harrison, blah "
"California, United States"
[[(u'These', u'O'), (u'names', u'O'), (u'Praveen', u'O'), (u'Kumar,,', u'O'), (u'David', u'PERSON'), (u'Harrison,', u'O'), (u'Paul', u'PERSON'), (u'Harrison,', u'O'), (u'blah', u'O')]]
[[(u'California,', u'O'), (u'United', u'LOCATION'), (u'States', u'LOCATION')]]
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
tags = st.tag("California, United States".split())
Since you are doing this through the nltk, use its tokenizers to split your input:
alltext = myfile.read() tokenized_text = nltk.word_tokenize(alltext)
Edit: You're probably better off with the stanford toolkit's own tokenizer, as recommended by the other answer. So if you'll be feeding the tokens to one of the Stanford tools, tokenize your text like this to get exactly the tokenization that the tools expect:
from nltk.tokenize.stanford import StanfordTokenizer tokenize = StanfordTokenizer().tokenize alltext = myfile.read() tokenized_text = tokenize(alltext)
To use this method you'll need to have the Stanford tools installed, and the nltk must be able to find them. I assume you have already taken care of this, since you're using the Stanford NER tool.