eliavs eliavs - 4 months ago 39
Python Question

use polyglot package for Named Entity Recognition in hebrew

I am trying to use the polyglot package for Named Entity Recognition in hebrew.

this is my code:

# -*- coding: utf8 -*-
import polyglot
from polyglot.text import Text, Word
from polyglot.downloader import downloader
downloader.download("embeddings2.iw")
text = Text(u"in france and in germany")
print(type(text))
text2 = Text(u"נסעתי מירושלים לתל אביב")
print(type(text2))
print(text.entities)
print(text2.entities)


this is the output:

<class 'polyglot.text.Text'>
<class 'polyglot.text.Text'>
[I-LOC([u'france']), I-LOC([u'germany'])]
Traceback (most recent call last):
File "C:/Python27/Lib/site-packages/IPython/core/pyglot.py", line 15, in <module>
print(text2.entities)
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "C:\Python27\lib\site-packages\polyglot\text.py", line 132, in entities
for i, (w, tag) in enumerate(self.ne_chunker.annotate(self.words)):
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 20, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "C:\Python27\lib\site-packages\polyglot\text.py", line 100, in ne_chunker
return get_ner_tagger(lang=self.language.code)
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
cache[key] = obj(*args, **kwargs)
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 191, in get_ner_tagger
return NEChunker(lang=lang)
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 104, in __init__
super(NEChunker, self).__init__(lang=lang)
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 40, in __init__
self.predictor = self._load_network()
File "C:\Python27\lib\site-packages\polyglot\tag\base.py", line 109, in _load_network
self.embeddings = load_embeddings(self.lang, type='cw', normalize=True)
File "C:\Python27\lib\site-packages\polyglot\decorators.py", line 30, in memoizer
cache[key] = obj(*args, **kwargs)
File "C:\Python27\lib\site-packages\polyglot\load.py", line 61, in load_embeddings
p = locate_resource(src_dir, lang)
File "C:\Python27\lib\site-packages\polyglot\load.py", line 43, in locate_resource
if downloader.status(package_id) != downloader.INSTALLED:
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 738, in status
info = self._info_or_id(info_or_id)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 508, in _info_or_id
return self.info(info_or_id)
File "C:\Python27\lib\site-packages\polyglot\downloader.py", line 934, in info
raise ValueError('Package %r not found in index' % id)
ValueError: Package u'embeddings2.iw' not found in index


The english worked but not the hebrew.

Whether I try to download the package
u'embeddings2.iw'
or not I get:


ValueError: Package u'embeddings2.iw' not found in index

Answer

I got it!
It seems like a bug to me.
The language detection defined the language as 'iw' which is the The former ISO 639 language code for Hebrew, and was changed to 'he'. The text.entities did not recognize the iw code, so i changes it like so:

text2.hint_language_code = 'he'