Mahmood Kohansal Mahmood Kohansal - 3 years ago 291
Python Question

Text Processing - Word2Vec training after phrase detection (bigram model)

I want to make a word2vec model with more n-grams that usual. As I found, Phrase class in gensim.models.phrase can find phrases that I want and it's possible to use phrases on corpus and use it's result model for word2vec train function.

So first of all I do something like below, exactly like sample codes in gensim documentation.

class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname

def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield word_tokenize(line)

sentences = MySentences('sentences_directory')

bigram = gensim.models.Phrases(sentences)

model = gensim.models.Word2Vec(bigram['sentences'], size=300, window=5, workers=8)


model has been created but without any good result in evaluation and a warning :

WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable)


I searched for it and I found https://groups.google.com/forum/#!topic/gensim/XWQ8fPMFSi0 and changed my code:

class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname

def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield word_tokenize(line)

class PhraseItertor(object):
def __init__(self, my_phraser, data):
self.my_phraser, self.data = my_phraser, data

def __iter__(self):
yield self.my_phraser[self.data]


sentences = MySentences('sentences_directory')

bigram_transformer = gensim.models.Phrases(sentences)

bigram = gensim.models.phrases.Phraser(bigram_transformer)

corpus = PhraseItertor(bigram, sentences)

model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8)


I get error:

Traceback (most recent call last):
File "/home/fatemeh/Desktop/Thesis/bigramModeler.py", line 36, in <module>
model = gensim.models.Word2Vec(corpus, size=300, window=5, workers=8)
File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 478, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 553, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
File "/home/fatemeh/.local/lib/python3.4/site-packages/gensim/models/word2vec.py", line 575, in scan_vocab
vocab[word] += 1
TypeError: unhashable type: 'list'


Now I want to know that what is wrong in my codes.

Answer Source

I asked my question in Gensim GoogleGroup and Mr Gordon Mohr answered me:

You typically wouldn't want an __iter__() method to do a single yield. It should return an iterator object (ready to return multiple objects via next() or a StopIteration exception). One way to effect a iterator is to use yield to have the method treated as a 'generator' – but that would typically require the yield to be inside a loop.

But I now see that my example code in the thread you reference does the wrong thing with its __iter__() return line: it should not be returning the raw phrasifier, but one that has already been started-as-an-iterator, by use of the iter() built-in method. That is, the example there should have read:

class PhrasingIterable(object):
    def __init__(self, phrasifier, texts):
        self. phrasifier, self.texts = phrasifier, texts
    def __iter__():
        return iter(phrasifier[texts])

Making a similar change in your variation may resolve the TypeError: iter() returned non-iterator of type 'TransformedCorpus' error.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download