pir pir - 28 days ago 21
Python Question

Gensim word2vec on predefined dictionary and word-indices data

I need to train a word2vec representation on tweets using gensim. Unlike most tutorials and code I've seen on gensim my data is not raw, but has already been preprocessed. I have a dictionary in a text document containing 65k words (incl. an "unknown" token and a EOL token) and the tweets are saved as a numpy matrix with indices into this dictionary. A simple example of the data format can be seen below:

dict.txt

you
love
this
code


tweets (5 is unknown and 6 is EOL)

[[0, 1, 2, 3, 6],
[3, 5, 5, 1, 6],
[0, 1, 3, 6, 6]]


I'm unsure how I should handle the indices representation. An easy way is just to convert the list of indices to a list of strings (i.e. [0, 1, 2, 3, 6] -> ['0', '1', '2', '3', '6']) as I read it into the word2vec model. However, this must be inefficient as gensim then will try to look up the internal index used for e.g. '2'.

How do I load this data and create the word2vec representation in an efficient manner using gensim?

Mai Mai
Answer

The normal way to initialize a Word2Vec model in gensim is [1]

model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

The question is, what is sentences? sentences is supposed to be an iterator of iterables of words/tokens. It is just like the numpy matrix you have, but each row can be of different lengths.

If you look at the documentation for gensim.models.word2vec.LineSentence, it gives you a way of loading a text files as sentences directly. As a hint, according to the documentation, it takes

one sentence = one line; words already preprocessed and separated by whitespace.

When it says words already preprocessed, it is referring to lower-casing, stemming, stopword filtering and all other text cleansing processes. In your case you wouldn't want 5 and 6 to be in your list of sentences, so you do need to filter them out.

Given that you already have the numpy matrix, assuming each row is a sentence, it is better to then cast it into a 2d array and filter out all 5 and 6. The resultant 2d array can be used directly as the sentences argument to initialize the model. The only catch is that when you want to query the model after training, you need to input the indices instead of the tokens.

Now one question you have is if the model takes integer directly. In the Python version it doesn't check for type, and just passes the unique tokens around. Your unique indices in that case will work fine. But most of the time you would want to use the C-Extended routine to train your model, which is a big deal because it can give 70x performance. [2] I imagine in that case the C code may check for string type, which means there is a string-to-index mapping stored.

Is this inefficient? I think not, because the strings you have are numbers, which are in generally much shorter than the real token they represent (assuming they are compact indices from 0). Therefore models will be smaller in size, which will save some effort in serialization and deserialization of the model at the end. You essentially have encoded the input tokens in a shorter string format and separated it from the word2vec training, and word2vec model do not and need not know this encoding happened before training.

My philosophy is try the simplest way first. I would just throw a sample test input of integers to the model and see what can go wrong. Hope it helps.

[1] https://radimrehurek.com/gensim/models/word2vec.html

[2] http://rare-technologies.com/word2vec-in-python-part-two-optimizing/