I need to train a word2vec representation on tweets using gensim. Unlike most tutorials and code I've seen on gensim my data is not raw, but has already been preprocessed. I have a dictionary in a text document containing 65k words (incl. an "unknown" token and a EOL token) and the tweets are saved as a numpy matrix with indices into this dictionary. A simple example of the data format can be seen below:
[[0, 1, 2, 3, 6],
[3, 5, 5, 1, 6],
[0, 1, 3, 6, 6]]
The normal way to initialize a
Word2Vec model in
gensim is 
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
The question is, what is
sentences is supposed to be an iterator of iterables of words/tokens. It is just like the numpy matrix you have, but each row can be of different lengths.
If you look at the documentation for
gensim.models.word2vec.LineSentence, it gives you a way of loading a text files as sentences directly. As a hint, according to the documentation, it takes
one sentence = one line; words already preprocessed and separated by whitespace.
When it says
words already preprocessed, it is referring to lower-casing, stemming, stopword filtering and all other text cleansing processes. In your case you wouldn't want
6 to be in your list of sentences, so you do need to filter them out.
Given that you already have the numpy matrix, assuming each row is a sentence, it is better to then cast it into a 2d array and filter out all
6. The resultant 2d array can be used directly as the
sentences argument to initialize the model. The only catch is that when you want to query the model after training, you need to input the indices instead of the tokens.
Now one question you have is if the model takes integer directly. In the
Python version it doesn't check for type, and just passes the unique tokens around. Your unique indices in that case will work fine. But most of the time you would want to use the C-Extended routine to train your model, which is a big deal because it can give 70x performance.  I imagine in that case the C code may check for string type, which means there is a string-to-index mapping stored.
Is this inefficient? I think not, because the strings you have are numbers, which are in generally much shorter than the real token they represent (assuming they are compact indices from
0). Therefore models will be smaller in size, which will save some effort in serialization and deserialization of the model at the end. You essentially have encoded the input tokens in a shorter string format and separated it from the
word2vec training, and
word2vec model do not and need not know this encoding happened before training.
My philosophy is
try the simplest way first. I would just throw a sample test input of integers to the model and see what can go wrong. Hope it helps.