zhfkt zhfkt - 21 days ago 10
Python Question

How to calculate the sentence similarity using word2vec model of gensim with python

According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.

e.g.

trained_model.similarity('woman', 'man')
0.73723527


However, the word2vec model fails to predict the sentence similarity. I find out the LSI model with sentence similarity in gensim, but, which doesn't seem that can be combined with word2vec model. The length of corpus of each sentence I have is not very long (shorter than 10 words). So, are there any simple ways to achieve the goal?

Answer

This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. "he walked to the store yesterday" and "yesterday, he walked to the store"), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding statistical co-occurences / relationships in lots of real textual examples, etc.

The simplest thing you could try -- though I don't know how well this would perform and it would certainly not give you the optimal results -- would be to first remove all "stop" words (words like "the", "an", etc. that don't add much meaning to the sentence) and then run word2vec on the words in both sentences, sum up the vectors in the one sentence, sum up the vectors in the other sentence, and then find the difference between the sums. By summing them up instead of doing a word-wise difference, you'll at least not be subject to word order. That being said, this will fail in lots of ways and isn't a good solution by any means (though good solutions to this problem almost always involve some amount of NLP, machine learning, and other cleverness).

So, short answer is, no, there's no easy way to do this (at least not to do it well).