Chelsea_cole Chelsea_cole - 4 months ago 50x
Python Question

Compute Pairwise Cosine Similarity using scikit-learn

I am new to this, so it would be helpful if someone could point me in right direction/help me with some tutorial.
Given a sentence and a list of other sentences (English):

s = "This concept of distance is not restricted to two dimensions."
list_s = ["It is not difficult to imagine the figure above translated into three dimensions.", "We can persuade ourselves that the measure of distance extends to an arbitrary number of dimensions;"]

I want to compute pairwise cosine similarity between each sentence in the list and sentence s, then find the max value.

What i've got so far:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(norm='l2', min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
bow_matrix = tfidf.fit_transform([s, ' '.join(list_s)])

1. What's next?

2. Should we transform the whole corpus or just 2 sentences when compute pairwise cosine similarity?

3. How to apply removing stopwords and stemming for this?



First, you might want to transform your documents as follows

X = tfidf.fit_transform([s] + list_s) # now X will have 3 rows
  1. What's next?: you have to find cosine similarity between each row of tf-idf matrix. See this post on how to do that. For intuition, you can calculate distance between s and list_s using cosine distance.

    from scipy.spatial.distance import cosine
    cosine(X[0].toarray(), X[1].toarray()) # cosine between s and 1st sentence
  2. I would suggest transform whole corpus to tf-idf matrix since the model will also generate vocabulary i.e. you vector will correspond to this dictionary. You shouldn't transform only 2 sentences.

  3. You can remove stopwords by adding stop_words='english' when you create tf-idf model (i.e. tfidf = TfidfVectorizer(..., stop_words='english')).

For stemming, you might consider nltk in order to create a stemmer. Here is a simple way to stem your texts. (note that you might want to remove punctuation before stemming also)

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

def stem(text):
    text_stem = [stemmer.stem(token) for token in text.split(' ')]
    text_stem_join = ' '.join(text_stem)
    return text_stem_join

list_s_stem = list(map(stem, list_s)) # map stem function to list of documents

Now, you can use this list_s_stem in TfidfVectorizer instead of list_s