Chelsea_cole - 4 months ago 50x
Python Question

# Compute Pairwise Cosine Similarity using scikit-learn

I am new to this, so it would be helpful if someone could point me in right direction/help me with some tutorial.
Given a sentence and a list of other sentences (English):

``````s = "This concept of distance is not restricted to two dimensions."
list_s = ["It is not difficult to imagine the figure above translated into three dimensions.", "We can persuade ourselves that the measure of distance extends to an arbitrary number of dimensions;"]
``````

I want to compute pairwise cosine similarity between each sentence in the list and sentence s, then find the max value.

What i've got so far:

``````from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(norm='l2', min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
bow_matrix = tfidf.fit_transform([s, ' '.join(list_s)])
``````

# Thanks!

First, you might want to transform your documents as follows

``````X = tfidf.fit_transform([s] + list_s) # now X will have 3 rows
``````
1. What's next?: you have to find cosine similarity between each row of tf-idf matrix. See this post on how to do that. For intuition, you can calculate distance between `s` and `list_s` using `cosine` distance.

``````from scipy.spatial.distance import cosine
cosine(X[0].toarray(), X[1].toarray()) # cosine between s and 1st sentence
``````
2. I would suggest transform whole corpus to tf-idf matrix since the model will also generate vocabulary i.e. you vector will correspond to this dictionary. You shouldn't transform only 2 sentences.

3. You can remove stopwords by adding `stop_words='english'` when you create tf-idf model (i.e. `tfidf = TfidfVectorizer(..., stop_words='english')`).

For stemming, you might consider `nltk` in order to create a stemmer. Here is a simple way to stem your texts. (note that you might want to remove punctuation before stemming also)

``````from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

def stem(text):
text_stem = [stemmer.stem(token) for token in text.split(' ')]
text_stem_join = ' '.join(text_stem)
return text_stem_join

list_s_stem = list(map(stem, list_s)) # map stem function to list of documents
``````

Now, you can use this `list_s_stem` in `TfidfVectorizer` instead of `list_s`