David David - 1 month ago 19
Python Question

Cosine similarity using TFIDF

There are several questions on SO and the web describing how to take the

cosine similarity
between two strings, and even between two strings with TFIDF as weights. But the output of a function like scikit's
linear_kernel
confuses me a little.

Consider the following code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

a = ['hello world', 'my name is', 'what is your name?']
b = ['my name is', 'hello world', 'my name is what?']

df = pd.DataFrame(data={'a':a, 'b':b})
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1)
print(df.head())

a b ab
0 hello world my name is hello world my name is
1 my name is hello world my name is hello world
2 what is your name? my name is what? what is your name? my name is what?


Question:
I'd like to have a column that is the cosine similarity between the strings in
a
and the strings in
b
.

What I tried:

I trained a TFIDF classifier on
ab
, so as to include all the words:

clf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
clf.fit(df['ab'])


I then got the sparse TFIDF matrix of both
a
and
b
columns:

tfidf_a = clf.transform(df['a'])
tfidf_b = clf.transform(df['b'])


Now, if I use scikit's
linear_kernel
, which is what others recommended, I get back a Gram matrix of (nfeatures,nfeatures), as mentioned in their docs.

from sklearn.metrics.pairwise import linear_kernel
linear_kernel(tfidf_a,tfidf_b)

array([[ 0., 1., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])


But what I need is a simple vector, where the first element is the cosin_sim between the first row of
a
and the first row of
b
, the second element is the cos_sim(a[1],b[1]), and so forth.

Using python3, scikit-learn 0.17.

Answer

I think your example is falling down a little bit because your TfidfVectorizer is filtering out the majority of your words because you have the stop_words = 'english' parameter (you've included almost all stop words in the example). I've removed that and made your matrices dense so we can see what's happening. What if you did something like this?

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial

a = ['hello world', 'my name is', 'what is your name?']
b = ['my name is', 'hello world', 'my name is what?']

df = pd.DataFrame(data={'a':a, 'b':b})
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1)

clf = TfidfVectorizer(ngram_range=(1, 1))
clf.fit(df['ab'])

tfidf_a = clf.transform(df['a']).todense()
tfidf_b = clf.transform(df['b']).todense()

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]
row_similarities

[0.0, 0.0, 0.72252389079716417]

This shows the distance between each row. I'm not fully on board with how you're building the full corpus, but the example isn't optimized at all, so I'll leave that for now. Hope this helps.