I am new to this, so it would be helpful if someone could point me in right direction/help me with some tutorial.
Given a sentence and a list of other sentences (English):
s = "This concept of distance is not restricted to two dimensions."
list_s = ["It is not difficult to imagine the figure above translated into three dimensions.", "We can persuade ourselves that the measure of distance extends to an arbitrary number of dimensions;"]
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(norm='l2', min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
bow_matrix = tfidf.fit_transform([s, ' '.join(list_s)])
First, you might want to transform your documents as follows
X = tfidf.fit_transform([s] + list_s) # now X will have 3 rows
What's next?: you have to find cosine similarity between each row of tf-idf matrix. See this post on how to do that. For intuition, you can calculate distance between
from scipy.spatial.distance import cosine cosine(X.toarray(), X.toarray()) # cosine between s and 1st sentence
I would suggest transform whole corpus to tf-idf matrix since the model will also generate vocabulary i.e. you vector will correspond to this dictionary. You shouldn't transform only 2 sentences.
You can remove stopwords by adding
stop_words='english' when you create tf-idf model (i.e.
tfidf = TfidfVectorizer(..., stop_words='english')).
For stemming, you might consider
nltk in order to create a stemmer. Here is a simple way to stem your texts. (note that you might want to remove punctuation before stemming also)
from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() def stem(text): text_stem = [stemmer.stem(token) for token in text.split(' ')] text_stem_join = ' '.join(text_stem) return text_stem_join list_s_stem = list(map(stem, list_s)) # map stem function to list of documents
Now, you can use this
TfidfVectorizer instead of