gmazlami - 5 months ago 40

Java Question

My goals is to find a similarity value between two documents (collections of words). I have already found several answers like this SO post or this SO post which provide Python libraries that achieve this, but I have trouble understanding the approach and making it work for my use case.

If I understand correctly, TF-IDF of a document is computed with respect to a given term, right? That's how I interpret it from the Wikipedia article on this: "tf-idf...is a numerical statistic that is intended to reflect how important a word is to a document".

In my case, I don't have a specific search term which I want to compare to the document, but I have two different documents. I assume I need to first compute vectors for the documents, and then take the cosine between these vectors. But all the answers I found with respect to constructing these vectors always assume a search term, which I don't have in my case.

Can't wrap my head around this, any conceptual help or links to Java libraries that achieve this would be highly appreciated.

Answer

I suggest running *terminology extraction* first, together with their frequencies. Note that stemming can also be applied to the extracted terms to avoid noise in during the subsequent cosine similarity calculation. See Java library for keywords extraction from input text SO thread for more help and ideas on that.

Then, as you yourself mention, for each of those terms, you will have to compute the TF-IDF values, get the vectors and compute the cosine similarity.

When calculating TF-IDF, mind that `1 + log(N/n)`

(*N* standing for the total number of corpora and `n`

standing for the number of corpora that include the term) formula is better since it avoids the issue when TF is not 0 and IDF turns out equal to 0.