Giora Simchoni Giora Simchoni - 2 months ago 16
R Question

How to represent a new document with a TF-IDF document-term-matrix, and how to implement in production with a large matrix?

Suppose I have a document-term-matrix on bag-of-words representation of some documents, with TF-IDF weighting. E.g. in R:

library(tm)
x <- c("a cat sat on a mat", "cat and dog are friends", "friends are sitting on a mat")
corpus <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf)
inspect(dtm[1:3,])
<<DocumentTermMatrix (documents: 3, terms: 8)>>
Non-/sparse entries: 12/12
Sparsity : 50%
Maximal term length: 7
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)

Terms
Docs and are cat dog friends mat sat sitting
1 0.0000000 0.0000000 0.1949875 0.0000000 0.0000000 0.1949875 0.5283208 0.0000000
2 0.3169925 0.1169925 0.1169925 0.3169925 0.1169925 0.0000000 0.0000000 0.0000000
3 0.0000000 0.1462406 0.0000000 0.0000000 0.1462406 0.1462406 0.0000000 0.3962406


Question1:

How do I get the vector representation of a new document?

a) supposing all the document's tokens have columns in the matrix (e.g. "cat and dogs are friends on mat" in the above example) - how do I calculate the IDF (i.e. if
IDFi = log(N/ni)
where
N
is total no. of documents and
ni
is the no. of documents containing token
i
, how is IDF calculated in a new document?)

b) when the new document contains tokens never encountered before (e.g. "cat and mouse are friends" in the above example) - how is their TF-IDF calculate?

Question2:

Now suppose the DTM matrix is huge albeit sparse, like 100K documents X 200K words. And a fast application necessitates getting the vector representation of each coming document "fast" (I don't have an exact definition, I'm talking less than 500ms), e.g. for calculating cosine distance between documents.

This is a production application, not necessarily in R. Is there a common way to store such big DTM matrices and project documents to get vectors? Do I have to store the huge matrix somewhere on a server and extract it every time I want to query a document or is there some approximation, a heuristic, for Big Data real world applications?

Answer

This answer I will use text2vec package (>= 0.4) instead of tm. I personally don't recommend to use tm for many reasons - see huge amount of the similar questions on SO. But I'm biased, because I'm the author of text2vec.

For the full article which covers all you questions check this tutorial.

Here are the answers in plain english:

    • idf is just per-word scaling wich we got from train data. You can apply exactly the same transformation to unseen data.
    • in case vocabulary based vectorization for new documents we simply won't consider tokens never encountered before
  1. You have to keep idf sparse diagonal matrix (or you can think of it as a vector of weights ). You can probably easily achieve few ms response time for either vocabulary based vectorization of feature hashing. See tutorial by the link above.

Comments