Giora Simchoni - 1 year ago 109
R Question

# How to represent a new document with a TF-IDF document-term-matrix, and how to implement in production with a large matrix?

Suppose I have a document-term-matrix on bag-of-words representation of some documents, with TF-IDF weighting. E.g. in R:

``````library(tm)
x <- c("a cat sat on a mat", "cat and dog are friends", "friends are sitting on a mat")
corpus <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf)
inspect(dtm[1:3,])
<<DocumentTermMatrix (documents: 3, terms: 8)>>
Non-/sparse entries: 12/12
Sparsity           : 50%
Maximal term length: 7
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Terms
Docs       and       are       cat       dog   friends       mat       sat   sitting
1 0.0000000 0.0000000 0.1949875 0.0000000 0.0000000 0.1949875 0.5283208 0.0000000
2 0.3169925 0.1169925 0.1169925 0.3169925 0.1169925 0.0000000 0.0000000 0.0000000
3 0.0000000 0.1462406 0.0000000 0.0000000 0.1462406 0.1462406 0.0000000 0.3962406
``````

Question1:

How do I get the vector representation of a new document?

a) supposing all the document's tokens have columns in the matrix (e.g. "cat and dogs are friends on mat" in the above example) - how do I calculate the IDF (i.e. if
`IDFi = log(N/ni)`
where
`N`
is total no. of documents and
`ni`
is the no. of documents containing token
`i`
, how is IDF calculated in a new document?)

b) when the new document contains tokens never encountered before (e.g. "cat and mouse are friends" in the above example) - how is their TF-IDF calculate?

Question2:

Now suppose the DTM matrix is huge albeit sparse, like 100K documents X 200K words. And a fast application necessitates getting the vector representation of each coming document "fast" (I don't have an exact definition, I'm talking less than 500ms), e.g. for calculating cosine distance between documents.

This is a production application, not necessarily in R. Is there a common way to store such big DTM matrices and project documents to get vectors? Do I have to store the huge matrix somewhere on a server and extract it every time I want to query a document or is there some approximation, a heuristic, for Big Data real world applications?

This answer I will use text2vec package (>= 0.4) instead of `tm`. I personally don't recommend to use `tm` for many reasons - see huge amount of the similar questions on SO. But I'm biased, because I'm the author of text2vec.
• `idf` is just per-word scaling wich we got from train data. You can apply exactly the same transformation to unseen data.
1. You have to keep `idf` sparse diagonal matrix (or you can think of it as a vector of weights ). You can probably easily achieve few `ms` response time for either vocabulary based vectorization of feature hashing. See tutorial by the link above.