Giora Simchoni - 10 months ago 51

R Question

Suppose I have a document-term-matrix on bag-of-words representation of some documents, with TF-IDF weighting. E.g. in R:

`library(tm)`

x <- c("a cat sat on a mat", "cat and dog are friends", "friends are sitting on a mat")

corpus <- Corpus(VectorSource(x))

dtm <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf)

inspect(dtm[1:3,])

<<DocumentTermMatrix (documents: 3, terms: 8)>>

Non-/sparse entries: 12/12

Sparsity : 50%

Maximal term length: 7

Weighting : term frequency - inverse document frequency (normalized) (tf-idf)

Terms

Docs and are cat dog friends mat sat sitting

1 0.0000000 0.0000000 0.1949875 0.0000000 0.0000000 0.1949875 0.5283208 0.0000000

2 0.3169925 0.1169925 0.1169925 0.3169925 0.1169925 0.0000000 0.0000000 0.0000000

3 0.0000000 0.1462406 0.0000000 0.0000000 0.1462406 0.1462406 0.0000000 0.3962406

Question1:

How do I get the vector representation of a new document?

a) supposing all the document's tokens have columns in the matrix (e.g. "cat and dogs are friends on mat" in the above example) - how do I calculate the IDF (i.e. if

`IDFi = log(N/ni)`

`N`

`ni`

`i`

b) when the new document contains tokens never encountered before (e.g. "cat and mouse are friends" in the above example) - how is their TF-IDF calculate?

Question2:

Now suppose the DTM matrix is huge albeit sparse, like 100K documents X 200K words. And a fast application necessitates getting the vector representation of each coming document "fast" (I don't have an exact definition, I'm talking less than 500ms), e.g. for calculating cosine distance between documents.

This is a production application, not necessarily in R. Is there a common way to store such big DTM matrices and project documents to get vectors? Do I have to store the huge matrix somewhere on a server and extract it every time I want to query a document or is there some approximation, a heuristic, for Big Data real world applications?

Answer Source

This answer I will use text2vec package (**>= 0.4**) instead of `tm`

. I personally don't recommend to use `tm`

for many reasons - see huge amount of the similar questions on SO. But I'm biased, because I'm the author of text2vec.

For the full article which covers all you questions check this tutorial.

Here are the answers in plain english:

`idf`

is just per-word scaling wich we got from train data. You can apply exactly the same transformation to unseen data.- in case vocabulary based vectorization for new documents we simply won't consider tokens never encountered before

You have to keep

`idf`

sparse diagonal matrix (or you can think of it as a vector of weights ). You can probably easily achieve few`ms`

response time for either vocabulary based vectorization of feature hashing. See tutorial by the link above.