Suppose I have a document-term-matrix on bag-of-words representation of some documents, with TF-IDF weighting. E.g. in R:
x <- c("a cat sat on a mat", "cat and dog are friends", "friends are sitting on a mat")
corpus <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf)
<<DocumentTermMatrix (documents: 3, terms: 8)>>
Non-/sparse entries: 12/12
Sparsity : 50%
Maximal term length: 7
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Docs and are cat dog friends mat sat sitting
1 0.0000000 0.0000000 0.1949875 0.0000000 0.0000000 0.1949875 0.5283208 0.0000000
2 0.3169925 0.1169925 0.1169925 0.3169925 0.1169925 0.0000000 0.0000000 0.0000000
3 0.0000000 0.1462406 0.0000000 0.0000000 0.1462406 0.1462406 0.0000000 0.3962406
IDFi = log(N/ni)
This answer I will use text2vec package (>= 0.4) instead of
tm. I personally don't recommend to use
tm for many reasons - see huge amount of the similar questions on SO. But I'm biased, because I'm the author of text2vec.
For the full article which covers all you questions check this tutorial.
Here are the answers in plain english:
idfis just per-word scaling wich we got from train data. You can apply exactly the same transformation to unseen data.
You have to keep
idf sparse diagonal matrix (or you can think of it as a vector of weights ). You can probably easily achieve few
ms response time for either vocabulary based vectorization of feature hashing. See tutorial by the link above.