csmontt csmontt - 1 month ago 15
R Question

R How to use maxCount scheme in Quanteda package

My question is simple, the Quanteda package in R has a function for calculating the Term Frequency (tf) of a Document Frequency Matrix (dfm). When you look at the description of tf function with ?tf, it says it has four arguments. My question is regarding the 'scheme' argument. I don´t understant how to use the maxCount option, that is, to use the maximum feature count per document as a divisor for the normalization of the tf. When you look at 'usage', the only options for the scheme argument are "count", "prop", "propmax", "boolean", "log", "augmented" and "logave", so, how can I use the maxCount option?

Answer

The short answer is that this is a "bug" in the documentation (for quanteda 0.9.8.0-0.9.8.2), as that option was removed from the function but not the documentation. The new syntax is the propMax argument, such that:

txt <- c(doc1 = "This is a simple, simple, simple document.",
         doc2 = "This document is a second document.")
(myDfm <- dfm(txt, verbose = FALSE))
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
##       features
## docs   this is a simple document second
##   doc1    1  1 1      3        1      0
##   doc2    1  1 1      0        2      1

Applying the weights:

tf(myDfm, scheme = "prop")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
##       features
## docs        this        is         a    simple  document    second
##   doc1 0.1428571 0.1428571 0.1428571 0.4285714 0.1428571 0        
##   doc2 0.1666667 0.1666667 0.1666667 0         0.3333333 0.1666667

propmax is supposed to compute the proportions of each count relative to the most frequent count within document. For doc1, for instance, the maximum feature count is 3, so that each term in that document would be divided by 3. However in quanteda <=0.9.8.2, there was a bug that caused it to wrongly compute this:

tf(myDfm, scheme = "propmax")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
##       features
## docs        this        is         a simple  document    second
##   doc1 1.0000000 1.0000000 1.0000000      3 1.0000000 0        
##   doc2 0.3333333 0.3333333 0.3333333      0 0.6666667 0.3333333

In quanteda v0.9.8.3, this is fixed:

tf(myDfm, scheme = "propmax")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
##       features
## docs        this        is         a simple  document second
##   doc1 0.3333333 0.3333333 0.3333333      1 0.3333333    0  
##   doc2 0.5000000 0.5000000 0.5000000      0 1.0000000    0.5

Note: Fixed in 0.9.8.3.

Comments