csmontt - 8 months ago 41

R Question

My question is simple, the Quanteda package in R has a function for calculating the Term Frequency (tf) of a Document Frequency Matrix (dfm). When you look at the description of tf function with ?tf, it says it has four arguments. My question is regarding the 'scheme' argument. I don´t understant how to use the maxCount option, that is, to use the maximum feature count per document as a divisor for the normalization of the tf. When you look at 'usage', the only options for the scheme argument are "count", "prop", "propmax", "boolean", "log", "augmented" and "logave", so, how can I use the maxCount option?

Answer

The short answer is that this is a "bug" in the documentation (for quanteda 0.9.8.0-0.9.8.2), as that option was removed from the function but not the documentation. The new syntax is the `propMax`

argument, such that:

```
txt <- c(doc1 = "This is a simple, simple, simple document.",
doc2 = "This document is a second document.")
(myDfm <- dfm(txt, verbose = FALSE))
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs this is a simple document second
## doc1 1 1 1 3 1 0
## doc2 1 1 1 0 2 1
```

Applying the weights:

```
tf(myDfm, scheme = "prop")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs this is a simple document second
## doc1 0.1428571 0.1428571 0.1428571 0.4285714 0.1428571 0
## doc2 0.1666667 0.1666667 0.1666667 0 0.3333333 0.1666667
```

`propmax`

is supposed to compute the proportions of each count relative to the most frequent count within document. For doc1, for instance, the maximum feature count is 3, so that each term in that document would be divided by 3. However in quanteda <=0.9.8.2, there was a **bug** that caused it to **wrongly** compute this:

```
tf(myDfm, scheme = "propmax")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs this is a simple document second
## doc1 1.0000000 1.0000000 1.0000000 3 1.0000000 0
## doc2 0.3333333 0.3333333 0.3333333 0 0.6666667 0.3333333
```

In quanteda v0.9.8.3, this is fixed:

```
tf(myDfm, scheme = "propmax")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs this is a simple document second
## doc1 0.3333333 0.3333333 0.3333333 1 0.3333333 0
## doc2 0.5000000 0.5000000 0.5000000 0 1.0000000 0.5
```

**Note**: Fixed in 0.9.8.3.

Source (Stackoverflow)