Jakob - 1 month ago 12

R Question

Can anybody explain?

My understanding:

`tf >= 0 (absolute frequency value)`

tfidf >= 0 (for negative idf, tf=0)

sparse entry = 0

nonsparse entry > 0

So the exact sparse/nonsparse proportion should be the same in the two DTMs created with the code below.

`library(tm)`

data(crude)

dtm <- DocumentTermMatrix(crude, control=list(weighting=weightTf))

dtm2 <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf))

dtm

dtm2

But:

`> dtm`

<<DocumentTermMatrix (documents: 20, terms: 1266)>>

**Non-/sparse entries: 2255/23065**

Sparsity : 91%

Maximal term length: 17

Weighting : term frequency (tf)

> dtm2

<<DocumentTermMatrix (documents: 20, terms: 1266)>>

**Non-/sparse entries: 2215/23105**

Sparsity : 91%

Maximal term length: 17

Weighting : term frequency - inverse document frequency (normalized) (tf-idf)

Answer

The sparsity can differ. The TF-IDF value will be zero if TF is zero or if IDF is zero, and IDF is zero if a term occurs in every document. Consider the following example:

```
txts <- c("super World", "Hello World", "Hello super top world")
library(tm)
tf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTf))
tfidf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTfIdf))
inspect(tf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 9/3
# Sparsity : 25%
# Maximal term length: 5
# Weighting : term frequency (tf)
#
# Docs
# Terms 1 2 3
# hello 1 1 1
# super 1 0 1
# top 0 0 1
# world 1 1 1
inspect(tfidf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 5/7
# Sparsity : 58%
# Maximal term length: 5
# Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
#
# Docs
# Terms 1 2 3
# hello 0.0000000 0.2924813 0.1462406
# super 0.2924813 0.0000000 0.1462406
# top 0.0000000 0.0000000 0.3962406
# world 0.0000000 0.0000000 0.0000000
```

The term *super* occurs 1 time in document 1, which has 2 terms, and it occurs in 2 out of 3 documents:

```
1/2 * log2(3/2)
# [1] 0.2924813
```

The term *world* occurs 1 time in document 3, which has 4 terms, and it occurs in all 3 documents:

```
1/3 * log2(3/3) # 1/3 * 0
# [1] 0
```