user3554004 user3554004 - 1 month ago 15
R Question

Compute unweighted bag-of-words based TCM using text2vec in R?

I am trying to compute a term-term co-occurrence matrix (or TCM) from a corpus using the

text2vec
package in
R
(since it has a nice parallel backend). I followed this tutorial, but while inspecting some toy examples, I noticed the
create_tcm
function does some sort of scaling or weighting on the term-term co-occurrence values. I know it uses skip-grams internally, but the documentation does not mention how it scales them - clearly, more distant terms/unigrams are weighted lower.

Here is an example:

tcmtest = function(sentences){
tokens <- space_tokenizer(sentences)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L))
vectorizer <- vocab_vectorizer(vocab, grow_dtm = FALSE, skip_grams_window = 5L)
return(create_tcm(it, vectorizer))
}

> tcmtest(c("a b", "a b c"))
3 x 3 sparse Matrix of class "dgTMatrix"
b c a
b . 1 2.0
c . . 0.5
a . . .
> tcmtest(c("a b", "c a b"))
3 x 3 sparse Matrix of class "dgTMatrix"
b c a
b . 0.5 2
c . . 1
a . . .
> tcmtest(c("a b", "c a a a b"))
3 x 3 sparse Matrix of class "dgTMatrix"
b c a
b . 0.25 2.833333
c . . 1.833333
a . . .


Question: is there any way to disable this behaviour, so that every term/unigram in the skip-gram window is treated equally? I.e., if a term occurs inside the context window of another term twice in a corpus, it should say "2" in the TCM matrix.

Bonus question: how does the default scaling thing work anyway? If you add more "a"s to the last example, then the b-c value seems to linearly decrease, while the b-a value actually increases - although more occurrences or "a" appear further away from "b".

Answer

Weighting function is defined here. If you need equal weight for each term within window, you need to adjust weighting function to always return 1 (just clone repo, change function definition and build package from source with devtools or R CMD build):

inline float weighting_fun(uint32_t offset) {
  return 1.0;
}

However several people already asked for this feature and I will probably include such option in next release.

Comments