B_Miner B_Miner - 6 months ago 39
R Question

text2vec in R- Transform new data?

There is documentation on creating a DTM (document term matrix) for the text2vec package, for example the following where a TFIDF weighting is applied after building the matrix:

N <- 1000
it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer)
v <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(v)
it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer)
dtm <- create_dtm(it, vectorizer)
# get tf-idf matrix from bag-of-words matrix
dtm_tfidf <- transformer_tfidf(dtm)

It is common practice to create a DTM based on a training dataset and use that dataset as input to a model. Then, when new data is encountered (a test set) one needs to create the same DTM on the new data (meaning all the same terms that were used in the training set). Is there anyway in the package to transform a new data set in this manner (in scikit we have a transform method for just this type of instance).


Actually when I have started text2vec I kept that pipeline at the first place. Now we are preparing new release with updated documentation.

For v0.3 following should work:

train_rows = 1:1000
prepr = tolower
tok = word_tokenizer

it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows])
v <- create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 5)

vectorizer <- vocab_vectorizer(v)
it <- itoken(movie_review$review[train_rows], prepr, tok)
dtm_train <- create_dtm(it, vectorizer)
# get idf scaling from train data
idf = get_idf(dtm_train)
# create tf-idf
dtm_train_tfidf <- transform_tfidf(dtm_train, idf)

test_rows = 1001:2000
# create iterator
it <- itoken(movie_review$review[test_rows], prepr, tok, ids = movie_review$id[test_rows])
# create dtm using same vectorizer, but new iterator
dtm_test_tfidf <- create_dtm(it, vectorizer) %>% 
  # transform  tf-idf using idf from train data