I came across the text2vec package today and it's exactly what I need for a particular problem. However, I haven't been able to figure out how to export a dtm created with text2vec to some kind of output file. My ultimate goal is to generate features in R using text2vec and import the resulting matrices into H2O for further modeling. H2O can read either CSV or SVMLight formats.
The first one I've created is
987753 x 8806 sparse Matrix of class "dgCMatrix", with 3625049 entries
There are several packages who can do that. Take a look into https://github.com/Laurae2/sparsity - imho most promising:
library(text2vec) library(sparsity) data("movie_review") N = 5000 tokens = movie_review$review[1:N] %>% tolower %>% word_tokenizer it = itoken(tokens, progressbar = T) dtm = create_dtm(it, hash_vectorizer()) write.svmlight(dtm, labelVector = movie_review$sentiment, file = "dtm.svmlight")