snowneji snowneji - 4 months ago 19
R Question

R: find ngram using dfm when there are multiple sentences in one document

I have a big dataset (>1 million rows) and each row is a multi-sentence text. For example following is a sample of 2 rows:

mydat <- data.frame(text=c('I like apple. Me too','One two. Thank you'),stringsAsFactors = F)


What I was trying to do is extracting the bigram terms in each row (the "." will be able to separate ngram terms). If I simply use the dfm function:

mydfm = dfm(mydat$text,toLower = T,removePunct = F,ngrams=2)
dtm = as.DocumentTermMatrix(mydfm)
txt_data = as.data.frame(as.matrix(dtm))


These are the terms I got:

"i_like" "like_apple" "apple_." "._me" "me_too" "one_two" "two_." "._thank" "thank_you"


These are What I expect, basically "." is skipped and used to separate the terms:

"i_like" "like_apple" "me_too" "one_two" "thank_you"


Believe writing slow loops can solve this as well but given it is a huge dataset I would prefer efficient ways similar to the dfm() in quanteda to solve this. Any suggestions would be appreciated!

Answer

If your goal is just to extract those bigrams, then you could use tokenize twice. Once to tokenize to sentences, then again to make the ngrams for each sentence

library(quanteda)
tokenize(
  tokenize(mydat$text, what = "sentence", simplify = TRUE),
  ngrams = 2,
  removePunct = TRUE,
  simplify = TRUE)
#[1] "I_like"     "like_apple" "Me_too"     "One_two"    "Thank_you"

Wrap the whole thing in toLower() if you like.

Comments