Feng Chen Feng Chen - 23 days ago 17
R Question

Does tm automatically ignore the very short strings?

Here is my code:
example 1:

a <- c("ab cd de","ENERGIZER A23 12V ALKALINE BATTERi")
a1 <- VCorpus(VectorSource(a))
a2 <- TermDocumentMatrix(a1,control = list(stemming=T))
inspect(a2)


The result is:

Docs
Terms 1 2
12v 0 1
a23 0 1
alkalin 0 1
batteri 0 1
energ 0 1


Looks like the first string in a is ignored.

example 2

a <- c("abcd cde de","ENERGIZER A23 12V ALKALINE BATTERi")
a1 <- VCorpus(VectorSource(a))
a2 <- TermDocumentMatrix(a1,control = list(stemming=T))
inspect(a2)


The result is:

Docs
Terms 1 2
12v 0 1
a23 0 1
abcd 1 0
alkalin 0 1
batteri 0 1
cde 1 0
energ 0 1


We can see two sub-strings (abcd, cde) are kept while the shorest one (de) is still missing. The situation is the same if I do not use control = list(stemming=T). So, I am curious if this is a sort of definition in tm? The strings will be ignored if it is less than 3 letters? I do not think this is a good idea. It is very possible that a string is useful even it is short such as abbreviation.

If so, is there a parameter or something that can change this? Thanks a lot.

Answer

See ?termFreq. The option you have to set is wordLengths. From the doc:

An integer vector of length 2. Words shorter than the minimum word length ‘wordLengths[1]’ or longer than the maximum word length ‘wordLengths[2]’ are discarded. Defaults to ‘c(3, Inf)’, i.e., a minimum word length of 3 characters.

So, if you don't want to exclude short words you can:

a2 <- TermDocumentMatrix(a1,control = list(stemming=T,wordLengths=c(1,Inf)))
inspect(a2)
         Docs
Terms     1 2
  12v     0 1
  a23     0 1
  ab      1 0
  alkalin 0 1
  batteri 0 1
  cd      1 0
  de      1 0
  energ   0 1
Comments