000andy8484 000andy8484 - 3 months ago 11
R Question

Create dfm step by step with quanteda

I want to analyze a big (n=500,000) corpus of documents. I am using

quanteda
in the expectation that will be faster than
tm_map()
from
tm
. I want to proceed step by step instead of using the automated way with
dfm()
. I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures.

I would like this sequence to be implemented:

1) remove the punctuation and numbers

2) remove stopwords (i.e. before the tokenization to avoid useless tokens)

3) tokenize using unigrams and bigrams

4) create the dfm

My attempt:

> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))

> class(text.corpus)
[1] "corpus" "list"

> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") :
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"

# This is how I would theoretically continue:
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))


Bonus question
How do I remove sparse tokens in
quanteda
? (i.e. equivalent of
removeSparseTerms()
in
tm
.




UPDATE
At the light of @Ken's answer, here is the code to proceed step by step with
quanteda
:

library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’


1) Remove custom punctuation and numbers. E.g. notice that the "\n" in the ie2010 corpus

text.corpus <- ie2010Corpus
texts(text.corpus)[1] # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is

texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1]) # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e


A further note on the reason why one may prefer to preprocess. My present corpus is in Italian, a language that has articles connected to the words with an apostrophe. Thus, the straight
dfm()
can lead to inexact tokenization.
e.g.:

broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))


will produce two separated tokens for the same word ("un'abile" and "l'abile"), hence the need of an additional step with
gsub()
here.

2) In
quanteda
it is not possible to remove stopwords directly in the text before the tokenization. In my previous example "l" and "un" have to be removed not to produce misleading bigrams. This can be handled in
tm
with
tm_map(..., removeWords)
.

3) Tokenization

token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)


4) Create the dfm:

dfm <- dfm(token)


5) Remove sparse features

dfm <- trim(dfm, minCount = 5)

Answer

We designed dfm() not as a "black box" but more as a Swiss army knife that combines many of the options that typical users want to apply when converting their texts to a matrix of documents and features. However all of these options are also available through lower-level processing commands, should you wish to exert a finer level of control.

However one of the design principles of quanteda is that text only becomes "features" through the process of tokenisation. If you have a set of tokenised features that you wish to exclude, you must first tokenise your text, or you cannot exclude them. Unlike other text packages for R (e.g. tm), these steps are applied "downstream" from a corpus, so that the corpus remains an unprocessed set of texts to which manipulations will be applied (but will not itself be a transformed set of texts). The purpose of this is to preserve generality but also to promote reproducibility and transparency in text analysis.

In response to your questions:

  1. You can however override our encouraged behaviour using the texts(myCorpus) <- function, where what is assigned to the texts will override the existing texts. So you could use regular expressions to remove punctuation and numbers -- for example the stringi commands and using the Unicode classes for punctuation and numerals to identify patterns.

  2. I would recommend you tokenise before removing stopwords. Stop "words" are tokens, so there is no way to remove these from the text before you tokenise the text. Even applying regular expressions to substitute them for "" involves specifying some form of word boundary in the regex - again, this is tokenisation.

  3. To tokenise into unigrams and bigrams:

    tokenize(myCorpus, ngrams = 1:2)

  4. To create the dfm, simply call dfm(myTokens). (You could also have applied step 3, for ngrams, at this stage.

Bonus 1: n=2 collocations produces the same list as bigrams, except in a different format. Did you intend something else? (Separate SO question perhaps?)

Bonus 2: See trim(x, sparsity = ). The removeSparseTerms() options are quite confusing to most people, but this included for migrants from tm. See this post for a full explanation.

BTW: Use texts() instead of ie2010Corpus$documents$texts -- we will rewrite the object structure of a corpus soon, so you should not access its internals this way when there is an extractor function. (Also, this step is unnecessary - here you have simply recreated the corpus.)