Since there is no ready implementation of stopwords for Polish in quanteda, I would like to use my own list. I have it in a text file as a list separated by spaces. If need be, I can also prepare a list separated by new lines.
How can I remove the custom long list of stopwords from my corpus?
How can I do that after stemming?
I have tried creating various formats, converting to string vectors like
stopwordsPL <- as.character(readtext("polish.stopwords.txt",encoding = "UTF-8"))
stopwordsPL <- read.txt("polish.stopwords.txt",encoding = "UTF-8",stringsAsFactors = F))
stopwordsPL <- dictionary(stopwordsPL)
remove = as.vector(stopwordsPL),
stem = FALSE,
remove_punct = TRUE,
dfm_trim(myStemMat, sparsity = stopwordsPL)
myStemMat <- dfm_remove(myStemMat,features = as.data.frame(stopwordsPL))
polish.stopwords.txt are like this then you should be able to remove them from your corpus easily this way:
stopwordsPL <- readLines("polish.stopwords.txt", encoding = "UTF-8") dfm(mycorpus, remove = stopwordsPL, stem = FALSE, remove_punct = TRUE, ngrams=c(1,3))
The solution using readtext is not working because it reads in the entire file as one document. To get the individual words, you would need to tokenise it and to coerce the tokens to character. Probably
readLines() is easier.
No need to create a dictionary from
stopwordsPL either, since
remove should take a character vector. Also, there is no Polish stemmer implemented yet, I am afraid.
Currently (v0.9.9-65) the feature removal in
dfm() does not get rid of stop words that form bigrams. To override this, try:
# form the tokens, removing punctuation mytoks <- tokens(mycorpus, remove_punct = TRUE) # remove the Polish stopwords, leave pads mytoks <- tokens_remove(mytoks, stopwordsPL, padding = TRUE) ## can't do this next one since no Polish stemmer in ## SnowballC::getStemLanguages() # mytoks <- tokens_wordstem(mytoks, language = "polish") # form the ngrams mytoks <- tokens_ngrams(mytoks, n = c(1, 3)) # construct the dfm dfm(mytoks)