araspion araspion - 3 months ago 37
Python Question

CountVectorizer in sklearn with only words above some minimum number of occurrences

I am using sklearn to train a logistic regression on some text data, by using CountVectorizer to tokenize the data into bigrams. I use a line of code like the one below:

vect= CountVectorizer(ngram_range=(1,2), binary =True)


However, I'd like to limit myself to only including bigrams in my resultant sparse matrix that occur more than some threshold number of times (e.g., 50) across all of my data. Is there some way to specify this or make it happen?

Answer

It looks like this can be solved by using CountVectorizer's min_df argument:

vect= CountVectorizer(ngram_range=(1,2), binary =True, min_df = 500)