Howard Zoopaloopa Howard Zoopaloopa - 3 months ago 15
Python Question

Adding New Text to Sklearn TFIDIF Vectorizer (Python)

Is there a function to add to the existing corpus? I've already generated my matrix, I'm looking to periodically add to the table without re-crunching the whole sha-bang

e.g;

articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now']
tfidf_vectorizer = TfidfVectorizer(
max_df=.8,
max_features=2000,
min_df=.05,
preprocessor=prep_text,
use_idf=True,
tokenizer=tokenize_text
)
tfidf_matrix = tfidf_vectorizer.fit_transform(articleList)

#### ADDING A NEW ARTICLE TO EXISTING SET?
bigger_tfidf_matrix = tfidf_vectorizer.fit_transform(['the last article I wanted to add'])

Answer

No, the idf_ property of the TfidfVectorizer doesn't have a setter, so it's not possible to change this once it's been initialised:

In [1]: vec.idf_ = np.append(vec.idf_, 1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-234-8d96c65efb59> in <module>()
----> 1 vec.idf_ = np.append(vec.idf_, 1)

AttributeError: can't set attribute

On the other hand if you're just using a CountVectoriser, you can access the vocabulary_ attribute of your vectoriser directly, so it would be possible to monkey-patch something like this:

import re 
from sklearn.feature_extraction.text import CountVectorizer

def partial_fit(self, X):
    max_idx = max(self.vocabulary_.values())
    for a in X:
        if self.lowercase: a = a.lower()
        for w in re.findall(self.token_pattern, a):
            if w not in self.vocabulary_:
                self.vocabulary_[w] = max_idx + 1
                max_idx += 1

CountVectorizer.partial_fit = partial_fit
vec = CountVectorizer()
vec.fit(articleList)
vec.partial_fit(['the last article I wanted to add'])
vec.transform(['the last article I wanted to add']).toarray()

# array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])