soham.m17 soham.m17 - 1 month ago 11
Python Question

How do I do prediction after fitting TfidfVectorizer and KMeans in Scikit learn?

I have a training data set which is in Pandas Dataframe. I have done TfIdf Vectorization to get features and run Kmeans. Here is the relevant code:

vectorizer = TfidfVectorizer(max_df=0.8, max_features=max_feat, norm="l1", analyzer="word",
min_df=0.1,ngram_range=(1,2)
)

X = vectorizer.fit_transform(df['reviews'])
km = KMeans(n_clusters=number, init='k-means++', max_iter=100, n_init=3,
verbose=1, n_jobs = -2)
km.fit(X)


I can get the centroids through this:

order_centroids = km.cluster_centers_.argsort()[:, ::-1]


Now, when I try to run the test data I get error. Here is the code I'm running for test data. I'm basically taking each row from the test dataframe of Panda and fitting into the same vectorizer above. Am I doing it wrong?

sample = df.tail(int(totalTestRows * lineLimit))

for row in sample.itertuples():
test_data = np.array([row[6]])
testVectorizerArray = vectorizer.transform(test_data).toarray()
rowX = vectorizer.fit(testVectorizerArray)
print(km.predict(rowX))


On the
rowX = vectorizer.fit(testVectorizerArray)
line, I'm getting the following error:

AttributeError: 'numpy.ndarray' object has no attribute 'lower'


I searched through StackOverflow and it seems that I need to format the
test_data
array as an one dimensional array. I've checked and test_data is of the form
(n,)
. However, I'm still getting error. Is there anything wrong with my approach?

Answer

You shouldn't be refitting the vectorizer in the test stage, your code would be cleaner if you combine the vectorizer and classifier with a pipeline:

from sklearn.pipeline import make_pipeline
vectorizer = TfidfVectorizer(max_df=0.8, max_features=max_feat, norm="l1", analyzer="word",
                                 min_df=0.1,ngram_range=(1,2)
                                 )   
km = KMeans(n_clusters=number, init='k-means++', max_iter=100, n_init=3,
                    verbose=1, n_jobs = -2)
clf = make_pipeline(vectorizer, km)
clf.fit(X)


sample = df.tail(int(totalTestRows * lineLimit))

for row in sample.itertuples():
    test_data = np.array([row[6]])
    print(clf.predict(test_data))
Comments