Shivam Agrawal Shivam Agrawal - 1 year ago 94
Python Question

How to perform text classification with naive bayes using sklearn library?

I am trying text classification using naive bayes text classifier.
My data is in the below format and based on the question and excerpt i have to decide the topic of the question. The training data is having more than 20K records. I know SVM would be a better option here but i want to go with Naive Bayes using sklearn library.

{[{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: \n\nWhat is the effective capacitance of this circuit and will the ...\r\n "},
{"topic":"electronics","question":"Outlet Installation--more wires than my new outlet can use [on hold]","excerpt":"I am replacing a wall outlet with a Cooper Wiring USB outlet (TR7745). The new outlet has 3 wires coming out of it--a black, a white, and a green. Each one needs to be attached with a wire nut to ...\r\n "}]}

This is what i have tried so far,

import numpy as np
import json
from sklearn.naive_bayes import *

topic = []
question = []
excerpt = []

with open('training.json') as f:
for line in f:
data = json.loads(line)

unique_topics = list(set(topic))
new_topic = [x.encode('UTF8') for x in topic]
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic]
numeric_topics = [float(i) for i in numeric_topics]

x1 = np.array(question)
x2 = np.array(excerpt)
X = zip(*[x1,x2])
Y = np.array(numeric_topics)
print X[0]
clf = BernoulliNB(), Y)
print "Prediction:", clf.predict( ['hello'] )

But as expected i am getting ValueError: could not convert string to float. My question is how can i create a simple classifier to classify the question and excerpt into related topic ?

Answer Source

All classifiers in sklearn require input to be represented as vectors of some fixed dimensionality. For text there are CountVectorizer, HashingVectorizer and TfidfVectorizer which can transform your strings into vectors of floating numbers.

vect = TfidfVectorizer()
X = vect.fit_transform(X)

Obviously, you'll need to vectorize your test set in the same way

clf.predict( vect.transform(['hello']) )

See a tutorial on using sklearn with textual data.