toy toy - 7 months ago 46
Python Question

How to do multiclass classification properly with NLTK?

So, I'm trying to do text multiclass classification. I have been reading a lot of old questions and blog posts, but I still can't fully understand the concept of that.

I tried some example from this blog post as well.

But when it comes to multiclass classification I don't quite understand that. Let's say I want to classify text into multi languages, French, English, Italian and German. And I want to use NaviesBayes which I think it would be the easiest to start with. From what I have read in the old questions, the simplest solution would be to use one vs all. So, each language will have its own model. So, I would have 3 models for French, English and Italian. Then I would run a text against every model and check if which one has the highest probability. Am I correct?

But when it comes to coding, in the example above he has tweets like this which will be classified either positive or negative.

pos_tweets = [('I love this car', 'positive'),
('This view is amazing', 'positive'),
('I feel great this morning', 'positive'),
('I am so excited about tonight\'s concert', 'positive'),
('He is my best friend', 'positive')]

neg_tweets = [('I do not like this car', 'negative'),
('This view is horrible', 'negative'),
('I feel tired this morning', 'negative'),
('I am not looking forward to tonight\'s concert', 'negative'),
('He is my enemy', 'negative')]

Which it's positive or negative. So, when it comes to train one model for French how should I tag the text? Would it be like this? So this would be the positive?

[('Bon jour', 'French'),
'je m'appelle', 'French']

And the negative would be

[('Hello', 'English'),
('My name', 'English')]

But would this mean I could just add Italian and German and have just one model for 4 languages? Or I don't really need the negative?

So, the question would be what's the right approach to do multi class classification with ntlk?


There's no need for a one-vs-all scheme with Naive Bayes -- it's a multiclass model out of the box. Just feed a list of (sample, label) pairs to the classifier learner where label denotes the language.