Cody Cody - 9 months ago 84
Python Question

Classifying text documents using nltk

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...

I've successfully stripped the message bodies out of my emails, now I'm looking to start classifying. I've done the classic example of sentiment-analysis classification using the move_reviews corpus separating documents into positive and negative reviews.

I'm just wondering how I could apply this approach to my project? Can I create multiple classes like sports, technology, politics, entertainment, etc.? I have hit a road block here and am looking for a push in the right direction.

If this isn't an appropriate question for SO I'll happily delete it.

Answer Source

To create a classifier, you need a training data set with the classes you are looking for. In your case, you may need to either:

  1. create your own data set
  2. use a pre-existing dataset

The brown corpus is a seminal text with many of the categories you are speaking about. This could be a starting point to help classify your emails using some package like gensim to find semantically similar texts.

Once you classify your emails, you can then train a system to predict a label for each unseen email.