I am attempting to write a machine learning algorithm with
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect = CountVectorizer()
>>> X_train_counts = count_vect.fit_transform(twenty_train.data)
What about word2vec embedding? It is a neural network based embedding of words into vectors, and takes context into account. This could provide a more sophisticated set of features for your classifier.
One powerful python library for natural language processing with a good word2vec implementation is gensim. Gensim is built to be very scalable and fast, and has advanced text processing capabilities. Here is a quick outline on how to get started:
easy_install -U gensim or
pip install --upgrade gensim.
A simple word2vec example
import gensim documents = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']] model = gensim.models.Word2Vec(documents, min_count=1) print model["survey"]
This will output the vector that "survey" maps to, which you could use for a feature input to your classifier.
Gensim has a lot of other capabilities, and it is worth getting to know it better if you're interested in Natural Language Processing.