2Cubed 2Cubed - 4 months ago 15
Python Question

Retain ordering of text data when vectorizing

I am attempting to write a machine learning algorithm with

that parses text and classifies it based on training data.

The example for using text data, taken directly from the
documentation, uses a
to generate a sparse array for how many times each word appears.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect = CountVectorizer()
>>> X_train_counts = count_vect.fit_transform(twenty_train.data)

Unfortunately, this does not take into account any ordering of the phrases. It is possible to use larger
CountVectorizer(ngram_range=(min, max))
) to look at specific phrases, but this increases the number of features rapidly and isn't even that great.

Is there a good way of dealing with ordered text in another way? I'm definitely open to using a natural language parser (
, etc.) along with


What about word2vec embedding? It is a neural network based embedding of words into vectors, and takes context into account. This could provide a more sophisticated set of features for your classifier.

One powerful python library for natural language processing with a good word2vec implementation is gensim. Gensim is built to be very scalable and fast, and has advanced text processing capabilities. Here is a quick outline on how to get started:


Just do easy_install -U gensim or pip install --upgrade gensim.

A simple word2vec example

import gensim

documents = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

model = gensim.models.Word2Vec(documents, min_count=1)
print model["survey"]

This will output the vector that "survey" maps to, which you could use for a feature input to your classifier.

Gensim has a lot of other capabilities, and it is worth getting to know it better if you're interested in Natural Language Processing.