mumpy mumpy - 4 months ago 17
Python Question

python symmetric word matrix using nltk

I'm trying to create a symmetric word matrix from a text document.

For example:
text = "Barbara is good. Barbara is friends with Benny. Benny is bad."

I have tokenized the text document using nltk. Now I want to count how many times other words appear in the same sentence. From the text above, I want to create the matrix below:

Barbara good friends Benny bad
Barbara 2 1 1 1 0
good 1 1 0 0 0
friends 1 0 1 1 0
Benny 1 0 1 2 1
bad 0 0 1 1 1


Note the diagonals are the frequency of the word. Since Barbara appears with Barbara in a sentence as often as there are Barbaras. I hope to not overcount, but this is not a big issue if the code becomes too complicated.

Answer

First we tokenize the text, iterate through each sentence, and iterate through all pairwise combinations of the words in each sentence, and store out counts in a nested dict:

from nltk.tokenize import word_tokenize, sent_tokenize
from collections import defaultdict
import numpy as np
text = "Barbara is good. Barbara is friends with Benny. Benny is bad."

sparse_matrix = defaultdict(lambda: defaultdict(lambda: 0))

for sent in sent_tokenize(text):
    words = word_tokenize(sent)
    for word1 in words:
        for word2 in words:
            sparse_matrix[word1][word2]+=1

print sparse_matrix
>> defaultdict(<function <lambda> at 0x7f46bc3587d0>, {
'good': defaultdict(<function <lambda> at 0x3504320>, 
    {'is': 1, 'good': 1, 'Barbara': 1, '.': 1}), 
'friends': defaultdict(<function <lambda> at 0x3504410>, 
    {'friends': 1, 'is': 1, 'Benny': 1, '.': 1, 'Barbara': 1, 'with': 1}), etc..

This is essentially like a matrix, in that we can index sparse_matrix['good']['Barbara'] and get the number 1, and index sparse_matrix['bad']['Barbara'] and get 0, but we actually aren't storing counts for any words that never co-occured, the 0 is just generated by the defaultdict only when you ask for it. This can really save a lot of memory when doing this stuff. If we need a dense matrix for some type of linear algebra or other computational reason, we can get it like this:

lexicon_size=len(sparse_matrix)
def mod_hash(x, m):
    return hash(x) % m
dense_matrix = np.zeros((lexicon_size, lexicon_size))

for k in sparse_matrix.iterkeys():
    for k2 in sparse_matrix[k].iterkeys():
        dense_matrix[mod_hash(k, lexicon_size)][mod_hash(k2, lexicon_size)] = \
            sparse_matrix[k][k2]

print dense_matrix
>>
[[ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  1.  0.  0.  1.]
 [ 0.  0.  1.  1.  1.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  2.  0.  2.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  1.  2.  0.  3.]]

I would recommend looking at http://docs.scipy.org/doc/scipy/reference/sparse.html for other ways of dealing with matrix sparsity.

Comments