I'm trying to create a symmetric word matrix from a text document.
For example:
text = "Barbara is good. Barbara is friends with Benny. Benny is bad."
I have tokenized the text document using nltk. Now I want to count how many times other words appear in the same sentence. From the text above, I want to create the matrix below:
Barbara good friends Benny bad
Barbara 2 1 1 1 0
good 1 1 0 0 0
friends 1 0 1 1 0
Benny 1 0 1 2 1
bad 0 0 1 1 1
First we tokenize the text, iterate through each sentence, and iterate through all pairwise combinations of the words in each sentence, and store out counts in a nested dict
:
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import defaultdict
import numpy as np
text = "Barbara is good. Barbara is friends with Benny. Benny is bad."
sparse_matrix = defaultdict(lambda: defaultdict(lambda: 0))
for sent in sent_tokenize(text):
words = word_tokenize(sent)
for word1 in words:
for word2 in words:
sparse_matrix[word1][word2]+=1
print sparse_matrix
>> defaultdict(<function <lambda> at 0x7f46bc3587d0>, {
'good': defaultdict(<function <lambda> at 0x3504320>,
{'is': 1, 'good': 1, 'Barbara': 1, '.': 1}),
'friends': defaultdict(<function <lambda> at 0x3504410>,
{'friends': 1, 'is': 1, 'Benny': 1, '.': 1, 'Barbara': 1, 'with': 1}), etc..
This is essentially like a matrix, in that we can index sparse_matrix['good']['Barbara']
and get the number 1
, and index sparse_matrix['bad']['Barbara']
and get 0
, but we actually aren't storing counts for any words that never co-occured, the 0
is just generated by the defaultdict
only when you ask for it. This can really save a lot of memory when doing this stuff. If we need a dense matrix for some type of linear algebra or other computational reason, we can get it like this:
lexicon_size=len(sparse_matrix)
def mod_hash(x, m):
return hash(x) % m
dense_matrix = np.zeros((lexicon_size, lexicon_size))
for k in sparse_matrix.iterkeys():
for k2 in sparse_matrix[k].iterkeys():
dense_matrix[mod_hash(k, lexicon_size)][mod_hash(k2, lexicon_size)] = \
sparse_matrix[k][k2]
print dense_matrix
>>
[[ 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 1. 1. 1. 0. 1.]
[ 0. 0. 1. 1. 1. 0. 0. 1.]
[ 0. 0. 1. 1. 1. 1. 0. 1.]
[ 0. 0. 1. 0. 1. 2. 0. 2.]
[ 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 1. 1. 2. 0. 3.]]
I would recommend looking at http://docs.scipy.org/doc/scipy/reference/sparse.html for other ways of dealing with matrix sparsity.