Magic8ball - 1 year ago 249
Python Question

# Computing Jaccard Similarity in Python

I have 20,000 documents that I want to compute the true Jaccard similarity for, so that I can later check how accurately MinWise hashing approximates it.

Each document is represented as a column in a numpy matrix, where each row is a word that either appears in document (entry=1) or does not (entry = 0). There are ~600 words (rows).

So for example column 1 would be [1 0 0 0 0 0 1 0 0 0 1 0], which means words 1,7,11 appeared in it and no others.

Is there a more efficient way to compute the similarity besides my element-wise comparison approach? I don't see how I could use sets to improve the speed since the sets just become (0,1), but as it stands the code is impossibly slow.

``````import numpy as np

#Convert the documents from rows to columns
rawdata = np.transpose(rawdata)
#compute true jacard similarity
ndocs = rawdata.shape[1]
nwords = rawdata.shape[0]
tru_sim = np.zeros((ndocs,ndocs))

#computes jaccard similarity of 2 documents
def jaccard(c1, c2):
n11 = sum((c1==1)&(c2==1))
n00 = sum((c1==0)&(c2==0))
jac = n11 / (nfeats-n00)
return (jac)

for i in range(0,ndocs):
tru_sim[i,i]=1
for j in range(i+1,ndocs):
tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])
``````

Here's a vectorized approach -

``````# Get the row, col indices that are to be set in output array
r,c = np.tril_indices(ndocs,-1)

# Use those indicees to slice out respective columns
p1 = rawdata[:,c]
p2 = rawdata[:,r]

# Perform n11 and n00 vectorized computations across all indexed columns
n11v = ((p1==1) & (p2==1)).sum(0)
n00v = ((p1==0) & (p2==0)).sum(0)

# Finally, setup output array and set final division computations
out = np.eye(ndocs)
out[c,r] = n11v / (nfeats-n00v)
``````

Alternative way to compute `n11v` and `n00v` with `np.einsum` -

``````n11v = np.einsum('ij,ij->j',(p1==1),(p2==1).astype(int))
n00v = np.einsum('ij,ij->j',(p1==0),(p2==0).astype(int))
``````

If `rawdata` consists of `0s` and `1s` only, a simpler way to get them would be -

``````n11v = np.einsum('ij,ij->j',p1,p2)
n00v = np.einsum('ij,ij->j',1-p1,1-p2)
``````

Benchmarking

Function definitions -

``````def original_app(rawdata, ndocs, nfeats):
tru_sim = np.zeros((ndocs,ndocs))
for i in range(0,ndocs):
tru_sim[i,i]=1
for j in range(i+1,ndocs):
tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])
return tru_sim

def vectorized_app(rawdata, ndocs, nfeats):
r,c = np.tril_indices(ndocs,-1)
p1 = rawdata[:,c]
p2 = rawdata[:,r]
n11v = ((p1==1) & (p2==1)).sum(0)
n00v = ((p1==0) & (p2==0)).sum(0)
out = np.eye(ndocs)
out[c,r] = n11v / (nfeats-n00v)
return out
``````

Verification and timings -

``````In [6]: # Setup inputs
...: rawdata = (np.random.rand(20,10000)>0.2).astype(int)
...: rawdata = np.transpose(rawdata)
...: ndocs = rawdata.shape[1]
...: nwords = rawdata.shape[0]
...: nfeats = 5
...:

In [7]: # Verify results
...: out1 = original_app(rawdata, ndocs, nfeats)
...: out2 = vectorized_app(rawdata, ndocs, nfeats)
...: print np.allclose(out1,out2)
...:
True

In [8]: %timeit original_app(rawdata, ndocs, nfeats)
1 loops, best of 3: 8.72 s per loop

In [9]: %timeit vectorized_app(rawdata, ndocs, nfeats)
10 loops, best of 3: 27.6 ms per loop
``````

Some magical `300x+` speedup there!

So, why is it so fast? Well, there are lots of factors involved, most important one being the fact NumPy arrays are built for performance and optimized for vectorized computations. With the proposed approach we are harnessing it quite well and thus seeing such speedups.

Here's one `related Q&A` that talk about these performance criteria in detail.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download