Magic8ball Magic8ball - 24 days ago 9
Python Question

Computing Jaccard Similarity in Python

I have 20,000 documents that I want to compute the true Jaccard similarity for, so that I can later check how accurately MinWise hashing approximates it.

Each document is represented as a column in a numpy matrix, where each row is a word that either appears in document (entry=1) or does not (entry = 0). There are ~600 words (rows).

So for example column 1 would be [1 0 0 0 0 0 1 0 0 0 1 0], which means words 1,7,11 appeared in it and no others.

Is there a more efficient way to compute the similarity besides my element-wise comparison approach? I don't see how I could use sets to improve the speed since the sets just become (0,1), but as it stands the code is impossibly slow.

import numpy as np

#load file into python
rawdata = np.loadtxt("myfile.csv",delimiter="\t")
#Convert the documents from rows to columns
rawdata = np.transpose(rawdata)
#compute true jacard similarity
ndocs = rawdata.shape[1]
nwords = rawdata.shape[0]
tru_sim = np.zeros((ndocs,ndocs))

#computes jaccard similarity of 2 documents
def jaccard(c1, c2):
n11 = sum((c1==1)&(c2==1))
n00 = sum((c1==0)&(c2==0))
jac = n11 / (nfeats-n00)
return (jac)

for i in range(0,ndocs):
tru_sim[i,i]=1
for j in range(i+1,ndocs):
tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])

Answer

Here's a vectorized approach -

# Get the row, col indices that are to be set in output array        
r,c = np.tril_indices(ndocs,-1)

# Use those indicees to slice out respective columns 
p1 = rawdata[:,c]
p2 = rawdata[:,r]

# Perform n11 and n00 vectorized computations across all indexed columns
n11v = ((p1==1) & (p2==1)).sum(0)
n00v = ((p1==0) & (p2==0)).sum(0)

# Finally, setup output array and set final division computations
out = np.eye(ndocs)
out[c,r] = n11v / (nfeats-n00v)

Alternative way to compute n11v and n00v with np.einsum -

n11v = np.einsum('ij,ij->j',(p1==1),(p2==1).astype(int))
n00v = np.einsum('ij,ij->j',(p1==0),(p2==0).astype(int))

If rawdata consists of 0s and 1s only, a simpler way to get them would be -

n11v = np.einsum('ij,ij->j',p1,p2)
n00v = np.einsum('ij,ij->j',1-p1,1-p2)

Benchmarking

Function definitions -

def original_app(rawdata, ndocs, nfeats):
    tru_sim = np.zeros((ndocs,ndocs))
    for i in range(0,ndocs):
        tru_sim[i,i]=1
        for j in range(i+1,ndocs):
            tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])
    return tru_sim

def vectorized_app(rawdata, ndocs, nfeats):
    r,c = np.tril_indices(ndocs,-1)
    p1 = rawdata[:,c]
    p2 = rawdata[:,r]
    n11v = ((p1==1) & (p2==1)).sum(0)
    n00v = ((p1==0) & (p2==0)).sum(0)
    out = np.eye(ndocs)
    out[c,r] = n11v / (nfeats-n00v)
    return out

Verification and timings -

In [6]: # Setup inputs
   ...: rawdata = (np.random.rand(20,10000)>0.2).astype(int)
   ...: rawdata = np.transpose(rawdata)
   ...: ndocs = rawdata.shape[1]
   ...: nwords = rawdata.shape[0]
   ...: nfeats = 5
   ...: 

In [7]: # Verify results
   ...: out1 = original_app(rawdata, ndocs, nfeats)
   ...: out2 = vectorized_app(rawdata, ndocs, nfeats)
   ...: print np.allclose(out1,out2)
   ...: 
True

In [8]: %timeit original_app(rawdata, ndocs, nfeats)
1 loops, best of 3: 8.72 s per loop

In [9]: %timeit vectorized_app(rawdata, ndocs, nfeats)
10 loops, best of 3: 27.6 ms per loop

Some magical 300x+ speedup there!

So, why is it so fast? Well, there are lots of factors involved, most important one being the fact NumPy arrays are built for performance and optimized for vectorized computations. With the proposed approach we are harnessing it quite well and thus seeing such speedups.

Here's one related Q&A that talk about these performance criteria in detail.