for a project, I need an efficient function in python that solves to following task:
Given a very large List X of long sparse Vectors (=> big sparse Matrix) and another Matrix Y that contains a single Vector y, I want a List of "distances", that y has to every Element of X. Hereby the "distance" is defined like this:
Compare each Element of the two Vectors, always take the lower one and sum them up.
X = [[0,0,2],
Y = [[1,0,2]]
def get_distances(X, Y):
rows, cols = X.shape
for i in range(0,rows):
dist = 0
sample = X.getrow(i).todense()
test = Y.getrow(0).todense()
rows_s, cols_s = sample.shape
rows_t, cols_t = test.shape
for s,t in zip(range(0, cols_s), range(0, cols_t)):
dist += min(sample[0,s], test[0,t])
I revised your function and ran it in
import numpy as np from scipy import sparse def get_distances(X, Y): ret= for row in X: sample = row.A test = Y.getrow(0).A dist = np.minimum(sample[0,:], test[0,:]).sum() ret.append(dist) return ret X = [[0,0,2], [1,0,0], [3,1,0]] Y = [[1,0,2]] XM = sparse.csr_matrix(X) YM = sparse.csr_matrix(Y) print( get_distances(XM,YM)) print (np.minimum(XM.A, YM.A).sum(axis=1))
1255:~/mypy$ python3 stack37056258.py [2, 1, 1] [2 1 1]
np.minimum takes element wise minimum of two arrays (may be 2d), so I don't need to iterate on columns. I also don't need to use indexing.
minimum is also implemented for sparse matrices, but I get a segmenation error when I try to apply it to your
X (with 3 rows) and
Y (with 1). If they are the same size this works:
Ys = sparse.vstack((YM,YM,YM)) print(Ys.shape) print (XM.minimum(Ys).sum(axis=1))
Converting the single row matrix to an array also gets around the error - because it ends up using the dense version,
When I try other element by element operations on these 2 matrices I get
ValueError: inconsistent shapes, e.g.
XM<YM. Looks like sparse does not implement broadcasting as
numpy arrays does.
There's an overlap in issues with Scipy sparse matrix alternative for getrow()