I have a large file where each line has a pair of 8 character strings. Something like:
I think this is a regular task in
sklearn, so there must be some tool in the package that does this, or an answer in other SO questions. We need to add the correct tag.
But just working from my knowledge of
sparse, where what I'd do:
Make a sample 2d array - N rows, 2 columns with character values:
In : A=np.array([('a','b'),('b','d'),('a','d'),('b','c'),('d','e')]) In : A Out: array([['a', 'b'], ['b', 'd'], ['a', 'd'], ['b', 'c'], ['d', 'e']], dtype='<U1')
np.unique to identify the unique strings, and as a bonus a map from those strings to the original array. This is the workhorse of the task.
In : k1,k2,k3=np.unique(A,return_inverse=True,return_index=True) In : k1 Out: array(['a', 'b', 'c', 'd', 'e'], dtype='<U1') In : k2 Out: array([0, 1, 7, 3, 9], dtype=int32) In : k3 Out: array([0, 1, 1, 3, 0, 3, 1, 2, 3, 4], dtype=int32)
I can reshape that
inverse array to identify the row and col for each entry in
In : rows,cols=k3.reshape(A.shape).T In : rows Out: array([0, 1, 0, 1, 3], dtype=int32) In : cols Out: array([1, 3, 3, 2, 4], dtype=int32)
with those it is trivial to construct a sparse matrix that has
1 at each 'intersection`.
In : M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols))) In : M Out: <4x5 sparse matrix of type '<class 'numpy.int32'>' with 5 stored elements in COOrdinate format> In : M.A Out: array([[0, 1, 0, 1, 0], [0, 0, 1, 1, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 1]])
the first row,
a has values in the 2nd and 4th col,
d. and so on.
Originally I had:
In : M=sparse.coo_matrix((np.ones(k1.shape,int),(rows,cols)))
This is wrong. The
data array should match
cols in shape. Here it didn't raise an error because
k1 happens to have the same size. But with a different mix unique values could raise an error.
This approach assumes the whole data base,
A can be loaded into memory.
unique probably requires similar memory usage. Initially a
coo matrix might not increase the memory usage, since it will use the arrays provided as parameters. But any calculations and/or conversion to
csr or other format will make further copies.
I can imagine getting around memory issues by loading the data base in chunks and using some other structure to get the unique values and mapping. You might even be able to construct a
coo matrix from chunks. But sooner or later you'll hit memory issues. The scikit code will be making one or more copies of that sparse matrix.