eleanora eleanora - 4 months ago 27
Python Question

How to read in an edge list to make a scipy sparse matrix

I have a large file where each line has a pair of 8 character strings. Something like:

ab1234gh iu9240gh


on each line.

This file really represents a graph and each string is a node id. I would like to read in the file and directly make a scipy sparse adjacency matrix. I will then run PCA on this matrix using one of the many tools available in python

Is there a neat way to do this or do I need to first make a graph in RAM and then convert that into a sparse matrix? As the file is large I would like to avoid intermediate steps if possible.

Ultimately I will feed the sparse adjacency matrix into http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD .

Answer

I think this is a regular task in sklearn, so there must be some tool in the package that does this, or an answer in other SO questions. We need to add the correct tag.

But just working from my knowledge of numpy and sparse, where what I'd do:

Make a sample 2d array - N rows, 2 columns with character values:

In [638]: A=np.array([('a','b'),('b','d'),('a','d'),('b','c'),('d','e')])
In [639]: A
Out[639]: 
array([['a', 'b'],
       ['b', 'd'],
       ['a', 'd'],
       ['b', 'c'],
       ['d', 'e']], 
      dtype='<U1')

Use np.unique to identify the unique strings, and as a bonus a map from those strings to the original array. This is the workhorse of the task.

In [640]: k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
In [641]: k1
Out[641]: 
array(['a', 'b', 'c', 'd', 'e'], 
      dtype='<U1')
In [642]: k2
Out[642]: array([0, 1, 7, 3, 9], dtype=int32)
In [643]: k3
Out[643]: array([0, 1, 1, 3, 0, 3, 1, 2, 3, 4], dtype=int32)

I can reshape that inverse array to identify the row and col for each entry in A.

In [644]: rows,cols=k3.reshape(A.shape).T
In [645]: rows
Out[645]: array([0, 1, 0, 1, 3], dtype=int32)
In [646]: cols
Out[646]: array([1, 3, 3, 2, 4], dtype=int32)

with those it is trivial to construct a sparse matrix that has 1 at each 'intersection`.

In [648]: M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
In [649]: M
Out[649]: 
<4x5 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in COOrdinate format>
In [650]: M.A
Out[650]: 
array([[0, 1, 0, 1, 0],
       [0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1]])

the first row, a has values in the 2nd and 4th col, b and d. and so on.

============================

Originally I had:

In [648]: M=sparse.coo_matrix((np.ones(k1.shape,int),(rows,cols)))

This is wrong. The data array should match rows and cols in shape. Here it didn't raise an error because k1 happens to have the same size. But with a different mix unique values could raise an error.

====================

This approach assumes the whole data base, A can be loaded into memory. unique probably requires similar memory usage. Initially a coo matrix might not increase the memory usage, since it will use the arrays provided as parameters. But any calculations and/or conversion to csr or other format will make further copies.

I can imagine getting around memory issues by loading the data base in chunks and using some other structure to get the unique values and mapping. You might even be able to construct a coo matrix from chunks. But sooner or later you'll hit memory issues. The scikit code will be making one or more copies of that sparse matrix.