eleanora - 7 months ago 43

Python Question

I have a large file where each line has a pair of 8 character strings. Something like:

`ab1234gh iu9240gh`

on each line.

This file really represents a graph and each string is a node id. I would like to read in the file and directly make a scipy sparse adjacency matrix. I will then run PCA on this matrix using one of the many tools available in python

Is there a neat way to do this or do I need to first make a graph in RAM and then convert that into a sparse matrix? As the file is large I would like to avoid intermediate steps if possible.

Ultimately I will feed the sparse adjacency matrix into http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD .

Answer

I think this is a regular task in `sklearn`

, so there must be some tool in the package that does this, or an answer in other SO questions. We need to add the correct tag.

But just working from my knowledge of `numpy`

and `sparse`

, where what I'd do:

Make a sample 2d array - N rows, 2 columns with character values:

```
In [638]: A=np.array([('a','b'),('b','d'),('a','d'),('b','c'),('d','e')])
In [639]: A
Out[639]:
array([['a', 'b'],
['b', 'd'],
['a', 'd'],
['b', 'c'],
['d', 'e']],
dtype='<U1')
```

Use `np.unique`

to identify the unique strings, and as a bonus a map from those strings to the original array. This is the workhorse of the task.

```
In [640]: k1,k2,k3=np.unique(A,return_inverse=True,return_index=True)
In [641]: k1
Out[641]:
array(['a', 'b', 'c', 'd', 'e'],
dtype='<U1')
In [642]: k2
Out[642]: array([0, 1, 7, 3, 9], dtype=int32)
In [643]: k3
Out[643]: array([0, 1, 1, 3, 0, 3, 1, 2, 3, 4], dtype=int32)
```

I can reshape that `inverse`

array to identify the row and col for each entry in `A`

.

```
In [644]: rows,cols=k3.reshape(A.shape).T
In [645]: rows
Out[645]: array([0, 1, 0, 1, 3], dtype=int32)
In [646]: cols
Out[646]: array([1, 3, 3, 2, 4], dtype=int32)
```

with those it is trivial to construct a sparse matrix that has `1`

at each 'intersection`.

```
In [648]: M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols)))
In [649]: M
Out[649]:
<4x5 sparse matrix of type '<class 'numpy.int32'>'
with 5 stored elements in COOrdinate format>
In [650]: M.A
Out[650]:
array([[0, 1, 0, 1, 0],
[0, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]])
```

the first row, `a`

has values in the 2nd and 4th col, `b`

and `d`

. and so on.

============================

Originally I had:

```
In [648]: M=sparse.coo_matrix((np.ones(k1.shape,int),(rows,cols)))
```

This is wrong. The `data`

array should match `rows`

and `cols`

in shape. Here it didn't raise an error because `k1`

happens to have the same size. But with a different mix unique values could raise an error.

====================

This approach assumes the whole data base, `A`

can be loaded into memory. `unique`

probably requires similar memory usage. Initially a `coo`

matrix might not increase the memory usage, since it will use the arrays provided as parameters. But any calculations and/or conversion to `csr`

or other format will make further copies.

I can imagine getting around memory issues by loading the data base in chunks and using some other structure to get the unique values and mapping. You might even be able to construct a `coo`

matrix from chunks. But sooner or later you'll hit memory issues. The scikit code will be making one or more copies of that sparse matrix.