Luca Massaron - 22 days ago 6

Python Question

I am working in Python and I have a matrix stored in a text file. The text file is arranged in such a format:

row_id, col_id

row_id, col_id

...

row_id, col_id

row_id and col_id are integers and they take values from 0 to n (in order to know n for row_id and col_id I have to scan the entire file first).

there's no header and row_ids and col_ids appear multiple times in the file, but each combination row_id,col_id appears once. There's no explicit value for each combination row_id,col_id , actually each cell value is 1. The file is almost 1 gigabyte of size.

Unfortunately the file is difficult to handle in the memory, in fact, it is 2257205 row_ids and 122905 col_ids for 26622704 elements. So I was looking for better ways to handle it. Matrix market format could be a way to deal with it.

Is there a fast and memory efficient way to convert this file into a file in a market matrix format (http://math.nist.gov/MatrixMarket/formats.html#mtx) using Python?

Answer

There is a fast and memory efficient way of handling such matrices: using the sparse matrices offered by SciPy (which is the de facto standard in Python for this kind of things).

For a matrix of size `N`

by `N`

:

```
from scipy.sparse import lil_matrix
result = lil_matrix((N, N)) # In order to save memory, one may add: dtype=bool, or dtype=numpy.int8
with open('matrix.csv') as input_file:
for line in input_file:
x, y = map(int, line.split(',', 1)) # The "1" is only here to speed the splitting up
result[x, y] = 1
```

(or, in one line instead of two: `result[map(int, line.split(',', 1))] = 1`

).

The argument `1`

given to `split()`

is just here to speed things up when parsing the coordinates: it instructs Python to stop parsing the line when the first (and only) comma is found. This can matter some, since you are reading a 1 GB file.

Depending on your needs, you might find one of the other six sparse matrix representations offered by SciPy to be better suited.

If you want a faster but also more memory-consuming array, you can use `result = numpy.array(…)`

(with NumPy) instead.

Source (Stackoverflow)

Comments