Paul - 7 months ago 22

Python Question

I am using **Python**, **numpy** and **scikit-learn**. I have data of *keys* and *values* that are stored in an SQL table. I retrieve this as a list of tuples returned as:

`[(id, value),...]`

`key: value`

`dataset = []`

for sample in samples:

listOfTuplePairs = getDataFromSQL(sample) # get a [(id, value),...] list

dataset.append(listOfTuplePairs)

Keys may be duplicated across different samples, and each row may be of a different length. An example

`dataset`

`dataset = [[(1, 0.13), (2, 2.05)],`

[(2, 0.23), (4, 7.35), (5, 5.60)],

[(2, 0.61), (3, 4.45)]]

It can be seen that each row is a sample, and that some ids (in this case 2) appear in multiple samples.

`ids = 1 2 3 4 5`

------------------------------

dataset = [(0.13, 2.05, null, null, null),

(null, 0.23, null, 7.35, 5.60),

(null, 0.61, 4.45, null, null)]

As you can see, I also wish to strip the ids from the matrix (though I will need to retain a list of them so I know what the values in the matrix relate to. Each initial list of

`key: value`

Many, many thanks in advance for any help.

Answer

Here's a NumPy based approach to create a sparse matrix `coo_matrix`

with memory efficiency in focus -

```
from scipy.sparse import coo_matrix
# Construct row IDs
lens = np.array([len(item) for item in dataset])
shifts_arr = np.zeros(lens.sum(),dtype=int)
shifts_arr[lens[:-1].cumsum()] = 1
row = shifts_arr.cumsum()
# Extract values from dataset into a NumPy array
arr = np.concatenate(dataset)
# Get the unique column IDs to be used for col-indexing into output array
col = np.unique(arr[:,0],return_inverse=True)[1]
# Determine the output shape
out_shp = (row.max()+1,col.max()+1)
# Finally create a sparse marix with the row,col indices and col-2 of arr
sp_out = coo_matrix((arr[:,1],(row,col)), shape=out_shp)
```

Please note that if the `IDs`

are supposed to be column numbers in the output array, you could replace the use of `np.unique`

that gives us such unique IDs with something like this -

```
col = (arr[:,0]-1).astype(int)
```

This should give us a good performance boost!

Sample run -

```
In [264]: dataset = [[(1, 0.13), (2, 2.05)],
...: [(2, 0.23), (4, 7.35), (5, 5.60)],
...: [(2, 0.61), (3, 4.45)]]
In [265]: sp_out.todense() # Using .todense() to show output
Out[265]:
matrix([[ 0.13, 2.05, 0. , 0. , 0. ],
[ 0. , 0.23, 0. , 7.35, 5.6 ],
[ 0. , 0.61, 4.45, 0. , 0. ]])
```