Paul Paul - 3 months ago 7
Python Question

How to construct a numpy array from multiple vectors with data aligned by id

I am using Python, numpy and scikit-learn. I have data of keys and values that are stored in an SQL table. I retrieve this as a list of tuples returned as:

[(id, value),...]
. Each id appears only once in the list and the tuples appear sorted in order of ascending id. This process is completed a few times so that I have multiple lists of
key: value
pairs. Such that:

dataset = []
for sample in samples:
listOfTuplePairs = getDataFromSQL(sample) # get a [(id, value),...] list

Keys may be duplicated across different samples, and each row may be of a different length. An example
might be:

dataset = [[(1, 0.13), (2, 2.05)],
[(2, 0.23), (4, 7.35), (5, 5.60)],
[(2, 0.61), (3, 4.45)]]

It can be seen that each row is a sample, and that some ids (in this case 2) appear in multiple samples.

Problem: I wish to construct a single (possibly sparse) numpy array suitable for processing with scikit-learn. The values relating to a specific key (id) for each sample should be aligned in the same 'column' (if that is the correct terminology) such that the matrix of the above example would look as follows:

ids = 1 2 3 4 5
dataset = [(0.13, 2.05, null, null, null),
(null, 0.23, null, 7.35, 5.60),
(null, 0.61, 4.45, null, null)]

As you can see, I also wish to strip the ids from the matrix (though I will need to retain a list of them so I know what the values in the matrix relate to. Each initial list of
key: value
pairs may contain several thousand rows and there may be several thousand samples so the resulting matrix may be very large. Please provide answers that consider speed (within the limits of Python), memory efficiency and code clarity.

Many, many thanks in advance for any help.


Here's a NumPy based approach to create a sparse matrix coo_matrix with memory efficiency in focus -

from scipy.sparse import coo_matrix

# Construct row IDs
lens = np.array([len(item) for item in dataset])
shifts_arr = np.zeros(lens.sum(),dtype=int)
shifts_arr[lens[:-1].cumsum()] = 1
row = shifts_arr.cumsum()

# Extract values from dataset into a NumPy array
arr = np.concatenate(dataset)

# Get the unique column IDs to be used for col-indexing into output array
col = np.unique(arr[:,0],return_inverse=True)[1]

# Determine the output shape
out_shp = (row.max()+1,col.max()+1)

# Finally create a sparse marix with the row,col indices and col-2 of arr
sp_out = coo_matrix((arr[:,1],(row,col)), shape=out_shp)

Please note that if the IDs are supposed to be column numbers in the output array, you could replace the use of np.unique that gives us such unique IDs with something like this -

col = (arr[:,0]-1).astype(int)

This should give us a good performance boost!

Sample run -

In [264]: dataset = [[(1, 0.13), (2, 2.05)],
     ...:            [(2, 0.23), (4, 7.35), (5, 5.60)],
     ...:            [(2, 0.61), (3, 4.45)]]

In [265]: sp_out.todense() # Using .todense() to show output
matrix([[ 0.13,  2.05,  0.  ,  0.  ,  0.  ],
        [ 0.  ,  0.23,  0.  ,  7.35,  5.6 ],
        [ 0.  ,  0.61,  4.45,  0.  ,  0.  ]])