Wilmar van Ommeren Wilmar van Ommeren - 4 months ago 22
Python Question

Vectorized numpy 1-d reclassification

I have a large numpy 1-d containing about 700,000 classes. In addition, I have another similar sized array which contains the new values of the classes.

Example arrays

original_classes = np.array([0,1,2,3,4,5,6,7,8,9,10,10])
new_classes = np.array([1,0,1,2,2,10,1,6,6,9,5,12])

Desired output

>>> reclassify_function(original_classes, new_classes)
array([ 1, 1, 1, 1, 1, 12, 1, 1, 9, 12, 12])

The difficulty is that there are multiple class relations.

Original class 1 should get a new value of 0, which means that 0 and 1 are equal classes and all occurrences of these values should be assigned to the same new class number. Original class 2 should be classified as 1, which means that class 2 is equal to class 0 and 1. Original class 0-2 should thus be assigned to the same new class number etc...

As I am working with huge arrays I would like the reclassify function to be vectorized.


You could use scipy.sparse.csgraph.connected_components to relabel your classes. For your example data:

from scipy.sparse import csr_matrix
from scipy.sparse.csgraph import connected_components

A = np.array([0,1,2,3,4,5, 6,7,8,9,10,10])
B = np.array([1,0,1,2,2,10,1,6,6,9,5 ,12])

N = max(A.max(), B.max()) + 1
weights = np.ones(len(A), int)
graph = csr_matrix((weights, (A, B)), shape=(N, N))
n_remaining, mapping = connected_components(graph, directed=False)
print mapping[A]


[0 0 0 0 0 1 0 0 0 2 1 1]

These are the relabeled classes. I'm sure you can figure out how to express these in terms of the input data. Note for best performance the "original" and "new" classes should be a single range of consecutive integers without gaps.