Wilmar van Ommeren - 1 year ago 53

Python Question

I have a large numpy 1-d containing about 700,000 classes. In addition, I have another similar sized array which contains the new values of the classes.

**Example arrays**

`original_classes = np.array([0,1,2,3,4,5,6,7,8,9,10,10])`

new_classes = np.array([1,0,1,2,2,10,1,6,6,9,5,12])

`>>> reclassify_function(original_classes, new_classes)`

array([ 1, 1, 1, 1, 1, 12, 1, 1, 9, 12, 12])

The difficulty is that there are multiple class relations.

As I am working with huge arrays I would like the reclassify function to be vectorized.

Answer Source

You could use scipy.sparse.csgraph.connected_components to relabel your classes. For your example data:

```
from scipy.sparse import csr_matrix
from scipy.sparse.csgraph import connected_components
A = np.array([0,1,2,3,4,5, 6,7,8,9,10,10])
B = np.array([1,0,1,2,2,10,1,6,6,9,5 ,12])
N = max(A.max(), B.max()) + 1
weights = np.ones(len(A), int)
graph = csr_matrix((weights, (A, B)), shape=(N, N))
n_remaining, mapping = connected_components(graph, directed=False)
print mapping[A]
```

Gives:

```
[0 0 0 0 0 1 0 0 0 2 1 1]
```

These are the relabeled classes. I'm sure you can figure out how to express these in terms of the input data. Note for best performance the "original" and "new" classes should be a single range of consecutive integers without gaps.