rwolst rwolst - 7 months ago 40
Python Question

Map a NumPy array of strings to integers

Problem:

Given an array of string data

dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'),


I would like a function that returns the indexed dataset

indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')


and a lookup table

lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')


such that

(lookupTable[indexed_dataSet] == dataSet).all()


is true. Note that the
indexed_dataSet
and
lookupTable
can both be permuted such that the above holds and that is fine (i.e. it is not necessary that the order of
lookupTable
is equivalent to the order of first appearance in
dataSet
).

Slow Solution:

I currently have the following slow solution

def indexDataSet(dataSet):
"""Returns the indexed dataSet and a lookup table
Input:
dataSet : A length n numpy array to be indexed
Output:
indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}
lookupTable : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""
labels = set(dataSet)
lookupTable = np.empty(len(labels), dtype='U21')
indexed_dataSet = np.zeros(dataSet.size, dtype='int')
count = -1
for label in labels:
count += 1
indexed_dataSet[np.where(dataSet == label)] = count
lookupTable[count] = label

return indexed_dataSet, lookupTable


Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.

Answer

You can use np.unique with the return_inverse argument:

>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'], 
      dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])

If you like, you can reconstruct your original array from these two arrays:

>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'], 
      dtype='<U21')

If you use pandas, lookupTable, indexed_dataSet = pd.factorize(dataSet) will achieve the same thing (and potentially be more efficient for large arrays).