rwolst - 2 years ago 226
Python Question

# Map a NumPy array of strings to integers

Problem:

Given an array of string data

``````dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'),
``````

I would like a function that returns the indexed dataset

``````indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')
``````

and a lookup table

``````lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')
``````

such that

``````(lookupTable[indexed_dataSet] == dataSet).all()
``````

is true. Note that the
`indexed_dataSet`
and
`lookupTable`
can both be permuted such that the above holds and that is fine (i.e. it is not necessary that the order of
`lookupTable`
is equivalent to the order of first appearance in
`dataSet`
).

Slow Solution:

I currently have the following slow solution

``````def indexDataSet(dataSet):
"""Returns the indexed dataSet and a lookup table
Input:
dataSet         : A length n numpy array to be indexed
Output:
indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}
lookupTable     : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""
labels = set(dataSet)
lookupTable = np.empty(len(labels), dtype='U21')
indexed_dataSet = np.zeros(dataSet.size, dtype='int')
count = -1
for label in labels:
count += 1
indexed_dataSet[np.where(dataSet == label)] = count
lookupTable[count] = label

return indexed_dataSet, lookupTable
``````

Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.

You can use `np.unique` with the `return_inverse` argument:

``````>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'],
dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])
``````

If you like, you can reconstruct your original array from these two arrays:

``````>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'],
dtype='<U21')
``````

If you use pandas, `lookupTable, indexed_dataSet = pd.factorize(dataSet)` will achieve the same thing (and potentially be more efficient for large arrays).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download