rwolst - 1 year ago 123

Python Question

**Problem:**

Given an array of string data

`dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'),`

I would like a function that returns the indexed dataset

`indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')`

and a lookup table

`lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')`

such that

`(lookupTable[indexed_dataSet] == dataSet).all()`

is true. Note that the

`indexed_dataSet`

`lookupTable`

`lookupTable`

`dataSet`

I currently have the following slow solution

`def indexDataSet(dataSet):`

"""Returns the indexed dataSet and a lookup table

Input:

dataSet : A length n numpy array to be indexed

Output:

indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}

lookupTable : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""

labels = set(dataSet)

lookupTable = np.empty(len(labels), dtype='U21')

indexed_dataSet = np.zeros(dataSet.size, dtype='int')

count = -1

for label in labels:

count += 1

indexed_dataSet[np.where(dataSet == label)] = count

lookupTable[count] = label

return indexed_dataSet, lookupTable

Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.

Answer Source

You can use `np.unique`

with the `return_inverse`

argument:

```
>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'],
dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])
```

If you like, you can reconstruct your original array from these two arrays:

```
>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'],
dtype='<U21')
```

If you use pandas, `lookupTable, indexed_dataSet = pd.factorize(dataSet)`

will achieve the same thing (and potentially be more efficient for large arrays).