shenglih - 10 months ago 44

Python Question

let's say I have a huge panda data frame/numpy array where each element is a list of ordered values:

`sequences = np.array([12431253, 123412531, 12341234,12431253, 145345],`

[5463456, 1244562, 23452],

[243524, 141234,12431253, 456367],

[456345, 253451],

[75635, 14145, 12346,12431253])

or,

`sequences = pd.DataFrame({'sequence': [[12431253, 123412531, 12341234,12431253, 145345],`

[5463456, 1244562, 23452],

[243524, 141234, 456367,12431253],

[456345, 253451],

[75635, 14145, 12346,12431253]]})

and I want to replace them with another set of identifiers that start from 0, so I design a mapping like this:

`from compiler.ast import flatten`

from sets import Set

mapping = pd.DataFrame({'v0': list(Set(flatten(sequences['sequence']))), 'v1': range(len(Set(flatten(sequences['sequence'])))})

......

so the result I was looking for:

`sequences = np.array([1, 2, 3,1, 4], [5, 6, 7], [8, 9, 10,1], [11, 12], [13, 14, 15,1])`

how can I scale this up to a huge data frame/numpy of sequences ?

Thanks so much for any guidance! Greatly appreciated!

Answer Source

Here's an approach that flattens into a `1D`

array, uses `np.unique`

to assign unique IDs to each element and then splits back into list of arrays -

```
lens = np.array(map(len,sequences))
seq_arr = np.concatenate(sequences)
ids = np.unique(seq_arr,return_inverse=1)[1]
out = np.split(ids,lens[:-1].cumsum())
```

Sample run -

```
In [391]: sequences = np.array([[12431253, 123412531, 12341234,12431253, 145345],
...: [5463456, 1244562, 23452],
...: [243524, 141234,12431253, 456367],
...: [456345, 12431253],
...: [75635, 14145, 12346,12431253]])
In [392]: out
Out[392]:
[array([12, 13, 11, 12, 5]),
array([10, 9, 2]),
array([ 6, 4, 12, 8]),
array([ 7, 12]),
array([ 3, 1, 0, 12])]
In [393]: np.array(map(list,out)) # If you need NumPy array as final o/p
Out[393]:
array([[12, 13, 11, 12, 5], [10, 9, 2], [6, 4, 12, 8], [7, 12],
[3, 1, 0, 12]], dtype=object)
```