KrunalParmar KrunalParmar - 2 months ago 6
Python Question

Take Unique of numpy array according to 2 column values.

I have Numpy array in python with two columns as follows :

time,id
1,a
2,b
3,a
1,a
5,c
6,b
3,a


i want to take unique time of each user.
For above data i want below output.

time,id
1,a
2,b
3,a
5,c
6,b


That is, I want to take only unique rows. so, 1,a and 3,a will not repeat in the result.
I have both the column as string datatype and have a very large 2-D array.
one solution may be, i can iterate over all the rows and make a set. But that will be very slow. Please suggest an efficient way to implement it.

Answer

Given:

>>> b
[['1' 'a']
 ['2' 'b']
 ['3' 'a']
 ['1' 'a']
 ['5' 'c']
 ['6' 'b']
 ['3' 'a']]

You can do:

>>> np.vstack({tuple(e) for e in b})
[['3' 'a']
 ['1' 'a']
 ['2' 'b']
 ['6' 'b']
 ['5' 'c']]

Since that is a set comprehension, you loose the order of the original.

Or, to maintain order, you can do:

>>> c = np.ascontiguousarray(b).view(np.dtype((np.void, b.dtype.itemsize * b.shape[1])))
>>> b[np.unique(c, return_index=True)[1]]
[['1' 'a']
 ['2' 'b']
 ['3' 'a']
 ['5' 'c']
 ['6' 'b']]

Or, if you can use Pandas, this is really easy. Given the following DataFrame:

>>> df
  id  time
0  a     1
1  b     2
2  a     3
3  a     1
4  c     5
5  b     6
6  a     3

Just use drop_duplicates():

>>> df.drop_duplicates()
  id  time
0  a     1
1  b     2
2  a     3
4  c     5
5  b     6
Comments