zephyr zephyr - 7 months ago 107
Python Question

Get non-duplicate rows from numpy array

Let's say I have a numpy array of the form

x = np.array([[2, 5],
[3, 4],
[1, 3],
[2, 5],
[4, 5],
[1, 3],
[1, 4],
[3, 4]])


What I would like to get from this is an array which contains only the rows which are NOT duplicates, i.e., I expect from this example

array([[4, 5],
[1, 4]])


I'm looking for a method which is reasonably fast and scales well. The only way that I can think to do this is


  1. First find the set of unique rows in
    x
    , as a new array
    y
    .

  2. Create a new array
    z
    which has those individual elements of
    y
    removed from
    x
    , thus
    z
    is a list of the duplicated rows in
    x
    .

  3. Do a set difference between
    x
    and
    z
    .



This seems horribly inefficient though. Anyone have a better way?

If it is important, I'm guaranteed that each of my rows will be sorted smallest to largest so that you'll never have a row be
[5, 2]
or
[3, 1]
.

Answer

Approach #1

Here's an approach based on np.unique and considering each row as an indexing tuple for efficiency (assuming that the input array has integers) -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)

# See which counts are exactly 1 and select the corresponding unique indices 
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]

Note: If there is a huge number of columns in the input array, you might want to get the linear indices lid manually, for which you can use np.cumprod, like so -

lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])

Approach #2

Here's an alternative one that offloads the counting task to np.bincount, which might be more efficient for such purposes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)

# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]

Approach #3

Here's a different approach using convolution to catch such a pattern. Let the inlined comments help out to understand the underlying idea. Here goes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Store sorted indices for lid
sidx = lid.argsort()

# Append 1s at either ends of sorted and differentiated version of lid
mask = np.hstack((True,np.diff(lid[sidx])!=0,True))

# Perform convolution on it. Thus non duplicate elements would have
# consecutive two True elements, which could be caught with convolution
# kernel of [1,1]. Get the corresponding mask. 
# Index into sorted indices with it for final output
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]
Comments