zephyr - 2 years ago 290
Python Question

# Get non-duplicate rows from numpy array

Let's say I have a numpy array of the form

``````x = np.array([[2, 5],
[3, 4],
[1, 3],
[2, 5],
[4, 5],
[1, 3],
[1, 4],
[3, 4]])
``````

What I would like to get from this is an array which contains only the rows which are NOT duplicates, i.e., I expect from this example

``````array([[4, 5],
[1, 4]])
``````

I'm looking for a method which is reasonably fast and scales well. The only way that I can think to do this is

1. First find the set of unique rows in
`x`
, as a new array
`y`
.

2. Create a new array
`z`
which has those individual elements of
`y`
removed from
`x`
, thus
`z`
is a list of the duplicated rows in
`x`
.

3. Do a set difference between
`x`
and
`z`
.

This seems horribly inefficient though. Anyone have a better way?

If it is important, I'm guaranteed that each of my rows will be sorted smallest to largest so that you'll never have a row be
`[5, 2]`
or
`[3, 1]`
.

Approach #1

Here's an approach based on `np.unique` and considering each row as an indexing tuple for efficiency (assuming that the input array has integers) -

``````# Consider each row as indexing tuple & get linear indexing value
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)

# See which counts are exactly 1 and select the corresponding unique indices
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]
``````

Note: If there is a huge number of columns in the input array, you might want to get the linear indices `lid` manually, for which you can use `np.cumprod`, like so -

``````lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])
``````

Approach #2

Here's an alternative one that offloads the counting task to `np.bincount`, which might be more efficient for such purposes -

``````# Consider each row as indexing tuple & get linear indexing value
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)

# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]
``````

Approach #3

Here's a different approach using `convolution` to catch such a pattern. Let the inlined comments help out to understand the underlying idea. Here goes -

``````# Consider each row as indexing tuple & get linear indexing value
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Store sorted indices for lid
sidx = lid.argsort()

# Append 1s at either ends of sorted and differentiated version of lid