zephyr - 9 months ago 139

Python Question

Let's say I have a numpy array of the form

`x = np.array([[2, 5],`

[3, 4],

[1, 3],

[2, 5],

[4, 5],

[1, 3],

[1, 4],

[3, 4]])

What I would like to get from this is an array which contains only the rows which are NOT duplicates, i.e., I expect from this example

`array([[4, 5],`

[1, 4]])

I'm looking for a method which is reasonably fast and scales well. The only way that I can think to do this is

- First find the set of unique rows in , as a new array
`x`

.`y`

- Create a new array which has those individual elements of
`z`

removed from`y`

, thus`x`

is a list of the duplicated rows in`z`

.`x`

- Do a set difference between and
`x`

.`z`

This seems horribly inefficient though. Anyone have a better way?

`[5, 2]`

`[3, 1]`

Answer

**Approach #1**

Here's an approach based on `np.unique`

and considering each row as an indexing tuple for efficiency (assuming that the input array has integers) -

```
# Consider each row as indexing tuple & get linear indexing value
lid = np.ravel_multi_index(x.T,x.max(0)+1)
# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)
# See which counts are exactly 1 and select the corresponding unique indices
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]
```

**Note:** If there is a huge number of columns in the input array, you might want to get the linear indices `lid`

manually, for which you can use `np.cumprod`

, like so -

```
lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])
```

**Approach #2**

Here's an alternative one that offloads the counting task to `np.bincount`

, which might be more efficient for such purposes -

```
# Consider each row as indexing tuple & get linear indexing value
lid = np.ravel_multi_index(x.T,x.max(0)+1)
# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)
# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]
```

**Approach #3**

Here's a different approach using `convolution`

to catch such a pattern. Let the inlined comments help out to understand the underlying idea. Here goes -

```
# Consider each row as indexing tuple & get linear indexing value
lid = np.ravel_multi_index(x.T,x.max(0)+1)
# Store sorted indices for lid
sidx = lid.argsort()
# Append 1s at either ends of sorted and differentiated version of lid
mask = np.hstack((True,np.diff(lid[sidx])!=0,True))
# Perform convolution on it. Thus non duplicate elements would have
# consecutive two True elements, which could be caught with convolution
# kernel of [1,1]. Get the corresponding mask.
# Index into sorted indices with it for final output
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]
```