Jee Seok Yoon - 10 months ago 51

Python Question

Say I have these 2D arrays A and B.

How can I remove elements from A that are in B. (Complement in set theory: A-B)

`A=np.asarray([[1,1,1], [1,1,2], [1,1,3], [1,1,4]])`

B=np.asarray([[0,0,0], [1,0,2], [1,0,3], [1,0,4], [1,1,0], [1,1,1], [1,1,4]])

#output = [[1,1,2], [1,1,3]]

To be more precise, I would like to do something like this.

`data = some numpy array`

label = some numpy array

A = np.argwhere(label==0) #[[1 1 1], [1 1 2], [1 1 3], [1 1 4]]

B = np.argwhere(data>1.5) #[[0 0 0], [1 0 2], [1 0 3], [1 0 4], [1 1 0], [1 1 1], [1 1 4]]

out = np.argwhere(label==0 and data>1.5) #[[1 1 2], [1 1 3]]

Answer Source

Based on `this solution`

to `Find the row indexes of several values in a numpy array`

, here's a NumPy based solution with less memory footprint and could be beneficial when working with large arrays -

```
dims = np.maximum(B.max(0),A.max(0))+1
out = A[~np.in1d(np.ravel_multi_index(A.T,dims),np.ravel_multi_index(B.T,dims))]
```

Sample run -

```
In [38]: A
Out[38]:
array([[1, 1, 1],
[1, 1, 2],
[1, 1, 3],
[1, 1, 4]])
In [39]: B
Out[39]:
array([[0, 0, 0],
[1, 0, 2],
[1, 0, 3],
[1, 0, 4],
[1, 1, 0],
[1, 1, 1],
[1, 1, 4]])
In [40]: out
Out[40]:
array([[1, 1, 2],
[1, 1, 3]])
```

Runtime test on large arrays -

```
In [107]: def in1d_approach(A,B):
...: dims = np.maximum(B.max(0),A.max(0))+1
...: return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
...: np.ravel_multi_index(B.T,dims))]
...:
In [108]: # Setup arrays with B as large array and A contains some of B's rows
...: B = np.random.randint(0,9,(1000,3))
...: A = np.random.randint(0,9,(100,3))
...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
...: A[A_idx] = B[B_idx]
...:
```

Timings with `broadcasting`

based solutions -

```
In [109]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100 loops, best of 3: 4.64 ms per loop # @Kasramvd's soln
In [110]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100 loops, best of 3: 3.66 ms per loop
```

Timing with less memory footprint based solution -

```
In [111]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 231 µs per loop
```

**Further performance boost**

`in1d_approach`

reduces each row by considering each row as an indexing tuple. We can do the same a bit more efficiently by introducing matrix-multiplication with `np.dot`

, like so -

```
def in1d_dot_approach(A,B):
cumdims = (np.maximum(A.max(),B.max())+1)**np.arange(B.shape[1])
return A[~np.in1d(A.dot(cumdims),B.dot(cumdims))]
```

Let's test it against the previous on much larger arrays -

```
In [251]: # Setup arrays with B as large array and A contains some of B's rows
...: B = np.random.randint(0,9,(10000,3))
...: A = np.random.randint(0,9,(1000,3))
...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
...: A[A_idx] = B[B_idx]
...:
In [252]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 1.28 ms per loop
In [253]: %timeit in1d_dot_approach(A, B)
1000 loops, best of 3: 1.2 ms per loop
```