Thomas - 2 months ago 5
Python Question

# Delete repeated columns of array keeping the order

Is there a relatively simple way of removing columns of an (numpy) array and keeping the order of the columns?

As an example, consider this array:

``````a = np.array([[2, 1, 1, 3],
[2, 1, 1, 3]])
``````

where I would like column three to be removed such that:

``````a = np.array([[2, 1, 3],
[2, 1, 3]])
``````

Approach #1 Here's an approach using `broadcasting` -

``````a[:,~np.triu((a[:,None,:] == a[...,None]).all(0),1).any(0)]
``````

Sample run -

``````In [115]: a
Out[115]:
array([[2, 1, 3, 5, 1, 3, 7],
[6, 5, 4, 6, 5, 4, 8]])

In [116]: a[:,~np.triu((a[:,None,:] == a[...,None]).all(0),1).any(0)]
Out[116]:
array([[2, 1, 3, 5, 7],
[6, 5, 4, 6, 8]])
``````

Explanation

1) Input array -

``````In [156]: a
Out[156]:
array([[2, 1, 3, 5, 1, 3, 7],
[6, 5, 4, 6, 5, 4, 8]])
``````

2) Use broadcasting to perform elementwise equality comparison keeping the first axis aligned, which would correspond to the column axis from original 2D array -

``````In [157]: a[:,None,:] == a[...,None]
Out[157]:
array([[[ True, False, False, False, False, False, False],
[False,  True, False, False,  True, False, False],
[False, False,  True, False, False,  True, False],
[False, False, False,  True, False, False, False],
[False,  True, False, False,  True, False, False],
[False, False,  True, False, False,  True, False],
[False, False, False, False, False, False,  True]],

[[ True, False, False,  True, False, False, False],
[False,  True, False, False,  True, False, False],
[False, False,  True, False, False,  True, False],
[ True, False, False,  True, False, False, False],
[False,  True, False, False,  True, False, False],
[False, False,  True, False, False,  True, False],
[False, False, False, False, False, False,  True]]], dtype=bool)
``````

3) Since we are looking for duplicate cols, let's look for ALL matches along the first axis -

``````In [158]: (a[:,None,:] == a[...,None]).all(0)
Out[158]:
array([[ True, False, False, False, False, False, False],
[False,  True, False, False,  True, False, False],
[False, False,  True, False, False,  True, False],
[False, False, False,  True, False, False, False],
[False,  True, False, False,  True, False, False],
[False, False,  True, False, False,  True, False],
[False, False, False, False, False, False,  True]], dtype=bool)
``````

4) We are looking to keep the first occurrence only, so we can use a upper triangular matrix to set all diagonal and lower triangular elems as `False` -

``````In [163]: np.triu((a[:,None,:] == a[...,None]).all(0),1)
Out[163]:
array([[False, False, False, False, False, False, False],
[False, False, False, False,  True, False, False],
[False, False, False, False, False,  True, False],
[False, False, False, False, False, False, False],
[False, False, False, False, False, False, False],
[False, False, False, False, False, False, False],
[False, False, False, False, False, False, False]], dtype=bool)
``````

5) Next up, we look for ANY matches along the first axis indicating the duplicates -

``````In [164]: (np.triu((a[:,None,:] == a[...,None]).all(0),1)).any(0)
Out[164]: array([False, False, False, False,  True,  True, False], dtype=bool)
``````

6) We are looking to remove those duplicates, so invert the mask -

``````In [165]: ~(np.triu((a[:,None,:] == a[...,None]).all(0),1)).any(0)
Out[165]: array([ True,  True,  True,  True, False, False,  True], dtype=bool)
``````

7) Finally, we index into the columns of input array with the mask for final output -

``````In [166]: a[:,~(np.triu((a[:,None,:] == a[...,None]).all(0),1)).any(0)]
Out[166]:
array([[2, 1, 3, 5, 7],
[6, 5, 4, 6, 8]])
``````

Approach #2 With focus on memory efficiency and might even be faster, here's an approach considering each column as an indexing tuple -

``````lidx = np.ravel_multi_index(a,a.max(1)+1)
out = a[:,np.sort(np.unique(lidx,return_index=1)[1])]
``````

Explanation

1) Input array -

``````In [203]: a
Out[203]:
array([[2, 1, 3, 5, 1, 3, 7],
[6, 5, 4, 6, 5, 4, 8]])
``````

2) Calculate linear index equivalents for each column -

``````In [207]: lidx = np.ravel_multi_index(a,a.max(1)+1)

In [208]: lidx
Out[208]: array([24, 14, 31, 51, 14, 31, 71])
``````

3) Get the first occurence of each unique linear index

``````In [209]: np.unique(lidx,return_index=1)[1]
Out[209]: array([1, 0, 2, 3, 6])
``````

4) Sort those and index into cols of input array for final o/p -

``````In [210]: np.sort(np.unique(lidx,return_index=1)[1])
Out[210]: array([0, 1, 2, 3, 6])

In [211]: a[:,np.sort(np.unique(lidx,return_index=1)[1])]
Out[211]:
array([[2, 1, 3, 5, 7],
[6, 5, 4, 6, 8]])
``````

For a detailed info on the considerations related to converting to indexing tuples, please refer to `this post`.