ToneDaBass - 1 year ago 54

Python Question

I have a problem where I want to identify and remove columns in a logic matrix that are subsets of other columns. i.e. [1, 0, 1] is a subset of [1, 1, 1]; but neither of [1, 1, 0] and [0, 1, 1] are subsets of each other. I wrote out a quick piece of code that identifies the columns that are subsets, which does (n^2-n)/2 checks using a couple nested for loops.

`import numpy as np`

A = np.array([[1, 0, 0, 0, 0, 1],

[0, 1, 1, 1, 1, 0],

[1, 0, 1, 0, 1, 1],

[1, 1, 0, 1, 0, 1],

[1, 1, 0, 1, 0, 0],

[1, 0, 0, 0, 0, 0],

[0, 0, 1, 1, 1, 0],

[0, 0, 1, 0, 1, 0]])

rows,cols = A.shape

columns = [True]*cols

for i in range(cols):

for j in range(i+1,cols):

diff = A[:,i]-A[:,j]

if all(diff >= 0):

print "%d is a subset of %d" % (j, i)

columns[j] = False

elif all(diff <= 0):

print "%d is a subset of %d" % (i, j)

columns[i] = False

B = A[:,columns]

The solution should be

`>>> print B`

[[1 0 0]

[0 1 1]

[1 1 0]

[1 0 1]

[1 0 1]

[1 0 0]

[0 1 1]

[0 1 0]]

For massive matrices though, I'm sure there's a way that I could do this faster. One thought is to eliminate subset columns as I go so I'm not checking columns already known to be a subset. Another thought is to vectorize this so don't have O(n^2) operations. Thank you.

Answer

Since the `A`

matrices I'm actually dealing with are 5000x5000 and sparse with about 4% density, I decided to try a sparse matrix approach combined with Python's "set" objects. Overall it's much faster than my original solution, but I feel like my process of going from matrix `A`

to list of sets `D`

is not as fast it could be. Any ideas on how to do this better are appreciated.

```
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
drops = np.zeros(cols).astype(bool)
# sparse nonzero elements
C = np.nonzero(A)
# create a list of sets containing the indices of non-zero elements of each column
D = [set() for j in range(cols)]
for i in range(len(C[0])):
D[C[1][i]].add(C[0][i])
# find subsets, ignoring columns that are known to already be subsets
for i in range(cols):
if drops[i]==True:
continue
col1 = D[i]
for j in range(i+1,cols):
col2 = D[j]
if col2.issubset(col1):
# I tried `if drops[j]==True: continue` here, but that was slower
print "%d is a subset of %d" % (j, i)
drops[j] = True
elif col1.issubset(col2):
print "%d is a subset of %d" % (i, j)
drops[i] = True
break
B = A[:, ~drops]
print B
```

Source (Stackoverflow)