Mike S. - 7 months ago 37

Python Question

I'm trying to build a binary decision tree classifier in Python from scratch based on a data set that has only nominal attributes.

One step I'm stuck on is finding all possible ways to compute a binary split of a nominal attribute. For example, for an attribute with possible values [a, b, c, d], I am looking for a way to split these in two arrays such that we obtain:

`left right`

---- -----

a bcd

b acd

c abd

d abc

ab cd

ac bd

ad bc

without duplicate splits (e.g. we don't need "bc" in

`left`

`right`

`left`

`right`

The exact terminology is escaping me, but I think this is some form of combination/permutation problem. I know its not quite a powerset I'm after. The closest question I could find similar to mine is linked here.

So far I've started an iterative procedure:

`for i in range(1, array.length / 2):`

for j in range(1, i):

# stuck here

The reason for looping only through the floor of half the length of the attribute's possible values (

`array`

`array.length / 2`

`left`

`1 - (array.length / 2)`

Also, I've heard of the

`itertools`

Answer

I would use `itertools.product`

to write a function that splits a sequence into all possible divisions of two halves. I'd iterate through that and remove duplicates using a set.

```
import itertools
def binary_splits(seq):
for result_indices in itertools.product((0,1), repeat=len(seq)):
result = ([], [])
for seq_index, result_index in enumerate(result_indices):
result[result_index].append(seq[seq_index])
#skip results where one of the sides is empty
if not result[0] or not result[1]: continue
#convert from list to tuple so we can hash it later
yield map(tuple, result)
def binary_splits_no_dupes(seq):
seen = set()
for item in binary_splits(seq):
key = tuple(sorted(item))
if key in seen: continue
yield key
seen.add(key)
for left, right in binary_splits_no_dupes("abcd"):
print left, right
```

Result:

```
('a', 'b', 'c') ('d',)
('a', 'b', 'd') ('c',)
('a', 'b') ('c', 'd')
('a', 'c', 'd') ('b',)
('a', 'c') ('b', 'd')
('a', 'd') ('b', 'c')
('a',) ('b', 'c', 'd')
```