Mike S. Mike S. - 1 year ago 109
Python Question

Find All Binary Splits of a Nominal Attribute


I'm trying to build a binary decision tree classifier in Python from scratch based on a data set that has only nominal attributes.

One step I'm stuck on is finding all possible ways to compute a binary split of a nominal attribute. For example, for an attribute with possible values [a, b, c, d], I am looking for a way to split these in two arrays such that we obtain:

left right
---- -----
a bcd
b acd
c abd
d abc
ab cd
ac bd
ad bc

without duplicate splits (e.g. we don't need "bc" in
and "ad" in
since this would yield the same binary split as "ad" in
and "bc" in
). Order within each split is also irrelevant (e.g. "ad" is the same as "da" in one side of a split).

Current Attempt

The exact terminology is escaping me, but I think this is some form of combination/permutation problem. I know its not quite a powerset I'm after. The closest question I could find similar to mine is linked here.

So far I've started an iterative procedure:

for i in range(1, array.length / 2):
for j in range(1, i):
# stuck here

The reason for looping only through the floor of half the length of the attribute's possible values (
) is because if we store up to
array.length / 2
values in
, right has
1 - (array.length / 2)
values, covering all possible splits.

Also, I've heard of the
library .. so perhaps there's a way to achieve what I'm after with that?

Answer Source

I would use itertools.product to write a function that splits a sequence into all possible divisions of two halves. I'd iterate through that and remove duplicates using a set.

import itertools

def binary_splits(seq):
    for result_indices in itertools.product((0,1), repeat=len(seq)):
        result = ([], [])
        for seq_index, result_index in enumerate(result_indices):
        #skip results where one of the sides is empty
        if not result[0] or not result[1]: continue
        #convert from list to tuple so we can hash it later
        yield map(tuple, result)

def binary_splits_no_dupes(seq):
    seen = set()
    for item in binary_splits(seq):
        key = tuple(sorted(item))
        if key in seen: continue
        yield key

for left, right in binary_splits_no_dupes("abcd"):
    print left, right


('a', 'b', 'c') ('d',)
('a', 'b', 'd') ('c',)
('a', 'b') ('c', 'd')
('a', 'c', 'd') ('b',)
('a', 'c') ('b', 'd')
('a', 'd') ('b', 'c')
('a',) ('b', 'c', 'd')
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download