RussW RussW - 1 month ago 10
Python Question

N random, contiguous and non-overlapping subsequences each of length

I'm trying to get n random and non-overlapping slices of a sequence where each subsequence is of length l, preferably in the order they appear.

This is the code I have so far and it's gotten more and more messy with each attempt to make it work, needless to say it doesn't work.

def rand_parts(seq, n, l):
"""
return n random non-overlapping partitions each of length l.
If n * l > len(seq) raise error.
"""
if n * l > len(seq):
raise Exception('length of seq too short for given n, l arguments')
if not isinstance(seq, list):
seq = list(seq)
gaps = [0] * (n + 1)
for g in xrange(len(seq) - (n * l)):
gaps[random.randint(0, len(gaps) - 1)] += 1
result = []
for i, g in enumerate(gaps):
x = g + (i * l)
result.append(seq[x:x+l])
if i < len(gaps) - 1:
gaps[i] += x
return result


For example if we say
rand_parts([1, 2, 3, 4, 5, 6], 2, 2)
there are 6 possible results that it could return from the following diagram:

[1, 2, 3, 4, 5, 6]
____ ____

[1, 2, 3, 4, 5, 6]
____ ____

[1, 2, 3, 4, 5, 6]
____ ____

[1, 2, 3, 4, 5, 6]
____ ____

[1, 2, 3, 4, 5, 6]
____ ____

[1, 2, 3, 4, 5, 6]
____ ____


So
[[3, 4], [5, 6]]
would be acceptable but
[[3, 4], [4, 5]]
wouldn't because it's overlapping and
[[2, 4], [5, 6]]
also wouldn't because
[2, 4]
isn't contiguous.

I encountered this problem while doing a little code golfing so for interests sake it would also be nice to see both a simple solution and/or an efficient one, not so much interested in my existing code.

Answer Source
def rand_parts(seq, n, l):
    indices = xrange(len(seq) - (l - 1) * n)
    result = []
    offset = 0
    for i in sorted(random.sample(indices, n)):
        i += offset
        result.append(seq[i:i+l])
        offset += l - 1
    return result

To understand this, first consider the case l == 1. Then it's basically just returning a random.sample() of the input data in sorted order; in this case the offset variable is always 0.

The case where l > 1 is an extension of the previous case. We use random.sample() to pick up positions, but maintain an offset to shift successive results: in this way, we make sure that they are non-overlapping ranges --- i.e. they start at a distance of at least l of each other, rather than 1.