izaak_pyzaak izaak_pyzaak - 5 months ago 22
Python Question

all combinations of DNA characters in a string of length 4

I am trying to generate a list of all possible DNA sequences of length four with the four character

A
,
T
,
C
,
G
. There is a total of 4^4 (256) different combinations. I include repeats, such that
AAAA
is allowed.
I have looked at
itertools.combinations_with_replacement(iterable, r)

however, the list output changes depending on the order of the input string i.e

itertools.combinations_with_replacement('ATCG', 4) #diff results to...
itertools.combinations_with_replacement('ATGC', 4)


Because of this, I had an attempt at combining
itertools.combinations_with_replacement(iterable, r)
, with
itertools.permutations()


Such that pass the output of
itertools.permutations()
to
itertools.combinations_with_replacement()
. As defined below:

def allCombinations(s, strings):
perms = list(itertools.permutations(s, 4))
allCombos = []
for perm in perms:
combo = list(itertools.combinations_with_replacement(perm, 4))
allCombos.append(combo)
for combos in allCombos:
for tup in combos:
strings.append("".join(str(x) for x in tup))


However running
allCombinations('ATCG', li)
where
li = []
and then taking the
list(set(li))
still only proceeds 136 unique sequences, rather than 256.

There must be an easy way to do this, maybe generating a power set and then filtering for length 4?

Answer

You can achieve this by using product. It gives the Cartesian product of the passed iterables:

a = 'ACTG'

print(len(list(itertools.product(a, a, a, a))))
# or even better, print(len(list(itertools.product(a, repeat=4)))) as @ayhan commented
>> 256

But it returns tuples, so if you are looking for strings:

for output in itertools.product(a, repeat=4):
    print(''.join(output))

>> 'AAAA'
   'AAAC'
   .
   .
   'GGGG'