izaak_pyzaak - 1 year ago 90
Python Question

# all combinations of DNA characters in a string of length 4

I am trying to generate a list of all possible DNA sequences of length four with the four character

`A`
,
`T`
,
`C`
,
`G`
. There is a total of 4^4 (256) different combinations. I include repeats, such that
`AAAA`
is allowed.
I have looked at
`itertools.combinations_with_replacement(iterable, r)`

however, the list output changes depending on the order of the input string i.e

``````itertools.combinations_with_replacement('ATCG', 4) #diff results to...
itertools.combinations_with_replacement('ATGC', 4)
``````

Because of this, I had an attempt at combining
`itertools.combinations_with_replacement(iterable, r)`
, with
`itertools.permutations()`

Such that pass the output of
`itertools.permutations()`
to
`itertools.combinations_with_replacement()`
. As defined below:

``````def allCombinations(s, strings):
perms = list(itertools.permutations(s, 4))
allCombos = []
for perm in perms:
combo = list(itertools.combinations_with_replacement(perm, 4))
allCombos.append(combo)
for combos in allCombos:
for tup in combos:
strings.append("".join(str(x) for x in tup))
``````

However running
`allCombinations('ATCG', li)`
where
`li = []`
and then taking the
`list(set(li))`
still only proceeds 136 unique sequences, rather than 256.

There must be an easy way to do this, maybe generating a power set and then filtering for length 4?

You can achieve this by using `product`. It gives the Cartesian product of the passed iterables:

``````a = 'ACTG'

print(len(list(itertools.product(a, a, a, a))))
# or even better, print(len(list(itertools.product(a, repeat=4)))) as @ayhan commented
>> 256
``````

But it returns tuples, so if you are looking for strings:

``````for output in itertools.product(a, repeat=4):
print(''.join(output))

>> 'AAAA'
'AAAC'
.
.
'GGGG'
``````
