krlk89 - 1 month ago 9
Python Question

# Python bitarray reverse complement

I am using Python's

`bitarray module`
to convert a DNA sequence, that is written in a binary file, to its reverse complement. Each nucleotide is represented by two bits in the following format:

`A - 00, C - 01, G - 10, T - 11`
.

For example, the reverse complement of
`AGCTACGG (00 10 01 11 00 01 10 10)`
would be
`CCGTAGCT (01 01 10 11 00 10 01 11)`
.

This sequence takes up exactly 16 bits (2 bytes), but a sequence of length 9 would take 18 bits and it is padded to take up 24 bits (3 bytes).

At the moment I use a for cycle for the conversion, but this solution is dreadfully slow.

``````def reverse_complement( my_bitarray, seq_length ):

for i in range(0, 2 * seq_length - 1, 2):

if my_bitarray[i] == my_bitarray[i + 1]:

if my_bitarray[i] == 0:
my_bitarray[i], my_bitarray[i + 1] = 1, 1

else:
my_bitarray[i], my_bitarray[i + 1] = 0, 0

#padding if the bitarray is not a multiple of 8 bits in length
if seq_length / 4 != int():
my_bitarray.reverse()
my_bitarray.fill()
my_bitarray.reverse()

return my_bitarray

a = bitarray()
a.frombytes(seq[::-1])
b = a[int(seq_start)::] # seq without padding
b.reverse()

reverse_complement(b, seq_length)
``````

Any tips on how to make this process faster?

The code you provided doesn't give the answer you indicated.

Here is code that gives the correct answer. Perhaps it will also be fast enough:

``````def reverse_complement(my_bitarray):
# First reverse by twos
my_bitarray = zip(my_bitarray[0::2], my_bitarray[1::2])
my_bitarray = reversed(list(my_bitarray))
my_bitarray = (i for t in my_bitarray for i in t)
my_bitarray = bitarray(my_bitarray)

# Then complement
my_bitarray.invert()
return my_bitarray
``````

Note that you don't have to worry about the padding. `bitarray.bitarray()` manages all of that for you.

Source (Stackoverflow)