krlk89 krlk89 - 2 months ago 23
Python Question

Python bitarray reverse complement

I am using Python's

bitarray module
to convert a DNA sequence, that is written in a binary file, to its reverse complement. Each nucleotide is represented by two bits in the following format:

A - 00, C - 01, G - 10, T - 11
.

For example, the reverse complement of
AGCTACGG (00 10 01 11 00 01 10 10)
would be
CCGTAGCT (01 01 10 11 00 10 01 11)
.


This sequence takes up exactly 16 bits (2 bytes), but a sequence of length 9 would take 18 bits and it is padded to take up 24 bits (3 bytes).

At the moment I use a for cycle for the conversion, but this solution is dreadfully slow.


def reverse_complement( my_bitarray, seq_length ):

for i in range(0, 2 * seq_length - 1, 2):

if my_bitarray[i] == my_bitarray[i + 1]:

if my_bitarray[i] == 0:
my_bitarray[i], my_bitarray[i + 1] = 1, 1

else:
my_bitarray[i], my_bitarray[i + 1] = 0, 0

#padding if the bitarray is not a multiple of 8 bits in length
if seq_length / 4 != int():
my_bitarray.reverse()
my_bitarray.fill()
my_bitarray.reverse()

return my_bitarray

a = bitarray()
a.frombytes(seq[::-1])
b = a[int(seq_start)::] # seq without padding
b.reverse()

reverse_complement(b, seq_length)


Any tips on how to make this process faster?

Answer

The code you provided doesn't give the answer you indicated.

Here is code that gives the correct answer. Perhaps it will also be fast enough:

def reverse_complement(my_bitarray):
    # First reverse by twos
    my_bitarray = zip(my_bitarray[0::2], my_bitarray[1::2])
    my_bitarray = reversed(list(my_bitarray))
    my_bitarray = (i for t in my_bitarray for i in t)
    my_bitarray = bitarray(my_bitarray)

    # Then complement
    my_bitarray.invert()
    return my_bitarray

Note that you don't have to worry about the padding. bitarray.bitarray() manages all of that for you.