Venemo - 6 months ago 36

C++ Question

I have binary matrices in C++ that I repesent with a vector of 8-bit values.

For example, the following matrix:

`1 0 1 0 1 0 1`

0 1 1 0 0 1 1

0 0 0 1 1 1 1

is represented as:

`const uint8_t matrix[] = {`

0b01010101,

0b00110011,

0b00001111,

};

The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.

I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.

Just to clarify, for the above example, I'd like to get the following result from the transposition:

`const uint8_t transposed[] = {`

0b00000000,

0b00000100,

0b00000010,

0b00000110,

0b00000001,

0b00000101,

0b00000011,

0b00000111,

};

Answer

I've spent more time looking for a solution, and I've found some good ones.

On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.

This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.

The idea is simple:

`#include <emmintrin.h>`

- Pack 16
`uint8_t`

variables into an`__m128i`

- Use
`_mm_movemask_epi8`

to get the MSBs of each byte, producing an`uint16_t`

- Use
`_mm_slli_epi64`

to shift the 128-bit register by one - Repeat until you've got all 8
`uint16_t`

s

Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the *Cortex-M* CPU, (contrary to the *Cortex-A*) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.

**NOTE**: Because the *Cortex-M* **doesn't have native 64-bit arithmetics**, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an `uint64_t`

. Most microcontrollers that have a *Cortex-M* CPU also don't have too much memory so I prefer to do all this without a lookup table.

After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.

- Pack 4
`uint8_t`

s in a 32-bit variable - Mask the 1st bit of each byte (using
`0x80808080`

) - Multiply it with
`0x02040810`

- Take the 4 LSBs of the upper 32 bits of the multiplication
- Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.

If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.