Jennifer Q - 2 months ago 33

C Question

I try to use SSE to transpose my matrix. But it can only fit matrix whose N is divisible by 4. So I want to pad matrix to reformat it.

For example, if a 3 * 3 matrix, it should pad into 4 * 4 matrix:

`1 2 3 1 2 3 0`

4 5 6 => 4 5 6 0

7 8 9 7 8 9 0

0 0 0 0

Any efficient way to do this? And I am not sure if cost time padding it, would the SSE transpose be even slower than just loop every index...

Answer

You don't actually need to pad, do you? You're just suggesting that as a way to use a 4x4 SSE transpose routine you already have, right?

A matrix transpose doesn't move the diagonal elements (including the first and last). A 3x3 transpose has much less data movement: only 7 elements need to be loaded/stored.

```
1 2 3 1 4 7
4 5 6 => 2 5 8
7 8 9 3 6 9
```

If your elements are 4B (`int`

or `float`

, not `double`

), then the first 8 elements fit in a single AVX vector. AVX2 has a full lane-crossing shuffle, `vpermps`

. So the whole transpose can be done with a single load / `_mm256_permutevar8x32_ps`

/ store. It has one-per-clock throughput and three cycle latency on Intel Haswell.

Since the last element doesn't need to move, you don't need to touch it at all, except to copy it if you aren't transposing in-place.

With just SSE, you might load two vectors containing the first eight elements and shuffle them with each other using `shufps`

or something to combine elements from each vector.

Or maybe shuffle to create a `{ 1 4 3 2 }`

vector and a `{ 5 8 7 6 }`

vector, and then blend element 7 into the first, and blend element 3 into the 2nd.

**Anyway, a 3x3 is easier to transpose than a 4x4, so don't pad out to 4x4 if you don't need to use SSE on whole rows later.**