Erik Sj&#246;lund - 1 year ago 77
C Question

# How to load a sliding diagonal vector from data stored column-wise with SSE

The sliding diagonal vector contains 16 elements, each one an 8-bit unsigned integer.

Without SSE and a bit simplified it would have looked like this in C:

``````int width=1000000; // a big number
uint8_t matrix[width][16];
fill_matrix_with_interesting_values(&matrix);

for (int i=0; i < width - 16; ++i) {
uint8_t diagonal[16];
for (int j=0; j<16; ++j) {
diagonal[j] = matrix[i+j][j];
}
do_something(&diagonal);
}
``````

but in my case I can only load column-wise (vertically) from the matrix with the
`_mm_load_si128`
intrinsics function. The sliding diagonal vector is moving horizontally so I need to load 16 column vectors in advance and use one element from each of those column vectors to create the diagonal vector.

Is it possible to make a fast low-memory implementation for this with SSE?

The implementation I will present only needs one column load per iteration. First we initialize some variables

``````const __m128i mask1=_mm_set_epi8(0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,255);
__m128i v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15;
``````

Then for each step the variable `v_column_load` is loaded with the next column.

``````v15 = v_column_load;
v_diagonal = v0;
``````

In the next step the variable name numbers in `v0`, `v1`, `v3`, `v7`, `v15` are incremented by 1 and adjusted to be in the range 0 to 15. In other words: newnumber = ( oldnumber + 1 ) modulo 16.

``````v0 = v_column_load;
v_diagonal = v1;
``````

After 16 iterations the `v_diagonal` will start to contain the correct diagonal values.

Looking at `mask1`,`mask2`, `mask3`, `mask4`, we see a pattern that can be used to generalize this algorithm for other vector lengths (2^n).

For instance, for vector length 8, we would only need 3 masks and the iteration steps would look like this:

``````v7 = a a a a a a a a
v6 =
v5 =
v4 =
v3 =         a a a a
v2 =
v1 =             a a
v0 =               a

v0 = b b b b b b b b
v7 = a a a a a a a a
v6 =
v5 =
v4 =         b b b b
v3 =         a a a a
v2 =             b b
v1 =             a b

v1 = c c c c c c c c
v0 = b b b b b b b b
v7 = a a a a a a a a
v6 =
v5 =         c c c c
v4 =         b b b b
v3 =         a a c c
v2 =           a b c

v2 = d d d d d d d d
v1 = c c c c c c c c
v0 = b b b b b b b b
v7 = a a a a a a a a
v6 =         d d d d
v5 =         c c c c
v4 =         b b d d
v3 =         a a c d

v3 = e e e e e e e e
v2 = d d d d d d d d
v1 = c c c c c c c c
v0 = b b b b b b b b
v7 = a a a a e e e e
v6 =         d d d d
v5 =     a a c c e e
v4 =       a b b d a

v4 = f f f f f f f f
v3 = e e e e e e e e
v2 = d d d d d d d d
v1 = c c c c c c c c
v0 = b b b b f f f f
v7 = a a a a e e e e
v6 =     b b d d f f
v5 =     a b c d e f

v5 = g g g g g g g g
v4 = f f f f f f f f
v3 = e e e e e e e e
v2 = d d d d d d d d
v1 = c c c c g g g g
v0 = b b b b f f f f
v7 = a a c c e e g g
v6 =   a b c d e f g

v6 = h h h h h h h h
v5 = g g g g g g g g
v4 = f f f f f f f f
v3 = e e e e e e e e
v2 = d d d d h h h h
v1 = c c c c g g g g
v0 = b b d d f f h h
v7 = a b c d e f g h  <-- this vector now contains the diagonal

v7 = i i i i i i i i
v6 = h h h h h h h h
v5 = g g g g g g g g
v4 = f f f f f f f f
v3 = e e e e i i i i
v2 = d d d d h h h h
v1 = c c e e g g i i
v0 = b c d e f g h i  <-- this vector now contains the diagonal

v0 = j j j j j j j j
v7 = i i i i i i i i
v6 = h h h h h h h h
v5 = g g g g g g g g
v4 = f f f f j j j j
v3 = e e e e i i i i
v2 = d d f f h h j j
v1 = c d e f g h i j  <-- this vector now contains the diagonal
``````

Sidenote: I discovered this way of loading a diagonal vector when I was working on an implementation of the Smith-Waterman algorithm. Some more information can be found on the old SourceForge project web page.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download