zhangbaochong - 5 months ago 34

C++ Question

Recently I am writing a soft raster renderer, but it's speed is really so slow. By performance testing, I find that the float lerp function is the bottleneck. How to improve the speed of this function? use simd? Any idea?

`inline float MathUtil::Lerp(float x1, float x2, float t)`

{

return x1 + (x2 - x1)*t;

}

//lerp vector

ZCVector MathUtil::Lerp(const ZCVector& v1, const ZCVector& v2, float t)

{

return ZCVector(

Lerp(v1.x, v2.x, t),

Lerp(v1.y, v2.y, t),

Lerp(v1.z, v2.z, t),

v1.w

);

}

//lerp ZCFLOAT2

ZCFLOAT2 MathUtil::Lerp(const ZCFLOAT2& v1, const ZCFLOAT2& v2, float t)

{

return ZCFLOAT2(

Lerp(v1.u, v2.u, t),

Lerp(v1.v, v2.v, t)

);

}

//lerp ZCFLOAT3

ZCFLOAT3 MathUtil::Lerp(const ZCFLOAT3& v1, const ZCFLOAT3& v2, float t)

{

return ZCFLOAT3(

Lerp(v1.x, v2.x, t),

Lerp(v1.y, v2.y, t),

Lerp(v1.z, v2.z, t)

);

}

//lerp VertexOut

VertexOut MathUtil::Lerp(const VertexOut& v1, const VertexOut& v2, float t)

{

return VertexOut(

Lerp(v1.posTrans, v2.posTrans, t),

Lerp(v1.posH, v2.posH, t),

Lerp(v1.tex, v2.tex, t),

Lerp(v1.normal, v2.normal, t),

Lerp(v1.color, v2.color, t),

Lerp(v1.oneDivZ, v2.oneDivZ, t)

);

}

the structure of VertexOut:

`class VertexOut`

{

public:

ZCVector posTrans;

ZCVector posH;

ZCFLOAT2 tex;

ZCVector normal;

ZCFLOAT3 color;

float oneDivZ;

}

the

`scanlinefill`

`void Tiny3DDeviceContext::ScanlineFill(const VertexOut& left, const VertexOut& right, int yIndex)`

{

float dx = right.posH.x - left.posH.x;

for (float x = left.posH.x; x <= right.posH.x; x += 0.5f)

{

int xIndex = static_cast<int>(x + .5f);

if(xIndex >= 0 && xIndex < m_pDevice->GetClientWidth())

{

float lerpFactor = 0;

if (dx != 0)

{

lerpFactor = (x - left.posH.x) / dx;

}

float oneDivZ = MathUtil::Lerp(left.oneDivZ, right.oneDivZ, lerpFactor);

if (oneDivZ >= m_pDevice->GetZ(xIndex,yIndex))

{

m_pDevice->SetZ(xIndex, yIndex, oneDivZ);

//lerp get vertex

VertexOut out = MathUtil::Lerp(left, right, lerpFactor);

out.posH.y = yIndex;

m_pDevice->DrawPixel(xIndex, yIndex, m_pShader->PS(out));

}

}

}

}

Answer

This loop structure potentially runs `lerp`

twice as many times as needed:

```
for (float x = left.posH.x; x <= right.posH.x; x += 0.5f) {
int xIndex = static_cast<int>(x + .5f);
...
}
```

Instead, (and more accurately), loop by incrementing integer `xIndex`

, and calculate the right `float x`

for each `xIndex`

.

This might auto-vectorize, but you'd have to check your compiler output to see what happened. Hopefully the Lerp that you overwrite with `out.posH.y = yIndex;`

gets optimized away since you discard the result. If not, you might get a speedup from making a wrapper function that doesn't do that Lerp.

You could make it more SIMD-friendly by using a Struct-of-Arrays approach instead of your AoS approach that keeps everything for a struct contiguous. However, you're Lerping multiple elements the same way, so it might auto-vectorize with two scalar and one vector Lerp.

See the sse tag wiki for some guides to SIMD stuff, including a link to this very nice beginner / intermediate set of slides.

There are probably other things you could change, too, esp. **bigger restructuring of your code to do less overall work**. This kind of optimization can more often give you even bigger speedups than using SIMD to efficiently apply the brute force of modern CPUs.

Doing both at once to multiply the speedups is what really makes things fast.

Cache misses and memory-bandwidth bottlenecks are often a huge factor, so optimizing your access patterns can make a big difference.

See Agner Fog's optimization guide if you want to learn about more low-level details. He has a C++ optimization guide, but most of the good stuff is about x86 asm. (See also the x86 tag wiki). But remember, this low-level optimization stuff is only a good idea *after* looking for high-level optimizations.