zhangbaochong zhangbaochong - 4 months ago 28
C++ Question

How to improve the speed of the float lerp function?

Recently I am writing a soft raster renderer, but it's speed is really so slow. By performance testing, I find that the float lerp function is the bottleneck. How to improve the speed of this function? use simd? Any idea?

inline float MathUtil::Lerp(float x1, float x2, float t)
{
return x1 + (x2 - x1)*t;
}

//lerp vector
ZCVector MathUtil::Lerp(const ZCVector& v1, const ZCVector& v2, float t)
{
return ZCVector(
Lerp(v1.x, v2.x, t),
Lerp(v1.y, v2.y, t),
Lerp(v1.z, v2.z, t),
v1.w
);
}

//lerp ZCFLOAT2
ZCFLOAT2 MathUtil::Lerp(const ZCFLOAT2& v1, const ZCFLOAT2& v2, float t)
{
return ZCFLOAT2(
Lerp(v1.u, v2.u, t),
Lerp(v1.v, v2.v, t)
);
}

//lerp ZCFLOAT3
ZCFLOAT3 MathUtil::Lerp(const ZCFLOAT3& v1, const ZCFLOAT3& v2, float t)
{
return ZCFLOAT3(
Lerp(v1.x, v2.x, t),
Lerp(v1.y, v2.y, t),
Lerp(v1.z, v2.z, t)
);
}

//lerp VertexOut
VertexOut MathUtil::Lerp(const VertexOut& v1, const VertexOut& v2, float t)
{
return VertexOut(
Lerp(v1.posTrans, v2.posTrans, t),
Lerp(v1.posH, v2.posH, t),
Lerp(v1.tex, v2.tex, t),
Lerp(v1.normal, v2.normal, t),
Lerp(v1.color, v2.color, t),
Lerp(v1.oneDivZ, v2.oneDivZ, t)
);
}


the structure of VertexOut:

class VertexOut
{
public:

ZCVector posTrans;

ZCVector posH;

ZCFLOAT2 tex;

ZCVector normal;

ZCFLOAT3 color;

float oneDivZ;
}


the
scanlinefill
function to fill triangle, every vertex needs to use lerp function, so it will be called so many times.

void Tiny3DDeviceContext::ScanlineFill(const VertexOut& left, const VertexOut& right, int yIndex)
{
float dx = right.posH.x - left.posH.x;

for (float x = left.posH.x; x <= right.posH.x; x += 0.5f)
{
int xIndex = static_cast<int>(x + .5f);
if(xIndex >= 0 && xIndex < m_pDevice->GetClientWidth())
{

float lerpFactor = 0;
if (dx != 0)
{
lerpFactor = (x - left.posH.x) / dx;
}


float oneDivZ = MathUtil::Lerp(left.oneDivZ, right.oneDivZ, lerpFactor);
if (oneDivZ >= m_pDevice->GetZ(xIndex,yIndex))
{
m_pDevice->SetZ(xIndex, yIndex, oneDivZ);
//lerp get vertex
VertexOut out = MathUtil::Lerp(left, right, lerpFactor);
out.posH.y = yIndex;

m_pDevice->DrawPixel(xIndex, yIndex, m_pShader->PS(out));
}
}
}
}

Answer

This loop structure potentially runs lerp twice as many times as needed:

for (float x = left.posH.x; x <= right.posH.x; x += 0.5f) {
      int xIndex = static_cast<int>(x + .5f);
      ...
}

Instead, (and more accurately), loop by incrementing integer xIndex, and calculate the right float x for each xIndex.


This might auto-vectorize, but you'd have to check your compiler output to see what happened. Hopefully the Lerp that you overwrite with out.posH.y = yIndex; gets optimized away since you discard the result. If not, you might get a speedup from making a wrapper function that doesn't do that Lerp.


You could make it more SIMD-friendly by using a Struct-of-Arrays approach instead of your AoS approach that keeps everything for a struct contiguous. However, you're Lerping multiple elements the same way, so it might auto-vectorize with two scalar and one vector Lerp.

See the tag wiki for some guides to SIMD stuff, including a link to this very nice beginner / intermediate set of slides.


There are probably other things you could change, too, esp. bigger restructuring of your code to do less overall work. This kind of optimization can more often give you even bigger speedups than using SIMD to efficiently apply the brute force of modern CPUs.

Doing both at once to multiply the speedups is what really makes things fast.

Cache misses and memory-bandwidth bottlenecks are often a huge factor, so optimizing your access patterns can make a big difference.

See Agner Fog's optimization guide if you want to learn about more low-level details. He has a C++ optimization guide, but most of the good stuff is about x86 asm. (See also the tag wiki). But remember, this low-level optimization stuff is only a good idea after looking for high-level optimizations.