iksemyonov iksemyonov - 1 month ago 16
C Question

What is the instruction that gives branchless FP min and max on x86?

To quote (thanks to the author for developing and sharing the algorithm!):

https://tavianator.com/fast-branchless-raybounding-box-intersections/


Since modern floating-point instruction sets can compute min and max without branches


Corresponding code by the author is just

dmnsn_min(double a, double b)
{
return a < b ? a : b;
}


I'm familiar with e.g.
_mm_max_ps
, but that's a vector instruction. The code above obviously is meant to be used in a scalar form.

Question:


  • What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?

  • Is it safe to assume it's going to be applied, or how do I call it?

  • Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?

  • Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?



Just in case: I'm familiar with Use of min and max functions in C++, believe it's related but not quite my question.

Answer

Most vector FP instructions have scalar equivalents. MINSS / MAXSS / MINSD / MAXSD are what you want. They handle +/-Infinity the way you'd expect.

Don't try to use _mm_min_ss on scalar floats; the intrinsic is only available with __m128 operands, and Intel's intrinsics don't provide any way to get a scalar float into the low element of a __m128 without zeroing the high elements or somehow doing extra work. Most compilers will actually emit the useless instructions to do that even if the final result doesn't depend on anything in the upper elements. There's nothing like __m256 _mm256_castps128_ps256 (__m128 a) to just cast a float to a __m128 with garbage in the upper elements. I consider this a design flaw. :/


Note their asymmetric behaviour with NaN: if the operands are unordered, dest=src (i.e. it takes the second operand if either operand is NaN). The corresponding _mm_min_ss / _mm_min_ps intrinsics may or may not have this behaviour, depending on the compiler.

(a and b are unordered if either of them is NaN. That means a<b, a==b, and a>b are all false. See Bruce Dawson's series of articles on floating point for lots of FP gotchas.)

I think the intrinsics are supposed to have the same operand-order semantics as the asm instructions, but gcc has treated the operands to _mm_min_ps as commutative even without -ffast-math for a long time, going back to at least gcc4.4. gcc7 finally changed it to match icc and clang. Still, Intel's online intrinsics finder doesn't mention that behaviour for the function, but OTOH the asm insn ref manual doesn't mention that the intrinsic doesn't have the behaviour when it lists _mm_min_ss as the intrinsic for MINSS.

When I googled on "_mm_min_ps" NaN, I found this real code and some other discussion of using the intrinsic to handle NaNs, so clearly many people expect the intrinsic to behave like the asm instruction. (This came up for some code I was writing yesterday, and I was already thinking of writing this up as a self-answered Q&A.)

Given the existence of this longstanding gcc bug, portable code that wants to take advantage of _mm_min_ps's NaN handling needs to take precautions. The standard gcc version on many existing Linux distros will mis-compile your code if it depends on the order of operands to _mm_min_ps. So you probably need an #ifdef to detect actual gcc (not clang etc), and an alternative. Or just do it differently in the first place :/ Perhaps with a _mm_cmplt_ps and boolean AND/ANDNOT/OR.

Enabling -ffast-math also makes _mm_min_ps commutative on all compilers.


As usual, compilers know how to use the instruction set. MINSS and MAXSS are faster than anything you could do with a branch anyway, so just write code that can compile to one of those.

The commutative-_mm_min_ps issue only applies to the intrinsic: gcc knows exactly how MINSS/MINPS work, and uses them to correctly implement strict FP semantics (when you don't use -ffast-math).

You don't usually need to do anything special to get decent scalar code out of a compiler. If you're going to spend time caring about what instructions the compiler uses, you should probably start by manually vectorizing your code if the compiler isn't doing that.

(There may be rare cases where a branch is best, if the condition almost always goes one way and latency is more important than throughput. MINPS latency is ~3 cycles, but a perfectly predicted branch adds 0 cycles to the dependency chain of the critical path.)


In C++, use std::min and std::max, which are defined in terms of > or <, and don't have the same requirements on NaN behaviour that fmin and fmax do. Avoid fmin and fmax unless you need their NaN behaviour.

In C, I think just write your own min and max functions (or macros if you do it safely).


C & asm on the Godbolt compiler explorer

float minfloat(float a, float b) {
  return (a<b) ? a : b;
}
# any decent compiler (gcc, clang, icc), without any -ffast-math or anything:
    minss   xmm0, xmm1
    ret

// C++
float minfloat_std(float a, float b) { return std::min(a,b); }
  # This implementation of std::min uses (b<a) : b : a;
  # So it can only produce the result in the register that b was in
  # This isn't worse (when inlined), just opposite
    minss   xmm1, xmm0
    movaps  xmm0, xmm1
    ret


float minfloat_fmin(float a, float b) { return fminf(a, b); }

# clang inlines fmin; other compilers just tailcall it.
minfloat_fmin(float, float):
    movaps  xmm2, xmm0
    cmpunordss      xmm2, xmm2
    movaps  xmm3, xmm2
    andps   xmm3, xmm1
    minss   xmm1, xmm0
    andnps  xmm2, xmm1
    orps    xmm2, xmm3
    movaps  xmm0, xmm2
    ret
   # Obviously you don't want this if you don't need it.

If you want to use _mm_min_ss / _mm_min_ps yourself, write code that lets the compiler make good asm even without -ffast-math.

If you don't expect NaNs, or want to handle them specially, write stuff like

lowest = _mm_min_ps(lowest, some_loop_variable);

so the register holding lowest can be updated in-place (even without AVX).