iksemyonov - 1 year ago 106
C Question

# What is the instruction that gives branchless FP min and max on x86?

To quote (thanks to the author for developing and sharing the algorithm!):

https://tavianator.com/fast-branchless-raybounding-box-intersections/

Since modern floating-point instruction sets can compute min and max without branches

Corresponding code by the author is just

``````dmnsn_min(double a, double b)
{
return a < b ? a : b;
}
``````

I'm familiar with e.g.
`_mm_max_ps`
, but that's a vector instruction. The code above obviously is meant to be used in a scalar form.

Question:

• What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?

• Is it safe to assume it's going to be applied, or how do I call it?

• Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?

• Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?

Just in case: I'm familiar with Use of min and max functions in C++, believe it's related but not quite my question.

Most vector FP instructions have scalar equivalents. MINSS / MAXSS / MINSD / MAXSD are what you want. They handle +/-Infinity the way you'd expect.

Don't try to use `_mm_min_ss` on scalar floats; the intrinsic is only available with `__m128` operands, and Intel's intrinsics don't provide any way to get a scalar float into the low element of a `__m128` without zeroing the high elements or somehow doing extra work. Most compilers will actually emit the useless instructions to do that even if the final result doesn't depend on anything in the upper elements. There's nothing like `__m256 _mm256_castps128_ps256 (__m128 a)` to just cast a float to a `__m128` with garbage in the upper elements. I consider this a design flaw. :/

Note their asymmetric behaviour with NaN: if the operands are unordered, dest=src (i.e. it takes the second operand if either operand is NaN). The corresponding `_mm_min_ss` / `_mm_min_ps` intrinsics may or may not have this behaviour, depending on the compiler.

(`a` and `b` are unordered if either of them is NaN. That means `a<b`, `a==b`, and `a>b` are all false. See Bruce Dawson's series of articles on floating point for lots of FP gotchas.)

I think the intrinsics are supposed to have the same operand-order semantics as the asm instructions, but gcc has treated the operands to `_mm_min_ps` as commutative even without `-ffast-math` for a long time, going back to at least gcc4.4. gcc7 finally changed it to match icc and clang. Still, Intel's online intrinsics finder doesn't mention that behaviour for the function, but OTOH the asm insn ref manual doesn't mention that the intrinsic doesn't have the behaviour when it lists `_mm_min_ss` as the intrinsic for MINSS.

When I googled on `"_mm_min_ps" NaN`, I found this real code and some other discussion of using the intrinsic to handle NaNs, so clearly many people expect the intrinsic to behave like the asm instruction. (This came up for some code I was writing yesterday, and I was already thinking of writing this up as a self-answered Q&A.)

Given the existence of this longstanding gcc bug, portable code that wants to take advantage of `_mm_min_ps`'s NaN handling needs to take precautions. The standard gcc version on many existing Linux distros will mis-compile your code if it depends on the order of operands to `_mm_min_ps`. So you probably need an `#ifdef` to detect actual gcc (not clang etc), and an alternative. Or just do it differently in the first place :/ Perhaps with a `_mm_cmplt_ps` and boolean AND/ANDNOT/OR.

Enabling `-ffast-math` also makes `_mm_min_ps` commutative on all compilers.

As usual, compilers know how to use the instruction set. MINSS and MAXSS are faster than anything you could do with a branch anyway, so just write code that can compile to one of those.

The commutative-`_mm_min_ps` issue only applies to the intrinsic: gcc knows exactly how MINSS/MINPS work, and uses them to correctly implement strict FP semantics (when you don't use -ffast-math).

You don't usually need to do anything special to get decent scalar code out of a compiler. If you're going to spend time caring about what instructions the compiler uses, you should probably start by manually vectorizing your code if the compiler isn't doing that.

(There may be rare cases where a branch is best, if the condition almost always goes one way and latency is more important than throughput. MINPS latency is ~3 cycles, but a perfectly predicted branch adds 0 cycles to the dependency chain of the critical path.)

In C++, use `std::min` and `std::max`, which are defined in terms of `>` or `<`, and don't have the same requirements on NaN behaviour that `fmin` and `fmax` do. Avoid `fmin` and `fmax` unless you need their NaN behaviour.

In C, I think just write your own `min` and `max` functions (or macros if you do it safely).

C & asm on the Godbolt compiler explorer

``````float minfloat(float a, float b) {
return (a<b) ? a : b;
}
# any decent compiler (gcc, clang, icc), without any -ffast-math or anything:
minss   xmm0, xmm1
ret

// C++
float minfloat_std(float a, float b) { return std::min(a,b); }
# This implementation of std::min uses (b<a) : b : a;
# So it can only produce the result in the register that b was in
# This isn't worse (when inlined), just opposite
minss   xmm1, xmm0
movaps  xmm0, xmm1
ret

float minfloat_fmin(float a, float b) { return fminf(a, b); }

# clang inlines fmin; other compilers just tailcall it.
minfloat_fmin(float, float):
movaps  xmm2, xmm0
cmpunordss      xmm2, xmm2
movaps  xmm3, xmm2
andps   xmm3, xmm1
minss   xmm1, xmm0
andnps  xmm2, xmm1
orps    xmm2, xmm3
movaps  xmm0, xmm2
ret
# Obviously you don't want this if you don't need it.
``````

If you want to use `_mm_min_ss` / `_mm_min_ps` yourself, write code that lets the compiler make good asm even without -ffast-math.

If you don't expect NaNs, or want to handle them specially, write stuff like

``````lowest = _mm_min_ps(lowest, some_loop_variable);
``````

so the register holding `lowest` can be updated in-place (even without AVX).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download