BeeOnRope - 6 months ago 28

C Question

I want to implement *unsigned integer division* by an arbitrary power of two, rounding *up*, efficiently. So what I, mathematically, is

`ceiling(p/q)`

`q`

`/** q must be a power of 2 */`

uint64_t divide(uint64_t p, uint64_t q) {

uint64_t res = p / q;

return p % q == 0 ? res : res + 1;

}

... of course, I don't actually want to use division or mod at the machine level, since that takes

`q`

You can assume we have an efficient

`lg(unsigned int x)`

`x`

`x`

Undefined behavior is fine if

`q`

Please note that the simple solution:

`(p+q-1) >> lg(q)`

`p == 2^64-100 and q == 256`

I'm interested in solutions in C, that are likely to perform well across a variety of platforms, but for the sake of concreteness, awarding the bounty and because any definitive discussion of performance needs to include a target architecture, I'll be specific about how I'll test them:

- Skylake CPU
- with compile flags
`gcc 5.4.0`

`-O3 -march=haswell`

Using gcc builtins (such as bitscan/leading zero builtins) is fine, and in particular I've implemented the

`lg()`

`inline uint64_t lg(uint64_t x) {`

return 63U - (uint64_t)__builtin_clzl(x);

}

inline uint32_t lg32(uint32_t x) {

return 31U - (uint32_t)__builtin_clz(x);

}

I verified that these compile down to a single

`bsr`

`-march=haswell`

I wrote a benchmark for the existing answers, and will share and update the results as changes are made.

Writing a good benchmark for a small, potentially inlined operation is quite tough. When code is inlined into a call site, a lot of the work of the function may disappear, especially when it's in a loop

You

`bench`

`return 0`

`call + ret`

So the two benchmarks I'll focus the most on repeatedly call the method under test in a loop, allowing inlining, cross-function optmization, loop hoisting and even vectorization.

There are two overall benchmark types: latency and throughput. The key difference is that in the latency benchmark, each call to

`divide`

`uint32_t bench_divide_latency(uint32_t p, uint32_t q) {`

uint32_t total = p;

for (unsigned i=0; i < ITERS; i++) {

total += divide_algo(total, q);

q = rotl1(q);

}

return total;

}

Note that the running

`total`

`divide`

The throughput variant, on the other hand, doesn't feed the output of one divide into the subsequent one. This allows work from one call to be overlapped with a subsequent one (both by the compiler, but especially the CPU), and even allows vectorization:

`uint32_t bench_divide_throughput(uint32_t p, uint32_t q) {`

uint32_t total = p;

for (unsigned i=0; i < ITERS; i++) {

total += fname(i, q);

q = rotl1(q);

}

return total;

}

Note that here we feed in the loop counter as the the dividend - this is

Furthermore, each benchmark has three flavors of behavior for the divisor,

`q`

- Compile-time constant divisor. For example, a call to . This is common in practice, and the code can be much simpler when the divisor is known at compile time.
`divide(p, 8)`

- Invariant divisor. Here the divisor is not know at compile time, but is constant for the whole benchmarking loop. This allows a subset of the optimizations that the compile-time constant does.
- Variable divisor. The divisor changes on each iteration of the loop. The benchmark functions above show this variant, using a "rotate left 1" instruction to vary the divisor.

Combining everything you get a total of 6 distinct benchmarks.

The results below test each algorithm over 1e9 iterations. Cycles are calculated simply by multiplying the time/call by the clock frequency. You can generally assume that something like

`4.01`

`4.00`

`5.11`

The results for

`divide_plusq_32`

`(p + q - 1) >> lg(q)`

`p + q`

`dummy`

`return p + q`

`==============================`

Bench: Compile-time constant Q

==============================

Function ns/call cycles

stoke32_32 1.93 5.00

divide_chux_32 1.55 4.01

divide_chux_64 1.55 4.01

divide_user23_32 1.97 5.11

divide_user23_64 1.93 5.00

divide_user23_variant_32 1.55 4.01

divide_user23_variant_64 1.55 4.01

divide_chrisdodd_32 1.95 5.05

divide_chrisdodd_64 1.93 4.99

divide_chris_32 4.63 12.00

divide_chris_64 4.52 11.71

divide_weather_32 2.72 7.04

divide_weather_64 2.78 7.20

divide_plusq_32 1.16 3.00

divide_plusq_64 1.16 3.00

divide_dummy_32 1.16 3.00

divide_dummy_64 1.16 3.00

==============================

Bench: Invariant Q

==============================

Function ns/call cycles

stoke32_32 1.93 5.00

divide_chux_32 1.56 4.03

divide_chux_64 1.55 4.01

divide_user23_32 1.95 5.04

divide_user23_64 1.99 5.17

divide_user23_variant_32 1.56 4.04

divide_user23_variant_64 1.55 4.01

divide_chrisdodd_32 1.95 5.04

divide_chrisdodd_64 1.93 4.99

divide_chris_32 4.60 11.90

divide_chris_64 4.57 11.85

divide_weather_32 12.54 32.48

divide_weather_64 19.18 49.69

divide_plusq_32 1.16 3.00

divide_plusq_64 1.16 3.00

divide_dummy_32 0.39 1.00

divide_dummy_64 0.39 1.00

==============================

Bench: Variable Q

==============================

Function ns/call cycles

stoke32_32 2.06 5.33

divide_chux_32 2.04 5.28

divide_chux_64 2.04 5.30

divide_user23_32 2.04 5.28

divide_user23_64 2.06 5.33

divide_user23_variant_32 2.04 5.27

divide_user23_variant_64 2.15 5.58

divide_chrisdodd_32 2.04 5.29

divide_chrisdodd_64 2.05 5.31

divide_chris_32 4.64 12.03

divide_chris_64 4.64 12.01

divide_weather_32 12.54 32.49

divide_weather_64 17.40 45.07

divide_plusq_32 1.93 4.99

divide_plusq_64 1.99 5.16

divide_dummy_32 0.39 1.02

divide_dummy_64 0.40 1.05

Here are the results for the throughput tests. Note that many of the algorithms here were auto-vectorized, so the performance is relatively

`==============================`

Bench: Compile-time constant Q

==============================

Function ns/call cycles

stoke32_32 0.39 1.00

divide_chux_32 0.15 0.39

divide_chux_64 0.53 1.37

divide_user23_32 0.14 0.36

divide_user23_64 0.53 1.37

divide_user23_variant_32 0.15 0.39

divide_user23_variant_64 0.53 1.37

divide_chrisdodd_32 1.16 3.00

divide_chrisdodd_64 1.16 3.00

divide_chris_32 4.34 11.23

divide_chris_64 4.34 11.24

divide_weather_32 1.35 3.50

divide_weather_64 1.35 3.50

divide_plusq_32 0.10 0.26

divide_plusq_64 0.39 1.00

divide_dummy_32 0.08 0.20

divide_dummy_64 0.39 1.00

==============================

Bench: Invariant Q

==============================

Function ns/call cycles

stoke32_32 0.48 1.25

divide_chux_32 0.15 0.39

divide_chux_64 0.48 1.25

divide_user23_32 0.17 0.43

divide_user23_64 0.58 1.50

divide_user23_variant_32 0.15 0.38

divide_user23_variant_64 0.48 1.25

divide_chrisdodd_32 1.16 3.00

divide_chrisdodd_64 1.16 3.00

divide_chris_32 4.35 11.26

divide_chris_64 4.36 11.28

divide_weather_32 5.79 14.99

divide_weather_64 17.00 44.02

divide_plusq_32 0.12 0.31

divide_plusq_64 0.48 1.25

divide_dummy_32 0.09 0.23

divide_dummy_64 0.09 0.23

==============================

Bench: Variable Q

==============================

Function ns/call cycles

stoke32_32 1.16 3.00

divide_chux_32 1.36 3.51

divide_chux_64 1.35 3.50

divide_user23_32 1.54 4.00

divide_user23_64 1.54 4.00

divide_user23_variant_32 1.36 3.51

divide_user23_variant_64 1.55 4.01

divide_chrisdodd_32 1.16 3.00

divide_chrisdodd_64 1.16 3.00

divide_chris_32 4.02 10.41

divide_chris_64 3.84 9.95

divide_weather_32 5.40 13.98

divide_weather_64 19.04 49.30

divide_plusq_32 1.03 2.66

divide_plusq_64 1.03 2.68

divide_dummy_32 0.63 1.63

divide_dummy_64 0.66 1.71

`/`

`ceiling`

`uint32_t`

`uint64_t`

`p`

`q`

`p + q >= 2^64`

`p`

`q`

`f`

`f`

`5 - 4 == 1`

Answer

This answer is about what's ideal in asm; what we'd like to convince the compiler to emit for us. (I'm not suggesting actually using inline asm, except as a point of comparison when benchmarking compiler output. https://gcc.gnu.org/wiki/DontUseInlineAsm).

**I did manage to get pretty good asm output from pure C for ceil_div_andmask, see below.**

It looks like Chris Dodd's general idea of `return ((p-1) >> lg(q)) + 1`

with special-case handling for d=0 is one of the best options. I.e. the optimal implementation of it in asm is hard to beat with an optimal implementation of anything else. Chux's `(p >> lg(q)) + (bool)(p & (q-1))`

also has advantages (like lower latency from p->result), and more CSE when the same q is used for multiple divisions. See below for a latency/throughput analysis of what gcc does with it.

If the same `e = lg(q)`

is reused for multiple dividends, or the same dividend is reused for multiple divisors, different implementations can CSE more of the expression. **They can also effectively vectorize with AVX2**.

Branches are cheap and very efficient if they predict very well, so branching on `d==0`

will be best if it's almost never taken. If `d==0`

is not rare, branchless asm will perform better on average. Ideally we can write something in C that will let gcc make the right choice during profile-guided optimization, and compiles to good asm for either case.

Since the best branchless asm implementations don't add much latency I can see are (e.g. happens more than 1% or 5% of the time (or less than 95% of the time))

It's hard to guide gcc5.4 into emitting anything optimal. This is my work-in-progress on Godbolt).

I think the optimal sequences for Skylake for this algorithm are as follows. (Shown as stand-alone functions for the AMD64 SysV ABI, but talking about throughput/latency on the assumption that the compiler will emit something similar inlined into a loop, with no RET attached).

**Branch on carry from calculating d-1 to detect d==0**, instead of a separate test & branch. Reduces the uop count nicely, esp on SnB-family where JC can macro-fuse with SUB.

```
ceil_div_pjc_branch:
xor eax,eax ; can take this uop off the fast path by adding a separate xor-and-return block, but in reality we want to inline something like this.
sub rdi, 1
jc .d_was_zero ; fuses with the sub on SnB-family
tzcnt rax, rsi ; tzcnt rsi,rsi also avoids any false-dep problems, but this illustrates that the q input can be read-only.
shrx rax, rdi, rax
inc rax
.d_was_zero:
ret
```

- Fused-domain uops: 5 (not counting ret), and one of them is an xor-zero (no execution unit)
- HSW/SKL latency with successful branch prediction:
- (d==0): No data dependency on d or q,
**breaks the dep chain**. (control dependency on d to detect mispredicts and retire the branch). - (d!=0): q->result: tzcnt+shrx+inc = 5c
- (d!=0):
**d->result: sub+shrx+inc = 3c**

- (d==0): No data dependency on d or q,
- Throughput: probably just bottlenecked on uop throughput

I've tried but failed to get gcc to branch on CF from the subtract, but it always wants to do a separate comparison. I know gcc can be coaxed into branching on CF after subtracting two variables, but maybe this fails if one is a compile-time constant. (IIRC, this typically compiles to a CF test with unsigned vars: `foo -= bar; if(foo>bar) carry_detected = 1;`

)

**Branchless with ADC / SBB to handle the d==0 case**. Zero-handling adds only one instruction to the critical path (vs. a version with no special handling for d==0), but also converts one other from an INC to a

`sbb rax, -1`

to make CF undo the `-= -1`

. Using a CMOV is cheaper on pre-Broadwell, but takes extra instructions to set it up.```
ceil_div_pjc_asm_adc:
tzcnt rsi, rsi
sub rdi, 1
adc rdi, 0 ; d? d-1 : d. Sets CF=CF
shrx rax, rdi, rsi
sbb rax, -1 ; result++ if d was non-zero
ret
```

- Fused-domain uops: 5 (not counting ret) on SKL. 7 on HSW
- SKL latency:
- q->result: tzcnt+shrx+sbb = 5c
**d->result: sub+adc+shrx(dep on q begins here)+sbb = 4c**

- Throughput: TZCNT runs on p1. SBB, ADC, and SHRX only run on p06. So I think we bottleneck on 3 uops for p06 per iteration, making this run at best
**one iteration per 1.5c**.

If q and d become ready at the same time, note that this version can run SUB/ADC in parallel with the 3c latency of TZCNT. If both are coming from the same cache-miss cache line, it's certainly possible. In any case, introducing the dep on q as late as possible in the d->result dependency chain is an advantage.

Getting this from C seems unlikely with gcc5.4. There is an intrinsic for add-with-carry, but gcc makes a total mess of it. It doesn't use immediate operands for ADC or SBB, and stores the carry into an integer reg between every operation. gcc7, clang3.9, and icc17 all make terrible code from this.

```
#include <x86intrin.h>
// compiles to completely horrible code, putting the flags into integer regs between ops.
T ceil_div_adc(T d, T q) {
T e = lg(q);
unsigned long long dm1; // unsigned __int64
unsigned char CF = _addcarry_u64(0, d, -1, &dm1);
CF = _addcarry_u64(CF, 0, dm1, &dm1);
T shifted = dm1 >> e;
_subborrow_u64(CF, shifted, -1, &dm1);
return dm1;
}
# gcc5.4 -O3 -march=haswell
mov rax, -1
tzcnt rsi, rsi
add rdi, rax
setc cl
xor edx, edx
add cl, -1
adc rdi, rdx
setc dl
shrx rdi, rdi, rsi
add dl, -1
sbb rax, rdi
ret
```

**CMOV to fix the whole result**: worse latency from q->result, since it's used sooner in the d->result dep chain.

```
ceil_div_pjc_asm_cmov:
tzcnt rsi, rsi
sub rdi, 1
shrx rax, rdi, rsi
lea rax, [rax+1] ; inc preserving flags
cmovc rax, zeroed_register
ret
```

- Fused-domain uops: 5 (not counting ret) on SKL. 6 on HSW
- SKL latency:
- q->result: tzcnt+shrx+lea+cmov = 6c (worse than ADC/SBB by 1c)
**d->result: sub+shrx(dep on q begins here)+lea+cmov = 4c**

- Throughput: TZCNT runs on p1. LEA is p15. CMOV and SHRX are p06. SUB is p0156. In theory only bottlenecked on fused-domain uop throughput, so
**one iteration per 1.25c**. With lots of independent operations, resource conflicts from SUB or LEA stealing p1 or p06 shouldn't be a throughput problem because at 1 iter per 1.25c, no port is saturated with uops that can only run on that port.

**CMOV to get an operand for SUB**: I was hoping I could find a way to create an operand for a later instruction that would produce a zero when needed, without an input dependency on q, e, or the SHRX result. This would help if `d`

is ready before `q`

, or at the same time.

This doesn't achieve that goal, and needs an extra 7-byte `mov rdx,-1`

in the loop.

```
ceil_div_pjc_asm_cmov:
tzcnt rsi, rsi
mov rdx, -1
sub rdi, 1
shrx rax, rdi, rsi
cmovnc rdx, rax
sub rax, rdx ; res += d ? 1 : -res
ret
```

**Lower-latency version for pre-BDW CPUs with expensive CMOV**, using SETCC to create a mask for AND.

```
ceil_div_pjc_asm_setcc:
xor edx, edx ; needed every iteration
tzcnt rsi, rsi
sub rdi, 1
setc dl ; d!=0 ? 0 : 1
dec rdx ; d!=0 ? -1 : 0 // AND-mask
shrx rax, rdi, rsi
inc rax
and rax, rdx ; zero the bogus result if d was initially 0
ret
```

Still **4c latency from d->result** (and 6 from q->result), because the SETC/DEC happen in parallel with the SHRX/INC. Total uop count: 8. Most of these insns can run on any port, so it should be 1 iter per 2 clocks.

Of course, for pre-HSW, you also need to replace SHRX.

**We can get gcc5.4 to emit something nearly as good:** (still uses a separate TEST instead of setting mask based on `sub rdi, 1`

, but otherwise the same instructions as above). See it on Godbolt.

```
T ceil_div_andmask(T p, T q) {
T mask = -(T)(p!=0); // TEST+SETCC+NEG
T e = lg(q);
T nonzero_result = ((p-1) >> e) + 1;
return nonzero_result & mask;
}
```

When the compiler knows that `p`

is non-zero, it takes advantage and makes nice code:

```
// http://stackoverflow.com/questions/40447195/can-i-hint-the-optimizer-by-giving-the-range-of-an-integer
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#define assume(x) do{if(!(x)) __builtin_unreachable();}while(0)
#else
#define assume(x) (void)(x) // still evaluate it once, for side effects in case anyone is insane enough to put any inside an assume()
#endif
T ceil_div_andmask_nonzerop(T p, T q) {
assume(p!=0);
return ceil_div_andmask(p, q);
}
# gcc5.4 -O3 -march=haswell
xor eax, eax # gcc7 does tzcnt in-place instead of wasting an insn on this
sub rdi, 1
tzcnt rax, rsi
shrx rax, rdi, rax
add rax, 1
ret
```

only 3c latency from p->result, and constant q can CSE a lot.

```
T divide_A_chux(T p, T q) {
bool round_up = p & (q-1); // compiles differently from user23_variant with clang: AND instead of
return (p >> lg(q)) + round_up;
}
xor eax, eax # in-place tzcnt would save this
xor edx, edx # target for setcc
tzcnt rax, rsi
sub rsi, 1
test rsi, rdi
shrx rdi, rdi, rax
setne dl
lea rax, [rdx+rdi]
ret
```

Doing the SETCC before TZCNT would allow an in-place TZCNT, saving the `xor eax,eax`

. I haven't looked at how this inlines in a loop.

- Fused-domain uops: 8 (not counting ret) on HSW/SKL
- HSW/SKL latency:
- q->result: (tzcnt+shrx(p) | sub+test(p)+setne) + lea(or add) = 5c
**d->result: test(dep on q begins here)+setne+lea = 3c**. (the shrx->lea chain is shorter, and thus not the critical path)

- Throughput: Probably just
**bottlenecked on the frontend, at one iter per 2c**. Saving the`xor eax,eax`

should speed this up to one per 1.75c (but of course any loop overhead will be part of the bottleneck, because frontend bottlenecks are like that).