jeffer son jeffer son - 1 month ago 8
C++ Question

Why is this C++ code faster than assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They are the same identical brute force approach. The assembly solution was assembled with

nasm -felf64 p14.asm && gcc p14.o -o p14


The C++ was compiled with

g++ p14.cpp -o p14


Assembly,
p14.asm


section .data
fmt db "%d", 10, 0

global main
extern printf

section .text

main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i

l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx

l2:
test rax, 1
jpe even

mov rbx, 3
mul rbx
inc rax
jmp c1

even:
mov rbx, 2
xor rdx, rdx
div rbx

c1:
inc r10
cmp rax, 1
jne l2

cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx

cmp rcx, 2
jne l1

mov rdi, fmt
xor rax, rax
call printf
ret


C++, p14.cpp

#include <iostream>

using namespace std;

int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = n*3 + 1;

++count;
}

return count;
}

int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}

cout << maxi << endl;
}


I know about the compiler optimizations to improve speed and everything, but I don't see many ways to optimize my assembly solution further (speaking programmatically not mathematically).

The C++ code has modulus every term and division every even term, where assembly is only one division per even term.

But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking out of mainly curiosity.

Answer

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, don't spend extra time optimizing).

See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the tag wiki for more perf links.


even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

On Intel Haswell for example, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but those don't matter). Microcoded instructions (like DIV) can also cause front-end bottlenecks.

Compare this with: shr rax, 1 to do the same unsigned division: It's 1 uop, with 1c latency, and can run at one per 0.5 cycles. (i.e. 2 per clock). BTW, n has a signed type, so to help the compiler make better code you should really have used unsigned long instead. You're using unsigned division in your asm.

For comparison, idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell. So it's still totally horrible compared to a shift (even including extra instructions to account for signed division by 2, and the different rounding for negative numbers). The compiler would have used idiv r64, though, because n is a long. (At least if you're compiling for Linux/OS X. In the Windows ABI, long is 32-bit even in 64-bit code.)


As you can see from looking at gcc's asm output (with -O0 on the Godbolt compiler explorer), it doesn't use any divide instructions in its implementation of sequence(), just shr and sar.

(-O0 doesn't mean perverse de-optimized code, it just means don't do any extra work optimizing beyond what's necessary to compile. Since gcc always recognizes division-by-constant and doesn't have a mode where it emits an actual div instruction for that, you get not-horrible code even at -O0.)


BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:

 # from gcc5.4 `-O3 -fverbose-asm`, plus my comments
 # on C++ source that uses  unsigned long   instead of  long

 # edx=1   (count)
 # rax=n

.L9:                                  # do{
    lea     rcx, [rax+1+rax*2]        # rcx = 3*n + 1
    mov     rdi, rax  # n, n
    shr     rdi       # n             # rdi = n>>1;
    test    al, 1   # n,              # check n%2
    mov     rax, rcx  # n, n
    cmove   rax, rdi      # n,, n     # n= (n%2) ? 3*n+1 : n/2;
    add     edx, 1    # max,          # ++count;
    cmp     rax, 1    # n,
    jne     .L9       #,              # }while(n!=1)

  cmp/branch to update max and maxi, and then do the next n

So the inner loop is branchless, and the critical path of the loop-carried dependency chain is:

  • 3-component LEA (3 cycles)
  • cmov (2 cycles on Haswell, 1c on Broadwell or later).

Total: 5 cycle per iteration latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (ideally; I haven't tested with perf counters to see if it really runs at 5c/iter).

The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.

Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). It still takes a uop, and a slot in the pipeline, so it's not free in general. But we're only bottlenecked on latency, so it doesn't matter here. The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.

The cmp/jne is also not part of the critical path: it's not loop-carried, because branch prediction takes control dependencies out of the critical path.


Beating the compiler

gcc did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and it's false-dependencies for partial-flag-modifying instructions.

It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.

 ### Hand-optimized version of what gcc does
.L9:                                  # do{
    lea     rcx, [rax+1+rax*2]        # rcx = 3*n + 1
    shr     rax                       # n>>=1;  // in-place, setting CF = n&1 = n%2
    cmovc   rax, rcx                  # n= (n&1) ? 3*n+1 : n/2;
    inc     edx                       # ++count;
    cmp     rax, 1
    jne     .L9                       # }while(n!=1)

    # see @johnfound's answer for another clever trick, saving the CMP because SHR's result is only zero if it was 1 (or 0) to start with.

This doesn't help with the latency at all on Haswell. But it helps significantly on CPUs like AMD Bulldozer-family, where MOV is not zero-latency, and complex-LEA and CMOV are both lower latency (2c and 1c respectively). So the compiler's wasted MOV instructions do affect the critical path. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU. On Intel CPUs, MOV only became zero-latency with IvyBridge.

Even on Haswell, this version may help a bit, and avoid some occasional resource conflicts where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop. (see below).

We could actually shorten the dependency chain by doing the 3n+1 in two 1-cycle latency instructions: lea rcx, [rax + rax*2] / inc rcx. A 2-component (base+index) LEA only requires one add, so Intel SnB-family CPUs can do it in one cycle. (Intel SnB-family standardizes latencies so there are no 2c uops, which is why the 3-component LEA's two addition operations give it 3c latency instead of 2 like AMD.) Normally an extra uop to save 1c latency wouldn't be worth it, but it might be here.

Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't have full models of the pipeline in the target microarchitecture, just some heuristics.)


Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading, since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.

Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.

The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.


You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector lanes where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe not worth it with just two 64-bit integers per 128b XMM register, but maybe worth it with AVX2 for 256b integer SIMD.

I think detecting when a vector element has reached 1 and avoiding further updates of the counter for that element position will suck, because the sequence doesn't stay at 1 after reaching it. So you'd need an extra instruction or two to force elements to not count. Hmm, maybe just one by ORing the compare mask into a vector, so you get a "sticky" vector that records if an element has ever been 1.

Or instead of having a fixed vector of 1 elements for the increment, permanently mask it until it's all 0s. Yeah that should work and not use up too many registers per vector-accumulator.

Untested idea for manual vectorization

# starting with YMM0 = [ n_d, n_c, n_b, n_a ]  (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1):  increment vector
# ymm5 = all-zeros:  count vector

.inner_loop:
    vpaddq    ymm1, ymm0, xmm0
    vpaddq    ymm1, ymm1, xmm0
    vpaddq    ymm1, ymm1, set1_epi64(1)     # ymm1= 3*n + 1.  Maybe could do this more efficiently?

    vprllq    ymm3, ymm0, 63                # shift bit 1 to the sign bit

    vpsrlq    ymm0, ymm0, 1                 # n /= 2

    # There may be a better way to do this blend, avoiding the bypass delay for an FP blend between integer insns, not sure.  Probably worth it
    vpblendvpd ymm0, ymm0, ymm1, ymm3       # variable blend controlled by the sign bit of each 64-bit element.  I might have the source operands backwards, I always have to look this up.

    # ymm0 = updated n  in each element.

    vpcmpeqq ymm1, ymm0, set1_epi64(1)
    vpandn   ymm4, ymm1, ymm4         # zero out elements of ymm4 where the compare was true

    vpaddq   ymm5, ymm5, ymm4         # count++ in elements where n has never been == 1

    vptest   ymm4, ymm4
    jnz  .inner_loop
    # Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero

    vextracti128 ymm0, ymm5, 1
    vpmaxq .... crap this doesn't exist
    # Actually just delay doing a horizontal max until the very very end.  But you need some way to record max and maxi.

You can and should implement this with intrinsics, instead of hand-written asm.


Algorithmic / implementation improvement:

Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences.

@EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing, because no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.

So the loop might look like this (in C structured the way the asm would look):

// handle any initialization necessary

goto loop_entry;  // Writing it this way to show how the asm would work.  Don't use a goto for this in real C++
do {
   n = n*3 + 1;
  loop_entry:
   shift = _mm_tzcnt(n);
   n >>= shift;
   count += shift;
} while(n != 1);