Hedy Hedy - 3 months ago 8
Linux Question

Big difference in overhead caused by instructions in straight-line code

I am trying to understand the overhead in

[blk_account_io_completion][1]
in Linux block layer. Using
perf annotate
I get the following snippet (abridged). Can someone shed some light on the reason the
add
and
test
instruction have such overheads compared to their neighboring instruction which are executed with them?

: part_stat_add(cpu, part, sectors[rw], bytes >> 9);
0.13 : ffffffff813336eb: movsxd r8,r8d
0.00 : ffffffff813336ee: lea rdx,[rax*8+0x0]
0.00 : ffffffff813336f6: mov rcx,QWORD PTR [rdi+0x210]
72.04 : ffffffff813336fd: add rcx,QWORD PTR [r8*8-0x7e2df6a0]
0.22 : ffffffff81333705: add QWORD PTR [rcx+rdx*1],rsi
0.61 : ffffffff81333709: mov eax,DWORD PTR [rdi+0x1f4]
26.52 : ffffffff8133370f: test eax,eax
0.00 : ffffffff81333711: je ffffffff81333733 <blk_account_io_completion+0x83>

Answer

One possible reason is that these instructions happen to be pointed to by the instruction pointer when a sample is taken. A typical x86 CPU can retire up to 4 instructions per cycle, but when it does so and a sample is token, the program counter will point to just one instruction, not all those four.

Here is an example - see below. Simple plain loop with a bunch of nop instructions. Note how clockticks distribute over this profile with exactly three instructions in the gaps. This may be similar to the effect you are seeing.

Alternatively, it could be that mov rcx,QWORD PTR [rdi+0x210] and mov eax,DWORD PTR [rdi+0x1f4] often miss the cache with the cycles spent on that being attributed to the next instruction, like see here.

       │    Disassembly of section .text:
       │
       │    00000000004004ed :
       │      push   %rbp
       │      mov    %rsp,%rbp
       │      movl   $0x0,-0x4(%rbp)
       │    ↓ jmp    25
 14.59 │ d:   nop
       │      nop
       │      nop
  0.03 │      nop
 14.58 │      nop
       │      nop
       │      nop
  0.08 │      nop
 13.89 │      nop
       │      nop
  0.01 │      nop
  0.08 │      nop
 13.99 │      nop
       │      nop
  0.01 │      nop
  0.05 │      nop
 13.92 │      nop
       │      nop
  0.01 │      nop
  0.07 │      nop
 14.44 │      addl   $0x1,-0x4(%rbp)
  0.33 │25:   cmpl   $0x3fffffff,-0x4(%rbp)
 13.90 │    ↑ jbe    d
       │      pop    %rbp
       │    ← retq