Anuj Kalia Anuj Kalia - 2 months ago 9
C++ Question

Busy loop slows down latency-critical computation

My code does the following:


  1. Do some long-running intense computation (called useless below)

  2. Do a small latency-critical task



I find that the time it takes to execute the latency-critical task is higher with the long-running computation than without it.

Here is some stand-alone C++ code to reproduce this effect:

#include <stdio.h>
#include <stdint.h>

#define LEN 128
#define USELESS 1000000000
//#define USELESS 0

// Read timestamp counter
static inline long long get_cycles()
{
unsigned low, high;
unsigned long long val;
asm volatile ("rdtsc" : "=a" (low), "=d" (high));
val = high;
val = (val << 32) | low;
return val;
}

// Compute a simple hash
static inline uint32_t hash(uint32_t *arr, int n)
{
uint32_t ret = 0;
for(int i = 0; i < n; i++) {
ret = (ret + (324723947 + arr[i])) ^ 93485734985;
}
return ret;
}

int main()
{
uint32_t sum = 0; // For adding dependencies
uint32_t arr[LEN]; // We'll compute the hash of this array

for(int iter = 0; iter < 3; iter++) {
// Create a new array to hash for this iteration
for(int i = 0; i < LEN; i++) {
arr[i] = (iter + i);
}

// Do intense computation
for(int useless = 0; useless < USELESS; useless++) {
sum += (sum + useless) * (sum + useless);
}

// Do the latency-critical task
long long start_cycles = get_cycles() + (sum & 1);
sum += hash(arr, LEN);
long long end_cycles = get_cycles() + (sum & 1);

printf("Iteration %d cycles: %lld\n", iter, end_cycles - start_cycles);
}
}


When compiled with
-O3
with
USELESS
set to 1 billion, the three iterations took 588, 4184, and 536 cycles, respectively. When compiled with
USELESS
set to 0, the iterations took 394, 358, and 362 cycles, respectively.

Why could this (particularly the 4184 cycles) be happening? I suspected cache misses or branch mis-predictions induced by the intense computation. However, without the intense computation, the zeroth iteration of the latency critical task is pretty fast so I don't think that cold cache/branch predictor is the cause.

Answer

Moving my speculative comment to an answer:

It is possible that while your busy loop is running, other tasks on the server are pushing the cached arr data out of the L1 cache, so that the first memory access in hash needs to reload from a lower level cache. Without the compute loop this wouldn't happen. You could try moving the arr initialization to after the computation loop, just to see what the effect is.