My CPU is Corei7 ( 4 physical cores/ 8 logical). I am testing my implementation of free-locking queue. What is the test? I just create a lot of threads ( "pusher" and "popper") and they push/pop elements. I noticed that it works much faster when... CPU is loaded. So, when the CPU isn't relatively loaded it works slower ( relatively much). And, when it is loaded it works faster.
Bouncing cache lines between cores is expensive. It sounds reasonable that one core could push/pop more than 4x faster than 4 cores contending with each other while they try to modify the same memory.
So it sounds like the problem is in deciding what changes in the total wall-clock time or total CPU time of all the threads tell you about whether your code is good or not.
To put it another way: You're testing the maximum-contention case, where your threads are spending all their time pushing and popping, and not doing any other work. In real code that used this queue, the other work done by a thread would throttle the access rate to your queue, probably a lot, so threads would step on each other's toes a lot less. (Contention probably causes a significant performance hit with cmpxchg loops, because only one CPU will succeed each time, and all the rest will retry every time.)
Related: Locks Aren't Slow; Lock Contention Is (Jeff Preshing) makes the same points for testing parallel algorithms that use locks in high vs. low contention cases.
If anything, it might be interesting to add a
pause instruction (
_mm_pause()) before a load somewhere in your benchmark, to avoid memory-ordering mis-speculation. (i.e. where the CPU speculatively uses a value loaded before the load is allowed to become globally visible, and then has to roll back when it turns out the value was changed by another core by the time the load was supposed to become globally visible). Keep in mind that
pause sleeps for ~5 cycles on Haswell, but ~100 cycles on Skylake, so even if you see a speedup from it in a non-synthetic benchmark on Haswell, it's probably a bad idea to leave it in for real use on future CPUs.
pause isn't useful before
locked read-modify-write instructions; they're already expecting writes from other cores.