Dan Dan Dan Dan - 1 month ago 21
C++ Question

OpenMP multithreading suggestions

I'm making in OpenGL a 2D newtonian gravity simulation with lot of particles following the mouse, changing velocities in one loop that iterates over all particles.

It works fine, but the performance isn't that good, I can get just 60 fps average with 2 milions particles(I have an i7 6700k and a gtx 970).
So I thought multi-threading was the best way to improve that.
To do this, I used OpenMP 2.0(I'm on Visual Studio).
The update loop then becomes:

#pragma omp parallel for
for (int i = 0; i < count; i++)
{
float vertX = WIDTH/2 * (vertices[i*2]+1);
float vertY = -HEIGHT/2*(vertices[i *2+ 1]+1)+HEIGHT;

float fact = (mouseX - vertX)*(mouseX - vertX) + (mouseY - vertY)*(mouseY - vertY) + 120;
glm::vec2 acc = 3.f / fact*(glm::vec2(mouseX, mouseY) - glm::vec2(vertX, vertY)) * (float)bPressed;
acc.y *= -1;

speed[i*2] += acc.x - speed[i*2]/200;
speed[i *2+ 1] += acc.y - speed[i *2+ 1] / 200;

vertices[i*2] += speed[i*2]*dt;
vertices[i *2+ 1] += speed[i*2+1]*dt;

}


The performance increased a lot(now I get 130 fps), but not as expected, in fact with 8 threads(4 cores with Intel Hyper-Threading), I would expect it to be 8 times better than before; but it is just 3 times better.
Am i doing something wrong with openMP or can't I get better performance at all?

Answer

Your code looks good, there is nothing immediate to improve, but your expectation is too high.

  1. For many codes, Hyper-threading will not provide a benefit. Your expected performance gain is 4x if this code is compute bound. Hyper-threading will only give you a benefit if you are latency-bound (i.e. your processor is waiting for memory, but the memory bandwidth is not saturated). Even then, you often only get only slightly above 4x speedup.
  2. Your speedup may be limited by the non-parallelized portion of your overall code. Obviously you have a whole bunch of other code outside of the parallel loop that does influence FPS. This is explained by Ahmdal's law.
  3. You processor utilizes Turbo frequency, it runs at higher speed when only one core is active.
  4. You may be be partially limited by memory or shared caches, although your speedup suggests it is not entirely memory-bandwidth bound.

Any additional optimizations would heavily depend on count and the rest of the code. If you want specific suggestions you would have to provide the code as a distilled [mcve].