Dusted Dusted - 2 months ago 19
C++ Question

OpenCL RK4 Integration on GPU

When parallelising an integrator using OpenCL - is it bad practice to have the whole loop in the kernel?

I'm attempting to move an RK4 integrator I've written in C++ into OpenCL so I can run the operations on a GPU - currently it uses OpenMP.

I need to run 10 million+ independent integration runs, with about 700 loop iterations for each run. I currently have the loop written into the kernel with a stop condition, but its not performing as well as I'd have expected.

Current CL Kernel snippet:

`
while (inPos.z > -1.0f){
cnt++;
//Eval 1

//Euler Velocity
vel1 = inVel + (inAcc * 0.0f);
//Euler Position
pos1 = inPos + (vel1 * 0.0f) + ((inAcc * 0.0f)*0.5f);

//Drag and accels
combVel = sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2));
//motionUtils::drag(netForce, combVel, mortSigma, outPos.z);
dragForce = mortSigma*1.225f*pow(combVel, 2);
//Normalise vector
normVel = vel1 / combVel;
//Drag Components
drag = (normVel * dragForce)*-1.0f;
//Add Gravity force
drag.z+=((mortMass*9.801f)*-1.0f);
//Acceleration components
acc1 = drag/mortMass;

...

//Taylor Expansion
tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);
inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);
tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);

//Swap ready for next iteration
inPos = inPos + (tayVel * timeStep);
inVel = inVel + (inAcc * timeStep);


`
Any thoughts / suggestions, much appreciated.

Answer

Try faster(and less precise) versions of slow function:

sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2))

to

native_rsqrt(vel1.x*vel1.x+vel1.y*vel1.y+vel1.z*vel1.z)

 normVel = vel1 / combVel;

to

 normVel = vel1 * combVel;

 dragForce = mortSigma*1.225f*pow(combVel, 2);

to

 dragForce = mortSigma*1.225f*(combVel*combVel);

    drag = (normVel * dragForce)*-1.0f;
    //Add Gravity force
    drag.z+=((mortMass*9.801f)*-1.0f);

to

    drag = -normVel * dragForce;
    //Add Gravity force
    drag.z-=mortMass*9.801f;

    tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);
    inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);
    tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);

to

    tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (0.166666f);
    inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (0.166666f);
    tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (0.166666f);

if you are using too many variables, try decreasing local workgroup size from 256 to 128 or 64 and if they are not being used out of loop, put their declaration in the loop so more threads can be issued at the sametime.