Dusted - 1 month ago 12x

C++ Question

When parallelising an integrator using OpenCL - is it bad practice to have the whole loop in the kernel?

I'm attempting to move an RK4 integrator I've written in C++ into OpenCL so I can run the operations on a GPU - currently it uses OpenMP.

I need to run 10 million+ independent integration runs, with about 700 loop iterations for each run. I currently have the loop written into the kernel with a stop condition, but its not performing as well as I'd have expected.

Current CL Kernel snippet:

```

while (inPos.z > -1.0f){

cnt++;

//Eval 1

//Euler Velocity

vel1 = inVel + (inAcc * 0.0f);

//Euler Position

pos1 = inPos + (vel1 * 0.0f) + ((inAcc * 0.0f)*0.5f);

//Drag and accels

combVel = sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2));

//motionUtils::drag(netForce, combVel, mortSigma, outPos.z);

dragForce = mortSigma*1.225f*pow(combVel, 2);

//Normalise vector

normVel = vel1 / combVel;

//Drag Components

drag = (normVel * dragForce)*-1.0f;

//Add Gravity force

drag.z+=((mortMass*9.801f)*-1.0f);

//Acceleration components

acc1 = drag/mortMass;

...

//Taylor Expansion

tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);

inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);

tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);

//Swap ready for next iteration

inPos = inPos + (tayVel * timeStep);

inVel = inVel + (inAcc * timeStep);

`

Any thoughts / suggestions, much appreciated.

Answer

Try faster(and less precise) versions of slow function:

```
sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2))
```

to

```
native_rsqrt(vel1.x*vel1.x+vel1.y*vel1.y+vel1.z*vel1.z)
```

```
normVel = vel1 / combVel;
```

to

```
normVel = vel1 * combVel;
```

```
dragForce = mortSigma*1.225f*pow(combVel, 2);
```

to

```
dragForce = mortSigma*1.225f*(combVel*combVel);
```

```
drag = (normVel * dragForce)*-1.0f;
//Add Gravity force
drag.z+=((mortMass*9.801f)*-1.0f);
```

to

```
drag = -normVel * dragForce;
//Add Gravity force
drag.z-=mortMass*9.801f;
```

```
tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);
inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);
tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);
```

to

```
tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (0.166666f);
inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (0.166666f);
tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (0.166666f);
```

if you are using too many variables, try decreasing local workgroup size from 256 to 128 or 64 and if they are not being used out of loop, put their declaration in the loop so more threads can be issued at the sametime.

Source (Stackoverflow)

Comments