danglingPointer danglingPointer - 25 days ago 16
C++ Question

OpenCL: Why does my program run significantly faster on my CPU device than on my GPU device?

I'm pretty new to OpenCL. I learned a little bit in college, and by a little bit I mean my graphics professor taught us about GPGPU and OpenCL for a single day (whereas the rest of the class was focused on shaders and OpenGL and so on).

I took an example program and changed it to work with the computations that I want it to run. However my program runs significantly faster on the CPU than my GPU, and I am trying to understand why.

My program takes in one input float array and has two output arrays. Under single threaded circumstances it has three arguments . The input array's size is: samplesPerTracetracesInsizeof(float), and the output array's size are: samplesPerTracetracesOutsizeof(float).

My test cases have been using the parameters 25000 2500 250, because that is on average the size of the arrays that I will be using (perhaps a little above average). The values are filled in randomly.

Here is the source code that OpenCL builds and runs on the kernel;

"__kernel void sumAllCL(__global const float prestackTraces[],\n"
" __global float stackTracesOut[],\n"
" __global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,\n"
" const unsigned int samplesPerTrace) {\n"
" unsigned int k = get_global_id(0);\n" // Thread ID
" unsigned int kTimesIn = k * nTracesIn;\n" // Store repeat ints
" unsigned int kTimesSamples = k * samplesPerTrace;\n"
" for (int j = 0; j < ? ; j++) {\n" // ? position to be replaced (nTracesOut)"
" int jTimesSamplesPT = j * samplesPerTrace;\n"
" for (int i = 0; i < # ; i++) {\n" // # position to be replaced ()
" int valueIndex = i + jTimesSamplesPT;\n"
" float value = prestackTraces[valueIndex];\n"
" stackTracesOut[i + kTimesSamples] += value;\n"
" powerTracesOut[i + kTimesSamples] += (value * value);\n"
" }\n"
" }\n"

Note that the ? and the # are replaced at run time with fixed numbers, I do this because I thought it would help the compiler unroll the rl

With the above parameters stated (25000 2500 250 ~10 <1 or 2>) it takes my CPU about 0.6 seconds to complete the program and my GPU about 40 seconds to complete. That's a bigger difference. Fyi, I have been messing around with the 4th parameters to see which value runs faster which is what is meant by the ~10.

My graphics card is a MSI Radeon R9 390X 8GB, given the name Hawaii. When I have OpenCL print out information about both of my devices this is what I get:

OpenCL Platform 0: AMD Accelerated Parallel Processing
----- OpenCL Device # 0: Hawaii-----
Gflops: 47.520000
Max Clock Frequency: 1080
Max Compute Units: 44
Max Work Group Size: 256
Total Memory of Device: 8.000G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.999G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Device Vendor: Advanced Micro Devices, Inc.
Device Version: OpenCL 2.0 AMD-APP (2117.13)
Driver Version: 2117.13 (VM)
Device OpenCL Version: OpenCL C 2.0
----- OpenCL Device # 1: Intel(R) Core(TM) i7-6700K CPU ? 4.00GHz-----
Gflops: 32.064000
Max Clock Frequency: 4008
Max Compute Units: 8
Max Work Group Size: 1024
Total Memory of Device: 15.967G (CL_DEVICE_GLOBAL_MEM_SIZE)
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE)
Max Memory Object Allocation: 3.1028G (CL_DEVICE_MAX_MEM_ALLOC_SIZE)
Device Vendor: GenuineIntel
Device Version: OpenCL 1.2 AMD-APP (2117.13)
Driver Version: 2117.13 (sse2,avx)
Device OpenCL Version: OpenCL C 1.2

Here is the code relevant to OpenCL. I would have posted a entire minimal verifiable complete example however it puts me over the character limit.

* Prints the given int (numToInsert) at location inside chars.
void PrintIntInStr(char* chars, int location, int numToInsert) {

std::stringstream strs;
strs << numToInsert;
std::string temp_str = strs.str();
char const* numToChars = temp_str.c_str();

int numberLength = strlen(numToChars);

int w;
for (w = 0; w < numberLength; w++) {
chars[location + w] = numToChars[w];

* Initialize fastest OpenCL device.
int InitOpenCL(int verbose, cl_int deviceType) {

cl_uint Nplat;
cl_int err;
char name[1024];
int MaxGflops = -1;

cl_platform_id winnerPlatform = 0;

// Reset (TODO)
_deviceID = NULL;
_context = NULL;
_queue = NULL;

// Get platforms
cl_platform_id platforms[4];
if (clGetPlatformIDs(4, platforms, &Nplat)) Fatal("Cannot get number of OpenCL platforms\n");
else if (Nplat<1) Fatal("No OpenCL platforms found\n");

// Loop over platforms
for (unsigned int platform = 0; platform < Nplat; platform++) {

if (clGetPlatformInfo(platforms[platform], CL_PLATFORM_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL platform name\n");
if (verbose) printf("OpenCL Platform %d: %s\n", platform, name);

// Get GPU device IDs
cl_uint Ndev;
cl_device_id id[4];
if (clGetDeviceIDs(platforms[platform], deviceType, 4, id, &Ndev))
Fatal("Cannot get number of OpenCL devices: %d\n", platform);
else if (Ndev < 1) Fatal("No OpenCL devices found.\n");

// Find the fastest device
for (unsigned int devId = 0; devId < Ndev; devId++) {

// Print informatio about the device
cl_uint compUnits, freq, cacheLineSize;
cl_ulong memSize, maxAlloc, localMemSize, globalCacheSize;
size_t maxWorkGrps;
char deviceVendor[50];
char deviceVersion[50];
char driverVersion[50];
char deviceOpenCLVersion[50];

// Computing Power...
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compUnits), &compUnits, NULL)) Fatal("Cannot get OpenCL device units\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, NULL)) Fatal("Cannot get OpenCL device frequency\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(maxWorkGrps), &maxWorkGrps, NULL)) Fatal("Cannot get OpenCL max work group size\n");
// Memory...
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(memSize), &memSize, NULL)) Fatal("Cannot get OpenCL memory size.\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_LOCAL_MEM_SIZE, sizeof(localMemSize), &localMemSize, NULL)) localMemSize = 0;
if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(maxAlloc), &maxAlloc, NULL)) Fatal("Cannot get OpenCL memory size.\n");
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, sizeof(globalCacheSize), &globalCacheSize, NULL)) globalCacheSize = 0;
if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, sizeof(cacheLineSize), &cacheLineSize, NULL)) cacheLineSize = 0;
// Versions...
clGetDeviceInfo(id[devId], CL_DEVICE_VENDOR, sizeof(deviceVendor), deviceVendor, NULL);
clGetDeviceInfo(id[devId], CL_DEVICE_VERSION, sizeof(deviceVersion), deviceVersion, NULL);
clGetDeviceInfo(id[devId], CL_DRIVER_VERSION, sizeof(driverVersion), driverVersion, NULL);
clGetDeviceInfo(id[devId], CL_DEVICE_OPENCL_C_VERSION, sizeof(deviceOpenCLVersion), deviceOpenCLVersion, NULL);

int Gflops = compUnits * freq;

if (verbose) printf(" ----- OpenCL Device # %d: %s-----\n"
"Gflops: %f\n"
"Max Clock Frequency: %d\n"
"Max Compute Units: %d\n"
"Max Work Group Size: %zu\n"
" MEMORY...\n"
"Total Memory of Device: %s (CL_DEVICE_GLOBAL_MEM_SIZE)\n"
"Local Memory of Device: %s (CL_DEVICE_LOCAL_MEM_SIZE)\n"
"Max Memory Object Allocation: %s (CL_DEVICE_MAX_MEM_ALLOC_SIZE)\n"
" VERSIONS...\n"
"Device Vendor: %s\n"
"Device Version: %s\n"
"Driver Version: %s\n"
"Device OpenCL Version: %s\n",
(1e-3 * Gflops),
byteConverter((unsigned long)memSize),
byteConverter((unsigned long)localMemSize),
byteConverter((unsigned long)maxAlloc),
byteConverter((unsigned long)globalCacheSize),
byteConverter((unsigned long)cacheLineSize),

if(Gflops > MaxGflops)
_deviceID = id[devId];
MaxGflops = Gflops;

winnerPlatform = platforms[platform];

// Print fastest device info (TODO: don't get name twice)
if (clGetDeviceInfo(_deviceID, CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name\n");
printf("\n Selected Fastest Open CL Device: %s (#%lu)\n", name, (unsigned long)_deviceID);

// Check thread count
size_t mwgs;
if (clGetDeviceInfo(_deviceID, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(mwgs), &mwgs, NULL))
Fatal("Cannot get OpenCL max work group size\n");

// Create OpenCL context for fastest device
cl_context_properties cps[3] =
_context = clCreateContextFromType(cps, deviceType, NULL, NULL, &err);
if (!_context || err) Fatal("Cannot create OpenCL Context\n");

// Properties for create command queue; currently nothing
// cl_command_queue_properties *propers;
cl_command_queue_properties prop = 0;
// propers = &prop;

_queue = clCreateCommandQueueWithProperties(_context, _deviceID, &prop, &err); // Create OpenCL command queue for fastest device
// _queue = clCreateCommandQueue(_context, _deviceID, &prop, &err);
if (!_queue || err) {
if (err == CL_INVALID_CONTEXT) Fatal("Cannot create OpenCL command cue: CL_INVALID_CONTEXT\n");
else if (err == CL_INVALID_DEVICE) Fatal("Cannot create OpenCL command cue: CL_INVALID_DEVICE\n");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create OpenCL command cue: CL_INVALID_VALUE\n");
else if (err == CL_INVALID_QUEUE_PROPERTIES) Fatal("Cannot create OpenCL command cue: CL_INVALID_QUEUE_PROPERTIES\n");
else if (err == CL_OUT_OF_RESOURCES) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_RESOURCES\n");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_HOST_MEMORY\n");
else if (!_queue) Fatal("Cannot create OpenCL command cue: !queue\n");
else Fatal("Cannot create OpenCL command cue: ?????\n");

if (_VERBOSE) printf("Init complete.\n");

return mwgs;

* Modify the source text to fit this run.
char* ModifySourceText(unsigned int nTracesIn, unsigned int samplesPerT) {

size_t sourceSize = strlen(M_AND_S_OPENCL_SOURCE_TEXT) + 1;
char* moveStackSourceCode = new char[sourceSize];
strncpy(moveStackSourceCode, M_AND_S_OPENCL_SOURCE_TEXT, sourceSize);
moveStackSourceCode[sourceSize] = '\0';

// Print out the locations of the characters where we should insert other text if asked to do so
size_t z;
for (z = 0; z < sourceSize; z++) {
if (moveStackSourceCode[z] == '@') {
printf("Found @ at position %zu\n", z);
for (z = 0; z < sourceSize; z++) {
if (moveStackSourceCode[z] == '#') {
printf("Found # at position %zu\n", z);

// Insert the digit that for loops go to inside of the source
PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_1, nTracesIn);
PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_2, samplesPerT);

// Print the modified source code if verbose
printf("\n GPU Source Code: \n");
printf("%s\n", moveStackSourceCode);

return moveStackSourceCode;

* Wait for event and then release it.
static void WaitForEventAndRelease(cl_event *event) {


cl_int status = CL_SUCCESS;

status = clWaitForEvents(1, event);
if (status) Fatal("clWaitForEvents Failed with Error Code");

printf("About to release event...\n");

status = clReleaseEvent(*event);
if (status) Fatal("clReleaseEvent Failed with Error Code");

// Runs the program via open CL
static double RunOpenCL(float prestackTracesArray[], float stackTracesOut1DArray[], float powerTracesOut1DArray[],
unsigned int nTracesOut, unsigned int nTracesIn, unsigned int samplesPerT,
size_t inXsamples, size_t outXsamples,
unsigned int localThreadCount)

cl_int err;

// Get the source code
char* modifiedGpuSource = ModifySourceText(nTracesIn, samplesPerT);

// Allocate device memory
// Input...
cl_mem prestackTracesCL = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
inXsamples * sizeof(cl_float), prestackTracesArray, &err);
if (err) FatalBufferCreation("Prestack traces", err);
// Output... TODO: How do we know that the output is zeroed out?
cl_mem stackTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
outXsamples * sizeof(cl_float), NULL, &err);
if (err) FatalBufferCreation("Stack traces", err);
cl_mem powerTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY,
outXsamples * sizeof(cl_float), NULL, &err);
if (err) FatalBufferCreation("Power traces", err);

// Compile the source code
char* gpuSourceText[1];
gpuSourceText[0] = modifiedGpuSource;
size_t sourceLength[1];
sourceLength[0] = strlen(modifiedGpuSource);
cl_program moveoutAndStackCLProgram = clCreateProgramWithSource(_context, 1, (const char**)gpuSourceText,
(const size_t*)sourceLength, &err);
if (err != CL_SUCCESS) {
if (err == CL_INVALID_CONTEXT) Fatal("Cannot create program: CL_INVALID_CONTEXT\n");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create program: CL_INVALID_VALUE\n");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create program: CL_OUT_OF_HOST_MEMORY\n");
else Fatal("Cannot create program_S %d\n", err);

// Build the program
cl_int buildCode = clBuildProgram(moveoutAndStackCLProgram, 0, NULL, NULL, NULL, NULL);
if (buildCode != CL_SUCCESS) {
// Attempt to get compile errors
char log[1048576];
if (clGetProgramBuildInfo(moveoutAndStackCLProgram, _deviceID, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL)) {
log[0] = '\0'; // Failed to get the log file

if (buildCode == CL_INVALID_PROGRAM) Fatal("Cannot build program: CL_INVALID_PROGRAM\n%s", log);
else if (buildCode == CL_INVALID_VALUE) Fatal("Cannot build program: CL_INVALID_VALUE\n%s", log);
else if (buildCode == CL_INVALID_DEVICE) Fatal("Cannot build program: CL_INVALID_DEVICE\n%s", log);
else if (buildCode == CL_INVALID_BINARY) Fatal("Cannot build program: CL_INVALID_BINARY\n%s", log);
else if (buildCode == CL_INVALID_BUILD_OPTIONS) Fatal("Cannot build program: CL_INVALID_BUILD\n_OPTIONS\n%s", log);
else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION\n%s", log);
else if (buildCode == CL_COMPILER_NOT_AVAILABLE) Fatal("Cannot build program: CL_COMPILER_NOT_AVAILABLE\n%s", log);
else if (buildCode == CL_BUILD_PROGRAM_FAILURE) Fatal("Cannot build program: CL_BUILD_PROGRAM_FAILURE\n%s", log);
else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION\n%s", log);
else if (buildCode == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot build program: CL_OUT_OF_HOST_MEMORY\n%s", log);
else Fatal("Cannot build program: %d\n%s", buildCode, log);

// Compile the source code & build the kernel
cl_kernel kernel = clCreateKernel(moveoutAndStackCLProgram, "sumAllCL", &err);
if (err) {
if (err == CL_INVALID_PROGRAM) Fatal("Cannot create kernel: CL_INVALID_PROGRAM\n");
else if (err == CL_INVALID_PROGRAM_EXECUTABLE) Fatal("Cannot create kernel: CL_INVALID_PROGRAM_EXECUTABLE\n");
else if (err == CL_INVALID_KERNEL_NAME) Fatal("Cannot create kernel: CL_INVALID_KERNEL_NAME\n");
else if (err == CL_INVALID_KERNEL_DEFINITION) Fatal("Cannot create kernel: CL_INVALID_KERNEL_DEFINITION\n");
else if (err == CL_INVALID_VALUE) Fatal("Cannot create kernel: CL_INVALID_VALUE\n");
else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create kernel: CL_OUT_OF_HOST_MEMOR\n");
else Fatal("Cannot create kernel: %d\n", err);

// Set program parameters
cl_int returnValArgSet;
returnValArgSet = clSetKernelArg(kernel, 0, sizeof(cl_mem), &prestackTracesCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("prestackTracesCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 1, sizeof(cl_mem), &stackTracesOutCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("stackTracesOutCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 2, sizeof(cl_mem), &powerTracesOutCL);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("powerTracesOutCL", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 3, sizeof(unsigned int), &nTracesOut);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesOut", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 4, sizeof(unsigned int), &nTracesIn);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesIn", returnValArgSet);
returnValArgSet = clSetKernelArg(kernel, 5, sizeof(unsigned int), &samplesPerT);
if (returnValArgSet != CL_SUCCESS) FatalSetArgs("samplesPerT", returnValArgSet);

// TODO: verbose
printf("About to run Kernel...\n");

// Start timer TODO: move?
double runTime = GetTime();

// Run the kernel (& also set the number of threads)
cl_event runEvent;
size_t Global[1] = { nTracesOut };
size_t Local[1] = { localThreadCount };
if (localThreadCount > 0) err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, Local, 0, NULL, &runEvent);
else err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, NULL, 0, NULL, &runEvent);
if (err) {
Fatal("Cannot run Kernel: No successfully built program executable available.\n");
} else if (err == CL_INVALID_COMMAND_QUEUE) {
Fatal("Cannot run Kernel: Command_queue is not a valid command-queue.\n");
} else if (err == CL_INVALID_KERNEL) {
Fatal("Cannot run Kernel: Kernel is not a valid kernel object.\n");
} else if (err == CL_INVALID_CONTEXT) {
Fatal("Cannot run Kernel: Context associated with command_queue and kernel is not the same or if "
"the context associated with command_queue and events in event_wait_list are not the same.\n");
} else if (err == CL_INVALID_KERNEL_ARGS) {
Fatal("Cannot run Kernel: Kernel argument values have not been specified.\n");
} else if (err == CL_INVALID_WORK_DIMENSION) {
Fatal("Cannot run Kernel: work_dim is not a valid value (must be between 1 and 3).\n");
} else if (err == CL_INVALID_WORK_GROUP_SIZE) {
Fatal("Cannot run Kernel: local_work_size is specified and number of work-items specified by global_work_size "
"is not evenly divisable by size of work-group given by local_work_size or does not match the "
"work-group size specified for kernel using the __attribute__((reqd_work_group_size(X, Y, Z))) "
"qualifier in program source.\n");
} else if (err == CL_INVALID_WORK_ITEM_SIZE) {
Fatal("Cannot run Kernel: If the number of work-items specified in any of local_work_size[0], ... "
"local_work_size[work_dim - 1] is greater than the corresponding values specified "
} else if (err == CL_INVALID_GLOBAL_OFFSET) {
Fatal("Cannot run Kernel: Global_work_offset is not NULL.\n");
} else if (err == CL_OUT_OF_RESOURCES) {
Fatal("Cannot run Kernel: CL_OUT_OF_RESOURCES.\n");
Fatal("Cannot run Kernel: Failure to allocate memory for data store associated with image or buffer "
"objects specified as arguments to kernel.\n");
} else if (err == CL_INVALID_EVENT_WAIT_LIST) {
Fatal("Cannot run Kernel: event_wait_list is NULL and num_events_in_wait_list > 0, or event_wait_list "
"is not NULL and num_events_in_wait_list is 0, or if event objects in event_wait_list "
"are not valid events..\n");
} else if (err == CL_OUT_OF_HOST_MEMORY) {
Fatal("Cannot run Kernel: Failure to allocate resources required by the OpenCL implementation on the host.\n");
} else {
Fatal("Cannot run Kernel: Unknown Error. (clEnqueueNDRangeKernel)");

// Flush the program & wait for the program to finish executing
if (clFlush(_queue)) printf("Flush Fail (Run)");

// Copy the end result back to CPU memory side
if (clEnqueueReadBuffer(_queue, stackTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), stackTracesOut1DArray, 0, NULL, NULL))
Fatal("Cannot copy stackTracesOutCL from device to host\n");
if (clEnqueueReadBuffer(_queue, powerTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), powerTracesOut1DArray, 0, NULL, NULL))
Fatal("Cannot copy powerTracesOutCL from device to host\n");

// Release kernel and program
if (clReleaseKernel(kernel)) Fatal("Cannot release kernel\n");
if (clReleaseProgram(moveoutAndStackCLProgram)) Fatal("Cannot release program\n");

// Free device memory

// Release the context and queue

// Return the time it took to run this program
return runTime;

double RunProg(unsigned int samplesPerTrace, unsigned int nTracesIn, unsigned int nTracesOut,
unsigned int localThreadCount, unsigned int deviceType) {

// Stores sizes of the various arrays
size_t tracesInxSample = nTracesIn * samplesPerTrace;
size_t tracesOutxSample = nTracesOut * samplesPerTrace;

// Allocate arrays
float* prestackTraces1D = (float*)malloc(tracesInxSample * sizeof(float));
float* stackTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out
float* powerTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out

// Count how much memory all of this is
// Make sure it is consistent with above allocation
unsigned long allocatedMemory = 0;
allocatedMemory += tracesInxSample * sizeof(float);
allocatedMemory += tracesOutxSample * sizeof(float);
allocatedMemory += tracesOutxSample * sizeof(float);

printf("TOTAL MEMORY ALLOCATED: %s\n", byteConverter(allocatedMemory));
printf("Input Array Sizes: %s\n", byteConverter((unsigned int)(tracesInxSample * sizeof(float))));
printf("Output Array Sizes: %s\n", byteConverter((unsigned int)(tracesOutxSample * sizeof(float))));

// Fill in array with randoms
RandomFillArray(prestackTraces1D, (unsigned int)tracesInxSample);

// Init OpenCL using the desired device type
double preInitTime = GetTime();
int maxWorkGroupSize;
if (deviceType == 0) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_ALL);
else if (deviceType == 1) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_GPU);
else maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_CPU);
printf("Max work size for the device is: %d\n", maxWorkGroupSize);

// --- ACTUAL TEST ---
// Run OpenCL
double startTime = GetTime();
double runTime = RunOpenCL(prestackTraces1D, stackTracesOut1Dgpu, powerTracesOut1Dgpu, // arrays
nTracesOut, nTracesIn, samplesPerTrace, // ints
tracesInxSample, tracesOutxSample,
localThreadCount); // samples

// Display run time
double endTime = GetTime();
printf("Elapsed Time: %fsecs\n", (endTime - runTime));
printf(" %fsecs (Before Function Call)\n", (endTime - startTime));
printf(" %fsecs (Including Init)\n\n", (endTime - preInitTime));

// Free the 1D arrays

return (endTime - startTime);

My first thought as to why it's running so much slower on my GPU than my CPU is that maybe it's because I am busing so much data over the graphics card before anything runs. Perhaps a better implementation would involve splitting the workload in multiple runs, so that code can be executing while more data is being bused over (I presume that's a thing). However now that I think about it this is almost certainly false, because as I said I wrote this program based on an example, and that example did matrix multiplication, and that example runs much much faster on the GPU than my CPU. I don't really know what the difference is.


The problem was caching; I was read and writing from array a lot. So I made a version that writes to local variables as much as possible before writing to an array and now it runs much faster on the GPU.