I'm running Data mining algorithm on VS 2013.
I've implement CPU based version (with .cpp file) and GPU based version (with cuda 7.5 .cu file).
Both version run as expected. CPU based version takes about 1500 seconds and GPU version 500 seconds.
I then combine both file into single .cu file, and control which version to run with a flag, and I found the CPU version become faster in .cu file with all other parameters and code stayed the same, it only takes about 600 seconds.
Then I tried to run same pieces of c++ code (without cuda) in Empty C++ project and CUDA project seperately and found the result consistent. The cu version takes 600 seconds while cpp one takes 1500 seconds.
Why would this happen? Is this come from different compiler or different initial environment of VS project?
Host code that
nvcc passes to the host compiler is usually not a verbatim copy of the host portion of the
.cu file as written by the programmer. Instead,
nvcc parses and pre-processes the code and sends semantically identical code to the host compiler (a look at the intermediate files generated as part of the
nvcc compilation trajectory will reveal the details). Due to artifacts in the host compiler's code generation, this could result in host code that runs faster or slower when incorporated into
.cu file compared to the stand-alone version in a
Usually, the resulting performance differences are quite small, up to about 10% in my experience. So the very significant different performance difference reported here is either an extreme outlier of the scenario outlined above, or (more likely, in my opionion) there are other differences in the compilation.
For example, different compiler options, e.g. different optimization levels, could have been passed to the host compiler as part of the CUDA compilation vs stand-alone compilation. If you enable a verbose log of the compilation process in MSVS that shows the details of host compiler invocation, it should become apparent whether that is the case.