I am implementing a conjugate gradient (CG) solver using the cuSPARSE_v2/cuBLAS_v2 libraries to cope with a large sparse matrix in my research. The weird thing I observed is the huge time cost by cublasCreate() function, ~ 10 seconds. I am aware that the library initialization cost is usually large, but by searching forums I found the usual time cost of cublasCreate is of ~100 ms scale, not as much as 10s. While the whole CG iteration part only cost 0.6 ~ 1 second. I also implemented CG solvers using CUSP library, which performed quite well - with the total code time of ~ 0.5 second.
So how to reduce the time cost by cublasCreate()? Also, if large as 10s a must-have for cuda library initialization, why CUSP library performs much better, with a nearly neglectable initialization cost?
I am using CUDA-7.5 on GTX 980 Ti. Here is my code snippet with timing:
// Timing begin
struct timeval begin, end;
cublasStatus = cublasCreate(&cublasHandle);
// Timing end
float cgtime = (end.tv_sec - begin.tv_sec) * 1000.0 + (end.tv_usec - begin.tv_usec) / 1000.0;
printf("\nTime elapse: %f ms.\n", cgtime);
I finally found the cause - our main server node didn't function well and couldn't communicate with GPU nodes normally, which somehow caused the dynamic linking of cuBLAS library hindered. A reboot recovered all.
So there is no problem with cublasCreate() at this point. I post it here as an answer in case anyone encounters a similar situation (though low probability).