By default all gpu module functions are synchronous, i.e. current CPU thread is blocked until operation finishes.
gpu::Stream is a wrapper for
cudaStream_t and allows to use asynchronous non-blocking call. You can also read "CUDA C Programming Guide" for detailed information about CUDA asynchronous concurrent execution.
Most gpu module functions have additional
gpu::Stream parameter. If you pass non-default stream the function call will be asynchronous, and the call will be added to stream command queue.
gpu::Stream provides methos for asynchronous memory transfers between
CPU<->GPU asynchronous memory transfers works only with page-locked host memory. There is another class
gpu::CudaMem that encapsulates such memory.
Currently, you may face problems if same operation is enqueued twice with different data to different streams. Some functions use the constant or texture GPU memory, and next call may update the memory before the previous one has been finished. But calling different operations asynchronously is safe because each operation has its own constant buffer. Memory copy/upload/download/set operations to the buffers you hold are also safe.
Here is small sample:
// allocate page-locked memory CudaMem host_src_pl(768, 1024, CV_8UC1, CudaMem::ALLOC_PAGE_LOCKED); CudaMem host_dst_pl; // get Mat header for CudaMem (no data copy) Mat host_src = host_src_pl; // fill mat on CPU someCPUFunc(host_src); GpuMat gpu_src, gpu_dst; // create Stream object Stream stream; // next calls are non-blocking // first upload data from host stream.enqueueUpload(host_src_pl, gpu_src); // perform blur blur(gpu_src, gpu_dst, Size(5,5), Point(-1,-1), stream); // download result back to host stream.enqueueDownload(gpu_dst, host_dst_pl); // call another CPU function in parallel with GPU anotherCPUFunc(); // wait GPU for finish stream.waitForCompletion(); // now you can use GPU results Mat host_dst = host_dst_pl;