camelCase camelCase - 24 days ago 9
C++ Question

Vulkan: vkCmdPipelineBarrier for data coherence

My question has 2 parts:

  1. What is the difference between memory being available / being visible?

  2. Im learning Vulkan from this tutorial ( and currently sneaking around for a different approach to upload uniform data (simple model/view/projection matrices) to a device local memory. The matrices are used in the vertex shader.

    In the tutorial the matrices get updated and are copied to a staging buffer (
    etc.) and are copied afterwards to the final device local buffer by creating a command buffer, recording
    , submiting it and destroying the buffer. I try to do the last step within the obligatory command buffers for drawing.

    While the tutorial way leads to a fluent animation, my experiment misses this feature. I tried to install 2 bufferBarriers to ensure that the copies are done (which seems to be problem), but that didn't help. The resources are properly created and bound - that's working fine.

    //update uniform buffer and copy it to the staging buffer
    //(called every frame)
    Tools::UniformBufferObject ubo;
    //set the matrices
    void* data;
    data = device.mapMemory( uniformStagingMemory, 0, sizeof( ubo ), (vk::MemoryMapFlagBits) 0 );
    memcpy( data, &ubo, sizeof( ubo ));
    device.unmapMemory( uniformStagingMemory );

    //once: create a command buffer for each framebuffer of the swapchain
    //queueFamily struct members set properly
    //1st barrier: make transfer from host memory to staging buffer available / visible
    vk::BufferMemoryBarrier bufMemBarrierStaging;
    bufMemBarrierStaging.srcAccessMask = vk::AccessFlagBits::eHostWrite;
    bufMemBarrierStaging.dstAccessMask = vk::AccessFlagBits::eTransferRead;
    bufMemBarrierStaging.buffer = uniformStagingBuffer;
    bufMemBarrierStaging.offset = 0;
    bufMemBarrierStaging.size = sizeof( Tools::UniformBufferObject );

    //2nd barrier: make transfer from staging buffer to device local buffer available / visible
    vk::BufferMemoryBarrier bufMemBarrier;
    bufMemBarrier.srcAccessMask = vk::AccessFlagBits::eTransferWrite;
    bufMemBarrier.dstAccessMask = vk::AccessFlagBits::eUniformRead | vk::AccessFlagBits::eShaderRead;
    bufMemBarrier.buffer = dataBuffer;
    bufMemBarrier.offset = dataBufferOffsets[2];
    bufMemBarrier.size = sizeof( Tools::UniformBufferObject );

    for( size_t i = 0; i < cmdBuffers.size(); i++ ) {
    //begin command buffer

    vk::PipelineStageFlagBits::eHost, //srcPipelineStage
    vk::PipelineStageFlagBits::eTransfer, //dstPipelineStage
    (vk::DependencyFlagBits) 0,
    nullptr, //memBarrier
    nullptr //imgBarrier
    vk::BufferCopy copyRegion; //filled appropriate
    cmdBuffers[i].copyBuffer( uniformStagingBuffer, dataBuffer, copyRegion );

    vk::PipelineStageFlagBits::eTransfer, //srcPipelineStage
    vk::PipelineStageFlagBits::eVertexShader, //dstPipelineStage
    (vk::DependencyFlagBits) 0,
    nullptr, //memBarrier
    nullptr //imgBarrier
    //renderpass stuff and drawing etc.


    namespace Tools {
    struct UniformBufferObject {
    glm::mat4 model;
    glm::mat4 view;
    glm::mat4 proj;
    vk::Buffer uniformStagingBuffer;
    vk::DeviceMemory uniformStagingMemory;
    //dataBuffer also contains the vertex and index data, is device local
    vk::Buffer dataBuffer;
    vk::DeviceMemory dataBufferMemory;
    vk::vector<vk::DeviceSize> dataBufferOffsets;

    std::vector<vk::CommandBuffer> cmdBuffers;

    Im using

    Is the reason for this nonfluid animation missing data coherence - and did I a mistake trying to achieve this?

Thanks in advance!

Edit: The problem of part 2 was a missing syncronisation; the staging buffer was (partially) updated before it was read during rendering the frame before. (Thanks for making clear the difference between memory being available / visible).


If the staging-buffer memory is not host-coherent then you additionally need to vkFlushMappedMemoryRanges after the memcpy (the memory can remain mapped). If you don't then there is no guarantee that the data is actually visible to the gpu.

The first barrier (host to transfer) is not actually needed; there is an implicit barrier on submit.

Another issue I see is that you have a single staging buffer which means that you need to wait for the previous frame to finish before you can upload new data.

If the mention of "destroying" means that you allocate per frame ... First you have to wait on destruction until all submitted command buffers using are done, second don't do that. GPU-side allocation is expensive, instead prefer allocating once and using a ring buffer.