Ext3h Ext3h - 3 years ago 120
C++ Question

How to eager commit allocated memory in C++?

The general situation:

An both extremely bandwidth and CPU and GPU intensive application, which needs to transfer about 10-15GB per second from one GPU to another. It's using the DX11 API to access the GPU, so upload to the GPU can only happen with buffers which need to be mapped for each single upload. The upload happens in chunks of 25MB at a time, and 16 threads are writing buffers to mapped buffers concurrently, not much which could be done about that. The actual concurrency level of the writes should be lower, if it wasn't for the following bug.

Beefy workstation with 3 Pascal GPUs, high end Haswell processor and quad channel RAM. Not much which can be improved on that end. Windows 10, even though still a desktop edition.

The actual problem:

Once I pass ~50% CPU load, something in
(inside the Windows kernel, called when accessing memory which has been mapped into your address space, but was not committed by the OS yet) breaks horribly, and the remaining 50% CPU load are being wasted on a spin-lock inside
. The CPU becomes 100% utilized, and the application performance completely degrades.

I must assume that this is due to the immense amount of memory which needs to be allocated to the process each second and which is also completely unmapped from the process every time the DX11 buffer is unmapped. Correspondingly, it's actually thousands of calls to
per second, happening sequentially as
is writing sequentially to the buffer. For each single uncommitted page encountered.

One the CPU load goes beyond 50%, the optimistic spin-lock in the Windows kernel protecting the page management completely degrades performance wise.


The buffer is allocated by the DX11 driver. So nothing can be tweaked about the allocation strategy. Especially re-use or use of a different memory API is not possible.

Calls to the DX11 API (mapping / unmapping the buffers) all happens from a single thread. The actual copy operations potentially happen multi threaded across more threads than there are virtual processors in the system.

Reducing the memory bandwidth requirements is not possible. It's a real time application. In fact, the hard limit is currently the PCIe 3.0 16x bandwidth of the primary GPU, if I could I would already need to push further.

Avoiding multi threaded copies is not possible, as there are independent producer-consumer queues which can't be merged trivially.

The spin lock performance degradation appears to so rare (respectively the use case is pushing it that far), that on Google you won't even find a single result for the name of the spin-lock function.

Upgrading to an API which gives more control over the mappings (Vulkan) is in progress, but it's not suitable as a short term fix. Switching to a better OS kernel is currently not an option for the same reason.

Reducing the CPU load doesn't work either, there is too much work which needs to be done other than the (usually trivial and inexpensive) buffer copy.

The question:

What to do?

I need to reduce the number of individual pagefaults, significantly. I know the the address and size of the buffer which has been mapped into my process, and I also know that the memory has not been committed yet.

How can I ensure that the memory is committed with the least amount of transactions possible?

Exotic flags for DX11 which would prevent de-allocation of the buffers after unmapping, Windows APIs to force commit in a single transaction, pretty much anything is welcome.

Answer Source

Current workaround, simplified pseudo code:

// During startup
    SetProcessWorkingSetSize(GetCurrentProcess(), 2*1024*1024*1024, -1);
// In the DX11 render loop thread
    DX11context->Map(..., &resource)
    VirtualLock(resource.pData, resource.size);
// In the processing threads
    std::memcpy(buffer, source, size);

VirtualLock() forces the kernel to back the specified address range with RAM immediately. The call to the complementing VirtualUnlock() function is optional, it happens implicitly (and at no extra cost) when the address range is unmapped from the process. (If called explicitly, it costs about 1/3rd of the locking cost.)

In order for VirtualLock() to work at all, SetProcessWorkingSetSize() needs to be called first, as the sum of all memory regions locked by VirtualLock() can not exceed the minimum working set size configured for the process. Setting the "minimum" working set size to something higher than the baseline memory footprint of your process has no side effects unless your system is actually potentially swapping, your process will still not consume more RAM than the actual working set size.

Just the use of VirtualLock(), albeit in individual threads and using deferred DX11 contexts for Map / Unmap calls, did instantly decrease the performance penalty from 40-50% to slightly more acceptable 15%.

Discarding the use of a deferred context, and exclusively triggering both all soft faults, as well as the corresponding de-allocation when unmapping on a single thread, gave the necessary performance boost. The total cost of that spin-lock is now down to <1% of the total CPU usage.


When you expect soft faults on Windows, try what you can to keep them all in the same thread. Performing a parallel memcpy itself is unproblematic, in some situations even necessary to fully utilize the memory bandwidth. However, that is only if the memory is already committed to RAM yet. VirtualLock() is the most efficient way to ensure that.

(Unless you are working with an API like DirectX which maps memory into your process, you are unlikely to encounter uncommitted memory frequently. If you are just working with standard C++ new or malloc your memory is pooled and recycled inside your process anyway, so soft faults are rare.)

Just make sure to avoid any form of concurrent page faults when working with Windows.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download