I need a way to copy pages from one virtual address range to another without actually copying the data. The ranges are massive and latency is important. mremap can do this, but the problem is it also deletes the old mapping. Since I need to do this in a multithreaded environment I need the old mapping to be simultaneously usable, I will free it later when I'm certain no other threads can be using it. Is this possible, however hacky, without modifying the kernel? The solution only need work with recent Linux kernels.
It is possible, although there are architecture-specific cache consistency issues you may need to consider. Some architectures simply do not allow the same page to be accessed from multiple virtual addresses simultaneously without losing coherency. So, some architectures will manage this fine, others do not.
Edited to add: AMD64 Architecture Programmer's Manual vol. 2, System Programming, section 7.8.7 Changing Memory Type, states:
A physical page should not have differing cacheability types assigned to it through different virtual mappings; they should be either all of a cacheable type (WB, WT, WP) or all of a non-cacheable type (UC, WC, CD). Otherwise, this may result in a loss of cache coherency, leading to stale data and unpredictable behavior.
Thus, on AMD64, it should be safe to
mmap() the same file or shared memory region again, as long as the same
flags are used; it should cause the kernel to use the same cacheable type to each of the mappings.
The first step is to always use a file backing for the memory maps. Use
mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0) so that the mappings do not reserve swap. (If you forget this, you'll run into swap limits much sooner than you hit actual real life limits for many workloads.) The extra overhead caused by having a file backing is absolutely neglible.
Edited to add: User strcmp pointed out that current kernels do not apply address space randomization to the addresses. Fortunately, this is easy to fix, by simply supplying randomly generated addresses to
mmap() instead of
NULL. On x86-64, the user address space is 47-bit, and the address should be page aligned; you could use e.g. Xorshift* to generate the addresses, then mask out the unwanted bits:
& 0x00007FFFFE00000 would give 2097152-byte-aligned 47-bit addresses, for example.
Because the backing is to a file, you can create a second mapping to the same file, after enlarging the backing file using
ftruncate(). Only after a suitable grace period -- when you know no thread is using the mapping anymore (perhaps use an atomic counter to keep track of that?) --, you unmap the original mapping.
In practice, when a mapping needs to be enlarged, you first enlarge the backing file, then try
mremap(mapping, oldsize, newsize, 0) to see if the mapping can be grown, without moving the mapping. Only if the in-place remapping fails, do you need to switch to the new mapping.
Edited to add: You definitely do want to use
mremap() instead of just using
MAP_FIXED to create a larger mapping, because
mmap() unmaps (atomically) any existing mappings, including those belonging to other files or shared memory regions. With
mremap(), you get an error if the enlarged mapping would overlap with existing mappings; with
MAP_FIXED, any existing mappings that the new mapping overlaps are ignored (unmapped).
Unfortunately, I must admit I haven't verified if the kernel detects collisions between existing mappings, or if it just assumes the programmer knows about such collisions -- after all, the programmer must know the address and length of every mapping, and therefore should know if the mapping would collide with anther existing one. Edited to add: The 3.8 series kernels do, returning
errno==ENOMEM if the enlarged mapping would collide with existing maps. I expect all Linux kernels to behave the same way, but have no proof, aside from testing on 3.8.0-30-generic on x86_64.
Also note that in Linux, POSIX shared memory is implemented using a special filesystem, typically a tmpfs mounted at
/dev/shm being a symlink). The
shm_open() et. al are implemented by the C library. Instead of having a large POSIX shared memory capability, I'd personally use a specially mounted tmpfs for use in a custom application. If not for anything else, the security controls (users and groups able to create new "files" in there) are much easier and clearer to manage.
If the mapping is, and has to be, anonymous, you can still use
mremap(mapping, oldsize, newsize, 0) to try and resize it; it just may fail.
Even with hundreds of thousands of mappings, the 64-bit address space is vast, and the failure case rare. So, although you must handle the failure case too, it does not necessarily have to be fast.
Edited to modify: On x86-64, the address space is 47-bit, and mappings must start at a page boundary (12 bits for normal pages, 21 bits for 2M hugepages, and 30 bits for 1G hugepages), so there is only 35, 26, or 17 bits available in the address space for the mappings. So, the collisions are more frequent, even if random addresses are suggested. (For 2M mappings, 1024 maps had an occasional collision, but at 65536 maps, the probability of a collision (resize failure) was about 2.3%.)
Edited to add: User strcmp pointed out in a comment that by default Linux
mmap() will return consecutive addresses, in which case growing the mapping will always fail unless it's the last one, or a map was unmapped just there.
The approach I know works in Linux is complicated and very architecture-specific. You can remap the original mapping read-only, create a new anonymous map, and copy the old contents there. You need a
SIGSEGV handler (
SIGSEGV signal being raised for the particular thread that tries to write to the now read-only mapping, this being one of the few recoverable
SIGSEGV situations in Linux even if POSIX disagrees) that examines the instruction that caused the problem, simulates it (modifying the contents of the new mapping instead), and then skips the problematic instruction. After a grace period, when there are no more threads accessing the old, now read-only mapping, you can tear down the mapping.
All of the nastiness is in the
SIGSEGV handler, of course. Not only must it be able to decode all machine instructions and simulate them (or at least those that write to memory), but it must also busy-wait if the new mapping has not been completely copied yet. It is complicated, absolutely unportable, and very architecture-specific.. but possible.