I have a network application which allocates predicable 65k chunks as part of the IO subsystem. The memory usage is tracked atomically within the system so I know how much memory I'm actually using. This number can also be checked against malloc_stats()
system bytes = 1617920
in use bytes = 1007840
system bytes = 2391826432
in use bytes = 247265696
system bytes = 2696175616
in use bytes = 279997648
system bytes = 6180864
in use bytes = 6113920
system bytes = 16199680
in use bytes = 699552
system bytes = 22151168
in use bytes = 899440
system bytes = 8765440
in use bytes = 910736
system bytes = 16445440
in use bytes = 11785872
Total (incl. mmap):
system bytes = 935473152
in use bytes = 619758592
max mmap regions = 32
max mmap bytes = 72957952
total in use bytes
total system bytes
system bytes = 2548473856
in use bytes = 3088112
system bytes = 3288600576
in use bytes = 6706544
system bytes = 16183296
in use bytes = 914672
system bytes = 24027136
in use bytes = 911760
system bytes = 15110144
in use bytes = 643168
system bytes = 16621568
in use bytes = 11968016
Total (incl. mmap):
system bytes = 1688858624
in use bytes = 98154448
max mmap regions = 32
max mmap bytes = 73338880
arena (total amount of memory allocated other than mmap) = 1617780736
ordblks (number of ordinary non-fastbin free blocks) = 1854
smblks (number of fastbin free blocks) = 21
hblks (number of blocks currently allocated using mmap) = 31
hblkhd (number of bytes in blocks currently allocated using mmap) = 71077888
usmblks (highwater mark for allocated space) = 0
fsmblks (total number of bytes in fastbin free blocks) = 1280
uordblks (total number of bytes used by in-use allocations) = 27076560
fordblks (total number of bytes in free blocks) = 1590704176
keepcost (total amount of releaseable free space at the top of the heap) = 439216
total system bytes
The full details can be a bit complex, so I'll try to simplify things as much as I can. Also, this is a rough outline and may be slightly inaccurate in places.
Requesting memory from the kernel
malloc uses either
sbrk or anonymous
mmap to request a contiguous memory area from the kernel. Each area will be a multiple of the machine's page size, typically 4096 bytes. Such a memory area is called an arena in
malloc terminology. More on that below.
Any pages so mapped become part of the process's virtual address space. However, even though they have been mapped in, they may not be backed up by a physical RAM page [yet]. They are mapped [many-to-one] to the single "zero" page in R/O mode.
When the process tries to write to such a page, it incurs a protection fault, the kernel breaks the mapping to the zero page, allocates a real physical page, remaps to it, and the process is restarted at the fault point. This time the write succeeds. This is similar to demand paging to/from the paging disk.
In other words, page mapping in a process's virtual address space is different than page residency in a physical RAM page/slot. More on this later.
RSS (resident set size)
RSS is not really a measure of how much memory a process allocates or frees, but how many pages in its virtual address space have a physical page in RAM at the present time.
If the system has a paging disk of 128GB, but only had (e.g.) 4GB of RAM, a process RSS could never exceed 4GB. The process's RSS goes up/down based upon paging in or paging out pages in its virtual address space.
So, because of the zero page mapping at start, a process RSS may be much lower than the amount of virtual memory it has requested from the system. Also, if another process B "steals" a page slot from a given process A, the RSS for A goes down and goes up for B.
The process "working set" is the minimum number of pages the kernel must keep resident for the process to prevent the process from excessively page faulting to get a physical memory page, based on some measure of "excessively". Each OS has its own ideas about this and it's usually a tunable parameter on a system-wide or per-process basis.
If a process allocates a 3GB array, but only accesses the first 10MB of it, it will have a lower working set than if it randomly/scattershot accessed all parts of the array.
That is, if the RSS is higher [or can be higher] than the working set, the process will run well. If the RSS is below the working set, the process will page fault excessively. This can be either because it has poor "locality of reference" or because other events in the system conspire to "steal" the process's page slots.
malloc and arenas
To cut down on fragmentation,
malloc uses multiple arenas. Each arena has a "preferred" allocation size (aka "chunk" size). That is, smaller requests like
malloc(32) come from (e.g.) arena A, but larger requests like
malloc(1024 * 1024) come from a different arena (e.g.) arena B.
This prevents a small allocation from "burning" the first 32 bytes of the last available chunk in arena B, making it too short to satisfy the next
Of course, we can't have a separate arena for each requested size, so the "preferred" chunk sizes are typically some power of 2.
When creating a new arena for a given chunk size,
malloc doesn't just request an area of the chunk size, but some multiple of it. It does this so it can quickly satisfy subsequent requests of the same size without having to do an
mmap for each one. Since the minimum size is 4096, arena A will have 4096/32 chunks or 128 chunks available.
free and munmap
When an application does a
ptr represents a chunk], the chunk is marked as available.
free could choose to combine contiguous chunks that are free/available at that time or not.
If the chunk is small enough, it does nothing more (i.e.) the chunk is available for reallocation, but,
free does not try to release the chunk back to the kernel. For larger allocations,
free will [try to] do
munmap can unmap a single page [or even a small number of bytes], even if comes in the middle of an area that was multiple pages long. If so, the application now has a "hole" in the mapping.
malloc_trim and madvise
free is called, it probably calls
munmap. If an entire page has been unmapped, the RSS of the process (e.g. A) goes down.
But, consider chunks that are still allocated, or chunks that were marked as free/available but were not unmapped.
They are still part of the process A's RSS. If another process (e.g. B) starts doing lots of allocations, the system may have to page out some of process A's slots to the paging disk [reducing A's RSS] to make room for B [whose RSS goes up].
But, if there is no process B to steal A's page slots, process A's RSS can remain high. Say process A allocated 100MB, used it a while back, but is only actively using 1MB now, the RSS will remain at 100MB.
That's because without the "interference" from process B, the kernel had no reason to steal any page slots from A, so they "remain on the books" in the RSS.
To tell the kernel that a memory area is not likely to be used soon, we need the
madvise syscall with
MADV_WONTNEED. This tells the kernel that the memory area is low priority and it should [more] aggressively page it out to the paging disk, thereby reducing the process's RSS.
The pages remain mapped in the process's virtual address space, but get farmed out to the paging disk. Remember, page mapping is different than page residency.
If the process accesses the page again, it incurs a page fault and the kernel will pull in the data from paging disk to a physical RAM slot and remap. The RSS goes back up. Classical demand paging.
madvise is what
malloc_trim uses to reduce the RSS of the process.