Johnny V Johnny V - 1 month ago 19
C Question

Malloc is using 10x the amount of memory necessary

I have a network application which allocates predicable 65k chunks as part of the IO subsystem. The memory usage is tracked atomically within the system so I know how much memory I'm actually using. This number can also be checked against malloc_stats()

Result of

malloc_stats()


Arena 0:
system bytes = 1617920
in use bytes = 1007840
Arena 1:
system bytes = 2391826432
in use bytes = 247265696
Arena 2:
system bytes = 2696175616
in use bytes = 279997648
Arena 3:
system bytes = 6180864
in use bytes = 6113920
Arena 4:
system bytes = 16199680
in use bytes = 699552
Arena 5:
system bytes = 22151168
in use bytes = 899440
Arena 6:
system bytes = 8765440
in use bytes = 910736
Arena 7:
system bytes = 16445440
in use bytes = 11785872
Total (incl. mmap):
system bytes = 935473152
in use bytes = 619758592
max mmap regions = 32
max mmap bytes = 72957952


Items to note:


  • The
    total in use bytes
    is completely correct number according to my internal counter. However, the application has a RES (from top/htop) of 5.2GB. The allocations are almost always 65k; I don't understand the huge amount of fragmentation/waste I am seeing even more so when mmap comes into play.

  • total system bytes
    does not equal to the sum of
    system bytes
    in each Arena.

  • I'm on Ubuntu 16.04 using glibc 2.23-0ubuntu3

  • Arena 1 and 2 account for the large RES value the kernel is reporting.

  • Arena 1 and 2 are holding on to 10x the amount of memory that is used.

  • The mass majority of allocations are ALWAYS 65k (explicit multiple of the page size)



How do I keep malloc for allocating an absurd amount of memory?

I think this version of malloc has a huge bug. Eventually (after an hour) a little more than half of the memory will be released. This isn't a fatal bug but it is definitely a problem.

UPDATE - I added
mallinfo
and re-ran the test - the app is no longer processing anything at the time this was captured. No network connections are attached. It is idle.


Arena 2:
system bytes = 2548473856
in use bytes = 3088112
Arena 3:
system bytes = 3288600576
in use bytes = 6706544
Arena 4:
system bytes = 16183296
in use bytes = 914672
Arena 5:
system bytes = 24027136
in use bytes = 911760
Arena 6:
system bytes = 15110144
in use bytes = 643168
Arena 7:
system bytes = 16621568
in use bytes = 11968016
Total (incl. mmap):
system bytes = 1688858624
in use bytes = 98154448
max mmap regions = 32
max mmap bytes = 73338880
arena (total amount of memory allocated other than mmap) = 1617780736
ordblks (number of ordinary non-fastbin free blocks) = 1854
smblks (number of fastbin free blocks) = 21
hblks (number of blocks currently allocated using mmap) = 31
hblkhd (number of bytes in blocks currently allocated using mmap) = 71077888
usmblks (highwater mark for allocated space) = 0
fsmblks (total number of bytes in fastbin free blocks) = 1280
uordblks (total number of bytes used by in-use allocations) = 27076560
fordblks (total number of bytes in free blocks) = 1590704176
keepcost (total amount of releaseable free space at the top of the heap) = 439216


My hypothesis is as follows: The difference between the
total system bytes
reported by
malloc
is much less than the amount reported in each
arena
. (1.6Gb vs 6.1GB) This could mean that (A)
malloc
is actually releasing blocks but the arena doesn't or (B) that
malloc
is not compacting memory allocations at all and it is creating huge amount of fragmentation.

UPDATE Ubuntu released a kernel update which basically fixed everything as described in this post. That said, there is a lot of good information in here on how malloc works with the kernel.

Answer

The full details can be a bit complex, so I'll try to simplify things as much as I can. Also, this is a rough outline and may be slightly inaccurate in places.


Requesting memory from the kernel

malloc uses either sbrk or anonymous mmap to request a contiguous memory area from the kernel. Each area will be a multiple of the machine's page size, typically 4096 bytes. Such a memory area is called an arena in malloc terminology. More on that below.

Any pages so mapped become part of the process's virtual address space. However, even though they have been mapped in, they may not be backed up by a physical RAM page [yet]. They are mapped [many-to-one] to the single "zero" page in R/O mode.

When the process tries to write to such a page, it incurs a protection fault, the kernel breaks the mapping to the zero page, allocates a real physical page, remaps to it, and the process is restarted at the fault point. This time the write succeeds. This is similar to demand paging to/from the paging disk.

In other words, page mapping in a process's virtual address space is different than page residency in a physical RAM page/slot. More on this later.


RSS (resident set size)

RSS is not really a measure of how much memory a process allocates or frees, but how many pages in its virtual address space have a physical page in RAM at the present time.

If the system has a paging disk of 128GB, but only had (e.g.) 4GB of RAM, a process RSS could never exceed 4GB. The process's RSS goes up/down based upon paging in or paging out pages in its virtual address space.

So, because of the zero page mapping at start, a process RSS may be much lower than the amount of virtual memory it has requested from the system. Also, if another process B "steals" a page slot from a given process A, the RSS for A goes down and goes up for B.

The process "working set" is the minimum number of pages the kernel must keep resident for the process to prevent the process from excessively page faulting to get a physical memory page, based on some measure of "excessively". Each OS has its own ideas about this and it's usually a tunable parameter on a system-wide or per-process basis.

If a process allocates a 3GB array, but only accesses the first 10MB of it, it will have a lower working set than if it randomly/scattershot accessed all parts of the array.

That is, if the RSS is higher [or can be higher] than the working set, the process will run well. If the RSS is below the working set, the process will page fault excessively. This can be either because it has poor "locality of reference" or because other events in the system conspire to "steal" the process's page slots.


malloc and arenas

To cut down on fragmentation, malloc uses multiple arenas. Each arena has a "preferred" allocation size (aka "chunk" size). That is, smaller requests like malloc(32) come from (e.g.) arena A, but larger requests like malloc(1024 * 1024) come from a different arena (e.g.) arena B.

This prevents a small allocation from "burning" the first 32 bytes of the last available chunk in arena B, making it too short to satisfy the next malloc(1M)

Of course, we can't have a separate arena for each requested size, so the "preferred" chunk sizes are typically some power of 2.

When creating a new arena for a given chunk size, malloc doesn't just request an area of the chunk size, but some multiple of it. It does this so it can quickly satisfy subsequent requests of the same size without having to do an mmap for each one. Since the minimum size is 4096, arena A will have 4096/32 chunks or 128 chunks available.


free and munmap

When an application does a free(ptr) [ptr represents a chunk], the chunk is marked as available. free could choose to combine contiguous chunks that are free/available at that time or not.

If the chunk is small enough, it does nothing more (i.e.) the chunk is available for reallocation, but, free does not try to release the chunk back to the kernel. For larger allocations, free will [try to] do munmap immediately.

munmap can unmap a single page [or even a small number of bytes], even if comes in the middle of an area that was multiple pages long. If so, the application now has a "hole" in the mapping.


malloc_trim and madvise

If free is called, it probably calls munmap. If an entire page has been unmapped, the RSS of the process (e.g. A) goes down.

But, consider chunks that are still allocated, or chunks that were marked as free/available but were not unmapped.

They are still part of the process A's RSS. If another process (e.g. B) starts doing lots of allocations, the system may have to page out some of process A's slots to the paging disk [reducing A's RSS] to make room for B [whose RSS goes up].

But, if there is no process B to steal A's page slots, process A's RSS can remain high. Say process A allocated 100MB, used it a while back, but is only actively using 1MB now, the RSS will remain at 100MB.

That's because without the "interference" from process B, the kernel had no reason to steal any page slots from A, so they "remain on the books" in the RSS.

To tell the kernel that a memory area is not likely to be used soon, we need the madvise syscall with MADV_WONTNEED. This tells the kernel that the memory area is low priority and it should [more] aggressively page it out to the paging disk, thereby reducing the process's RSS.

The pages remain mapped in the process's virtual address space, but get farmed out to the paging disk. Remember, page mapping is different than page residency.

If the process accesses the page again, it incurs a page fault and the kernel will pull in the data from paging disk to a physical RAM slot and remap. The RSS goes back up. Classical demand paging.

madvise is what malloc_trim uses to reduce the RSS of the process.

Comments