Nick Nick - 1 month ago 5x
Linux Question

Weird profiling results with zlib

I am playing with

and have (simplified) code like this:

#include <cstring> // memset
#include <string>

#include <zlib.h>

#include <cstdio>

const int compressionLevel_ = 9;
const size_t BUFFER_SIZE = 1024 * 8;
char buffer_[BUFFER_SIZE];

std::string compress(const char *data, size_t const size){
z_stream zs;
memset(&zs, 0, sizeof(zs));

if (deflateInit(&zs, compressionLevel_) != Z_OK)
return {};

zs.next_in = reinterpret_cast<Bytef *>( const_cast<char *>( data ) );
zs.avail_in = static_cast<uInt>( size );

std::string out;

int result;
do {
zs.next_out = reinterpret_cast<Bytef *>(buffer_);
zs.avail_out = BUFFER_SIZE;

result = deflate(&zs, Z_FINISH);

if (out.size() < zs.total_out){
out.append(buffer_, zs.total_out - out.size() );
} while (result == Z_OK);


if (result != Z_STREAM_END)
return {};

return out;

int main(){
const char *original = "Hello World";
size_t const original_size = strlen(original);

for(size_t i = 0; i < 1000000; ++i){
const auto cdata = compress(original, original_size);

Real code is bit bigger, because it decompress and check if decompressed string vs original string.

If I compile with clang or gcc, everything executes for about 5 sec.

However... If I comment this line:

if (out.size() < zs.total_out){
// out.append(buffer_, zs.total_out - out.size() );

Execution takes 30 sec!!!

If I try on fast Linux server, time is respectively 3 sec vs 18 sec.

I tried same on MacOS with clang and there were no big difference in timing.

I tried with different levels of optimization and result persist - if you comment, you get ~ 10x more execution time.

What could be the reason?


I tried on Cygwin with gcc - no difference in timing there.

I tired on Linux Arm CPU with gcc - 25 sec vs 2:20 min.


If you run it with time, you can notice an interesting thing, with line uncommented you get something like this:

$ time ./main 

real    0m5.309s
user    0m5.304s
sys     0m0.004s

Nothing special, just some 5 seconds spent in userspace, as expected. But with line commented you suddenly get this:

$ time ./main 

real    0m29.061s
user    0m7.424s
sys     0m21.660s

So it's not that your code suddenly got six times slower, it's actually spending more than 5000 times more in the kernel, which is unusual given that the program is just calculating things by its nature, no I/O or anything like that made.

And oprofile proves this, the good one shows this:

samples  %        image name               symbol name
62451    45.2376            /lib64/
61905    44.8421             memset
2752      1.9935             _int_free
2549      1.8464             _int_malloc
2474      1.7921             malloc_consolidate
2263      1.6392  no-vmlinux               /no-vmlinux
1365      0.9888             malloc
723       0.5237             __memcpy_sse2_unaligned
711       0.5150      /usr/lib64/
478       0.3462             free
366       0.2651  main                     main
6         0.0043               _dl_lookup_symbol_x
2         0.0014               _dl_relocate_object
2         0.0014               do_lookup_x
1        7.2e-04               _dl_name_match_p
1        7.2e-04               check_match.9478
1        7.2e-04               strcmp
1        7.2e-04             _dl_addr

While the bad one is like this:

samples  %        image name               symbol name
594605   74.6032  no-vmlinux               /no-vmlinux
102981   12.9207             memset
72822     9.1368            /lib64/
9093      1.1409             _int_malloc
3987      0.5002             sysmalloc
3365      0.4222             _int_free
2119      0.2659             brk
1958      0.2457             systrim.isra.1
1597      0.2004             free
1217      0.1527             malloc
1123      0.1409             sbrk
786       0.0986             __memcpy_sse2_unaligned
688       0.0863             __default_morecore
669       0.0839  main                     main
5        6.3e-04               _dl_relocate_object
4        5.0e-04               do_lookup_x
2        2.5e-04               strcmp
1        1.3e-04               _dl_lookup_symbol_x
1        1.3e-04             _dl_addr

If you also look at strace from both binaries, the good one makes just 64 syscalls (at least on my system), while the bad one makes 4000063 syscalls, most of which look like this:

brk(0x6c0000)                           = 0x6c0000
brk(0x6e2000)                           = 0x6e2000
brk(0x6d2000)                           = 0x6d2000
brk(0x6c2000)                           = 0x6c2000
brk(0x6c0000)                           = 0x6c0000
brk(0x6e2000)                           = 0x6e2000
brk(0x6d2000)                           = 0x6d2000
brk(0x6c2000)                           = 0x6c2000
brk(0x6c0000)                           = 0x6c0000

So, we have memory allocation-deallocation loop here. The only thing we have here using dynamic memory allocation/deallocation is the output string and indeed, if you leave the magic string uncommented, but initialize the string like std::string out = "1";, you get the same "good" result as with an append to string.

This must be some corner-case in allocation logic, where your usage-unusage pattern triggers glibc to first allocate some memory (a small chunk, thus via brk()) and then release it back to the system. With initialized (statically or appended to) string you have a different usage pattern and glibc can't return memory to the system. If you force glibc to not return any memory (see man mallopt), then the results are the same even with "bad" (commented) binary:

$ time MALLOC_TRIM_THRESHOLD_=-1 ./main 

real    0m5.094s
user    0m5.096s
sys     0m0.000s

Most probably you won't hit this corner-case in real life. Also, if you're using some different C library (or even different version of glibc), then the results can also be different.