Nick Nick - 2 months ago 8
Linux Question

Weird profiling results with zlib

I am playing with

zlib
and have (simplified) code like this:

#include <cstring> // memset
#include <string>

#include <zlib.h>

#include <cstdio>

const int compressionLevel_ = 9;
const size_t BUFFER_SIZE = 1024 * 8;
char buffer_[BUFFER_SIZE];

std::string compress(const char *data, size_t const size){
z_stream zs;
memset(&zs, 0, sizeof(zs));

if (deflateInit(&zs, compressionLevel_) != Z_OK)
return {};

zs.next_in = reinterpret_cast<Bytef *>( const_cast<char *>( data ) );
zs.avail_in = static_cast<uInt>( size );

std::string out;

int result;
do {
zs.next_out = reinterpret_cast<Bytef *>(buffer_);
zs.avail_out = BUFFER_SIZE;

result = deflate(&zs, Z_FINISH);

if (out.size() < zs.total_out){
// COMMENT / UNCOMMENT HERE !!!
out.append(buffer_, zs.total_out - out.size() );
}
} while (result == Z_OK);

deflateEnd(&zs);

if (result != Z_STREAM_END)
return {};

return out;
}


int main(){
const char *original = "Hello World";
size_t const original_size = strlen(original);

for(size_t i = 0; i < 1000000; ++i){
const auto cdata = compress(original, original_size);
}
}


Real code is bit bigger, because it decompress and check if decompressed string vs original string.

If I compile with clang or gcc, everything executes for about 5 sec.

However... If I comment this line:

if (out.size() < zs.total_out){
// COMMENT / UNCOMMENT HERE !!!
// out.append(buffer_, zs.total_out - out.size() );
}


Execution takes 30 sec!!!

If I try on fast Linux server, time is respectively 3 sec vs 18 sec.

I tried same on MacOS with clang and there were no big difference in timing.

I tried with different levels of optimization and result persist - if you comment, you get ~ 10x more execution time.

What could be the reason?

Update

I tried on Cygwin with gcc - no difference in timing there.

I tired on Linux Arm CPU with gcc - 25 sec vs 2:20 min.

Answer

If you run it with time, you can notice an interesting thing, with line uncommented you get something like this:

$ time ./main 

real    0m5.309s
user    0m5.304s
sys     0m0.004s

Nothing special, just some 5 seconds spent in userspace, as expected. But with line commented you suddenly get this:

$ time ./main 

real    0m29.061s
user    0m7.424s
sys     0m21.660s

So it's not that your code suddenly got six times slower, it's actually spending more than 5000 times more in the kernel, which is unusual given that the program is just calculating things by its nature, no I/O or anything like that made.

And oprofile proves this, the good one shows this:

samples  %        image name               symbol name
62451    45.2376  libz.so.1.2.8            /lib64/libz.so.1.2.8
61905    44.8421  libc-2.19.so             memset
2752      1.9935  libc-2.19.so             _int_free
2549      1.8464  libc-2.19.so             _int_malloc
2474      1.7921  libc-2.19.so             malloc_consolidate
2263      1.6392  no-vmlinux               /no-vmlinux
1365      0.9888  libc-2.19.so             malloc
723       0.5237  libc-2.19.so             __memcpy_sse2_unaligned
711       0.5150  libstdc++.so.6.0.21      /usr/lib64/libstdc++.so.6.0.21
478       0.3462  libc-2.19.so             free
366       0.2651  main                     main
6         0.0043  ld-2.19.so               _dl_lookup_symbol_x
2         0.0014  ld-2.19.so               _dl_relocate_object
2         0.0014  ld-2.19.so               do_lookup_x
1        7.2e-04  ld-2.19.so               _dl_name_match_p
1        7.2e-04  ld-2.19.so               check_match.9478
1        7.2e-04  ld-2.19.so               strcmp
1        7.2e-04  libc-2.19.so             _dl_addr

While the bad one is like this:

samples  %        image name               symbol name
594605   74.6032  no-vmlinux               /no-vmlinux
102981   12.9207  libc-2.19.so             memset
72822     9.1368  libz.so.1.2.8            /lib64/libz.so.1.2.8
9093      1.1409  libc-2.19.so             _int_malloc
3987      0.5002  libc-2.19.so             sysmalloc
3365      0.4222  libc-2.19.so             _int_free
2119      0.2659  libc-2.19.so             brk
1958      0.2457  libc-2.19.so             systrim.isra.1
1597      0.2004  libc-2.19.so             free
1217      0.1527  libc-2.19.so             malloc
1123      0.1409  libc-2.19.so             sbrk
786       0.0986  libc-2.19.so             __memcpy_sse2_unaligned
688       0.0863  libc-2.19.so             __default_morecore
669       0.0839  main                     main
5        6.3e-04  ld-2.19.so               _dl_relocate_object
4        5.0e-04  ld-2.19.so               do_lookup_x
2        2.5e-04  ld-2.19.so               strcmp
1        1.3e-04  ld-2.19.so               _dl_lookup_symbol_x
1        1.3e-04  libc-2.19.so             _dl_addr

If you also look at strace from both binaries, the good one makes just 64 syscalls (at least on my system), while the bad one makes 4000063 syscalls, most of which look like this:

brk(0x6c0000)                           = 0x6c0000
brk(0x6e2000)                           = 0x6e2000
brk(0x6d2000)                           = 0x6d2000
brk(0x6c2000)                           = 0x6c2000
brk(0x6c0000)                           = 0x6c0000
brk(0x6e2000)                           = 0x6e2000
brk(0x6d2000)                           = 0x6d2000
brk(0x6c2000)                           = 0x6c2000
brk(0x6c0000)                           = 0x6c0000

So, we have memory allocation-deallocation loop here. The only thing we have here using dynamic memory allocation/deallocation is the output string and indeed, if you leave the magic string uncommented, but initialize the string like std::string out = "1";, you get the same "good" result as with an append to string.

This must be some corner-case in allocation logic, where your usage-unusage pattern triggers glibc to first allocate some memory (a small chunk, thus via brk()) and then release it back to the system. With initialized (statically or appended to) string you have a different usage pattern and glibc can't return memory to the system. If you force glibc to not return any memory (see man mallopt), then the results are the same even with "bad" (commented) binary:

$ time MALLOC_TRIM_THRESHOLD_=-1 ./main 

real    0m5.094s
user    0m5.096s
sys     0m0.000s

Most probably you won't hit this corner-case in real life. Also, if you're using some different C library (or even different version of glibc), then the results can also be different.

Comments