We have a mid-sized program for a simulation task, that we need to optimize. We have already done our best optimizing the source to the limit of our programming skills, including profiling with Gprof and Valgrind.
When finally finished, we want to run the program on several systems probably for some months. Therefore we are really interested in pushing the optimization to the limits.
All systems will run Debian/Linux on relatively new hardware (Intel i5 or i7).
What are possible optimization options using a recent version of g++, that go beyond -O3/-Ofast?
We are also interested in costly minor optimization, that will payout in the long run.
What we use right now
Right now we use the following g++ optimization options:
Most of the answers suggest alternative solutions, such as different compilers or external libraries, which would most likely bring a lot of rewriting or integration work. I will try to stick to what the question is asking, and focus on what can be done with GCC alone, by activating compiler flags or doing minimal changes to the code, as requested by the OP. This is not a "you must do this" answer, but more a collection of GCC tweaks that have worked out well for me and that you can give a try if they are relevant in your specific context.
Warnings regarding original question
Before going into the details, a few warning regarding the question, typically for people who will come along, read the question and say "the OP is optimising beyond O3, I should use the same flags than he does!".
-march=nativeenables usage of instructions specific to a given CPU architecture, and that are not necessarily available on a different architecture. The program may not work at all if run on a system with a different CPU, or be significantly slower (as this also enables
mtune=native), so be aware of this if you decide to use it. More information here.
-fwhole-program"should not be used in combination with
-flto" according to the GCC documentation, unless the linker plugin is disabled or if the program does not require any symbols to be exported. I've managed to produce broken code in the past by using these two flags in combination with the linker plugin, so these options should be used with caution. More information here.
-Ofast, as you stated, enables some non standard compliant optimisations, so it should used with caution as well.
Other GCC flags to try out
The details for the different flags are listed here.
-ffast-math, which in turn enables
-fcx-limited-range. You can go even further on floating point calculation optimisations by selectively adding some extra flags such as
-fno-trapping-math. These are not included in
-Ofastand can give some additional performance increases on calculations, but you must check whether they actually benefit you and don't break any calculations.
-frename-registers, this option has never produced unwanted results for me and tends to give a noticeable performance increase (ie. can be measured when benchmarking). This is the type of flag that is very dependant on your processor though.
-funroll-loopsalso sometimes gives good results, but it is dependent on your actual code.
GCC has Profile-Guided Optimisations features. I haven't found a lot of precise GCC documentation about it, but nevertheless getting it to run is quite straightforward.
-fprofile-use. If your application is multi-threaded also add the
PGO with GCC can give amazing results and really significantly boost performance (I've seen a 15-20% speed increase on one of the projects I was recently working on). Obviously the issue here is to have some data that is sufficiently representative of your application's execution, which is not always available or easy to obtain.
GCC's Parallel Mode
GCC features a Parallel Mode, which was first released around the time where the GCC 4.2 compiler was out.
Basically, it provides you with parallel implementations of many of the algorithms in the C++ Standard Library. To enable them globally, you just have to add the
-fopenmp and the
-D_GLIBCXX_PARALLEL flags to the compiler. You can also selectively enable each algorithm when needed, but this will require some minor code changes.
All the information about this parallel mode can be found here.
If you frequently use these algorithms on large data structures, and have many hardware thread contexts available, these parallel implementations can give a huge performance boost. I have only made use of the parallel implementation of
sort so far, but to give a rough idea I managed to reduce the time for sorting from 14 to 4 seconds in one of my applications (testing environment: vector of 100 millions objects with custom comparator function and 8 cores machine).
Unlike the previous points sections, this part does require some small changes in the code. They are also GCC specific (some of them work on Clang as well), so compile time macros should be used to keep the code portable on other compilers. This section contains some more advanced techniques, and should not be used if you don't have some assembly level understanding of what's going on.
__builtin_expectcan help the compiler do better optimisations by providing it with branch prediction information. Other constructs such as
__builtin_prefetchbrings data into a cache before it is accessed and can help reducing cache misses.
coldattributes; the former will indicate to the compiler that the function is a hotspot of the program and optimise the function more aggressively and place it in a special subsection of the text section, for better locality; the later will optimise the function for size and place it in another special subsection of the text section.
I hope this answer will prove useful for some developers, and I will be glad to consider any edits or suggestions.