karo96 karo96 -4 years ago 59
C++ Question

C++ code is much slower on linux than on windows

I'm writing simple program and I want to mesure time of its execution on Windows and Linux (both 64). I have a problem, because for 1 000 000 elements in table on Windows it takes about 35 seconds, while on linux it takes about 30 seconds for 10 elements. Why the difference is so huge?
What am I doing wrong? Is there something in my code that is not proper on Linux?

Here is my code:

void fillTable(int s, int t[])
for (int i = 0; i < s; i++)
t[i] = rand();
void checkIfIsPrimeNotParalleled(int size, int table[])
for (int i = 0; i < size; i++)
int tmp = table[i];

if (tmp < 2)

for (int i = 2; i < tmp; i++)
if (tmp % i == 0)
void mesureTime(int size, int table[], int numberOfRepetitions)
long long sum = 0;
clock_t start_time, end_time;
fillTable(size, table);

for (int i = 0; i < numberOfRepetitions; i++)
start_time = clock();

checkIfIsPrimeNotParalleled(size, table);

end_time = clock();
double duration = (end_time - start_time) / CLOCKS_PER_SEC;
sum += duration;
cout << "Avg: " << round(sum / numberOfRepetitions) << " s"<<endl;

int main()

static constexpr int size = 1000000;
int *table = new int[size];
int numberOfRepetitions = 1;
mesureTime(size, table, numberOfRepetitions);
delete[] table;
return 0;


and the makefile for Linux. On Windows I'm using Visual Studio 2015

.PHONY: Project1

CXX = g++
EXEC = tablut
LDFLAGS = -fopenmp
CXXFLAGS = -std=c++11 -Wall -Wextra -fopenmp -m64
SRC= Project1.cpp
OBJ= $(SRC:.cpp=.o)

all: $(EXEC)

tablut: $(OBJ)
$(CXX) -o tablut $(OBJ) $(LDFLAGS)

%.o: %.cpp
$(CXX) -o $@ -c $< $(CXXFLAGS)

rm -rf *.o

mrproper: clean
rm -rf tablut

The main goal is to mesure time.

Answer Source

Your code has a for loop set to 1,000,000 iterations. As noted by others, the compiler can optimize this loop away, such that you learn nothing.

A technique I use to work around the good-compiler-issue, is to replace the fixed-loop with a low cost time check.

In the following code snippet, I use chrono for duration measurements, and time(0) to check for end-of-test. Chrono is not the lowest cost time check I have found, but I think good-enough for how I am using it. std::time(0) measures to be about 5 ns (on my system), about the fastest I have measured.

// Note 7 - semaphore function performance
// measure duration when no thread 'collision' and no context switch
 void measure_LockUnlock()
       PPLSem_t*    sem1 = new PPLSem_t;
       assert(nullptr != sem1);
       size_t     count1 = 0;
       size_t     count2 = 0;
       std::cout << dashLine << "  3 second measure of lock()/unlock()"
                 << " (no collision) " << std::endl;
       time_t t0 = time(0) + 3;

       Time_t start_us = HRClk_t::now();
       do {
          assert(0 == sem1->lock());   count1 += 1;
          assert(0 == sem1->unlock()); count2 += 1;
          if(time(0) > t0)  break;
       auto  duration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);

       assert(count1 == count2);
       std::cout << report (" 'sem lock()+unlock()' ", count1, duration_us.count());

       delete sem1;
       std::cout << "\n";
    } // void mainMeasures_LockUnlock()

FYI - "class PPLSem_t" is 4-single-line-methods running a Posix Process Semaphore set to local mode (unamed, unshared).

The test above measures only the cost of method invocations, no context switches (notoriously slow) were invoked in this experiment.

But wait, you say ... don't one or the other of lock() and unlock() have side effects? Agreed. But does the compiler know that? It has to assume that they do.

So how do you make this useful?

Two steps. 1) Measure your lock/unlock performance. 2) Add the code of what is inside of your for loop (not the for-loop itself), into this lock/unlock loop, then measure the performance again.

The difference of these two measurements is the information you seek, and I think the compiler can not optimize it away.

The result of duration measurement on my older Dell, with Ubuntu 15.10, and g++v5.2.1.23, and -O3 is

  3 second measure of lock()/unlock() (no collision) 
  133.5317660 M 'sem lock()+unlock()'  events in 3,341,520 us
  39.96138464 M 'sem lock()+unlock()'  events per second
  25.02415792 n sec per  'sem lock()+unlock()'  event 

So this is about 12.5 nsec for one of each method, and achieved 133 10^6 iterations in about 3 seconds.

You can attempt to adjust the time to reach 1,000,000 iterations, or simply use the iteration count to jump out of the loop. (i.e. if count1 == 1,000,000) break; kind of idea)

Your assignment, should you choose to accept it, is to find a suitable simple and fast method (or two) with a side-effect (which you know will not happen), and add your code into that loop, and then run until the loop count is 1,000,000.

Hope this helps.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download