VioPanda VioPanda -4 years ago 79
C++ Question

C++ data access benchmark

I

m writing simple benchmark on C++ to compare the execution time of data access on different platforms.And I
ve got strange results. I measure the time of sequential order access and indirection order access. For this I am just copying one array data to another in two different ways. The code and result are below.
The time I got is ambiguous. Evaluation for int data type shows, that sequential access is faster (it is OK). But for float and double types it is just the opposite (see time results below). Maybe I make benchmarking wrong or there are some pitfalls I did not take into account? Or could you suggest some benchmark tools to compare data access or simple operations performance for different datatypes?

template<typename T>
std::chrono::nanoseconds::rep PerformanceMeter<T>::testDataAccessArr()
{
std::chrono::nanoseconds::rep totalSequential = 0;

T* arrDataIn = new T[k_SIZE];
T* arrDataOut = new T[k_SIZE];

std::generate_n(arrDataIn, k_SIZE, DataProcess<T>::valueGenerator);
DataProcess<T>::clearCache();

std::chrono::nanoseconds::rep timeSequential = measure::ns(copySequentialArr, arrDataIn, arrDataOut, k_SIZE);

std::cout << "Sequential order access:\t" << timePrint(timeSequential) << "\t";
std::cout.flush();

std::chrono::nanoseconds::rep totalIndirection = 0;
T** pointers = new T*[k_SIZE];
T** pointersOut = new T*[k_SIZE];
for (size_t i = 0; i < k_SIZE; ++i)
{
pointers[i] = &arrDataIn[i];
pointersOut[i] = &arrDataOut[i];
}

std::generate_n(arrDataIn, k_SIZE, DataProcess<T>::valueGenerator);
std::generate_n(arrDataOut, k_SIZE, DataProcess<T>::valueGenerator);

DataProcess<T>::clearCache();

totalIndirection = measure::ns(copyIndirectionArr, pointers, pointersOut, k_SIZE);

std::cout << std::endl << "Indirection order access:\t" << timePrint(totalIndirection) << std::endl;
std::cout.flush();

delete[] arrDataIn;
delete[] arrDataOut;
delete[] pointers;
delete[] pointersOut;

return timeSequential;
}

template <typename T>
void PerformanceMeter<T>::copySequentialArr(const T* dataIn, T* dataOut, size_t dataSize)
{
for (int i = 0; i < dataSize; i++)
dataOut[i] = dataIn[i];
}

template <typename T>
void PerformanceMeter<T>::copyIndirectionArr(T** dataIn, T** dataOut, size_t dataSize)
{
for (int i = 0; i < dataSize; i++)
*dataOut[i] = *dataIn[i];
}


Results:

-------------------Measure int---------------

data: 10 MB ; iterations: 1

Sequential order access: 8.50454ms

Indirection order access: 11.6925ms

-------------------Measure float------------

data: 10 MB ; iterations: 1

Sequential order access: 8.84023ms

Indirection order access: 8.53148ms

-------------------Measure double-----------

data: 10 MB ; iterations: 1

Sequential order access: 5.57747ms

Indirection order access: 3.72843ms

Answer Source

Here are an example (using T = int) of the assembly output from GCC 6.3 using -O2: copySequentialArr and copyIndirectionArr.

From the assembly it is clear that they are very much alike, but copyIndirectionArr requires two mov instructions more than copySequentialArr. With this we can somewhat conclude that copySequentialArr is the fastest.

The same it true when using T = double: copySequentialArr and copyIndirectionArr.

Vectorization

It gets funny when we start using -O3: copySequentialArr and copyIndirectionArr. There is no change to copyIndirectionArr, but copySequentialArr is now vectorized by the compiler. This vectorization will make it even faster than before, under normal conditions.

Disclainmer

These examinations of the resulting assembly code are "out-of-context", in the sense that the compiler would optimize it even further if it has knowledge of the context.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download