jsguy jsguy - 2 months ago 9
C++ Question

Why does this recursive C++ function have such a bad cache behavior?

Let

T
be a rooted binary tree such that every internal node has exactly two children. The nodes of the tree will be stored in an array, let us call it
TreeArray
by following the preorder layout.

So for example if this is the tree that we have:
enter image description here

Then
TreeArray
will contain the following node objects:

7, 3, 1, 0, 2, 6, 12, 9, 8, 11, 13


A node in this tree is a struct of this kind:

struct tree_node{

int id; //id of the node, randomly generated
int numChildren; //number of children, it is 2 but for the leafs it's 0

int pos; //position in TreeArray where the node is stored
int lpos; //position of the left child
int rpos; //position of the right child

tree_node(){
id = -1;
pos = lpos = rpos = -1;
numChildren = 0;
}

};


Now suppose that we want a function that returns the sum of all the ids in the tree. Sounds very trivial, all you have to do is use a for loop that iterates over
TreeArray
and accumulates all the found ids. However, I am interested in understanding the cache behavior of the following implementation:

void testCache1(int cur){

//find the positions of the left and right children
int lpos = TreeArray[cur].lpos;
int rpos = TreeArray[cur].rpos;

//if there are no children we are at a leaf so update r and return

if(TreeArray[cur].numChildren == 0){
r += TreeArray[cur].id;
return;
}

//otherwise we are in an internal node, so update r and recurse
//first to the left subtree and then to the right subtree

r += TreeArray[cur].id;

testCache1(lpos);
testCache1(rpos);

}


To test the cache behavior I have the following experiment:

r = 0; //r is a global variable
int main(int argc, char* argv[]){

for(int i=0;i<100;i++) {
r = 0;
testCache1(0);
}

cout<<r<<endl;
return 0;
}


For a random tree with 5 million leafs,
perf stat -B -e cache-misses,cache-references,instructions ./run_tests 111.txt
prints the following:

Performance counter stats for './run_tests 111.txt':

469,511,047 cache-misses # 89.379 % of all cache refs
525,301,814 cache-references
20,715,360,185 instructions

11.214075268 seconds time elapsed


In the beginning I thought maybe it is because of the way I generate the tree, which I exclude including it in my question, but when I run
sudo perf record -e cache-misses ./run_tests 111.txt
I received the following output:

enter image description here

As we can see most of the cache misses come from this function. However I can not understand why this is the case. The values of
cur
will be sequential, I will first access position
0
of
TreeArray
, then position
1
,
2
,
3
etc.

To add more doubt to my understanding of what is happening, I have the following function that finds the same summation:

void testCache4(int index){

if(index == TreeArray.size) return;

r += TreeArray[index].id;

testCache4(index+1);

}


testCache4
accesses the elements of
TreeArray
in the same way, but the cache behavior is so much better.

output from
perf stat -B -e cache-misses,cache-references,instructions ./run_tests 11.txt
:

Performance counter stats for './run_tests 111.txt':

396,941,872 cache-misses # 54.293 % of all cache refs
731,109,661 cache-references
11,547,097,924 instructions

4.306576556 seconds time elapsed


in the output from
sudo perf record -e cache-misses ./run_tests 111.txt
the function is not even there:

enter image description here

I apologize for the long post but I feel completely lost. Thank you in advance.

EDIT:

Here is the entire test file, together with the parsers and everything that is required. It is assumed that the tree is available inside a text file that is given as an argument. Compile by typing
g++ -O3 -std=c++11 file.cpp
and run by typing
./executable tree.txt
. The tree I am using can be found here (don't open, click save us).

#include <iostream>
#include <fstream>
#define BILLION 1000000000LL

using namespace std;

/*
*
* Timing functions
*
*/

timespec startT, endT;

void startTimer(){
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT);
}

double endTimer(){
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT);
return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec);
}

/*
*
* tree node
*
*/

//this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers

struct tree_node_temp{

int id; //id of the node, randomly generated
int numChildren; //number of children, it is 2 but for the leafs it's 0
int size; //size of the subtree rooted at the current node
tree_node_temp *leftChild;
tree_node_temp *rightChild;

tree_node_temp(){
id = -1;
size = 1;
leftChild = nullptr;
rightChild = nullptr;
numChildren = 0;
}

};

struct tree_node{

int id; //id of the node, randomly generated
int numChildren; //number of children, it is 2 but for the leafs it's 0
int size; //size of the subtree rooted at the current node

int pos; //position in TreeArray where the node is stored
int lpos; //position of the left child
int rpos; //position of the right child

tree_node(){
id = -1;
pos = lpos = rpos = -1;
numChildren = 0;
}

};

/*
*
* Tree parser. The input is a file containing the tree in the newick format.
*
*/

string treeNewickStr; //string storing the newick format of a tree that we read from a file
int treeCurSTRindex; //index to the current position we are in while reading the newick string
int treeNumLeafs; //number of leafs in current tree
tree_node ** treeArrayReferences; //stack of references to free node objects
tree_node *treeArray; //array of node objects
int treeStackReferencesTop; //the top index to the references stack
int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree


//helper function for readNewick
tree_node_temp* readNewickHelper() {

int i;
if(treeCurSTRindex == treeNewickStr.size())
return nullptr;

tree_node_temp * leftChild;
tree_node_temp * rightChild;

if(treeNewickStr[treeCurSTRindex] == '('){
//create a left child
treeCurSTRindex++;
leftChild = readNewickHelper();
}

if(treeNewickStr[treeCurSTRindex] == ','){
//create a right child
treeCurSTRindex++;
rightChild = readNewickHelper();
}

if(treeNewickStr[treeCurSTRindex] == ')' || treeNewickStr[treeCurSTRindex] == ';'){
treeCurSTRindex++;
tree_node_temp * cur = new tree_node_temp();
cur->numChildren = 2;
cur->leftChild = leftChild;
cur->rightChild = rightChild;
cur->size = 1 + leftChild->size + rightChild->size;
return cur;
}

//we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format)
i = 0;
char treeLabel[20]; //buffer used for the label
while(treeNewickStr[treeCurSTRindex]!=',' && treeNewickStr[treeCurSTRindex]!='(' && treeNewickStr[treeCurSTRindex]!=')'){
treeLabel[i] = treeNewickStr[treeCurSTRindex];
treeCurSTRindex++;
i++;
}

treeLabel[i] = '\0';
tree_node_temp * cur = new tree_node_temp();
cur->numChildren = 0;
cur->id = atoi(treeLabel)-1;
treeNumLeafs++;

return cur;
}

//create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser
void treeInit(tree_node_temp * curRoot){

tree_node * curFinalRoot = treeArrayReferences[curpos];

curFinalRoot->pos = curpos;

if(curRoot->numChildren == 0) {
curFinalRoot->id = curRoot->id;
return;
}

//add left child
tree_node * cnode = treeArrayReferences[treeStackReferencesTop];
curFinalRoot->lpos = curpos + 1;
curpos = curpos + 1;
treeStackReferencesTop++;
cnode->id = curRoot->leftChild->id;
treeInit(curRoot->leftChild);

//add right child
curFinalRoot->rpos = curpos + 1;
curpos = curpos + 1;
cnode = treeArrayReferences[treeStackReferencesTop];
treeStackReferencesTop++;
cnode->id = curRoot->rightChild->id;
treeInit(curRoot->rightChild);

curFinalRoot->id = curRoot->id;
curFinalRoot->numChildren = 2;
curFinalRoot->size = curRoot->size;

}

//the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal
void updateInternalNodeIDs(int cur){

tree_node* curNode = treeArrayReferences[cur];

if(curNode->numChildren == 0){
return;
}
curNode->id = treeNumLeafs++;
updateInternalNodeIDs(curNode->lpos);
updateInternalNodeIDs(curNode->rpos);

}

//frees the memory of the first tree generated by the parser
void treeFreeMemory(tree_node_temp* cur){

if(cur->numChildren == 0){
delete cur;
return;
}
treeFreeMemory(cur->leftChild);
treeFreeMemory(cur->rightChild);

delete cur;

}

//reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree.
//this tree is scattered anywhere in the memory.

tree_node* readNewick(string& file){

treeCurSTRindex = -1;
treeNewickStr = "";
treeNumLeafs = 0;

ifstream treeFin;

treeFin.open(file, ios_base::in);
//read the newick format of the tree and store it in a string
treeFin>>treeNewickStr;
//initialize index for reading the string
treeCurSTRindex = 0;
//create the tree in main memory
tree_node_temp* root = readNewickHelper();

//store the tree in an array following the pre order layout
treeArray = new tree_node[root->size];
treeArrayReferences = new tree_node*[root->size];
int i;
for(i=0;i<root->size;i++)
treeArrayReferences[i] = &treeArray[i];
treeStackReferencesTop = 0;

tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop];
curpos = treeStackReferencesTop;
treeStackReferencesTop++;
finalRoot->id = root->id;
treeInit(root);

//update the internal node ids (the leaf ids are defined by the ids stored in the newick string)
updateInternalNodeIDs(0);
//close the file
treeFin.close();

//free the memory of initial tree
treeFreeMemory(root);
//return the pre order tree
return finalRoot;

}

/*
* experiments
*
*/

int r;
tree_node* T;

void testCache1(int cur){

int lpos = treeArray[cur].lpos;
int rpos = treeArray[cur].rpos;

if(treeArray[cur].numChildren == 0){
r += treeArray[cur].id;
return;
}

r += treeArray[cur].id;

testCache1(lpos);
testCache1(rpos);

}


void testCache4(int index){

if(index == T->size) return;

r += treeArray[index].id;

testCache4(index+1);

}


int main(int argc, char* argv[]){

string Tnewick = argv[1];
T = readNewick(Tnewick);
double tt;

startTimer();
for(int i=0;i<100;i++) {
r = 0;
testCache4(0);
}
tt = endTimer();
cout<<r<<endl;
cout<<tt/BILLION<<endl;

startTimer();
for(int i=0;i<100;i++) {
r = 0;
testCache1(0);
}
tt = endTimer();
cout<<r<<endl;
cout<<tt/BILLION<<endl;

delete[] treeArray;
delete[] treeArrayReferences;

return 0;
}


EDIT2:

I run some profiling tests with valgrind. The instructions might actually be the overhead here, but I don't understand why. For example even in the experiments above with perf, one version gives around 20 billion instructions and the other 11 billion. That's a 9 billion difference.

With
-O3
enabled I get the following:

enter image description here

so the function calls are expensive in
testCache1
and cost nothing in
testCache4
? The amount of function calls in both cases should be the same...

Answer

I guess the problem is a misunderstanding of what cache-references actually count.

As explained in this answer cache-references on Intel CPUs actually are the number of references to the last-level cache. Therefore memory references which were served by the L1 cache are not counted. The Intel 64 and IA-32 Architectures Developer's Manual states that loads from the L1 prefetcher however are counted.

If you actually compare the absolute number of cache-misses, you will see that they are roughly equal for both functions. I used a fully balanced tree for the test, removed pos to get a size of 16 fitting nicely into cache lines and got the following numbers:

testCache4:

843.628.131      L1-dcache-loads                                               (56,83%)
193.006.858      L1-dcache-load-misses     #   22,73% of all L1-dcache hits    (57,31%)
326.698.621      cache-references                                              (57,07%)
188.435.203      cache-misses              #   57,679 % of all cache refs      (56,76%)

testCache1:

3.519.968.253    L1-dcache-loads                                               (57,17%)
193.664.806      L1-dcache-load-misses     #    5,50% of all L1-dcache hits    (57,24%)
256.638.490      cache-references                                              (57,12%)
188.007.927      cache-misses              #   73,258 % of all cache refs      (57,23%)

And if I manually disable all hardware prefetchers:

testCache4:

846.124.474      L1-dcache-loads                                               (57,22%)
192.495.450      L1-dcache-load-misses     #   22,75% of all L1-dcache hits    (57,31%)
193.699.811      cache-references                                              (57,03%)
185.445.753      cache-misses              #   95,739 % of all cache refs      (57,17%)

testCache1:

3.534.308.118    L1-dcache-loads                                               (57,16%)
193.595.962      L1-dcache-load-misses     #    5,48% of all L1-dcache hits    (57,18%)
193.639.498      cache-references                                              (57,12%)
185.120.733      cache-misses              #   95,601 % of all cache refs      (57,15%)

As you can see the differences are gone now. There simply were additional cache-references events due to the prefetcher and the actual reference being counted twice.

Actually if you count all memory references testCache1 has a lower total cache miss rate, because each tree_node is referenced 4 times instead of ones, but each data member of a tree_node lies on the same cache line and so there is only one out of 4 misses.

For testCache4 you can see that the L1d load miss rate is actually close to 25% which you would expect if sizeof(tree_node) == 16 and cache lines are 64 bytes.

Also (at least gcc with -O2) applies tail recursion optimization to both functions eliminating the recursion of testCache4, while making testCache1 one-way recursive. Therefore testCache1 has many additional cache references to the stack frames which testCache4 does not have.

You can get the result without prefetcher also by using valgrind, which probably is also more reliable in its output. It does not simulate all properties of CPU caches though, as seen with this example.

Regarding your edits: As I metioned gcc aplies tail recursion optimization, so there are no calls left in testCache4 and of course recursion and additional memory loads in testCache1 have significant instruction overhead compared to the simple load/add loop left in testCache4.

Comments