jsguy - 1 year ago 214
C++ Question

# Why does this recursive C++ function have such a bad cache behavior?

Let

`T`
be a rooted binary tree such that every internal node has exactly two children. The nodes of the tree will be stored in an array, let us call it
`TreeArray`
by following the preorder layout.

So for example if this is the tree that we have:

Then
`TreeArray`
will contain the following node objects:

`7, 3, 1, 0, 2, 6, 12, 9, 8, 11, 13`

A node in this tree is a struct of this kind:

``````struct tree_node{

int id; //id of the node, randomly generated
int numChildren; //number of children, it is 2 but for the leafs it's 0

int pos; //position in TreeArray where the node is stored
int lpos; //position of the left child
int rpos; //position of the right child

tree_node(){
id = -1;
pos = lpos = rpos = -1;
numChildren = 0;
}

};
``````

Now suppose that we want a function that returns the sum of all the ids in the tree. Sounds very trivial, all you have to do is use a for loop that iterates over
`TreeArray`
and accumulates all the found ids. However, I am interested in understanding the cache behavior of the following implementation:

``````void testCache1(int cur){

//find the positions of the left and right children
int lpos = TreeArray[cur].lpos;
int rpos = TreeArray[cur].rpos;

//if there are no children we are at a leaf so update r and return

if(TreeArray[cur].numChildren == 0){
r += TreeArray[cur].id;
return;
}

//otherwise we are in an internal node, so update r and recurse
//first to the left subtree and then to the right subtree

r += TreeArray[cur].id;

testCache1(lpos);
testCache1(rpos);

}
``````

To test the cache behavior I have the following experiment:

``````r = 0; //r is a global variable
int main(int argc, char* argv[]){

for(int i=0;i<100;i++) {
r = 0;
testCache1(0);
}

cout<<r<<endl;
return 0;
}
``````

For a random tree with 5 million leafs,
`perf stat -B -e cache-misses,cache-references,instructions ./run_tests 111.txt`
prints the following:

`````` Performance counter stats for './run_tests 111.txt':

469,511,047      cache-misses              #   89.379 % of all cache refs
525,301,814      cache-references
20,715,360,185      instructions

11.214075268 seconds time elapsed
``````

In the beginning I thought maybe it is because of the way I generate the tree, which I exclude including it in my question, but when I run
`sudo perf record -e cache-misses ./run_tests 111.txt`

As we can see most of the cache misses come from this function. However I can not understand why this is the case. The values of
`cur`
will be sequential, I will first access position
`0`
of
`TreeArray`
, then position
`1`
,
`2`
,
`3`
etc.

To add more doubt to my understanding of what is happening, I have the following function that finds the same summation:

``````void testCache4(int index){

if(index == TreeArray.size) return;

r += TreeArray[index].id;

testCache4(index+1);

}
``````

`testCache4`
accesses the elements of
`TreeArray`
in the same way, but the cache behavior is so much better.

output from
`perf stat -B -e cache-misses,cache-references,instructions ./run_tests 11.txt`
:

`````` Performance counter stats for './run_tests 111.txt':

396,941,872      cache-misses              #   54.293 % of all cache refs
731,109,661      cache-references
11,547,097,924      instructions

4.306576556 seconds time elapsed
``````

in the output from
`sudo perf record -e cache-misses ./run_tests 111.txt`
the function is not even there:

I apologize for the long post but I feel completely lost. Thank you in advance.

EDIT:

Here is the entire test file, together with the parsers and everything that is required. It is assumed that the tree is available inside a text file that is given as an argument. Compile by typing
`g++ -O3 -std=c++11 file.cpp`
and run by typing
`./executable tree.txt`
. The tree I am using can be found here (don't open, click save us).

``````#include <iostream>
#include <fstream>
#define BILLION  1000000000LL

using namespace std;

/*
*
* Timing functions
*
*/

timespec startT, endT;

void startTimer(){
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT);
}

double endTimer(){
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT);
return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec);
}

/*
*
* tree node
*
*/

//this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers

struct tree_node_temp{

int id; //id of the node, randomly generated
int numChildren; //number of children, it is 2 but for the leafs it's 0
int size; //size of the subtree rooted at the current node
tree_node_temp *leftChild;
tree_node_temp *rightChild;

tree_node_temp(){
id = -1;
size = 1;
leftChild = nullptr;
rightChild = nullptr;
numChildren = 0;
}

};

struct tree_node{

int id; //id of the node, randomly generated
int numChildren; //number of children, it is 2 but for the leafs it's 0
int size; //size of the subtree rooted at the current node

int pos; //position in TreeArray where the node is stored
int lpos; //position of the left child
int rpos; //position of the right child

tree_node(){
id = -1;
pos = lpos = rpos = -1;
numChildren = 0;
}

};

/*
*
* Tree parser. The input is a file containing the tree in the newick format.
*
*/

string treeNewickStr; //string storing the newick format of a tree that we read from a file
int treeCurSTRindex; //index to the current position we are in while reading the newick string
int treeNumLeafs; //number of leafs in current tree
tree_node ** treeArrayReferences; //stack of references to free node objects
tree_node *treeArray; //array of node objects
int treeStackReferencesTop; //the top index to the references stack
int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree

int i;
if(treeCurSTRindex == treeNewickStr.size())
return nullptr;

tree_node_temp * leftChild;
tree_node_temp * rightChild;

if(treeNewickStr[treeCurSTRindex] == '('){
//create a left child
treeCurSTRindex++;
}

if(treeNewickStr[treeCurSTRindex] == ','){
//create a right child
treeCurSTRindex++;
}

if(treeNewickStr[treeCurSTRindex] == ')' || treeNewickStr[treeCurSTRindex] == ';'){
treeCurSTRindex++;
tree_node_temp * cur = new tree_node_temp();
cur->numChildren = 2;
cur->leftChild = leftChild;
cur->rightChild = rightChild;
cur->size = 1 + leftChild->size + rightChild->size;
return cur;
}

//we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format)
i = 0;
char treeLabel[20]; //buffer used for the label
while(treeNewickStr[treeCurSTRindex]!=',' && treeNewickStr[treeCurSTRindex]!='(' && treeNewickStr[treeCurSTRindex]!=')'){
treeLabel[i] = treeNewickStr[treeCurSTRindex];
treeCurSTRindex++;
i++;
}

treeLabel[i] = '\0';
tree_node_temp * cur = new tree_node_temp();
cur->numChildren = 0;
cur->id = atoi(treeLabel)-1;
treeNumLeafs++;

return cur;
}

//create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser
void treeInit(tree_node_temp * curRoot){

tree_node * curFinalRoot = treeArrayReferences[curpos];

curFinalRoot->pos = curpos;

if(curRoot->numChildren == 0) {
curFinalRoot->id = curRoot->id;
return;
}

tree_node * cnode = treeArrayReferences[treeStackReferencesTop];
curFinalRoot->lpos = curpos + 1;
curpos = curpos + 1;
treeStackReferencesTop++;
cnode->id = curRoot->leftChild->id;
treeInit(curRoot->leftChild);

curFinalRoot->rpos = curpos + 1;
curpos = curpos + 1;
cnode = treeArrayReferences[treeStackReferencesTop];
treeStackReferencesTop++;
cnode->id = curRoot->rightChild->id;
treeInit(curRoot->rightChild);

curFinalRoot->id = curRoot->id;
curFinalRoot->numChildren = 2;
curFinalRoot->size = curRoot->size;

}

//the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal
void updateInternalNodeIDs(int cur){

tree_node* curNode = treeArrayReferences[cur];

if(curNode->numChildren == 0){
return;
}
curNode->id = treeNumLeafs++;
updateInternalNodeIDs(curNode->lpos);
updateInternalNodeIDs(curNode->rpos);

}

//frees the memory of the first tree generated by the parser
void treeFreeMemory(tree_node_temp* cur){

if(cur->numChildren == 0){
delete cur;
return;
}
treeFreeMemory(cur->leftChild);
treeFreeMemory(cur->rightChild);

delete cur;

}

//reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree.
//this tree is scattered anywhere in the memory.

treeCurSTRindex = -1;
treeNewickStr = "";
treeNumLeafs = 0;

ifstream treeFin;

treeFin.open(file, ios_base::in);
//read the newick format of the tree and store it in a string
treeFin>>treeNewickStr;
//initialize index for reading the string
treeCurSTRindex = 0;
//create the tree in main memory

//store the tree in an array following the pre order layout
treeArray = new tree_node[root->size];
treeArrayReferences = new tree_node*[root->size];
int i;
for(i=0;i<root->size;i++)
treeArrayReferences[i] = &treeArray[i];
treeStackReferencesTop = 0;

tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop];
curpos = treeStackReferencesTop;
treeStackReferencesTop++;
finalRoot->id = root->id;
treeInit(root);

//update the internal node ids (the leaf ids are defined by the ids stored in the newick string)
updateInternalNodeIDs(0);
//close the file
treeFin.close();

//free the memory of initial tree
treeFreeMemory(root);
//return the pre order tree
return finalRoot;

}

/*
* experiments
*
*/

int r;
tree_node* T;

void testCache1(int cur){

int lpos = treeArray[cur].lpos;
int rpos = treeArray[cur].rpos;

if(treeArray[cur].numChildren == 0){
r += treeArray[cur].id;
return;
}

r += treeArray[cur].id;

testCache1(lpos);
testCache1(rpos);

}

void testCache4(int index){

if(index == T->size) return;

r += treeArray[index].id;

testCache4(index+1);

}

int main(int argc, char* argv[]){

string Tnewick = argv[1];
double tt;

startTimer();
for(int i=0;i<100;i++) {
r = 0;
testCache4(0);
}
tt = endTimer();
cout<<r<<endl;
cout<<tt/BILLION<<endl;

startTimer();
for(int i=0;i<100;i++) {
r = 0;
testCache1(0);
}
tt = endTimer();
cout<<r<<endl;
cout<<tt/BILLION<<endl;

delete[] treeArray;
delete[] treeArrayReferences;

return 0;
}
``````

EDIT2:

I run some profiling tests with valgrind. The instructions might actually be the overhead here, but I don't understand why. For example even in the experiments above with perf, one version gives around 20 billion instructions and the other 11 billion. That's a 9 billion difference.

With
`-O3`
enabled I get the following:

so the function calls are expensive in
`testCache1`
and cost nothing in
`testCache4`
? The amount of function calls in both cases should be the same...

I guess the problem is a misunderstanding of what cache-references actually count.

As explained in this answer cache-references on Intel CPUs actually are the number of references to the last-level cache. Therefore memory references which were served by the L1 cache are not counted. The Intel 64 and IA-32 Architectures Developer's Manual states that loads from the L1 prefetcher however are counted.

If you actually compare the absolute number of cache-misses, you will see that they are roughly equal for both functions. I used a fully balanced tree for the test, removed `pos` to get a size of 16 fitting nicely into cache lines and got the following numbers:

`testCache4`:

``````843.628.131      L1-dcache-loads                                               (56,83%)
193.006.858      L1-dcache-load-misses     #   22,73% of all L1-dcache hits    (57,31%)
326.698.621      cache-references                                              (57,07%)
188.435.203      cache-misses              #   57,679 % of all cache refs      (56,76%)
``````

`testCache1`:

``````3.519.968.253    L1-dcache-loads                                               (57,17%)
193.664.806      L1-dcache-load-misses     #    5,50% of all L1-dcache hits    (57,24%)
256.638.490      cache-references                                              (57,12%)
188.007.927      cache-misses              #   73,258 % of all cache refs      (57,23%)
``````

And if I manually disable all hardware prefetchers:

`testCache4`:

``````846.124.474      L1-dcache-loads                                               (57,22%)
192.495.450      L1-dcache-load-misses     #   22,75% of all L1-dcache hits    (57,31%)
193.699.811      cache-references                                              (57,03%)
185.445.753      cache-misses              #   95,739 % of all cache refs      (57,17%)
``````

`testCache1`:

``````3.534.308.118    L1-dcache-loads                                               (57,16%)
193.595.962      L1-dcache-load-misses     #    5,48% of all L1-dcache hits    (57,18%)
193.639.498      cache-references                                              (57,12%)
185.120.733      cache-misses              #   95,601 % of all cache refs      (57,15%)
``````

As you can see the differences are gone now. There simply were additional cache-references events due to the prefetcher and the actual reference being counted twice.

Actually if you count all memory references `testCache1` has a lower total cache miss rate, because each `tree_node` is referenced 4 times instead of ones, but each data member of a `tree_node` lies on the same cache line and so there is only one out of 4 misses.

For `testCache4` you can see that the L1d load miss rate is actually close to 25% which you would expect if `sizeof(tree_node) == 16` and cache lines are 64 bytes.

Also (at least gcc with -O2) applies tail recursion optimization to both functions eliminating the recursion of `testCache4`, while making `testCache1` one-way recursive. Therefore `testCache1` has many additional cache references to the stack frames which `testCache4` does not have.

You can get the result without prefetcher also by using valgrind, which probably is also more reliable in its output. It does not simulate all properties of CPU caches though, as seen with this example.

Regarding your edits: As I metioned gcc aplies tail recursion optimization, so there are no calls left in `testCache4` and of course recursion and additional memory loads in `testCache1` have significant instruction overhead compared to the simple load/add loop left in `testCache4`.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download