jsguy - 2 months ago 7

C++ Question

Let

`T`

`v`

`T`

`v`

Input

Desired output

With red I indicate the numbers that we want to compute. The nodes of the tree will be stored in an array, let us call it

`TreeArray`

For the example above,

`TreeArray`

`10, 11, 0, 12, 13, 2, 7, 3, 14, 1, 15, 16, 4, 8, 17, 18, 5, 9, 6`

A node of the tree is described by the following struct:

`struct tree_node{`

long long int id; //id of the node, randomly generated

int numChildren; //number of children, it is 2 but for the leafs it's 0

int size; //size of the subtree rooted at the current node,

// what we want to compute

int pos; //position in TreeArray where the node is stored

int lpos; //position of the left child

int rpos; //position of the right child

tree_node(){

id = -1;

size = 1;

pos = lpos = rpos = -1;

numChildren = 0;

}

};

The function to compute all the

`size`

`void testCache(int cur){`

if(treeArray[cur].numChildren == 0){

treeArray[cur].size = 1;

return;

}

testCache(treeArray[cur].lpos);

testCache(treeArray[cur].rpos);

treeArray[cur].size = treeArray[treeArray[cur].lpos].size +

treeArray[treeArray[cur].rpos].size + 1;

}

I would like to understand why this function is

`T`

and slower when

`T`

The following experiments were run on Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB.

Each dot in the graphs is the result of the following for loop (the parameters are defined by the axis):

`for (int i = 0; i < 100; i++) {`

testCache(0);

}

`n`

`n`

Now let us try to find where the bottleneck is. I used the PAPI library to count interesting hardware counters.

The first counter is the instructions, how many instructions do we actually spend? Is there a difference when the trees look different?

The difference is not significant. It looks like for large inputs the left going chain requires fewer instructions, but the difference is so small, so I think it is safe to assume that they both require the same number of instructions.

Seeing that we have stored the tree in a nice pre order layout inside

`treeArray`

Let's look at the accesses to L2 cache. The accesses to L2 cache happen when we get a miss in L1 cache, so that is an indirect counter for L1 misses as well.

As we can see the right going tree requires fewer L1 misses, so it seems that it uses the cache efficiently.

Same for L2 misses, the right going tree seems to be more efficient. Still nothing to indicate why the right going trees are so slower. Let's look at L3.

In L3 things explode for the right going trees. So the problem seems to be in L3 cache. Unfortunately I could not explain the reason behind this behavior. Why do things get messed up in L3 cache for the right going trees?

Here is the entire code together with the experiment:

`#include <iostream>`

#include <fstream>

#define BILLION 1000000000LL

using namespace std;

/*

*

* Timing functions

*

*/

timespec startT, endT;

void startTimer(){

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT);

}

double endTimer(){

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT);

return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec);

}

/*

*

* tree node

*

*/

//this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers

struct tree_node_temp{

long long int id; //id of the node, randomly generated

int numChildren; //number of children, it is 2 but for the leafs it's 0

int size; //size of the subtree rooted at the current node

tree_node_temp *leftChild;

tree_node_temp *rightChild;

tree_node_temp(){

id = -1;

size = 1;

leftChild = nullptr;

rightChild = nullptr;

numChildren = 0;

}

};

struct tree_node{

long long int id; //id of the node, randomly generated

int numChildren; //number of children, it is 2 but for the leafs it's 0

int size; //size of the subtree rooted at the current node

int pos; //position in TreeArray where the node is stored

int lpos; //position of the left child

int rpos; //position of the right child

tree_node(){

id = -1;

pos = lpos = rpos = -1;

numChildren = 0;

}

};

/*

*

* Tree parser. The input is a file containing the tree in the newick format.

*

*/

string treeNewickStr; //string storing the newick format of a tree that we read from a file

int treeCurSTRindex; //index to the current position we are in while reading the newick string

int treeNumLeafs; //number of leafs in current tree

tree_node ** treeArrayReferences; //stack of references to free node objects

tree_node *treeArray; //array of node objects

int treeStackReferencesTop; //the top index to the references stack

int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree

//helper function for readNewick

tree_node_temp* readNewickHelper() {

int i;

if(treeCurSTRindex == treeNewickStr.size())

return nullptr;

tree_node_temp * leftChild;

tree_node_temp * rightChild;

if(treeNewickStr[treeCurSTRindex] == '('){

//create a left child

treeCurSTRindex++;

leftChild = readNewickHelper();

}

if(treeNewickStr[treeCurSTRindex] == ','){

//create a right child

treeCurSTRindex++;

rightChild = readNewickHelper();

}

if(treeNewickStr[treeCurSTRindex] == ')' || treeNewickStr[treeCurSTRindex] == ';'){

treeCurSTRindex++;

tree_node_temp * cur = new tree_node_temp();

cur->numChildren = 2;

cur->leftChild = leftChild;

cur->rightChild = rightChild;

cur->size = 1 + leftChild->size + rightChild->size;

return cur;

}

//we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format)

i = 0;

char treeLabel[20]; //buffer used for the label

while(treeNewickStr[treeCurSTRindex]!=',' && treeNewickStr[treeCurSTRindex]!='(' && treeNewickStr[treeCurSTRindex]!=')'){

treeLabel[i] = treeNewickStr[treeCurSTRindex];

treeCurSTRindex++;

i++;

}

treeLabel[i] = '\0';

tree_node_temp * cur = new tree_node_temp();

cur->numChildren = 0;

cur->id = atoi(treeLabel)-1;

treeNumLeafs++;

return cur;

}

//create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser

void treeInit(tree_node_temp * curRoot){

tree_node * curFinalRoot = treeArrayReferences[curpos];

curFinalRoot->pos = curpos;

if(curRoot->numChildren == 0) {

curFinalRoot->id = curRoot->id;

return;

}

//add left child

tree_node * cnode = treeArrayReferences[treeStackReferencesTop];

curFinalRoot->lpos = curpos + 1;

curpos = curpos + 1;

treeStackReferencesTop++;

cnode->id = curRoot->leftChild->id;

treeInit(curRoot->leftChild);

//add right child

curFinalRoot->rpos = curpos + 1;

curpos = curpos + 1;

cnode = treeArrayReferences[treeStackReferencesTop];

treeStackReferencesTop++;

cnode->id = curRoot->rightChild->id;

treeInit(curRoot->rightChild);

curFinalRoot->id = curRoot->id;

curFinalRoot->numChildren = 2;

curFinalRoot->size = curRoot->size;

}

//the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal

void updateInternalNodeIDs(int cur){

tree_node* curNode = treeArrayReferences[cur];

if(curNode->numChildren == 0){

return;

}

curNode->id = treeNumLeafs++;

updateInternalNodeIDs(curNode->lpos);

updateInternalNodeIDs(curNode->rpos);

}

//frees the memory of the first tree generated by the parser

void treeFreeMemory(tree_node_temp* cur){

if(cur->numChildren == 0){

delete cur;

return;

}

treeFreeMemory(cur->leftChild);

treeFreeMemory(cur->rightChild);

delete cur;

}

//reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree.

//this tree is scattered anywhere in the memory.

tree_node* readNewick(string& file){

treeCurSTRindex = -1;

treeNewickStr = "";

treeNumLeafs = 0;

ifstream treeFin;

treeFin.open(file, ios_base::in);

//read the newick format of the tree and store it in a string

treeFin>>treeNewickStr;

//initialize index for reading the string

treeCurSTRindex = 0;

//create the tree in main memory

tree_node_temp* root = readNewickHelper();

//store the tree in an array following the pre order layout

treeArray = new tree_node[root->size];

treeArrayReferences = new tree_node*[root->size];

int i;

for(i=0;i<root->size;i++)

treeArrayReferences[i] = &treeArray[i];

treeStackReferencesTop = 0;

tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop];

curpos = treeStackReferencesTop;

treeStackReferencesTop++;

finalRoot->id = root->id;

treeInit(root);

//update the internal node ids (the leaf ids are defined by the ids stored in the newick string)

updateInternalNodeIDs(0);

//close the file

treeFin.close();

//free the memory of initial tree

treeFreeMemory(root);

//return the pre order tree

return finalRoot;

}

/*

*

*

* DOT FORMAT OUTPUT --- BEGIN

*

*

*/

void treeBstPrintDotAux(tree_node* node, ofstream& treeFout) {

if(node->numChildren == 0) return;

treeFout<<" "<<node->id<<" -> "<<treeArrayReferences[node->lpos]->id<<";\n";

treeBstPrintDotAux(treeArrayReferences[node->lpos], treeFout);

treeFout<<" "<<node->id<<" -> "<<treeArrayReferences[node->rpos]->id<<";\n";

treeBstPrintDotAux(treeArrayReferences[node->rpos], treeFout);

}

void treePrintDotHelper(tree_node* cur, ofstream& treeFout){

treeFout<<"digraph BST {\n";

treeFout<<" node [fontname=\"Arial\"];\n";

if(cur == nullptr){

treeFout<<"\n";

}

else if(cur->numChildren == 0){

treeFout<<" "<<cur->id<<";\n";

}

else{

treeBstPrintDotAux(cur, treeFout);

}

treeFout<<"}\n";

}

void treePrintDot(string& file, tree_node* root){

ofstream treeFout;

treeFout.open(file, ios_base::out);

treePrintDotHelper(root, treeFout);

treeFout.close();

}

/*

*

*

* DOT FORMAT OUTPUT --- END

*

*

*/

/*

* experiments

*

*/

tree_node* T;

int n;

void testCache(int cur){

if(treeArray[cur].numChildren == 0){

treeArray[cur].size = 1;

return;

}

testCache(treeArray[cur].lpos);

testCache(treeArray[cur].rpos);

treeArray[cur].size = treeArray[treeArray[cur].lpos].size + treeArray[treeArray[cur].rpos].size + 1;

}

int main(int argc, char* argv[]){

string Tnewick = argv[1];

T = readNewick(Tnewick);

n = T->size;

double tt;

startTimer();

for (int i = 0; i < 100; i++) {

testCache(0);

}

tt = endTimer();

cout << tt / BILLION << '\t' << T->size;

cout<<endl;

return 0;

}

Compile by typing

`g++ -O3 -std=c++11 file.cpp`

Run by typing

`./executable tree.txt`

`tree.txt`

Here is a left going tree with 10^5 leafs

Here is a right going tree with 10^5 leafs

The running times I get:

~0.07 seconds for the left going trees

~0.12 seconds for the right going trees

I apologize for the long post but given how narrow the problem seems to be, I couldn't find a better way to describe it.

Thank you in advance!

This is a follow up edit after MrSmith42's answer. I understand that locality plays a very big role, but I am not sure I understand that this is the case here.

For the two example trees above let us see how we access the memory over time.

For the left going tree:

For the right going tree:

To me it seems like in both cases we have locally access patterns.

Here is a plot about the number of conditional branches:

Here is a plot about the number of branch mispredictions:

Here is a left going tree with 10^6 leafs

Here is a right going tree with 10^6 leafs

Answer

UPDATE:

I plot the number of accessed element in array in time

```
void testCache(int cur, FILE *f) {
if(treeArray[cur].numChildren == 0){
fprintf (f, "%d\n", cur);
treeArray[cur].size = 1;
return;
}
fprintf (f, "%d\n", cur);
testCache(treeArray[cur].lpos, f);
fprintf (f, "%d\n", cur);
testCache(treeArray[cur].rpos, f);
fprintf (f, "%d\n", treeArray[cur].lpos);
fprintf (f, "%d\n", treeArray[cur].rpos);
treeArray[cur].size = treeArray[treeArray[cur].lpos].size + treeArray[treeArray[cur].rpos].size + 1;
}
```

as a result I plot 999990 element of resulted text file:

You can see that for the left tree all elements are locally accessed, but for the right one there exist non-uniformity in accessing.

OLD:

I tried to calculate number of memory reads using valgrind. for right one

```
valgrind --tool=callgrind --cache-sim ./a.out right
==11493== I refs: 427,444,674
==11493== I1 misses: 2,288
==11493== LLi misses: 2,068
==11493== I1 miss rate: 0.00%
==11493== LLi miss rate: 0.00%
==11493==
==11493== D refs: 213,159,341 (144,095,416 rd + 69,063,925 wr)
==11493== D1 misses: 15,401,346 ( 12,737,497 rd + 2,663,849 wr)
==11493== LLd misses: 329,337 ( 7,935 rd + 321,402 wr)
==11493== D1 miss rate: 7.2% ( 8.8% + 3.9% )
==11493== LLd miss rate: 0.2% ( 0.0% + 0.5% )
==11493==
==11493== LL refs: 15,403,634 ( 12,739,785 rd + 2,663,849 wr)
==11493== LL misses: 331,405 ( 10,003 rd + 321,402 wr)
==11493== LL miss rate: 0.1% ( 0.0% + 0.5% )
```

and for left one

```
valgrind --tool=callgrind --cache-sim=yes ./a.out left
==11496== I refs: 418,204,722
==11496== I1 misses: 2,327
==11496== LLi misses: 2,099
==11496== I1 miss rate: 0.00%
==11496== LLi miss rate: 0.00%
==11496==
==11496== D refs: 204,114,971 (135,076,947 rd + 69,038,024 wr)
==11496== D1 misses: 19,470,268 ( 12,661,123 rd + 6,809,145 wr)
==11496== LLd misses: 306,948 ( 7,935 rd + 299,013 wr)
==11496== D1 miss rate: 9.5% ( 9.4% + 9.9% )
==11496== LLd miss rate: 0.2% ( 0.0% + 0.4% )
==11496==
==11496== LL refs: 19,472,595 ( 12,663,450 rd + 6,809,145 wr)
==11496== LL misses: 309,047 ( 10,034 rd + 299,013 wr)
==11496== LL miss rate: 0.0% ( 0.0% + 0.4% )
```

As you can see number of memory read 'rd' in 'right' case bigger that in left