user3331388 - 8 months ago 42

C Question

`double compute_inner_contraction_0( double* X, double* T, int n, int f, int m, int e, int No, int Nv )`

{

double contraction_value = 0.0;

int l_begin = 0;

int l_end = m;

for ( int l = l_begin; l < l_end; l++ )

{

int k_begin = 0;

int k_end = l;

for ( int k = k_begin; k < k_end; k++ )

{

contraction_value += (-1)*X[n * Nv * No * No * No * No + e * No * No * No * No + m * No * No * No + l * No * No + l * No + k]*T[k * Nv + f];

}

}

return contraction_value;

}

void compute_contraction_0( double* X, double* T, int No, int Nv, double* ExEx_block , int nthd )

{

omp_set_dynamic(0);

omp_set_num_threads(nthd);

#pragma omp parallel for

for ( int n = 0; n < No; n++ )

{

int f_begin = 0;

int f_end = Nv;

for ( int f = f_begin; f < f_end; f++ )

{

int m_begin = 0;

int m_end = n;

for ( int m = m_begin; m < m_end; m++ )

{

int e_begin = 0;

int e_end = f;

for ( int e = e_begin; e < e_end; e++ )

{

ExEx_block[n * Nv * No * Nv + f * No * Nv + m * Nv + e] += compute_inner_contraction_0( X, T, n, f, m, e, No, Nv);

}

}

}

}

}

Hi Guys, sorry for the long functions. Please don't worry about the indexing for memory leaks as they've been extensively tested. Basically it is taking numbers in array X and T, and multiplies them, then adds the result to another array ExEx_block. This code would run on multi threads but will give very wrong computation results compared to the result of running on one thread(which is correct). Could you please help? It's built as a shared library (with many other similar functions ) and being loaded from Python wrappers. The OS is CentOS 6 and the compiler is gcc-4.4.7

It's a quad Xeon E5-4620 (Sandy-EP) (in total 32 cores) server with 480 GB of DDR3. No, Nv are just integers needed in the calculation. nthd (number of thread) is usually just 10 or 20. The size of these arrays can go up to 17 GB. Thank you very much for your time.

Answer

I've analyzed your code and you have a race condition when updating `ExEx_block`

. There is no other possible way you could get erroneous results [as everything else is just *reading* data].

Just so we're on the same page, I've produced an explanation of what *I* mean by race condition.

I've also produced a cleaned up and simplified version of your program without any fixes. But, it lays the groundwork for fixed version.

And, a second version that has some sample `omp`

directives I think you'll need (e.g. `omp private`

and `omp atomic`

)

Note that I didn't try compiling these examples, so consider them a reference only.

In your program, the race condition occurs if two different threads try to update `ExEx_block[i]`

at the same time. To simplify, let's use an `int`

scalar instead (e.g.) `target`

. The principle is the same.

Assume `target`

has a value of 5. And we have two threads `Ta`

and `Tb`

. `Ta`

wants to do `target += 3`

and `Tb`

wants to do `target += 7`

. Also assume that each thread has a private variable `ta`

and `tb`

respectively.

The `Ta`

action is actually three operations:

```
ta = target;
ta += 3
target = ta;
```

Likewise for `Tb`

:

```
tb = target;
tb += 7;
target = tb;
```

As long as these are disjoint, things are fine and we'll [eventually] get a `target`

value of 15.

But, if any of the operations are interspersed, we might get *either* 8 or 12 instead of the correct value of 15:

```
ta = target;
tb = target;
ta += 3;
target = ta;
tb += 7;
target = tb;
```

Simplified version of your program.

Note that I grouped things with parentheses and used things like `int No4 = No * No * No * No`

to allow the equations to be simplified and more readable. Not strictly required, but it helped me spot the problem and the solution. I tried to be careful, but I could easily have messed up a grouping.

```
double
compute_inner_contraction_0(double *X,double *T,int n,int f,int m,int e,
int No,int Nv)
{
double contraction_value = 0.0;
int l_begin = 0;
int l_end = m;
int No2 = No * No;
int No3 = No2 * No;
int No4 = No3 * No;
for (int l = l_begin; l < l_end; l++) {
int k_begin = 0;
int k_end = l;
for (int k = k_begin; k < k_end; k++) {
contraction_value += (-1) *
X[(n * Nv * No4) + (e * No4) + (m * No3) + (l * No2) +
(l * No) + k] * T[(k * Nv) + f];
}
}
return contraction_value;
}
void
compute_contraction_0(double *X,double *T,int No,int Nv,double *ExEx_block,
int nthd)
{
omp_set_dynamic(0);
omp_set_num_threads(nthd);
int NoNv = Nv * No;
int NoNv2 = NoNv * Nv;
#pragma omp parallel for
for (int n = 0; n < No; n++) {
int f_begin = 0;
int f_end = Nv;
int nx = n * NoNv2;
for (int f = f_begin; f < f_end; f++) {
int m_begin = 0;
int m_end = n;
int fx = f * NoNv;
for (int m = m_begin; m < m_end; m++) {
int e_begin = 0;
int e_end = f;
int mx = m * Nv;
int ax = nx + fx + mx;
for (int e = e_begin; e < e_end; e++) {
ExEx_block[ax + e] +=
compute_inner_contraction_0(X,T,n,f,m,e,No,Nv);
}
}
}
}
}
```

An example of what you'll need to do to prevent the race condition.

For efficiency, I split the `e`

loop into two loops. One that does the heavy computation and one that updates the global array. This helps minimize threads stalling, waiting on one another during the heavy computation phase. The "locked" update phase is a simple and quick loop.

You'll need a per-thread local array using omp's `private`

directive and need to add `omp atomic`

to the loop that updates `ExEx_block`

.

*Caveat:* Although I've done a lot of multithread programming using `pthreads`

and know how to fix race conditions, etc., I'm not very familiar with the specific equivalent `omp`

directives.

So, the following is my best guess as to what you'll need, based upon my reading of the omp docs. So, please ... Check the docs thoroughly about the directives I'm using here.

```
double
compute_inner_contraction_0(double *X,double *T,int n,int f,int m,int e,
int No,int Nv)
{
double contraction_value = 0.0;
int l_begin = 0;
int l_end = m;
int No2 = No * No;
int No3 = No2 * No;
int No4 = No3 * No;
for (int l = l_begin; l < l_end; l++) {
int k_begin = 0;
int k_end = l;
for (int k = k_begin; k < k_end; k++) {
contraction_value += (-1) *
X[(n * Nv * No4) + (e * No4) + (m * No3) + (l * No2) +
(l * No) + k] * T[(k * Nv) + f];
}
}
return contraction_value;
}
void
compute_contraction_0(double *X,double *T,int No,int Nv,double *ExEx_block,
int nthd)
{
omp_set_dynamic(0);
omp_set_num_threads(nthd);
int NoNv = Nv * No;
int NoNv2 = NoNv * Nv;
// ExEx_local must be at least Nv:
// NOTE: placement here may be wrong
double ExEx_local[Nv];
#pragma omp parallel for private(ExEx_local)
for (int n = 0; n < No; n++) {
int f_begin = 0;
int f_end = Nv;
int nx = n * NoNv2;
for (int f = f_begin; f < f_end; f++) {
int m_begin = 0;
int m_end = n;
int fx = f * NoNv;
for (int m = m_begin; m < m_end; m++) {
int e_begin = 0;
int e_end = f;
int mx = m * Nv;
int ax = nx + fx + mx;
// separate computation from global update
for (int e = e_begin; e < e_end; e++) {
ExEx_local[e] =
compute_inner_contraction_0(X,T,n,f,m,e,No,Nv);
}
// now do global update
for (int e = e_begin; e < e_end; e++) {
#pragma omp atomic update
ExEx_block[ax + e] += ExEx_local[e];
}
}
}
}
}
```