user3331388 user3331388 - 3 months ago 17
C Question

openmp for loop wrong results for more than one thread

double compute_inner_contraction_0( double* X, double* T, int n, int f, int m, int e, int No, int Nv )
{
double contraction_value = 0.0;
int l_begin = 0;
int l_end = m;
for ( int l = l_begin; l < l_end; l++ )
{
int k_begin = 0;
int k_end = l;
for ( int k = k_begin; k < k_end; k++ )
{
contraction_value += (-1)*X[n * Nv * No * No * No * No + e * No * No * No * No + m * No * No * No + l * No * No + l * No + k]*T[k * Nv + f];
}
}
return contraction_value;
}

void compute_contraction_0( double* X, double* T, int No, int Nv, double* ExEx_block , int nthd )
{
omp_set_dynamic(0);
omp_set_num_threads(nthd);
#pragma omp parallel for
for ( int n = 0; n < No; n++ )
{
int f_begin = 0;
int f_end = Nv;
for ( int f = f_begin; f < f_end; f++ )
{
int m_begin = 0;
int m_end = n;
for ( int m = m_begin; m < m_end; m++ )
{
int e_begin = 0;
int e_end = f;
for ( int e = e_begin; e < e_end; e++ )
{
ExEx_block[n * Nv * No * Nv + f * No * Nv + m * Nv + e] += compute_inner_contraction_0( X, T, n, f, m, e, No, Nv);
}
}
}
}
}


Hi Guys, sorry for the long functions. Please don't worry about the indexing for memory leaks as they've been extensively tested. Basically it is taking numbers in array X and T, and multiplies them, then adds the result to another array ExEx_block. This code would run on multi threads but will give very wrong computation results compared to the result of running on one thread(which is correct). Could you please help? It's built as a shared library (with many other similar functions ) and being loaded from Python wrappers. The OS is CentOS 6 and the compiler is gcc-4.4.7
It's a quad Xeon E5-4620 (Sandy-EP) (in total 32 cores) server with 480 GB of DDR3. No, Nv are just integers needed in the calculation. nthd (number of thread) is usually just 10 or 20. The size of these arrays can go up to 17 GB. Thank you very much for your time.

Answer

I've analyzed your code and you have a race condition when updating ExEx_block. There is no other possible way you could get erroneous results [as everything else is just reading data].

Just so we're on the same page, I've produced an explanation of what I mean by race condition.

I've also produced a cleaned up and simplified version of your program without any fixes. But, it lays the groundwork for fixed version.

And, a second version that has some sample omp directives I think you'll need (e.g. omp private and omp atomic)

Note that I didn't try compiling these examples, so consider them a reference only.


In your program, the race condition occurs if two different threads try to update ExEx_block[i] at the same time. To simplify, let's use an int scalar instead (e.g.) target. The principle is the same.

Assume target has a value of 5. And we have two threads Ta and Tb. Ta wants to do target += 3 and Tb wants to do target += 7. Also assume that each thread has a private variable ta and tb respectively.

The Ta action is actually three operations:

ta = target;
ta += 3
target = ta;

Likewise for Tb:

tb = target;
tb += 7;
target = tb;

As long as these are disjoint, things are fine and we'll [eventually] get a target value of 15.

But, if any of the operations are interspersed, we might get either 8 or 12 instead of the correct value of 15:

ta = target;
tb = target;

ta += 3;
target = ta;

tb += 7;
target = tb;

Simplified version of your program.

Note that I grouped things with parentheses and used things like int No4 = No * No * No * No to allow the equations to be simplified and more readable. Not strictly required, but it helped me spot the problem and the solution. I tried to be careful, but I could easily have messed up a grouping.

double
compute_inner_contraction_0(double *X,double *T,int n,int f,int m,int e,
    int No,int Nv)
{
    double contraction_value = 0.0;
    int l_begin = 0;
    int l_end = m;
    int No2 = No * No;
    int No3 = No2 * No;
    int No4 = No3 * No;

    for (int l = l_begin; l < l_end; l++) {
        int k_begin = 0;
        int k_end = l;

        for (int k = k_begin; k < k_end; k++) {
            contraction_value += (-1) *
                X[(n * Nv * No4) + (e * No4) + (m * No3) + (l * No2) +
                (l * No) + k] * T[(k * Nv) + f];
        }
    }

    return contraction_value;
}

void
compute_contraction_0(double *X,double *T,int No,int Nv,double *ExEx_block,
    int nthd)
{
    omp_set_dynamic(0);
    omp_set_num_threads(nthd);
    int NoNv = Nv * No;
    int NoNv2 = NoNv * Nv;

    #pragma omp parallel for
    for (int n = 0; n < No; n++) {
        int f_begin = 0;
        int f_end = Nv;
        int nx = n * NoNv2;

        for (int f = f_begin; f < f_end; f++) {
            int m_begin = 0;
            int m_end = n;
            int fx = f * NoNv;

            for (int m = m_begin; m < m_end; m++) {
                int e_begin = 0;
                int e_end = f;
                int mx = m * Nv;
                int ax = nx + fx + mx;

                for (int e = e_begin; e < e_end; e++) {
                    ExEx_block[ax + e] +=
                        compute_inner_contraction_0(X,T,n,f,m,e,No,Nv);
                }
            }
        }
    }
}

An example of what you'll need to do to prevent the race condition.

For efficiency, I split the e loop into two loops. One that does the heavy computation and one that updates the global array. This helps minimize threads stalling, waiting on one another during the heavy computation phase. The "locked" update phase is a simple and quick loop.

You'll need a per-thread local array using omp's private directive and need to add omp atomic to the loop that updates ExEx_block.

Caveat: Although I've done a lot of multithread programming using pthreads and know how to fix race conditions, etc., I'm not very familiar with the specific equivalent omp directives.

So, the following is my best guess as to what you'll need, based upon my reading of the omp docs. So, please ... Check the docs thoroughly about the directives I'm using here.

double
compute_inner_contraction_0(double *X,double *T,int n,int f,int m,int e,
    int No,int Nv)
{
    double contraction_value = 0.0;
    int l_begin = 0;
    int l_end = m;
    int No2 = No * No;
    int No3 = No2 * No;
    int No4 = No3 * No;

    for (int l = l_begin; l < l_end; l++) {
        int k_begin = 0;
        int k_end = l;

        for (int k = k_begin; k < k_end; k++) {
            contraction_value += (-1) *
                X[(n * Nv * No4) + (e * No4) + (m * No3) + (l * No2) +
                (l * No) + k] * T[(k * Nv) + f];
        }
    }

    return contraction_value;
}

void
compute_contraction_0(double *X,double *T,int No,int Nv,double *ExEx_block,
    int nthd)
{
    omp_set_dynamic(0);
    omp_set_num_threads(nthd);
    int NoNv = Nv * No;
    int NoNv2 = NoNv * Nv;

    // ExEx_local must be at least Nv:
    // NOTE: placement here may be wrong
    double ExEx_local[Nv];

    #pragma omp parallel for private(ExEx_local)
    for (int n = 0; n < No; n++) {
        int f_begin = 0;
        int f_end = Nv;
        int nx = n * NoNv2;

        for (int f = f_begin; f < f_end; f++) {
            int m_begin = 0;
            int m_end = n;
            int fx = f * NoNv;

            for (int m = m_begin; m < m_end; m++) {
                int e_begin = 0;
                int e_end = f;
                int mx = m * Nv;
                int ax = nx + fx + mx;

                // separate computation from global update
                for (int e = e_begin; e < e_end; e++) {
                    ExEx_local[e] =
                        compute_inner_contraction_0(X,T,n,f,m,e,No,Nv);
                }

                // now do global update
                for (int e = e_begin; e < e_end; e++) {
                    #pragma omp atomic update
                    ExEx_block[ax + e] += ExEx_local[e];
                }
            }
        }
    }
}
Comments