Anthony Jack Anthony Jack - 24 days ago 11
C Question

OpenMP parallel for loop issue with shared pointer to pointers

I am just learning OpenMP and am having a problem parallelizing a for loop that I"m pretty sure meets the criteria for parallelization. Below is a SSCCE that exhibits the issue I'm having in my actual code.

The problem is that when I compile and run this with OpenMP the contents of wr[i][j] and wi[i][j] are always 1.00 and 0.00 respectively. If I comment out the pragma and run the code serially then wr and wi contain the values that I expect. The code still exhibits the problem even if I run it with just 1 thread after being compiled with OpenMP. The pragma has to be commented out to run it serially.

I think my problem is related to the fact that wr and wi are pointers to pointers. If that's the case, how do I work around it? If that's not the problem what is? Any insights are greatly appreciated.

// compilation gcc -fopenmp -o test test.c -lm
// usage ./test <Problem Size (Must be a power of 2)> <Num Threads>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main (int argc, char *argv[]){
double **wr, **wi;
double *data;
double wtemp, wpr, wpi, theta;
int i, j, n, nn, mmax, istep, num_mmax_iterations, T;
const int isign = -1;

if(argc < 3){
printf("ERROR, usage: %s <N (must be a power of 2)> <T>\n", argv[0]);
exit(1);
} else {
n = atoi(argv[1]);
T = atoi(argv[2]);
}

num_mmax_iterations = log2(n);
wr = malloc(sizeof(double *) *num_mmax_iterations);
wi = malloc(sizeof(double *) *num_mmax_iterations);

mmax = 2;
for(i = 0; i < num_mmax_iterations; i++) {
wr[i] = malloc(sizeof(double) * (mmax >> 1));
wi[i] = malloc(sizeof(double) * (mmax >> 1));
wr[i][0] = 1.0;
wi[i][0] = 0.0;
mmax <<= 1;
}

mmax = 2;
#pragma omp parallel for private(i, j, theta, wtemp, wpr, wpi) firstprivate(mmax) shared(num_mmax_iterations, wr, wi) num_threads(T)
for(i = 0; i < num_mmax_iterations; i++) {
theta = isign * (6.28318530717959 / mmax);
wtemp = sin(0.5 * theta);
wpr = -2.0 * wtemp * wtemp;
wpi = sin(theta);

for(j = 1; j < (mmax >> 1); j++) {
wr[i][j] = wr[i][j - 1] * wpr - wi[i][j - 1] * wpi + wr[i][j - 1];
wi[i][j] = wi[i][j - 1] * wpr + wr[i][j - 1] * wpi + wi[i][j - 1];
}

mmax <<= 1;
}

for(i = 0; i < num_mmax_iterations; i++) {
for(j = 0; j < (mmax >> 1); j++) {
printf("%.2f | %.2f\n", wr[i][j], wi[i][j]);
}
}

for(i = 0; i < num_mmax_iterations; i++) {
free(wr[i]);
free(wi[i]);
}
free(wr);
free(wi);
return 0;
}


Edit

Expected output with the pragma removed running ./test 8 1

1.00 | 0.00
0.00 | 0.00
0.00 | 1.00
0.00 | -0.00
1.00 | 0.00
-0.00 | -1.00
0.00 | 1.00
-1.00 | 0.71
1.00 | -0.00
0.71 | -0.71
-0.00 | 0.00
-0.71 | -0.71
0.00 | -1.00
-0.71 | -0.71
-1.00 | 1.00
-0.71 | 0.92
1.00 | 0.00
-0.00 | -1.00
0.00 | 1.00
-1.00 | 0.71
1.00 | -0.00
0.71 | -0.71
-0.00 | 0.00
-0.71 | -0.71
0.00 | -1.00
-0.71 | -0.71
-1.00 | 1.00
-0.71 | 0.92
1.00 | 0.71
0.92 | 0.38
0.71 | -0.00
0.38 | -0.38
1.00 | 0.00
0.71 | -0.71
-0.00 | -1.00
-0.71 | -0.71
0.00 | 1.00
-0.71 | 0.92
-1.00 | 0.71
-0.71 | 0.38
1.00 | -0.00
0.92 | -0.38
0.71 | -0.71
0.38 | -0.92
-0.00 | 0.00
-0.38 | -0.38
-0.71 | -0.71
-0.92 | -0.92
1.00 | 0.00
0.92 | -0.38
0.71 | -0.71
0.38 | -0.92
-0.00 | -1.00
-0.38 | -0.92
-0.71 | -0.71
-0.92 | -0.38
0.00 | 0.00
-0.38 | 0.00
-0.71 | 0.00
-0.92 | 0.00
-1.00 | 0.00
-0.92 | 0.00
-0.71 | 0.00
-0.38 | 0.00


Actual output with the pragma left intact running ./test 8 (any number of threads)

1.00 | 0.00
1.00 | 0.00
1.00 | 0.00


Note: The actual problem sizes are much larger and the expected data usually varies more than it does above.

Answer

When the code is run serially, the value mmax is modified at each iteration of the outer loop. Your code needs to be modified so that each thread calculates the correct value of mmax for itself. Additionally, there outer loop variable i should be shared between threads, not private. Here's a suggestion for your openMP loop which gives me the same results with and without the pragma.

#pragma omp parallel for private( j, theta, wtemp, wpr, wpi,mmax) num_threads(T)
for(i = 0; i < num_mmax_iterations; i++) {
    mmax = pow(2,i+1);
    theta = isign * (6.28318530717959 / mmax);
    wtemp = sin(0.5 * theta);
    wpr = -2.0 * wtemp * wtemp;
    wpi = sin(theta);       

    for(j = 1; j < (mmax >> 1); j++) {
        wr[i][j] = wr[i][j - 1] * wpr - wi[i][j - 1] * wpi + wr[i][j - 1];
        wi[i][j] = wi[i][j - 1] * wpr + wr[i][j - 1] * wpi + wi[i][j - 1];
    }

}
// mmax has to be set at the end since there is no guarantee of which thread will finish last
mmax = pow(2,1+num_mmax_iterations);
Comments