bhjghjh bhjghjh - 1 month ago 40
C Question

Parallelization for Monte Carlo pi approximation

I am writing a c script to parallelize pi approximation with OpenMp. I think my code works fine with a convincing output. I am running it with 4 threads now. What I am not sure is that if this code is vulnerable to race condition? and if it is, how do I coordinate the thread action in this code ?

the code looks as follows:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>

double sample_interval(double a, double b) {

double x = ((double) rand())/((double) RAND_MAX);
return (b-a)*x + a;

}

int main (int argc, char **argv) {


int N = atoi( argv[1] ); // convert command-line input to N = number of points
int i;
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;



#pragma omp parallel firstprivate(x, y, z, i) reduction(+:counter) num_threads(NumThreads)
{
srand(time(NULL));
for (int i=0; i < N; ++i)
{
x = sample_interval(-1.,1.);
y = sample_interval(-1.,1.);
z = ((x*x)+(y*y));

if (z<= 1)
{
counter++;
}
}

}
double approx_pi = 4.0 * counter/ (double)N;

printf("%i %1.6e %1.6e\n ", N, 4.0 * counter/ (double)N, fabs(4.0 * counter/ (double)N - pi) / pi);


return 0;

}


Also I was wondering if the seed for random number should be declared inside or outside parallelization. my output looks like this:

10 3.600000e+00 1.459156e-01
100 3.160000e+00 5.859240e-03
1000 3.108000e+00 1.069287e-02
10000 3.142400e+00 2.569863e-04
100000 3.144120e+00 8.044793e-04
1000000 3.142628e+00 3.295610e-04
10000000 3.141379e+00 6.794439e-05
100000000 3.141467e+00 3.994585e-05
1000000000 3.141686e+00 2.971945e-05


Which looks OK for now. your suggestion for race condition and seed placement is most welcome.

Answer

There are a few problems in your code that I can see. The main one is from my standpoint that it isn't parallelized. Or more precisely, you didn't enable the parallelism you introduced with OpenMP while compiling it. Here is the way one can see that:

The way the code is parallelized, the main for loop should be executed in full by all the threads (there is no worksharing here, no #pragma omp parallel for, only a #pragma omp parallel). Therefore, considering you set the number of threads to be 4, the global number of iterations should be 4*N. Thus, your output should slowly converge towards 4*Pi, not towards Pi.

Indeed, I tried your code on my laptop, compiled it with OpenMP support, and that is pretty-much what I get. However, when I don't enable OpenMP, I get an output similar to yours. So in conclusion, you need to:

  1. Enable OpenMP at compilation time for getting a parallel version of your code.
  2. Divide your result by NumThreads to get a "valid" approximation of Pi (or distribute your loop over N with a #pragma omp for for example)

But that is if / when your code is correct elsewhere, which it isn't yet. As BitTickler already hinted, rand() isn't thread-safe. So you have to go for another random number generator, which will allow you to privatize it's state. That could be rand_r() for example. That said, this still has quite a few issues:

  1. rand() / rand_r() is a terrible RNG in term of randomness and periodicity. While increasing your number of tries, you'll rapidly go over the period of the RNG and repeat over and over again the same sequence. You need something more robust to do anything remotely serious.
  2. Even with a "good" RNG, the parallelism aspect can be an issue in the sense that you want your sequences in parallel to be uncorrelated between each-other. And just using a different seed value per thread doesn't guaranty that to you (although with a wide-enough RNG, you have a bit of headroom for that)

Anyway, bottom line is:

  • Use a better thread-safe RNG (I find drand48_r() or random_r() to be OK for toy codes on Linux)
  • Initialize its state per-thread based on the thread id for example, while keeping in mind that this won't ensure a proper decorrelation of the random series in some circumstances (and the larger the number of times you call the functions, the more likely you are to finally have overlapping series).

This done (along with a few minor fixes), your code becomes for example as follows:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>

typedef struct drand48_data RNGstate;

double sample_interval(double a, double b, RNGstate *state) {
    double x;
    drand48_r(state, &x);
    return (b-a)*x + a;
}

int main (int argc, char **argv) {

    int N = atoi( argv[1] ); // convert command-line input to N = number of points
    int NumThreads = 4;
    const double pi = 3.141592653589793;
    double x, y, z;
    double counter = 0;
    time_t ctime = time(NULL);

    #pragma omp parallel private(x, y, z) reduction(+:counter) num_threads(NumThreads)
    {
        RNGstate state;
        srand48_r(ctime+omp_get_thread_num(), &state);
        for (int i=0; i < N; ++i) {
            x = sample_interval(-1, 1, &state);
            y = sample_interval(-1, 1, &state);
            z = ((x*x)+(y*y));

            if (z<= 1) {
                counter++;
            }
        }

    } 
    double approx_pi = 4.0 * counter / (NumThreads * N);

    printf("%i %1.6e %1.6e\n ", N, approx_pi, fabs(approx_pi - pi) / pi);

    return 0;
}

Which I compile like this:

gcc -std=gnu99 -fopenmp -O3 -Wall pi.c -o pi_omp