cnandreu cnandreu - 1 month ago 6x
C Question

Recursive Fib with Threads, Segmentation Fault?

Any ideas why it works fine for values like 0, 1, 2, 3, 4... and seg faults for values like >15?

void *fib(void *fibToFind);

pthread_t mainthread;

long fibToFind = 15;
long finalFib;

pthread_create(&mainthread,NULL,fib,(void*) fibToFind);


printf("The number is: %d\n",finalFib);

void *fib(void *fibToFind){
long retval;

long newFibToFind = ((long)fibToFind);

long returnMinusOne;
long returnMinustwo;

pthread_t minusone;
pthread_t minustwo;

if(newFibToFind == 0 || newFibToFind == 1)
return newFibToFind;

long newFibToFind1 = ((long)fibToFind) - 1;
long newFibToFind2 = ((long)fibToFind) - 2;

pthread_create(&minusone,NULL,fib,(void*) newFibToFind1);
pthread_create(&minustwo,NULL,fib,(void*) newFibToFind2);


return returnMinusOne + returnMinustwo;




Runs out of memory (out of space for stacks), or valid thread handles?

You're asking for an awful lot of threads, which require lots of stack/context. Windows (and Linux) have a stupid "big [contiguous] stack" idea.

From the documentation on pthreads_create: "On Linux/x86-32, the default stack size for a new thread is 2 megabytes." If you manufacture 10,000 threads, you need 20 Gb of RAM. I built a version of OP's program, and it bombed with some 3500 (p)threads on Windows XP64.

See this SO thread for more details on why big stacks are a really bad idea: Why are stack overflows still a problem?

If you give up on big stacks, and implement a parallel language with heap allocation for activation records (our PARLANSE is one of these) the problem goes away.

Here's the first (sequential) program we wrote in PARLANSE:

(define fibonacci_argument 45)

(define fibonacci
   (lambda(function natural natural )function 
   `Given n, computes nth fibonacci number'
      (ifthenelse (<= ? 1)
         (+ (fibonacci (-- ?))
              (fibonacci (- ? 2))

Here's an execution run on an i7:

 C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonaccisequential
 Starting Sequential Fibonacci(45)...Runtime: 33.752067 seconds
 Result: 1134903170

Here's the second, which is parallel:

(define coarse_grain_threshold 30) ; technology constant: tune to amortize fork overhead across lots of work

(define parallel_fibonacci
   (lambda (function natural natural )function 
   `Given n, computes nth fibonacci number'
      (ifthenelse (<= ? coarse_grain_threshold)
           (fibonacci ?)
           (let (;; [n natural ] [m natural ]  )
                   (value (|| (= m (parallel_fibonacci (-- ?)) )=
                              (= n (parallel_fibonacci (- ? 2)) )=
                          (+ m n)

Making the parallelism explicit makes the programs a lot easier to write, too.

The parallel version we test by calling (parallel_fibonacci 45). Here is the execution run on the same i7 (which arguably has 8 processors, but it is really 4 processors hyperthreaded so it really isn't quite 8 equivalent CPUs):

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallelcoarse
Parallel Coarse-grain Fibonacci(45) with cutoff 30...Runtime: 5.511126 seconds
Result: 1134903170

A speedup near 6+, not bad for not-quite-8 processors. One of the other answers to this question ran the pthreads version; it took "a few seconds" (to blow up) computing Fib(18), and this is 5.5 seconds for Fib(45). This tells you pthreads is a fundamentally bad way to do lots of fine grain parallelism, because it has really, really high forking overhead. (PARLANSE is designed to minimize that forking overhead).

Here's what happens if you set the technology constant to zero (forks on every call to fib):

C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallel
Starting Parallel Fibonacci(45)...Runtime: 15.578779 seconds
Result: 1134903170

You can see that amortizing fork overhead is a good idea, even if you have fast forks.

Fib(45) produces a lot of grains. Heap allocation of activation records solves the OP's first-order problem (thousands of pthreads each with 1Mb of stack burns gigabytes of RAM).

But there's a second order problem: 2^45 PARLANSE "grains" will burn all your memory too just keeping track of the grains even if your grain control block is tiny. So it helps to have a scheduler that throttles forks once you have "a lot" (for some definition of "a lot" significantly less that 2^45) grains to prevent the explosion of parallelism from swamping the machine with "grain" tracking data structures. It has to unthrottle forks when the number of grains falls below a threshold too, to make sure there is always lots of logical, parallel work for the physical CPUs to do.