I'm new to Spark. I tried to run a simple application on Amazon EMR (Python pi approximation found here) with 1 worker node and in a second phase with 2 worker nodes (m4.large). Elapsed time to complete the task is approximately 25 seconds each time. Naively, I was expecting something like a 1.5x gain with 2 nodes. Am I naive? Is it normal?
Let's make a simple experiment:
from functools import reduce from operator import add import timeit # Taken from the linked example. n = 100000 def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 < 1 else 0 %timeit -n 100 reduce(add, (f(x) for x in range(n)))
The result I get using quite old hardware:
100 loops, best of 3: 132 ms per loop
This should be an expected processing time for a single partition and value we get is comparable to the task scheduling time.
Conclusion? What you measure is cluster and application latency (context initialization, scheduling delays, context teardown) not a processing time.