Patrick Patrick - 3 months ago 24
Python Question

Spark execution time vs number of nodes on AWS EMR

I'm new to Spark. I tried to run a simple application on Amazon EMR (Python pi approximation found here) with 1 worker node and in a second phase with 2 worker nodes (m4.large). Elapsed time to complete the task is approximately 25 seconds each time. Naively, I was expecting something like a 1.5x gain with 2 nodes. Am I naive? Is it normal?


Let's make a simple experiment:

from functools import reduce
from operator import add
import timeit

# Taken from the linked example.

n = 100000

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 < 1 else 0

%timeit -n 100 reduce(add, (f(x) for x  in range(n)))

The result I get using quite old hardware:

100 loops, best of 3: 132 ms per loop

This should be an expected processing time for a single partition and value we get is comparable to the task scheduling time.

Conclusion? What you measure is cluster and application latency (context initialization, scheduling delays, context teardown) not a processing time.