Olian04 Olian04 - 29 days ago 15
Python Question

Why is a ThreadPoolExecutor with one worker still faster than normal execution?

I'm using this library, Tomorrow, that in turn uses the ThreadPoolExecutor from the standard library, in order to allow for Async function calls.

Calling the decorator

@tomorrow.threads(1)
spins up a ThreadPoolExecutor with 1 worker.

Question




  • Why is it faster to execute a function using
    1 thread worker
    over just calling it as is (e.g. normally)?

  • Why is it slower to execute the same code with
    10 thread workers
    in place of just 1, or even None?



Demo code



imports excluded

def openSync(path: str):
for row in open(path):
for _ in row:
pass

@tomorrow.threads(1)
def openAsync1(path: str):
openSync(path)

@tomorrow.threads(10)
def openAsync10(path: str):
openSync(path)

def openAll(paths: list):
def do(func: callable)->float:
t = time.time()
[func(p) for p in paths]
t = time.time() - t
return t
print(do(openSync))
print(do(openAsync1))
print(do(openAsync10))

openAll(glob.glob("data/*"))


Note: The
data
folder contains 18 files, each 700 lines of random text.


Output



0 workers: 0.0120 seconds

1 worker: 0.0009 seconds

10 workers: 0.0535 seconds


What I've tested




  • I've ran the code more than a couple dusin times, with different programs running in the background (ran a bunch yesterday, and a couple today). The numbers change, ofc, but the order is always the same. (I.e. 1 is fastest, then 0 then 10).

  • I've also tried changing the order of execution (e.g. moving the do calls around) in order to eliminate caching as a factor, but still the same.


    • Turns out that executing in the order
      10
      ,
      1
      ,
      None
      results in a different order (1 is fastest, then 10, then 0) compared to every other permutation. The result shows that whatever
      do
      call is executed last, is considerably slower than it would have been had it been executed first or in the middle instead.




Results (After receiving solution from @Dunes)



0 workers: 0.0122 seconds

1 worker: 0.0214 seconds

10 workers: 0.0296 seconds

Answer

When you call one of your async functions it returns a "futures" object (instance of tomorrow.Tomorrow in this case). This allows you to submit all your jobs without having to wait for them to finish. However, never actually wait for the jobs to finish. So all do(openAsync1) does is time how long it takes to setup all the jobs (should be very fast). For a more accurate test you need to do something like:

def openAll(paths: list):
    def do(func: callable)->float:
        t = time.time()
        # do all jobs if openSync, else start all jobs if openAsync
        results = [func(p) for p in paths]
        # if openAsync, the following waits until all jobs are finished
        if func is not openSync:
            for r in results:
                r._wait()
        t = time.time() - t
        return t
    print(do(openSync))
    print(do(openAsync1))
    print(do(openAsync10))

openAll(glob.glob("data/*"))

Using additional threads in python generally slows things down. This is because of the global interpreter lock which means only 1 thread can ever be active, regardless of the number of cores the CPU has.

However, things are complicated by the fact that your job is IO bound. More worker threads might speed things up. This is because a single thread might spend more time waiting for the hard drive to respond than is lost between context switching between the various threads in the multi-threaded variant.

Side note, even though neither openAsync1 and openAsync10 wait for jobs to complete, do(openAsync10) is probably slower because it requires more synchronisation between threads when submitting a new job.