c3cris c3cris - 4 months ago 19
Python Question

Multi-threading vs single thread calculations

def dowork():
y = []
z = []
ab = 0
start_time = time.time()
t = threading.current_thread()

for x in range(0,1500):
y.append(random.randint(0,100000))
for x in range(0,1500):
z.append(random.randint(0,1000))
for x in range(0,100):
for k in range(0,len(z)):
ab += y[k] ** z[k]
print(" %.50s..." % ab)
print("--- %.6s seconds --- %s" % (time.time() - start_time, t.name))

#do the work!
threads = []
for x in range(0,4): #4 threads
threads.append(threading.Thread(target=dowork))

for x in threads:
x.start() # and they are off


Results:

23949968699026357507152486869104218631097704347109...
--- 11.899 seconds --- Thread-2
10632599432628604090664113776561125984322566079319...
--- 11.924 seconds --- Thread-4
20488842520966388603734530904324501550532057464424...
--- 12.073 seconds --- Thread-1
17247910051860808132548857670360685101748752056479...
--- 12.115 seconds --- Thread-3
[Finished in 12.2s]


And now let's do it in 1 thread:

def dowork():
y = []
z = []
ab = 0
start_time = time.time()
t = threading.current_thread()

for x in range(0,1500):
y.append(random.randint(0,100000))
for x in range(0,1500):
z.append(random.randint(0,1000))
for x in range(0,100):
for k in range(0,len(z)):
ab += y[k] ** z[k]
print(" %.50s..." % ab)
print("--- %.6s seconds --- %s" % (time.time() - start_time, t.name))

# print(threadtest())
threads = []
for x in range(0,4):
threads.append(True)

for x in threads:
dowork()


Results:

14283744921265630410246013584722456869128720814937...
--- 2.8463 seconds --- MainThread
13487957813644386002497605118558198407322675045349...
--- 2.7690 seconds --- MainThread
15058500261169362071147461573764693796710045625582...
--- 2.7372 seconds --- MainThread
77481355564746169357229771752308217188584725215300...
--- 2.7168 seconds --- MainThread
[Finished in 11.1s]


Why is single threaded and multi-threaded scripts have the same processing time?
Shouldn't the multi-threaded implementation only be 1/#number of threads less? (I know when you reach your max cpu threads there is diminishing returns)

Did I mess up my implementation?

Answer

Multithreading in Python does not work like other languages, it has something to do with the global interpreter lock if I recalled correctly. There are a lot of different workarounds though, for example you can use gevent's coroutine based "threads". I myself prefer dask for work that needs to run concurrently. For example

import dask.bag as db
start = time.time()
(db.from_sequence(range(4), npartitions=4)
     .map(lambda _: dowork())
    .compute())
print('total time: {} seconds'.format(time.time() - start))

start = time.time()
threads = []
for x in range(0,4):
  threads.append(True)

for x in threads:
  dowork()
print('total time: {} seconds'.format(time.time() - start))

and the output

 19016975777667561989667836343447216065093401859905...
--- 2.4172 seconds --- MainThread
 32883203981076692018141849036349126447899294175228...
--- 2.4685 seconds --- MainThread
 34450410116136243300565747102093690912732970152596...
--- 2.4901 seconds --- MainThread
 50964938446237359434550325092232546411362261338846...
--- 2.5317 seconds --- MainThread
total time: 2.5557193756103516 seconds
 10380860937556820815021239635380958917582122217407...
--- 2.3711 seconds --- MainThread
 13309313630078624428079401365574221411759423165825...
--- 2.2861 seconds --- MainThread
 27410752090906837219181398184615017013303570495018...
--- 2.2853 seconds --- MainThread
 73007436394172372391733482331910124459395132986470...
--- 2.3136 seconds --- MainThread
total time: 9.256525993347168 seconds

In this case dask uses multiprocessing to do the work, which may or may not be desireable for your case.

Also instead of using cpython, you can try other implementation of python, for example pypy, stackless python etc. which claimed to provide workaround/solution to the problem.

Comments