aviel aviel - 5 months ago 25
Python Question

Reduce execution time on huge list generation

I'm fairly new to Python, and I'm trying to write some huge lists (with random letters inside). Actually it takes me around 75 - 80 seconds on my machine for 2,000,000 lines.

import timeit
import random, string

global_tab = []
global_nb_loop = 2000000

print("Generate %d lines" % global_nb_loop)
global_tab = []
for x in range(global_nb_loop):
global_tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))
print("%d lines generated" % len(global_tab))


And the result with linux
time
command:

$ time python3 DEV/PyETL/generateList.py
Generate 2000000 lines
2000000 lines generated

real 1m16.844s
user 1m16.609s
sys 0m0.203s


I was surprised when monitoring system resources that only 1 core was at 100%, instead of 4 like on a Windows machine on which I've tested this too.

Of course I've tried to apply some threads, but I'm facing a problem: it takes more time than running on a single core. Maybe threads are not the solution or I'm probably using them wrong.

Here is the new code:

import random, string
import threading

global_tab = []
global_nb_threads = 4
global_nb_loop = 2000000


threadLock = threading.Lock()

class generateList(threading.Thread):
def __init__(self, name):
threading.Thread.__init__(self)
self.name = name

def run(self):
global global_tab
self.tab = []

print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads)))
# divide desirated lines with number of threads
for x in range(int(global_nb_loop/global_nb_threads)):
self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))

threadLock.acquire()
global_tab += self.tab
threadLock.release()
del self.tab
print("[%s] %d lines in list" % (self.name, len(global_tab)))


for i in range(global_nb_threads):
# Create threads
t = generateList("Thread-" + str(i))
# Start
t.start()

for i in range(global_nb_threads):
# Wait for threads end
t.join()


And the execution:

$ time python3 DEV/PyETL/generateListThreads.py
[Thread-0] Generate 500000 lines
[Thread-1] Generate 500000 lines
[Thread-2] Generate 500000 lines
[Thread-3] Generate 500000 lines
[Thread-3] 500000 lines in list
[Thread-0] 1000000 lines in list
[Thread-2] 1500000 lines in list
[Thread-1] 2000000 lines in list
real 1m40.858s
user 1m41.208s
sys 0m0.916s


32 seconds more than 1 core with 100%, but monitoring shows that the 8 cores were with 20 - 40% load at the same time.

Since all threads are working at the same time, generating fewer rows and synchronizing only for updating a global variable, shouldn't the execution time be lower than a single core?

Answer

I am pretty sure your lock is not necessary and is slowing you down. (edit: actually, I just noticed the lock is used after the majority of the work is done, so isn't really relevant.)

global_tab += self.tab is (I think) atomic through the Python GIL. (Actually, this only claims list.extend(), so use that instead. Here's another reference: Are lists thread safe?

Alternatively, I would try multiprocessing.imap_unordered with a large chunksize. The downside is the results are sent over by stream, but your random string processing might overshadow that.

import multiprocessing
import random
import string

def randomword(x):
    return ''.join(random.choice(string.ascii_letters) for i in range(15))

pool = multiprocessing.Pool(8)
results = pool.imap_unordered(randomword, range(100))
print([r for r in results])

For 2 million strings (I changed it to print the length):

$ time python r.py                                                                 
2000000

real    0m38.305s
user    1m31.717s
sys     0m25.853s

I also tried cleaning up your version a bit and got:

$ time python rr.py 
[Thread-0] Generate 250000 lines
[Thread-1] Generate 250000 lines
[Thread-2] Generate 250000 lines
[Thread-3] Generate 250000 lines
[Thread-4] Generate 250000 lines
[Thread-5] Generate 250000 lines
[Thread-6] Generate 250000 lines
[Thread-7] Generate 250000 lines
[Thread-4] 250000 lines in list
[Thread-1] 500000 lines in list
[Thread-7] 750000 lines in list
[Thread-0] 1000000 lines in list
[Thread-6] 1250000 lines in list
[Thread-2] 1500000 lines in list
[Thread-3] 1750000 lines in list
[Thread-5] 2000000 lines in list

real    0m22.113s
user    0m24.969s
sys     0m5.537s

A couple of significant changes:

  • use xrange() on the large ranges (ah, python3 already does this.)
  • remove the threadlock
  • use extend() on the global.

(my results were about the same when just appending to the global_tab, btw, and leaving out the temporary list.)

import random, string
import threading

global_tab         = []
global_nb_threads  = 8
global_nb_loop     = 2000000

class generateList(threading.Thread):
    def __init__(self, name):
        threading.Thread.__init__(self)
        self.name = name

    def run(self):
        global global_tab
        self.tab = []

        print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads)))
        for x in range(int(global_nb_loop/global_nb_threads)):
            self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))

        global_tab.extend(self.tab)
        print("[%s] %d lines in list" % (self.name, len(global_tab)))


for i in range(global_nb_threads):
    t = generateList("Thread-" + str(i))
    t.start()

for i in range(global_nb_threads):
    t.join()

...but, single threaded is still slightly faster at 16 seconds.

If I tune multiprocessing, I can get it down to 6 seconds:

size = 2000000
processes = 8
pool = multiprocessing.Pool(processes)
results = [r for r in pool.imap_unordered(randomword, range(size), chunksize=int(size/processes))]
print(len(results))

output:

$ time python r.py                                                                 
2000000

real    0m5.713s
user    0m35.594s
sys     0m0.546s

...so I think that's my final answer: Use multiprocessing.

Comments