max max - 3 months ago 17
Python Question

parallel file parsing, multiple CPU cores

I asked a related but very general question earlier (see especially this response).

This question is very specific. This is all the code I care about:

result = {}
for line in open('input.txt'):
key, value = parse(line)
result[key] = value


The function
parse
is completely self-contained (i.e., doesn't use any shared resources).

I have Intel i7-920 CPU (4 cores, 8 threads; I think the threads are more relevant, but I'm not sure).

What can I do to make my program use all the parallel capabilities of this CPU?

I assume I can open this file for reading in 8 different threads without much performance penalty since disk access time is small relative to the total time.

Answer

cPython does not provide the threading model you are looking for easily. You can get something similar using the multiprocessing module and a process pool

such a solution could look something like this:

def worker(lines):
    """Make a dict out of the parsed, supplied lines"""
    result = {}
    for line in lines.split('\n'):
        k, v = parse(line)
        result[k] = v
    return result

if __name__ == '__main__':
    # configurable options.  different values may work better.
    numthreads = 8
    numlines = 100

    lines = open('input.txt').readlines()

    # create the process pool
    pool = multiprocessing.Pool(processes=numthreads)

    # map the list of lines into a list of result dicts
    result_list = pool.map(worker, 
        (lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) )

    # reduce the result dicts into a single dict
    result = {}
    map(result.update, result_list)
Comments