Mahin Ra Mahin Ra - 19 days ago 4
Python Question

MultiProcessing slower with more processes

I have a program written in python that reads 4 input text files and writes all of them into a list called

which is a shared memory between 4 processes used in my program (I used 4 processes so my program runs faster!)

I also have a shared memory variable called
which stores the names of the already read input files by any of the processes so the current process does not read them again (I used lock so processes do not check the existence of a file inside
at the same time).

When I only use one process my program runs faster(7 milliseconds) — my computer has 8 cores. Why is this?

import glob
from multiprocessing import Process, Manager,Lock
import timeit
import os

# Define a function for the Processes
def print_content(ProcessName,processedFiles,ListOutput,lock):
for file in glob.glob("*.txt"):


print "\n Current Process:",ProcessName

if file not in processedFiles:
print "\n", file, " not in ", processedFiles," for ",ProcessName
newfile=1 #it is a new file


#if it is a new file
if newfile==1:
f = open(file,"r")
lines = f.readlines()

#print "%s: %s" % ( ProcessName, time.ctime(time.time()) )

# Create processes as follows
manager = Manager()
processedFiles = manager.list()
ListOutput = manager.list()
start = timeit.default_timer()

p1 = Process(target=print_content, args=("Procees-1",processedFiles,ListOutput,lock))
p2 = Process(target=print_content, args=("Process-2",processedFiles,ListOutput,lock))
p3 = Process(target=print_content, args=("Process-3",processedFiles,ListOutput,lock))
p4 = Process(target=print_content, args=("Process-4",processedFiles,ListOutput,lock))



print "ListOutput",ListOutput
stop = timeit.default_timer()
print stop - start
print "Error: unable to start process"


The problem is that what looks like multiprocessing often isn't. Just using more cores doesn't mean doing more work.

The most glaring problem is that you synchronize everything. Selecting files is sequential because you lock, so there is zero gain here. While you are reading in parallel, every line read is written to a shared data structure - which will internally synchronize itself. So the only gain you potentially get is from reading in parallel. Depending on your media, e.g. an HDD instead of an SSD, the sum of multiple readers is actually slower than a single one.

On top of that is the overhead from managing all those processes. Each one needs to be started. Each one needs to be passed its input. Each one must communicate with the others, which happens for practically every action. And don't be fooled, a Manager is nifty but heavyweight.

So aside from gaining little, you add an additional cost. Since you start out with a very small runtime of just 7ms, that additional cost can be pretty significant.

In general, multiprocessing is only worth it if you are CPU-bound. That is, your CPU efficiency is close to 100%, i.e. there's more work than what can be done. Generally, this happens when you do lots of computation. Usually, doing mostly I/O is a good indicator that you are not CPU-bound.