I have a program written in python that reads 4 input text files and writes all of them into a list called
from multiprocessing import Process, Manager,Lock
# Define a function for the Processes
for file in glob.glob("*.txt"):
print "\n Current Process:",ProcessName
if file not in processedFiles:
print "\n", file, " not in ", processedFiles," for ",ProcessName
newfile=1 #it is a new file
#if it is a new file
f = open(file,"r")
lines = f.readlines()
#print "%s: %s" % ( ProcessName, time.ctime(time.time()) )
# Create processes as follows
manager = Manager()
processedFiles = manager.list()
ListOutput = manager.list()
start = timeit.default_timer()
p1 = Process(target=print_content, args=("Procees-1",processedFiles,ListOutput,lock))
p2 = Process(target=print_content, args=("Process-2",processedFiles,ListOutput,lock))
p3 = Process(target=print_content, args=("Process-3",processedFiles,ListOutput,lock))
p4 = Process(target=print_content, args=("Process-4",processedFiles,ListOutput,lock))
stop = timeit.default_timer()
print stop - start
print "Error: unable to start process"
The problem is that what looks like multiprocessing often isn't. Just using more cores doesn't mean doing more work.
The most glaring problem is that you synchronize everything. Selecting files is sequential because you lock, so there is zero gain here. While you are reading in parallel, every line read is written to a shared data structure - which will internally synchronize itself. So the only gain you potentially get is from reading in parallel. Depending on your media, e.g. an HDD instead of an SSD, the sum of multiple readers is actually slower than a single one.
On top of that is the overhead from managing all those processes. Each one needs to be started. Each one needs to be passed its input. Each one must communicate with the others, which happens for practically every action. And don't be fooled, a
Manager is nifty but heavyweight.
So aside from gaining little, you add an additional cost. Since you start out with a very small runtime of just
7ms, that additional cost can be pretty significant.
multiprocessing is only worth it if you are CPU-bound. That is, your CPU efficiency is close to 100%, i.e. there's more work than what can be done. Generally, this happens when you do lots of computation. Usually, doing mostly I/O is a good indicator that you are not CPU-bound.