Pavlos Pavlos - 28 days ago 12
Python Question

spawn multiple processes to write different files Python

The idea is to write

N
files using
N
processes.

The data for the file to be written are coming from multiple files which are stored on a dictionary that has a list as a value and it looks like this:

dic = {'file1':['data11.txt', 'data12.txt', ..., 'data1M.txt'],
'file2':['data21.txt', 'data22.txt', ..., 'data2M.txt'],
...
'fileN':['dataN1.txt', 'dataN2.txt', ..., 'dataNM.txt']}


so
file1
is
data11 + data12 + ... + data1M
etc...

So my code looks like this:

jobs = []
for d in dic:
outfile = str(d)+"_merged.txt"
with open(outfile, 'w') as out:
p = multiprocessing.Process(target = merger.merger, args=(dic[d], name, out))
jobs.append(p)
p.start()
out.close()


and the merger.py looks like this:

def merger(files, name, outfile):
time.sleep(2)
sys.stdout.write("Merging %n...\n" % name)

# the reason for this step is that all the different files have a header
# but I only need the header from the first file.
with open(files[0], 'r') as infile:
for line in infile:
print "writing to outfile: ", name, line
outfile.write(line)
for f in files[1:]:
with open(f, 'r') as infile:
next(infile) # skip first line
for line in infile:
outfile.write(line)
sys.stdout.write("Done with: %s\n" % name)


I do see the file written on the folder it should go to, but it's empty. no header, no nothing. I had put prints in there to see if everything is correct but nothing works.

Help!

Answer

Since the worker processes run in parallel to the main process creating them, the files named out get closed before the workers can write to them. This will happen even if you remove out.close() because of the with statement. Rather pass each process the filename and let the process open and close the file.

Comments