Ian Panzica Ian Panzica - 6 months ago 97
Python Question

Python multiprocessing - AssertionError: can only join a child process

I'm taking my first foray into the python mutliprocessing module and I'm running into some problems. I'm very familiar with the threading module but I need to make sure the processes I'm executing are running in parallel.

Here's an outline of what I'm trying to do. Please ignore things like undeclared variables/functions because I can't paste my code in full.

import multiprocessing
import time

def wrap_func_to_run(host, args, output):
output.append(do_something(host, args))
return

def func_to_run(host, args):
return do_something(host, args)

def do_work(server, client, server_args, client_args):
server_output = func_to_run(server, server_args)
client_output = func_to_run(client, client_args)
#handle this output and return a result
return result

def run_server_client(server, client, server_args, client_args, server_output, client_output):
server_process = multiprocessing.Process(target=wrap_func_to_run, args=(server, server_args, server_output))
server_process.start()
client_process = multiprocessing.Process(target=wrap_func_to_run, args=(client, client_args, client_output))
client_process.start()
server_process.join()
client_process.join()
#handle the output and return some result

def run_in_parallel(server, client):
#set up commands for first process
server_output = client_output = []
server_cmd = "cmd"
client_cmd = "cmd"
process_one = multiprocessing.Process(target=run_server_client, args=(server, client, server_cmd, client_cmd, server_output, client_output))
process_one.start()
#set up second process to run - but this one can run here
result = do_work(server, client, "some server args", "some client args")
process_one.join()
#use outputs above and the result to determine result
return final_result

def main():
#grab client
client = client()
#grab server
server = server()
return run_in_parallel(server, client)

if __name__ == "__main__":
main()


Here's the error I'm getting:

Error in sys.exitfunc:
Traceback (most recent call last):
File "/usr/lib64/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/lib64/python2.7/multiprocessing/util.py", line 319, in _exit_function
p.join()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 143, in join
assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process


I've tried a lot of different things to fix this but my feeling is that there's something wrong with the way I'm using this module.

EDIT:

So I created a file that will reproduce this by simulating the client/server and the work they do - Also I missed an important point which was that I was running this in unix. Another important bit of information was that
do_work
in my actual case involves using
os.fork()
. I was unable to reproduce the error without also using
os.fork()
so I'm assuming the problem is there. In my real world case, that part of the code was not mine so I was treating it like a black box (likely a mistake on my part). Anyways here's the code to reproduce -

#!/usr/bin/python

import multiprocessing
import time
import os
import signal
import sys

class Host():
def __init__(self):
self.name = "host"

def work(self):
#override - use to simulate work
pass

class Server(Host):
def __init__(self):
self.name = "server"

def work(self):
x = 0
for i in range(10000):
x+=1
print x
time.sleep(1)

class Client(Host):
def __init__(self):
self.name = "client"

def work(self):
x = 0
for i in range(5000):
x+=1
print x
time.sleep(1)

def func_to_run(host, args):
print host.name + " is working"
host.work()
print host.name + ": " + args
return "done"

def do_work(server, client, server_args, client_args):
print "in do_work"
server_output = client_output = ""
child_pid = os.fork()
if child_pid == 0:
server_output = func_to_run(server, server_args)
sys.exit(server_output)
time.sleep(1)

client_output = func_to_run(client, client_args)
# kill and wait for server to finish
os.kill(child_pid, signal.SIGTERM)
(pid, status) = os.waitpid(child_pid, 0)

return (server_output == "done" and client_output =="done")

def run_server_client(server, client, server_args, client_args):
server_process = multiprocessing.Process(target=func_to_run, args=(server, server_args))
print "Starting server process"
server_process.start()
client_process = multiprocessing.Process(target=func_to_run, args=(client, client_args))
print "Starting client process"
client_process.start()
print "joining processes"
server_process.join()
client_process.join()
print "processes joined and done"

def run_in_parallel(server, client):
#set up commands for first process
server_cmd = "server command for run_server_client"
client_cmd = "client command for run_server_client"
process_one = multiprocessing.Process(target=run_server_client, args=(server, client, server_cmd, client_cmd))
print "Starting process one"
process_one.start()
#set up second process to run - but this one can run here
print "About to do work"
result = do_work(server, client, "server args from do work", "client args from do work")
print "Joining process one"
process_one.join()
#use outputs above and the result to determine result
print "Process one has joined"
return result

def main():
#grab client
client = Client()
#grab server
server = Server()
return run_in_parallel(server, client)

if __name__ == "__main__":
main()


If I remove the use of
os.fork()
in
do_work
I don't get the error and the code behaves like I would have expected it before (except for the passing of outputs which I've accepted as my mistake/misunderstanding). I can change the old code to not use os.fork() but I'd also like to know why this caused this problem and if there's a workable solution.

EDIT 2:

I started working on a solution that omits os.fork() before the accepted answer. Here's what I have with some tweaking to the amount of simulated work that can be done -

#!/usr/bin/python

import multiprocessing
import time
import os
import signal
import sys
from Queue import Empty

class Host():
def __init__(self):
self.name = "host"

def work(self, w):
#override - use to simulate work
pass

class Server(Host):
def __init__(self):
self.name = "server"

def work(self, w):
x = 0
for i in range(w):
x+=1
print x
time.sleep(1)

class Client(Host):
def __init__(self):
self.name = "client"

def work(self, w):
x = 0
for i in range(w):
x+=1
print x
time.sleep(1)

def func_to_run(host, args, w, q):
print host.name + " is working"
host.work(w)
print host.name + ": " + args
q.put("ZERO")
return "done"

def handle_queue(queue):
done = False
results = []
return_val = 0
while not done:
#try to grab item from Queue
tr = None
try:
tr = queue.get_nowait()
print "found element in queue"
print tr
except Empty:
done = True
if tr is not None:
results.append(tr)
for el in results:
if el != "ZERO":
return_val = 1
return return_val

def do_work(server, client, server_args, client_args):
print "in do_work"
server_output = client_output = ""
child_pid = os.fork()
if child_pid == 0:
server_output = func_to_run(server, server_args)
sys.exit(server_output)
time.sleep(1)

client_output = func_to_run(client, client_args)
# kill and wait for server to finish
os.kill(child_pid, signal.SIGTERM)
(pid, status) = os.waitpid(child_pid, 0)

return (server_output == "done" and client_output =="done")



def run_server_client(server, client, server_args, client_args, w, mq):
local_queue = multiprocessing.Queue()
server_process = multiprocessing.Process(target=func_to_run, args=(server, server_args, w, local_queue))
print "Starting server process"
server_process.start()
client_process = multiprocessing.Process(target=func_to_run, args=(client, client_args, w, local_queue))
print "Starting client process"
client_process.start()
print "joining processes"
server_process.join()
client_process.join()
print "processes joined and done"
if handle_queue(local_queue) == 0:
mq.put("ZERO")

def run_in_parallel(server, client):
#set up commands for first process
master_queue = multiprocessing.Queue()
server_cmd = "server command for run_server_client"
client_cmd = "client command for run_server_client"
process_one = multiprocessing.Process(target=run_server_client, args=(server, client, server_cmd, client_cmd, 400000000, master_queue))
print "Starting process one"
process_one.start()
#set up second process to run - but this one can run here
print "About to do work"
#result = do_work(server, client, "server args from do work", "client args from do work")
run_server_client(server, client, "server args from do work", "client args from do work", 5000, master_queue)
print "Joining process one"
process_one.join()
#use outputs above and the result to determine result
print "Process one has joined"
return_val = handle_queue(master_queue)
print return_val
return return_val

def main():
#grab client
client = Client()
#grab server
server = Server()
val = run_in_parallel(server, client)
if val:
print "failed"
else:
print "passed"
return val

if __name__ == "__main__":
main()


This code has some tweaked printouts just to see exactly what is happening. I used a multiprocessing.Queue to store and share outputs across the processes and back into my main thread to be handled. I think this solves the python portion of my problem but there's still some issues in the code I'm working on. The only other thing I can say is that the equivalent to
func_to_run
involves sending a command over ssh and grabbing any err along with the output. For some reason, this works perfectly fine for a command that has a low execution time, but not well for a command that has a much larger execution time/output. I tried simulating this with the drastically different work values in my code here but haven't been able to reproduce similar results.

EDIT 3
Library code I'm using (again not mine) uses
Popen.wait()
for the ssh commands and I just read this:


Popen.wait()

Wait for child process to terminate. Set and return returncode attribute.

Warning This will deadlock when using stdout=PIPE and/or stderr=PIPE and the >child process generates enough output to a pipe such that it blocks waiting for >the OS pipe buffer to accept more data. Use communicate() to avoid that.


I adjusted the code to not buffer and just print as it is received and everything works.

Answer

I can change the old code to not use os.fork() but I'd also like to know why this caused this problem and if there's a workable solution.

The key to understanding the problem is knowing exactly what fork() does. CPython docs state "Fork a child process." but this presumes you understand the C library call fork().

Here's what glibc's manpage says about it:

fork() creates a new process by duplicating the calling process. The new process, referred to as the child, is an exact duplicate of the calling process, referred to as the parent, except for the following points: ...

It's basically as if you took your program and made a copy of its program state (heap, stack, instruction pointer, etc) with small differences and let it execute independent of the original. When this child process exits naturally, it will use exit() and that will trigger atexit() handlers registered by the multiprocessing module.

What can you do to avoid it?

  • omit os.fork(): use multiprocessing instead, like you are exploring now
  • probably effective: import multiprocessing after executing fork(), only in the child or parent as necessary.
  • use _exit() in the child (CPython docs state, "Note The standard way to exit is sys.exit(n). _exit() should normally only be used in the child process after a fork().")

https://docs.python.org/2/library/os.html#os._exit