SpiderRico SpiderRico - 6 months ago 13
Python Question

Why AllGather() fails in for loops with different iteration values?

I've coded following code to experiment with AllGather():

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

a = None
if rank == 0:
a = 2
if rank == 1:
a = 3
z = 2
for i in range(0, a):
z = comm.allgather(z)

print(z, rank)
comm.barrier()


I run it as follows:


mpiexec -n 2 python3 allgather.py


I get the following output:


[[2, 2], [2, 2]] 0


The 2nd processor is stuck and program doesn't terminate.

The output should be:


[[2, 2], [2, 2]] 0

[[2, 2], [2, 2], [2, 2]] 1


I fail to see why 2nd processor gets stuck. It runs correctly if I set a = 2
in both processors. What am I doing wrong?

H2O H2O
Answer

The second process gets stuck because it is trying to gather from all processes, the second process is waiting for the first to join it in the allgather() operation and will wait indefinitely for this to happen.

It's kind of hard to tell you a good way to fix this without knowing a practical application. But generally all processes will have to partake in a gather operation, the simple fix is to just let process 0 partake without using the result:

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

a = None
if rank == 0:
    a = 2
if rank == 1:
    a = 3
z = 2
old_z = None
for i in range(0, a):
    if i == a-1:
        old_z = z 
    z = comm.allgather(z)

if rank == 0:
    comm.allgather(old_z)

print(z, rank)
comm.barrier()

This is not a very elegant solution but it does provide your desired output. Note how I had to store the value of z from the processes previous iteration to get anything close to your desired behavior.

If we don't store the previous value of z we get the following result:

[[2, 2], [2, 2]] 0

[[[2, 2], [2, 2]], [[2, 2], [2, 2]]] 1