Gábor Erdős Gábor Erdős - 8 days ago 5
Python Question

SLURM and python, nodes are allocated, but the code only runs on one node

I have a 4*64 CPU cluster. I installed SLURM, and it seems to be working, as if i call

sbatch
i get the proper allocation and queue. However if i use more than 64 cores (so basically more than 1 node) it perfectly allocates the correct amount of nodes, but if i
ssh
into the allocated nodes i only see actual work in one of them. The rest just sits there doing nothing.

My code is complex, and it uses
multiprocessing
. I call pools with like 300 workers, so i guess it should not be the problem.

What i would like to achieve is to call
sbatch myscript.py
on like 200 cores, and SLURM should distribute my run on these 200 cores, not just allocate the correct amount of nodes but actually only use one.

The header of my python script looks like this:

#!/usr/bin/python3

#SBATCH --output=SLURM_%j.log
#SBATCH --partition=part
#SBATCH -n 200


and i call the script with
sbatch myscript.py
.

Answer

I think your sbatch script should not be inside the python script. Rather it should be a normal bash script including the #SBATCH options followed by the actual script to run with srun jobs. like the following:

#!/usr/bin/bash

#SBATCH --output=SLURM_%j.log
#SBATCH --partition=part
#SBATCH -n 200

srun python3 myscript.py

I suggest testing this with a simple python script like this:

import multiprocessing as mp

def main():
    print("cpus =", mp.cpu_count())

if __name__ == "__main__":
    main()