Gábor Erdős Gábor Erdős - 1 month ago 8
Python Question

SLURM and python, nodes are allocated, but the code only runs on one node

I have a 4*64 CPU cluster. I installed SLURM, and it seems to be working, as if i call

i get the proper allocation and queue. However if i use more than 64 cores (so basically more than 1 node) it perfectly allocates the correct amount of nodes, but if i
into the allocated nodes i only see actual work in one of them. The rest just sits there doing nothing.

My code is complex, and it uses
. I call pools with like 300 workers, so i guess it should not be the problem.

What i would like to achieve is to call
sbatch myscript.py
on like 200 cores, and SLURM should distribute my run on these 200 cores, not just allocate the correct amount of nodes but actually only use one.

The header of my python script looks like this:


#SBATCH --output=SLURM_%j.log
#SBATCH --partition=part
#SBATCH -n 200

and i call the script with
sbatch myscript.py


I think your sbatch script should not be inside the python script. Rather it should be a normal bash script including the #SBATCH options followed by the actual script to run with srun jobs. like the following:


#SBATCH --output=SLURM_%j.log
#SBATCH --partition=part
#SBATCH -n 200

srun python3 myscript.py

I suggest testing this with a simple python script like this:

import multiprocessing as mp

def main():
    print("cpus =", mp.cpu_count())

if __name__ == "__main__":