Slurm job array submission severely underutilizing available resources - arrays

The SLURM job array submission isn't working as I expected. When I run my sbatch script to create the array and run the programs I expect it to fully utilize all the cores that are available, however, it only allows one job from the array to run on the a given node at a time. SCONTROL shows the job using all 36 cores on the node when I specified 4 cores for the process. Additionally, I want to restrict the jobs to running on one specific node, however if other nodes are unused, it will submit a job onto them as well, using every core available on that node.
I've tried submitting the jobs by changing the parameters for --nodes, --ntasks, --nodelist, --ntasks-per-node, --cpus-per-task, setting OMP_NUM_THREADS, and specifying the number of cores for mpirun directly. None of these options seemed to change anything at all.
#!/bin/bash
#SBATCH --time=2:00:00 # walltime
#SBATCH --ntasks=1 # number of processor cores (i.e. tasks)
#SBATCH --nodes=1 # number of nodes
#SBATCH --nodelist node001
#SBATCH --ntasks-per-node=9
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=500MB # memory per CPU core
#SBATCH --array=0-23%8
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun -n 4 MYPROGRAM
I expected to be able to run eight instances of MYPROGRAM, each utilizing four cores for a parallel operation. In total, I expected to use 32 cores at a time for MYPROGRAM, plus however many cores are needed to run the job submission program.
Instead, my squeue output looks like this
JOBID PARTITION NAME USER ST TIME NODES CPUS
num_[1-23%6] any MYPROGRAM user PD 0:00 1 4
num_0 any MYPROGRAM user R 0:14 1 36
It says that I am using all available cores on the node for this process, and will not allow additional array jobs to begin. While MYPROGRAM runs exactly as expected, there is only once instance of it running at any given time.
And my SCONTROL output looks like this:
UserId=user(225589) GroupId=domain users(200513) MCS_label=N/A
Priority=4294900562 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2019-06-21T18:46:25 EligibleTime=2019-06-21T18:46:26
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-06-21T18:46:28
Partition=any AllocNode:Sid=w***:45277
ReqNodeList=node001 ExcNodeList=(null)
NodeList=(null) SchedNodeList=node001
NumNodes=1-1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=2000M,node=1
Socks/Node=* NtasksPerN:B:S:C=9:0:*:* CoreSpec=*
MinCPUsNode=36 MinMemoryCPU=500M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Power=
JobId=1694 ArrayJobId=1693 ArrayTaskId=0 JobName=launch_vasp.sh
UserId=user(225589) GroupId=domain users(200513) MCS_label=N/A
Priority=4294900562 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:10 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2019-06-21T18:46:25 EligibleTime=2019-06-21T18:46:26
StartTime=2019-06-21T18:46:26 EndTime=2019-06-21T20:46:26 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-06-21T18:46:26
Partition=any AllocNode:Sid=w***:45277
ReqNodeList=node001 ExcNodeList=(null)
NodeList=node001
BatchHost=node001
NumNodes=1 NumCPUs=36 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
TRES=cpu=36,mem=18000M,node=1,billing=36
Socks/Node=* NtasksPerN:B:S:C=9:0:*:* CoreSpec=*
MinCPUsNode=36 MinMemoryCPU=500M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Power=
Something is going wrong in how SLURM is assigning cores to tasks, but nothing I've tried changes anything. I'd appreciate any help you can give.

Check if the slurm.conf file allows consumable resources. The default is to assign nodes exclusively. I had to add the following lines to allow per-score scheduling
SelectType=select/cons_res
SelectTypeParameters=CR_Core

Related

slurm - use array and limit the number of jobs running at the same time until they finish

Let's suppose I have the following bash script (bash.sh) to be run on a HPC using slurm:
#!/bin/bash
#SBATCH --job-name test
#SBATCH --ntasks 4
#SBATCH --time 00-05:00
#SBATCH --output out
#SBATCH --error err
#SBATCH --array=0-24
readarray -t VARS < file.txt
VAR=${VARS[$SLURM_ARRAY_TASK_ID]}
export VAR
bash my_script.sh
This script will run 25 times the my_script.sh script changing variables taken in the file.txt file. In other words, 25 jobs will be launched all together, if I submit bash.sh with the command sbatch bash.sh.
Is there a way I can limit the number of jobs to be ran at the same time (e.g. 5) until all 25 will be completed?
And if there is a way in doing so, how can I do the same but with having 24 jobs in total (i.e. not a number divisible by 5)?
Thanks
Extract from Slurm's sbatch documentation:
-a, --array=<indexes>
... A maximum number of simultaneously running tasks from the job array may be specified using a "%" separator. For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4. ...
This should limit the number of running jobs to 5 in your array:
#SBATCH --array=0-24%5

SLURM/sbatch: pass array ID as input argument of an executable

I'm new to SLURM and I'm trying to do something very natural: I have a compiled C program, off.exe, which takes one variable as input, and I want to run it several times in parallel, each with a different value of the input parameter.
I thought I could use the %a array iterator as input:
#!/bin/bash
#SBATCH --partition=regular1,regular2
#SBATCH --time=12:00:00 # walltime
#SBATCH --ntasks=1 # number of processor cores (i.e. tasks)
#SBATCH --mem-per-cpu=512M # memory per CPU core
#SBATCH --job-name="ISM" # job name
#SBATCH --array=1-60 # job array. The item identifier is %a
#SBATCH --output=Polarization_%a_v0.4.txt # output file. %A is the job ID
srun ./off.exe %a
but it's not working (it's as if the input parameter were always zero!).
Can someone help me please?
$a, %j etc. are replacement symbols for filenames, for example in the names of the output and error files recorded by Slurm. For your job arrays, you need to use one of Slurm's output environment variables, probably $SLURM_ARRAY_TASK_ID. You can find the full list in the manpage for sbatch.

Job distribution between nodes on HPC, instead of 1 CPU cores

I am using PBS, HPC to submit serially written C codes. I have to run suppose 5 codes in 5 different directories. when I select 1 node and 5 cores select=1:ncpus=5, and submits it with ./submit &. It forks and runs all the 5 jobs. The moment I choose 5 node and 1 cores select=5:ncpus=1, and submits it with ./submit &. Only 1 core of the first node runs all five jobs and rest 4 threads are free, speed decreased to 1/5.
My question is, Is it possible to fork the job between the nodes as well?
because when I select on HPC select=1:ncpus=24 it gets to Que instead select=4:ncpus=6 runs.
Thanks.
You should consider using job arrays (using option #PBS -t 1-5) with I node and 1 cpu each. Then 5 independent jobs will start and your job will wait less in the queue.
Within your script you can use environment variable PBS_ARRAYID to identify the task and use it to set appropriate directory and start the appropriate C code. Something like this:
#!/bin/bash -l
#PBS -N yourjobname
#PBS -q yourqueue
#PBS -l nodes=1:ppn=1
#PBS -t 1-5
./myprog-${PBS_ARRAYID}.c
This script will run 5 jobs and each of them will run programs with a name myprog-*.c where * is a number between 1 and 5.

SLURM: run jobs in parallel instead of as array?

I have a large file to analyze using "jellyfish query", which is not multithreaded. I have split the big file into 29 manageable fragments, to run as an array on SLURM. However, these are sitting in the workload queue for ages, whereas if I could request a whole node (32 cpus) they would get in a separate queue with quicker availability. Is there a way to tell SLURM to run the command on these fragments in parallel across all the cpus in a node, instead of as a serial array?
You could ask for 29 tasks, 1 cpu per task (you will get from 29 cpus on a node to 1 cpu in 29 different nodes), and in the slurm script you should start your calculus with srun, telling srun to allocate one task/cpu per chunk.
.
.
.
#SBATCH --ntasks=29
#SBATCH --cpus-per-task=1
.
.
.
for n in {1..29}
do
srun -n 1 <your_script> $n &
done
wait
I suggest running a python script to multithread this for you, then submit a SLURM job to run the python script.
from multiprocessing import Pool
import subprocess
num_threads = 29
def sample_function(input_file):
return subprocess.run(["cat", input_file], check=True).stdout
input_file_list = ['one','two','three']
pool = Pool(processes=num_threads)
[pool.apply_async(sample_function, args=(input_file,)) for input_file in input_file_list]
pool.close()
pool.join()
This assumes you have files "one", "two", and "three". Obviously you need to replace:
the input file list
job you want to run with subprocess
Thanks for the suggestions! I found a much less elegant but still functional way:
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
jellyfish query...fragment 1 &
jellyfish query...fragment 2 &
...
jellyfish query...fragment 29
wait

How to run an array job within a pipeline of several holded jobs when the number of subjobs in the array depends on the result of a previous job

I am trying to write a bash script that sends several jobs to the cluster (SGE scheduler), and that each of them waits for the previous to end, such as:
HOLD_ID=$(qsub JOB1.sh | cut -c 10-16)
HOLD_ID=$(qsub -hold_jid $HOLD_ID JOB2.sh | cut -c 10-16)
HOLD_ID=$(qsub -hold_jid $HOLD_ID JOB3.sh | cut -c 10-16)
This works perfectly, however, now I want to add to this pipeline a holded array job, such as:
qsub -hold_jid $HOLD_ID -t 1-$NB_OF_SUBJOBS JOB4.sh
But here the number of sub-jobs ($NB_OF_SUBJOBS) I will have depends on the result of JOB2.sh.
I want this to be a fast, master script that just send all the jobs. I would not like to have a while + sleep or something like that, which was my first attempt. The job on which depends the number I need (JOB2.sh) is relatively long in time. As the last line is evaluated when submited, any variable or file with the number of sub-jobs created by the previous JOB2.sh will not work. Any ideas?
Many thanks,
David
So, if I understand, the submission of job 4 is predicated on obtaining information from the completion of job 2. If this is the case, it is clear that you will need to submit job 4 after job 2 completes, which is separate from submitting job 4 and having execution hold on completion of job 2.
Why not use the -sync -y option on job 2 to have the submission of job 4 only occur after job 2 completes:
qsub -hold_jid $HOLD_ID JOB2.sh -sync y
Make sure to have job 2 output n_subjobs variable to somewhere like a file (n_subjobs.txt example below), or you can parse output into variable as you have done for job id. Then read this information when submitting job 4:
qsub -t 1-$(cat n_subjobs.txt) JOB4.sh

Resources