How do I submit a large number of very similar jobs?
There are a few tricks that can help you to submit large numbers of similar jobs (as in HTC) that will make your life easier.
The user should be careful if your program needs a random number seed, e.g. for Monte Carlo simulation. Your program should handle it properly, to avoid using the same pseudo seed multiple times.
We can start with the simple C++ program we introduced in here. We then create a submission template, called example3a-template.slurm,
#!/bin/bash
#
#SBATCH --qos=cu_hpc
#SBATCH --partition=cpu
#SBATCH --job-name=example3a
#SBATCH --output=example3a_INPUT1_log.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
module purge
#To handle PATHs
export MYCODEDIR=`pwd`
echo "MYCODEDIR = "$MYCODEDIR
echo "TMPDIR = "$TMPDIR
#Sleep 3m, allow me to capture the `squeue` screen in time
sleep 3m
#To run C++ program
cd $TMPDIR
cp $MYCODEDIR/example3 .
chmod a+x example3
rm -rf example3.txt
./example3
cp -rf example3.txt $MYCODEDIR/example3_INPUT1.txt
And here is our submit.sh srcipt,
#!/bin/bash
for x in {0..20..1}
do
#prepare configuration
rm -rf example3a_$x.slurm
cp example3a-template.slurm example3a_$x.slurm
sed s/INPUT1/$x/g example3a_$x.slurm >| temp
mv temp example3a_$x.slurm
#submit and clean slurm submission file
echo "Job:" $x
sbatch example3a_$x.slurm
rm -rf example3a_$x.slurm
done
The scipt will loop from 0 to 20. In each loop, it will
prepare submission script from the template. sed editor is used to find a pattern INPUT1and then replace it with $x ,
submit jobs to the Slurm cluster,
delete submission script.
After your submission is done, you can check your jobs using squeue
[your_name@frontend-03 example2]$ squeue -u your_name
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
81969 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81970 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81971 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81972 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81973 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81974 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81975 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81976 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81977 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81978 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81979 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81980 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81981 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81982 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81983 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81984 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81965 cpu example3 your_name R 1:04 1 cpu-bladeh-01
81966 cpu example3 your_name R 1:04 1 cpu-bladeh-01
81967 cpu example3 your_name R 1:04 1 cpu-bladeh-01
81968 cpu example3 your_name R 1:04 1 cpu-bladeh-01