How do I submit a large number of very similar jobs?
There are a few tricks that can help you to submit large numbers of similar jobs (as in HTC) that will make your life easier.
The user should be careful if your program needs a random number seed, e.g. for Monte Carlo simulation. Your program should handle it properly, to avoid using the same pseudo seed multiple times.
We can start with the simple C++ program we introduced in here. We then create a submission template, called example3a-template.slurm,
#!/bin/bash##SBATCH --qos=cu_hpc#SBATCH --partition=cpu#SBATCH --job-name=example3a#SBATCH --output=example3a_INPUT1_log.txt#SBATCH --nodes=1#SBATCH --ntasks=1#SBATCH --time=00:10:00#SBATCH --cpus-per-task=1#SBATCH --mem-per-cpu=1Gmodulepurge#To handle PATHsexport MYCODEDIR=`pwd`echo"MYCODEDIR = "$MYCODEDIRecho"TMPDIR = "$TMPDIR#Sleep 3m, allow me to capture the `squeue` screen in timesleep3m#To run C++ programcd $TMPDIRcp $MYCODEDIR/example3.chmoda+xexample3rm-rfexample3.txt./example3cp-rfexample3.txt $MYCODEDIR/example3_INPUT1.txt
And here is our submit.sh srcipt,
#!/bin/bashfor x in {0..20..1}do#prepare configurationrm-rfexample3a_$x.slurmcpexample3a-template.slurmexample3a_$x.slurmseds/INPUT1/$x/gexample3a_$x.slurm>|tempmvtempexample3a_$x.slurm#submit and clean slurm submission fileecho"Job:" $xsbatchexample3a_$x.slurmrm-rfexample3a_$x.slurmdone
The scipt will loop from 0 to 20. In each loop, it will
prepare submission script from the template. sed editor is used to find a pattern INPUT1and then replace it with $x ,
submit jobs to the Slurm cluster,
delete submission script.
After your submission is done, you can check your jobs using squeue
[your_name@frontend-03 example2]$ squeue -u your_name
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
81969 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81970 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81971 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81972 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81973 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81974 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81975 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81976 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81977 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81978 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81979 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81980 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81981 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81982 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81983 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81984 cpu example3 your_name PD 0:00 1 (QOSMaxJobsPerUserLimit)
81965 cpu example3 your_name R 1:04 1 cpu-bladeh-01
81966 cpu example3 your_name R 1:04 1 cpu-bladeh-01
81967 cpu example3 your_name R 1:04 1 cpu-bladeh-01
81968 cpu example3 your_name R 1:04 1 cpu-bladeh-01