Specifying the maximum runtime (means quicker job execution) / resource reservation
The problem
The queueing system we use on the VSC enables for the reservation of slots to jobs that need many of them. To make sure that these jobs are at all able to run on the cluster (not blocked by the many smaller ones), we automatically turn on this feature for any jobs that run with 32 or more slots. Unfortunately, this will “waste” a certain amount of resources. In the meantime, there will be some empty nodes which run no calculation.
The solution (quicker scheduling available here)
But fear not! We will tell you how to make use of these idle resources while they are waiting for the large job to be scheduled!
This is called backfilling and is done so:
#### add this to the top of your job submission script: #$ -l h_rt=01:00:00 # in this example, ONE HOUR (01:00:00) is the job's maximum run time
Here, h_rt means “hard runtime”. The job will be permitted to run at most this long. Please add a little buffer of about 10-50% to the runtime you've estimated/measured, so the job doesn't get killed before its result is available! For example, if your last job came in at 03:57:03 and you think that the next one will be pretty similar in nature, try to specify “-l h_rt=06:00:00”.
We hope that you are able and willing to utilize this feature as it would save a lot of electricity and greatly reduce the time your jobs spend in the queue!
Please put our precious computing time to use as best as you can.
Job chains
Job chains are sets of consecutive interdependent jobs.
There are several ways how to create job chains and we will discuss three different solutions below:
Solution I
Create one single job script and list several different jobs inside it. E.g.:
user@l01 $ cat allInOne.sge #$ -N allInOne ./doJob1 ./doJob2 ./doJob3
user@l01 $ qsub allInOne.sge Your job 10411 ("allInOne") has been submitted.
Solution II
Job chains, using -hold jid <job id|job name>.
user@l01 $ cat holdJob.sge #$ -N holdJob ./doJob1
user@l01 $ cat Job2.sge #$ -N Job2 #$ -hold_jid holdJob ./doJob2
…
user@l01 $ qsub holdJob.sge Your job 10451 ("holdJob") has been submitted. user@l01 $ qsub Job2.sge Your job 10452 ("Job2") has been submitted. user@l01 $ qsub Job3.sge Your job 10453 ("Job3") has been submitted.
'Job2' waits and starts only after 'holdJob' has finished. If you have another 'Job3' it waits until 'Job2' has finished and so on … .
- Advantage: accumulates ”waiting time”
- Note: -hold_jid <job_name> can only be used to reference jobs of the same user (-hold_jid <job_id> can be used to reference any job)
Job arrays
Job arrays are sets of similar but independent jobs. Submit sets of similar and independent “tasks”:
qsub -t 1-500:1 example_3.sge
submits 500 instances of the same script- each instance (“task”) is executed independently
- all instances subsumed with a single job ID
- variable
$SGE_TASK_ID
discriminates between instances - task numbering scheme:
-t <first>-<last>:<stepsize>
- related:
$SGE_TASK_FIRST,$SGE_TASK_LAST,$SGE_TASK_STEPSIZE
Example:
#$ -cwd #$ -N blastArray #$ -t 1-500:1 QUERY=query_${SGE_TASK_ID}.fa OUTPUT=blastout_${SGE_TASK_ID}.txt echo "processing query $QUERY ..." blastall -p blastn -d nt -i $QUERY -o $OUTPUT echo "...done"
user@l01 $ qsub example_3.sge Your job 10420.1-500:1 ("blastArray") has been submitted. user@l01 $ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 10420 0.56000 blastArray mjr r 02/13/2007 15:05:56 all.q@r10n01 1 198 10420 0.56000 blastArray mjr r 02/13/2007 15:05:56 all.q@r10n01 1 199 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 all.q@r10n01 1 202 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 all.q@r10n01 1 203 10420 0.56000 blastArray mjr r 02/13/2007 15:05:41 all.q@r10n01 1 196 10420 0.56000 blastArray mjr r 02/13/2007 15:05:41 all.q@r10n01 1 197 10420 0.55241 blastArray mjr r 02/13/2007 15:08:41 all.q@r10n01 1 208 10420 0.55241 blastArray mjr r 02/13/2007 15:08:41 all.q@r10n01 1 209 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 all.q@r10n02 1 204 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 all.q@r10n02 1 206 10420 0.56000 blastArray mjr r 02/13/2007 15:02:11 all.q@r10n02 1 176 10420 0.56000 blastArray mjr r 02/13/2007 15:02:11 all.q@r10n02 1 177 10420 0.56000 blastArray mjr r 02/13/2007 15:03:26 all.q@r10n02 1 182 10420 0.56000 blastArray mjr r 02/13/2007 15:03:26 all.q@r10n02 1 183 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 all.q@r10n02 1 200 10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 all.q@r10n02 1 201 10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 all.q@r10n03 1 193 10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 all.q@r10n03 1 194 10420 0.56000 blastArray mjr r 02/13/2007 15:04:41 all.q@r10n03 1 190 10420 0.56000 blastArray mjr r 02/13/2007 15:04:41 all.q@r10n03 1 191 10420 0.56000 blastArray mjr r 02/13/2007 15:03:41 all.q@r10n03 1 184 10420 0.56000 blastArray mjr r 02/13/2007 15:03:41 all.q@r10n03 1 185 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 all.q@r10n03 1 205 10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 all.q@r10n03 1 207 10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 all.q@r10n04 1 192 10420 0.56000 blastArray mjr r 02/13/2007 15:05:26 all.q@r10n04 1 195 10420 0.56000 blastArray mjr r 02/13/2007 15:04:26 all.q@r10n04 1 188 10420 0.56000 blastArray mjr r 02/13/2007 15:04:26 all.q@r10n04 1 189 10420 0.56000 blastArray mjr r 02/13/2007 15:03:56 all.q@r10n04 1 186 10420 0.56000 blastArray mjr r 02/13/2007 15:03:56 all.q@r10n04 1 187 10420 0.55242 blastArray mjr qw 02/13/2007 14:28:34 1 210-500:1
Job arrays with multiple single core jobs on one exclusive node
In some cases job arrays with single core tasks require more memory than the per core memory of the compute nodes (3 GB on VSC-1, 2 GB on VSC-2). For such cases the jobscript below can be used. It starts several single core tasks on one node within a job array. Note the definition of the job stepwidth.
#$ -N job_array_with_multilple single tasks on one node ### ### request single nodes, on vsc1 all nodes have 24 GB of memory: ### 8-core nodes: #$ -pe mpich 8 ### 12-core nodes: ### #$ -pe mpich 12 ### ### set first and last task_id and stepwidth of array tasks. stepwidth should be identical with the ### number of jobs per node #$ -t 1-18:3 #optimum order for using single cpus cpus=(0 4 2 6 1 5 3 7) for i in `seq 0 $( expr ${SGE_TASK_STEPSIZE} - 1 )` do TASK=`expr ${SGE_TASK_ID} + $i` CMD="run file_$TASK" taskset -c ${cpus[$i]} $CMD & done #wait for all tasks to be finished before exiting the script wait
Job arrays with multiple task within one SGE task step
In some cases, where a huge number of Job task need to be started and the task's runtime is very short, the following construction can be used. It starts several tasks, one after another, on the specified nodes. Note the definition of the job stepwidth.
#$ -N job_array_with_multilple single tasks on one node #$ -pe mpich <N> ### set first and last task_id and stepwidth of array tasks. ### stepwidth should be identical with the ### number of jobs per node #$ -t 1-18:3 for i in `seq 0 $( expr ${SGE_TASK_STEPSIZE} - 1 )` do TASK=`expr ${SGE_TASK_ID} + $i` CMD="run file_$TASK &" #or #CMD="mpirun -np $SLOTS ./a.out $TASK &" $CMD done wait
Specifying runtime limits
In SGE Jobs two runtime limits are available: soft (s_rt) and hard (h_rt) runtime limit
h_rt specifies the time after all parts of the job script have to be finished. Running processes are then killed by GridEngine. Grid Engine sends a SIGUSR2 signal.
s_rt specifies the soft runtime limit after that a SIGUSR1 signal is sent to the process. If s_rt is n times smaller than h_rt SIGUSR1 is sent n times:
#!/bin/bash #$ -N notify_test #$ -pe mpich 2 #$ -notify #$ -V #$ -l h_rt=0:20:00 #$ -l s_rt=0:02:00 echo $TMPDIR function sigusr1handler() { date echo "SIGUSR1 caught by shell script" 1>&2 } function sigusr2handler() { date echo "SIGUSR2 caught by shell script" 1>&2 } trap sigusr1handler SIGUSR1 trap sigusr2handler SIGUSR2 echo "starting:" date # Start # -q 0: disable "MPI progress Quiescence" error message #mpirun -q 0 -m $TMPDIR/machines -np $NSLOTS sleep 200 for i in {1..900} do echo "waiting $i" sleep 10 done echo "finished:" date
output in error (*.e*) file of this job example is:
User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 1 SIGUSR1 caught by shell script User defined signal 2 SIGUSR2 caught by shell script
Modyfying the machines file on VSC-1
In cases when not all CPUs of one node are required, the machines file can be modified to guarantee the right behaviour of mpirun. The $TMPDIR/machines file on VSC-1 consists of a number of machine/node names. Each name stands for one CPU on the given machine/node. For an exclusive job on 2 nodes the machine file looks like:
r10n01 r10n01 r10n01 r10n01 r10n01 r10n01 r10n01 r10n01 r12n10 r12n10 r12n10 r12n10 r12n10 r12n10 r12n10 r12n10
For running a job on less than eight cores the $TMPDIR/machines file has to be replaced within the job script:
#$ -N test #$ -pe mpich 16 NSLOTS_PER_NODE_AVAILABLE=8 NSLOTS_PER_NODE_USED=4 NSLOTS_REDUCED=`echo "$NSLOTS / $NSLOTS_PER_NODE_AVAILABLE * $NSLOTS_PER_NODE_USED" | bc ` echo "starting run with $NSLOTS_REDUCED processes; $NSLOTS_PER_NODE_USED per node" for i in `seq 1 $NSLOTS_PER_NODE_USED` do uniq $TMPDIR/machines >> $TMPDIR/tmp done sort $TMPDIR/tmp > $TMPDIR/myhosts cat $TMPDIR/myhosts mpirun -machinefile $TMPDIR/myhosts -np $NSLOTS_REDUCED sleep 2
The reduced form would look like:
r10n01 r10n01 r10n01 r10n01 r12n10 r12n10 r12n10 r12n10
Find out memory usage of running jobs
qstat -F mem_free |grep -B 2 <job_id>
Using the local scratch on VSC-1
On VSC-1 a local scratch directory is available in /tmp. For using this directory in a prallel Job, in which processes on each node need to access some data, the following script (not tested excessivly, bugs might be around) can be used for transferring the data to and from the nodes:
#!/bin/bash #$ -l h_rt=72:00:00 #$ -cwd #$ -N jobname #$ -pe mpich 64 #$ -V #$ -j yes ################################################ ##### start copying process #### ################################################ #temporary directory on nodes: tmp_dir=$TMPDIR/data #naming of the tar files that should be distributed to #the nodes. Each tar file should contain 8 subdirs. #file names are completed by appending a number without leading #zeros + tar.gz input_tar_file_base=processes_ #naming of the outputfiles; number and tar.gz are appended automatically output_tar_base=output_ #define which directories and files should be put into the output tar.gz # not tested yet PACKING=data\* start_dir=`pwd` #save output data to this directory data_save="${start_dir}/data_${JOB_ID}" nodes_uniq=$(cat $TMPDIR/machines| uniq) #copy files per node #tar.gz files contain 8 subdirectories for each process j=0 for i in $nodes_uniq do tar_file_name="${input_tar_file_base}${j}.tar.gz" echo "creating ${tmp_dir} on $i" ssh $i mkdir -p $tmp_dir echo "copying $tar_file_name to node $i" ssh $i cp ${start_dir}\/${tar_file_name} ${tmp_dir} ssh $i cp -r ${start_dir}\/tmp_dictionaries\/* ${tmp_dir} echo "extracting file" ssh $i "cd ${tmp_dir} ;tar -zxf ${tmp_dir}\/${tar_file_name}" j=$(echo "$j+1"|bc -l) done #command to run: cd ${tmp_dir} time mpirun -mca btl_openib_ib_timeout 20 -machinefile $TMPDIR/machines -np 64 $1 -parallel #cp files per node back to start directory of job echo "=================================================" j=0 for i in $nodes_uniq do tar_file_name="${input_tar_file_base}${j}.tar.gz" output_tar_file="${output_tar_base}${j}.tar.gz" echo "creating ${output_tar_file} on node $i" echo "ssh $i \" cd ${tmp_dir} ;tar -zcf ${output_tar_file} $PACKING\"" ssh $i " cd ${tmp_dir} ;tar -zcf ${output_tar_file} $PACKING" echo "copying file back" mkdir -p ${data_save} ssh $i cp ${tmp_dir}\/${output_tar_file} ${data_save} j=$(echo "$j+1"|bc -l) done # ----------------------------------------------------------------- end-of-file
Tight integration
Some MPI implementations have a tight integration to SGE. In this case, providing a machinefile could/will be ignored by mpirun. For disabling this behaviour one can
- unset the variable PE_HOSTFILE within the job script with:
unset PE_HOSTFILE
- recompile the MPI libraries with disabled tight integration support
- use the mpirun command with following options, '-machinefile' option has to be omitted:
mpirun -npernode <number_of_processes_on_one_node> -np <number_of_total_processes> <command>
Using Resources of foreign Projects
Sometimes users from one project are allowed by the project manager of another project to use its resources, ie. run jobs within this project by using the '-P <project_name>' flag in the job script. Grid Engine Jobs are alwayes executed with the primary group of the user. If the primary group of a user is not matching the project group given in the job, the job will be rejected.
For running a job in an other project than your primary group proceed as follows. Define the '-P' flag and if necessary the '-q' flag in your jobscript 'job.sh':
#$ -N foreingProjectJob ### specify queue if necessary: ###$ -q p80000 #$ -P p80000 #$ -pe mpich 6 sleep 60
Submit the job using a wrapper script:
qsub.py job.sh
The wrapper script will change your primary group to that given by the '-P' flag and submit the job.
Changing job parameters of already submitted jobs (qalter)
Jobs that have already been submitted to the queue can be modified using the 'qalter' command. Most common usage is changing queue or hard resources like runtime and free memory:
Change queue to 'long.q':
qalter -q long.q <job_id>
Change hard resources, first get resources with qstat:
qstat -j <job_id> |grep hard
Set new hard resources with modified string from above command:
qalter -l h_rt=60,<rest_of_string_from_above>
submitting a job from standard in (stdin)
echo "sleep 10" | qsub -N test
Redirecting output of jobs
Within Grid Engine
If nothing else is specified the output of GridEngine Jobs is written to four files:
- <job_name>.o<job_id>(.<task_id>): STDOUT of the job
- <job_name>.e<job_id>(.<task_id>): STDERR of the job
- <job_name>.po<job_id>(.<task_id>): STDOUT of parallel environment of the job
- <job_name>.pe<job_id>(.<task_id>): STDERR of parallel environment of the job
This behaviour can be modified in the following way:
- '#$ -j y' joins STDOUT and STDERR, i.e. there will be a '.o' and a '.po' file
- '#$ -o /path/filname' sets the STDOUT to a certain filename, i.e one gets the file specified with '-o' plus a '.e' and a '.pe' file
- '#$ -e /path/filname' sets the STDERR to a certain filename, three files in analogy to the '-o' option
- combination of '-j' with '-o' is possible, '-e' is ignored if '-j' is used.
When using the '-o' and '-e' options following pseudo variables can be used for specifying the path/filenames. NOTE: The variables can not be used as in a bash environment, i.e. using ${HOME} will not work.
- $HOME home directory on execution machine
- $USER user ID of job owner
- $JOB_ID current job ID
- $JOB_NAME current job name
- $HOSTNAME name of the execution host
- $TASK_ID array job task index number
For further details see 'man qsub' on the login nodes of VSC.
Ordinary shell redirection
Since the job is executed in an shell environment, the output of any shell command can be redirected to any file:
echo hello > /path/file
Overcommitting memory of the compute nodes on VSC-2
By default only 32 GB of (the physically available) memory can be allocated by applications on the compute nodes of VSC-2. Some applications however have to allocate larger amounts of memory than the physical limit (but although allocated, it is never used). For such cases a kernel parameter of the compute nodes can be changed in order to allow for overcommittment of the memory:
#$ -N test #$ -pe mpich 32 #$ -l overcommit_mem=true