Difference between revisions of "Comp:cluster reservation"
Line 57: | Line 57: | ||
Cn start | Cn start | ||
− | + | You can always use this line, also if you did not reserve anything: it will only do anything if | |
− | the node on which | + | you made a reservation for the node on which your jobs is running. |
Your reservation expires the number of days you reserved after your job started. You can also | Your reservation expires the number of days you reserved after your job started. You can also |
Revision as of 21:48, 10 February 2021
Batch jobs are handled by SLURM. However, sometimes you need all cores of one specific cluster node and you want to ask other theoretical chemistry group members not to start new jobs on that node. We first used Slack/Yoink to request nodes, but we now have a script to automate this process. The script will put the cluster node reservation requests in a database and from a bash script you can get a list of cluster nodes that have been reserved.
There can be only one reservation at at time for each cluster node.
Tutorial and Quick Start (for bash users)
Make sure to have /vol/thchem/bin in your bash PATH variable. In your .bashrc add
export PATH=$PATH:/vol/tchem/bin
On one of the clusternodes and on lilo7 (but not lilo6 and older) check this:
Cn info
You should get something like:
You are connected to database "batchq" as user "cnuser" on host "cn58" (address "131.174.30.158") at port "16034".
If you get Cn: command not found, ask someone to setup a .profile script.
From the command line you can get the manual by just typing Cn, or
Cn help
Use the spacebar and the b to go back-and-forth through the manual, use q or cntrl-C to quit the help info.
Have a look at current reservations:
Cn show
Reserve, e.g., cluster node cn10 for three and a half days:
Cn reserve cn31 3.5 "My molecule is big"
If the node was still available, you should get something like this:
cn | who | days | time | started | comment ------+---------+------+---------------------+---------+-------------------- cn31 | gerritg | 3.5 | 2021-02-10 21:17:18 | | my molecule is big (1 row)
At his point, the command
Cn exclude
will return a comma-separted list of reserved cluster nodes that includes cn31.
You can now submit a job to slurm asking for cluster node cn31 and the number of cores and memory that you need. For example, if you want your job to start when 40 cores and 512GB memory are available on cn31, use:
sbatch -n 40 --mem=521GB -w cn31 myjob.sh
Your job myjob.sh should update the table of reservations when it starts, so it should contain the line
Cn start
You can always use this line, also if you did not reserve anything: it will only do anything if you made a reservation for the node on which your jobs is running.
Your reservation expires the number of days you reserved after your job started. You can also cancel the reservation with
Cn cancel cn31
You can add or update the comment of your reservation with:
Cn comment cn31 "Thanks!"
This system will only work if everyone uses the --exclude flag in the sbatch command to honour the request:
Add the --exclude flag to your slurm sbatch jobs
To submit a job, use the following in your script (set the memory that you actually need):
EXCL=$(Cn exclude) sbatch --exclude "$EXCL" --mem=32GB myjob.sh
You can also add nodes that you do not want to use even if they are available, e.g.
sbatch --exclude "$EXCL,cn58,cn86" --mem=32GB myjob.sh