Difference between revisions of "Comp:cluster reservation"

From Theochem
Jump to navigationJump to search
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Batch jobs are handled by [https://wiki.science.ru.nl/slurm SLURM]. However, sometimes you need all cores of one specific cluster node and you want to ask other theoretical chemistry group members not to start new jobs on that node. We first used Slack/Yoink to request nodes, but we now have a script to automate this process. The script will put the cluster node reservation requests in a database and from a bash script you can get a list of cluster nodes that have been reserved.
+
Batch jobs are handled by [https://wiki.science.ru.nl/slurm SLURM]. However, sometimes you need all cores of one specific cluster node and you want to ask other theoretical chemistry group members not to start new jobs on that node. We first used [https://wiki.theochem.ru.nl/Comp:clusters Slack/Yoink] to request nodes, but we now have a script to automate this process. The script will put the cluster node reservation requests in a database and from a bash script you can get a list of cluster nodes that have been reserved.
  
 
There can be only one reservation at at time for each cluster node.
 
There can be only one reservation at at time for each cluster node.
Line 7: Line 7:
 
Make sure to have <tt>/vol/thchem/bin</tt> in your bash PATH variable. In your <tt>.bashrc</tt> add
 
Make sure to have <tt>/vol/thchem/bin</tt> in your bash PATH variable. In your <tt>.bashrc</tt> add
  
   export PATH=$PATH:/vol/tchem/bin
+
   export PATH=$PATH:/vol/thchem/bin
  
On one of the clusternodes and on <tt>lilo7</tt> (but not <tt>lilo6</tt> and older) check this:
+
On one of the cluster nodes and on <tt>lilo7</tt> (but not <tt>lilo6</tt> and older) check this:
  
 
   Cn info
 
   Cn info
Line 17: Line 17:
 
   You are connected to database "batchq" as user "cnuser" on host "cn58" (address "131.174.30.158") at port "16034".
 
   You are connected to database "batchq" as user "cnuser" on host "cn58" (address "131.174.30.158") at port "16034".
  
If you get <tt>Cn: command not found</tt>, ask someone to setup a <tt>.profile</tt> script.
+
If you get <tt>Cn: command not found</tt>, create a file with the name <tt>.profile</tt> in your home directory. The file should contain these four lines:
 +
 
 +
    . /system.profile
 +
    if [ "$SHELL" == /bin/bash ]; then
 +
      test -s ~/.bashrc && . ~/.bashrc || true
 +
    fi
  
 
From the command line you can get the manual by just typing <tt>Cn</tt>, or  
 
From the command line you can get the manual by just typing <tt>Cn</tt>, or  
Line 50: Line 55:
 
are available on <tt>cn31</tt>, use:
 
are available on <tt>cn31</tt>, use:
  
     sbatch -n 40 --mem=521GB -w cn31 myjob.sh
+
     sbatch -n 40 --mem=521GB -w cn31 my_job.sh
  
 
Your job <tt>myjob.sh</tt> should update the table of reservations when it starts, so it should
 
Your job <tt>myjob.sh</tt> should update the table of reservations when it starts, so it should
Line 60: Line 65:
 
you made a reservation for the node on which your jobs is running.
 
you made a reservation for the node on which your jobs is running.
  
Your reservation expires the number of days you reserved after your job started. You can also
+
<b>Note</b> the job that uses the reserved node should NOT be submitted with the <tt>--exclude<tt> flag. Instead, make sure it runs on the reserved node using the <tt>-w</tt> flag:
cancel the reservation with
+
 
 +
    Cn reserve cn50 7
 +
    sbatch -w cn50 -n 24 my_job.sh
 +
 
 +
Your reservation expires the number of days you reserved AFTER your job STARTED.
 +
 
 +
To cancel a reservation use:
  
 
     Cn cancel cn31
 
     Cn cancel cn31
  
You can add or update the comment of your reservation with:
+
You can add extra days to the reservation with:
 +
 
 +
    Cn add cn31 3
 +
 
 +
You can also use a negative number of days to shorten the reservation.
 +
 
 +
To add or update the comment of your reservation use:
  
 
     Cn comment cn31 "Thanks!"
 
     Cn comment cn31 "Thanks!"
Line 82: Line 99:
  
 
     sbatch --exclude "$EXCL,cn58,cn86" --mem=32GB myjob.sh
 
     sbatch --exclude "$EXCL,cn58,cn86" --mem=32GB myjob.sh
 +
 +
=== What happens when the database is down? ===
 +
To check whether the system is working:
 +
 +
    Cn status
 +
 +
If things are fine you will get:
 +
 +
    /vol/thchem/bin/Cn: The cluster node reservation system is up and running.
 +
 +
However, when the database is down, some things may still work. Unless things went horribly wrong,
 +
a comma-separated file with the reservations was written to disk, and also the list of excluded
 +
files is on disk.
 +
 +
In particular, the <tt>Cn exclude</tt> command will give this last known list of excluded nodes. It will
 +
also give an error message on standard-error (<tt>/dev/stderr</tt>), but that will not interfere
 +
with this line in your job, since it only uses the cluster node list that is written to standard-output
 +
<tt>/dev/stdout</tt>:
 +
 +
  EXCL=$(Cn exclude)
 +
 +
The <tt>Cn show</tt> command will show the last known table with reservations.
 +
 +
The command
 +
 +
    Cn start
 +
 +
however, will not be able to update the table. Your commands, however, will be written to a file, and
 +
they can later be entered "by hand" by whoever fixed the database problem. The list of recorded
 +
command can be printed:
 +
 +
    Cn recorded
 +
 +
This will only work well when the system was properly shut down. In case of an unintended crash, it will
 +
still show the filenames where the commands were recorded, but they will be labeled as <tt>stale</tt>,
 +
since it is not certain when the database stopped working.
 +
 +
Commands given while the system is down will also automatically generate an e-mail to <tt>gerritg@theochem.ru.nl</tt>.

Latest revision as of 22:30, 19 February 2021

Batch jobs are handled by SLURM. However, sometimes you need all cores of one specific cluster node and you want to ask other theoretical chemistry group members not to start new jobs on that node. We first used Slack/Yoink to request nodes, but we now have a script to automate this process. The script will put the cluster node reservation requests in a database and from a bash script you can get a list of cluster nodes that have been reserved.

There can be only one reservation at at time for each cluster node.

Tutorial and Quick Start (for bash users)

Make sure to have /vol/thchem/bin in your bash PATH variable. In your .bashrc add

  export PATH=$PATH:/vol/thchem/bin

On one of the cluster nodes and on lilo7 (but not lilo6 and older) check this:

  Cn info

You should get something like:

  You are connected to database "batchq" as user "cnuser" on host "cn58" (address "131.174.30.158") at port "16034".

If you get Cn: command not found, create a file with the name .profile in your home directory. The file should contain these four lines:

   . /system.profile
   if [ "$SHELL" == /bin/bash ]; then
     test -s ~/.bashrc && . ~/.bashrc || true
   fi 

From the command line you can get the manual by just typing Cn, or

   Cn help

Use the spacebar and the b to go back-and-forth through the manual, use q or cntrl-C to quit the help info.

Have a look at current reservations:

   Cn show

Reserve, e.g., cluster node cn10 for three and a half days:

   Cn reserve cn31 3.5 "My molecule is big"

If the node was still available, you should get something like this:

     cn  |   who   | days |        time         | started |      comment       
   ------+---------+------+---------------------+---------+--------------------
    cn31 | gerritg |  3.5 | 2021-02-10 21:17:18 |         | my molecule is big
   (1 row)

At his point, the command

   Cn exclude

will return a comma-separted list of reserved cluster nodes that includes cn31.

You can now submit a job to slurm asking for cluster node cn31 and the number of cores and memory that you need. For example, if you want your job to start when 40 cores and 512GB memory are available on cn31, use:

   sbatch -n 40 --mem=521GB -w cn31 my_job.sh

Your job myjob.sh should update the table of reservations when it starts, so it should contain the line

   Cn start

You can always use this line, also if you did not reserve anything: it will only do anything if you made a reservation for the node on which your jobs is running.

Note the job that uses the reserved node should NOT be submitted with the --exclude flag. Instead, make sure it runs on the reserved node using the -w flag:

   Cn reserve cn50 7
   sbatch -w cn50 -n 24 my_job.sh

Your reservation expires the number of days you reserved AFTER your job STARTED.

To cancel a reservation use:

   Cn cancel cn31

You can add extra days to the reservation with:

   Cn add cn31 3

You can also use a negative number of days to shorten the reservation.

To add or update the comment of your reservation use:

   Cn comment cn31 "Thanks!"

This system will only work if everyone uses the --exclude flag in the sbatch command to honour the request:

Add the --exclude flag to your slurm sbatch jobs

To submit a job, use the following in your script (set the memory that you actually need):

   EXCL=$(Cn exclude)
   sbatch --exclude "$EXCL" --mem=32GB myjob.sh

You can also add nodes that you do not want to use even if they are available, e.g.

   sbatch --exclude "$EXCL,cn58,cn86" --mem=32GB myjob.sh

What happens when the database is down?

To check whether the system is working:

   Cn status

If things are fine you will get:

   /vol/thchem/bin/Cn: The cluster node reservation system is up and running.

However, when the database is down, some things may still work. Unless things went horribly wrong, a comma-separated file with the reservations was written to disk, and also the list of excluded files is on disk.

In particular, the Cn exclude command will give this last known list of excluded nodes. It will also give an error message on standard-error (/dev/stderr), but that will not interfere with this line in your job, since it only uses the cluster node list that is written to standard-output /dev/stdout:

  EXCL=$(Cn exclude)

The Cn show command will show the last known table with reservations.

The command

   Cn start

however, will not be able to update the table. Your commands, however, will be written to a file, and they can later be entered "by hand" by whoever fixed the database problem. The list of recorded command can be printed:

   Cn recorded

This will only work well when the system was properly shut down. In case of an unintended crash, it will still show the filenames where the commands were recorded, but they will be labeled as stale, since it is not certain when the database stopped working.

Commands given while the system is down will also automatically generate an e-mail to gerritg@theochem.ru.nl.