How to use the Janus cluster
From: RC system announcements <firstname.lastname@example.org>
Date: Fri, 7 Jun 2013 19:00:31 -0600
We have phased in RedHat Enterprise Linux 6 as the default operating system across the CU Research Computing enterprise. Compared to our previous OS (RHEL5), RHEL6 rebases a wide range of system libraries and utilities to significantly newer versions. More importantly, it offers numerous new features that are important for high-performance computing, including scalable memory management, kernel performance counters, control groups, and the performance application programming interface (PAPI).
For more information on how the upgrade may affect you, please visit our documentation page  and FAQ page for tips and tricks.  One major change is that you should set up your shell environment using "modules" rather than "dotkits". If you would like some in-person assistance in migrating your applications and workflows to the new environment, stop by our open office hours Monday and Tuesday (details at https://www.rc.colorado.edu.)
The login nodes are now up and available for use as normal. Don't forget to use janus-compile1 - janus-compile4 for building software.
We have upgraded to newer versions of the Torque resource manager and Moab scheduler. Please check the RHEL6 documentation page for details on how to load the new versions. Note that any queued jobs that used job arrays or dependencies were not compatible with the new version of Torque and have been removed from the queue; they can be resubmitted.
Our testing today has revealed some instabilities in the InfiniBand fabric on Janus. As a result we are releasing only about half of the compute nodes while we work with the vendor to diagnose the IB network. We will start by opening just the crc-serial and janus-debug queues and will phase in additional queues as soon as possible.
Some nodes have changed names. login.rc.colorado.edu now rotates between login01.rc.colorado.edu - login04.rc.colorado.edu. The serial nodes are now cnode0101 - cnode0116. The himem nodes are now himem01 - himem03. You should never need to refer to any of these nodes by name, as the queue system will route your jobs to the appropriate nodes.
If you have any questions please contact us at email@example.com
Rc-announce mailing list
Check your allocation
use Crc-allocations check_allocation.py
New additions to the configuration
From this mailing list post:
There are 2 additional himem nodes. Two departments (IBG and Biofrontiers) contributed to purchasing additional large memory nodes for the himem queue. The specifications for the himem queue nodes are:
- 1 node @ 512 GB RAM, 2 TB Local disk, 32 cores (64 w/hyperthreading) - 2 nodes @ 1 TB RAM, 16 TB local disk, 40 cores (80 w/hyperthreading)
Although there is still a lot of work to be done configuring the scheduling for these nodes (contributors will get higher priority on these resources), they are available for use to the general community.
A new Dell M1000e chassis filled with 16 blades has been brought online for this queue. Each blade has 96 GB RAM, 12 cores (24 w/hyperthreading) and 2 TB of local disk. Currently the crc-serial queue has no special restrictions other than jobs there are limited to 1 node and 1 - 24 cores.
Both crc-serial and himem queues have a maximum walltime of two weeks (336:00:00). It's important to consider the current scheduling policies and maximum limits as a starting point which will be adjusted in response to how the queues are used and to keep access to them as fair as possible.
The torque tmpdir feature has been enabled in the scheduler. This means that every job will get a job-specific temporary directory local to the node where the jobs processes are running and the scheduler will set the variable TMPDIR to point to this location. For example:
[joha8473 at login00 ~]$ qsub -I -q janus-debug qsub: waiting for job 90934.torque.rc.colorado.edu to start qsub: job 90934.torque.rc.colorado.edu ready
[joha8473 at node1779 ~]$ echo $TMPDIR /local/scratch/90934.torque.rc.colorado.edu [joha8473 at node1779 ~]$ df -h $TMPDIR Filesystem Size Used Avail Use% Mounted on tmpfs 12G 220K 12G 1% /local/scratch [joha8473 at node1779 ~]$
How this storage is provided depends on the node you are using and is subject to change, however, if you use the TMPDIR variable in your job scripts, future changes we make to the location or to any new resources that are brought online should be transparent. Each class of nodes has a /local/scratch as follows:
Janus Compute Nodes: /local/scratch is a tmpfs filesystem with 12 GB available. Because these nodes have no local disk for swap, this means all of /local/scratch is maintained in RAM. Files stored here will use memory that will then not be available for processes and so you should consider that when estimating how much memory your application needs to run.
Himem Nodes: /local/scratch is a tmpfs filesystem with 1.4 TB (512 GB node) or 3.1 TB (1 TB nodes) available. See performance notes below.
CRC-Serial Nodes: /local/scratch is a tmpfs filesystem with 1.8 TB available. See performance notes below.
VMware Compute Nodes: /local/scratch is a tmpfs filesystem with 16 GB available. These nodes have a very small amount of swap space available, but should be considered diskless much like the Janus Compute nodes for purposes of estimating memory usage of your applications + TMPDIR use.
Performance notes: Since the himem and crc-serial nodes have swap space on local disk, it's possible to store more in /local/scratch than the available RAM. However, as long as TMPDIR usage is below the amount of available RAM, all I/O to the TMPDIR runs at near-memory speeds. In our testing, even if RAM is exhausted, I/O performance is still quite good relative to a filesystem on disk. This means that even if your application is not technically a "himem" application, but uses a lot of local temp files during execution it may run a lot faster on a himem or serial node. We'd be happy to work with you to find the optimal location for a specific application or workflow and/or to develop new resources where necessary. We've already seen examples of applications that switched from being I/O bound on other available scratch locations to being CPU bound on the RAM based TMPDIR.
You can request TMPDIR space in your job and ensure that there is at least that much available at job start, however this does not limit you (or anyone else running on the node) to using that much so over the lifetime of individual jobs available/used space can vary. For example:
- request that at least 2 GB be available
[joha8473 at login00 ~]$ qsub -I -q janus-debug -l file=2g qsub: waiting for job 90947.torque.rc.colorado.edu to start qsub: job 90947.torque.rc.colorado.edu ready
[joha8473 at node1761 ~]$ exit
- Request too much for a Janus node:
[joha8473 at login00 ~]$ qsub -I -q janus-debug -l file=16g qsub: waiting for job 90949.torque.rc.colorado.edu to start
- from another terminal
[joha8473 at login00 ~]$ checkjob 90949 | grep Rejection Node Rejection Summary: [Disk: 206][State: 1074]
In this case the 206 nodes that the job could otherwise run on are rejected for not having enough disk available in /local/scratch.
In recent weeks we've seen an increasing use of /tmp for local files and over time an increasing amount of space used for files that the jobs did not successfully clean up. This is particularly problematic on Janus nodes where leaving several GB of file in /tmp can cause the next job to run the node Out of Memory. Over the next few weeks we'll be putting a limit in place on /tmp to restrict how much space is available there so if you are currently using /tmp, you are strongly encouraged to migrate to using the TMPDIR location provided by the scheduler. Currently there is a health_check which removes a node from scheduling if /tmp has more that 2 GB of contents, we plan to replace that with the more robust /tmp limits and usage of the TMPDIR space. It's also not advised to write to /local/scratch outside of the TMPDIR location as anything residing outside of a valid TMPDIR path for a currently running job is considered eligible to be immediately deleted (and this deletion may be added to the node health_check very soon).
If you have any questions please contact us at rc-help at colorado.edu. We hope these new resources and changes are a positive step, and if not would like to hear you view so they can be adjusted in a positive direction.
Running multiple serial jobs
- You can use the crc-serial queue, which allocate nodes per core.
- But the scheduler will allocate only complete nodes on all the janus-* queues. This means if you run a serial job from your PBS script on any janus-* queue, the job will use only 1 core, with the other 11 cores unused. Try to bundle the serial jobs like this.