Gross user's documentation
This is a legacy page and it is no longer maintained. For archival purposes only.
- This page is for actions that can be done as a regular user only. See Gross administrator's documentation for what you can do if you have special powers.
Gross cluster uses the operating system StackIQ a.k.a. Rocks+ version 6.5 (Joe).
Note: The data on the research clusters is not backed up. Please backup your important data as needed.
Using the front end
The front end has 16 cores (32 virtual) and 128GB memory. It is meant to be used for all interactive work.
Using compute nodes
The cluster has 12 compute nodes,
compute-0-11, each with 12 cores and 24GB memory. Gross is using the Sun Grid Engine job scheduler (henceforth SGE). SGE is an alternative to the portable batch system and can be used similarly. SGE will allocate submitted jobs to some of the compute nodes, by default up to 12 processes per node.
NEVER ssh directly to a compute node and run code directly. This will interfere with others on the system because the scheduler will allocate other jobs to the node. Please always use the SGE scheduler.
Using special nodes
The following nodes can be used interactively.
compute-0-12is a new large memory node, 396GB memory and 16 cores, 32 virtual
- there is currently no
compute-0-14is the one of the original disk array nodes, and it has mounted the /storage2 filesystem. It has a slower processor than the compute nodes 0-11.
compute-0-15is another original disk array node, and it has mounted /storage filesystem. It has a slower processor than the compute nodes 0-11.
compute-0-16was the old original large memory frontend, which was starting to have memory reliability problems and is running reduced to about 100GB memory.
Compiling and running MPI jobs
mvapich2-1.9, log out and back in.
- build your executable using
- create a batch job script like this:
#!/bin/bash #$ -pe mpich 120 #$ -cwd #$ -j y #$ -S /bin/bash # to limit run time, you can add line like this: #$ -l h_rt=hours:minutes:seconds # list environment and nodes to help diagnose problems env cat $TMPDIR/machines # run mpi job mpirun_rsh -np $NSLOTS -hostfile $TMPDIR/machines $PWD/<your executable>
- make the script executable:
chmod +x <your batch job script file>
qsub <your batch job script file>
qsub -fto check where and how is the job running
PGI and MPI
The MPI build that comes with PGI compilers is generic. It runs over ethernet rather than the fast interconnect and does not seem to work for batch jobs. We will need to build a version of MPI that supports Infiniband and SGE.
- To see the status of the nodes, use the command
- To get a mode detailed look at what jobs are running at which node in which queue, use
qstat -f -u '*'
- To look at the detailed status of a job, use
qstat -f -j [job id]
- To run a command on all nodes, use
rocks run host compute 'command', do not forget quotes around command if it contains spaces
- To kill all your processes on all nodes, use
rocks run host compute 'killall -u $USER'
- To check what processes you are running, use
rocks run host compute 'ps -f -u $USER'
- To kill all your processes on all nodes, use
Please do not use the compute nodes directly by mpirun or run any commands on them other than to look at your processes or to kill your processes that may be left after MPI jobs. Only use the SGE command
qsub to submit jobs as described here.
Once the job is submitted, the queue and run status can be inquired with the commands
qstat -f. Job ID's of all active jobs can be listed using the
qstat command. Use
qstat -f for a full listing.
A job that has been submitted from the queue can be canceled (or terminated) using the
qdel command. This command takes as an argument the job ID to be cancelled.
You may ssh to a node to check on your job by
ps -af or
top, or kill any runaway processes - sometimes a failed job leaves processes behind that consume 100% CPU - but again, please do not run any computations on any of the nodes directly.
For complete documentation of SGE job scripts see the SGE man pages.
gross and colibri are working per the documentation provided by StackIQ
The documentation they provide can be accessed at:
Scheduler info is at http://gross.ucdenver.edu/roll-documentation/ogs/2011.11p1/using.html See http://gridscheduler.sourceforge.net/howto/GridEngineHowto.html for an overview.
To access these instructions users need be connected to the university network directly or using VPN.
Click on “Help” on the lower left of the screen then select the specific Documentation wanted.
Matlab is installed and working on both gross and colibri systems and all nodes. To launch use:
(Remember the 25 concurrent license limit)
The PGI compilers are set up and working on the front end. However MPI for PGI over Infiniband is not available (yet), the MPI environment below is generic over ethernet as it came with the compilers and if you use it, it conflicts with the system MPI that is set up by mpi-selector-menu.
To ensure compiled code will run on the nodes since hardware varies in the cluster use the compiler switch
-tp=x64</code?. For example create a binary using:
pgf77 -tp=x64 source.F -o binary
otherwise the binary may crash on illegal instruction on other nodes than the front end.
To set up your environment to use pgi, <code>source one of the following depending on which shell you are using:
/share/apps/pgi/linux86-64/14.9/mpi.csh /share/apps/pgi/linux86-64/14.9/mpi.sh /share/apps/pgi/linux86-64/14.9/pgi.csh /share/apps/pgi/linux86-64/14.9/pgi.sh
Once the environment is set details on using PGI compilers are available by looking at the man pages using something like:
of course more documentation is available at:
One 100TB disk array is local on gross and it houses the
The disk arrays are not backed up. Save unique files elsewhere!
Old file location (prior to Rocks installation):
storage2 is mounted on compute-0-14 as /storage2 storage is mounted on compute-0-15 as /storage These are setup for archival storage for now. copy files to a location under your home directory if they need to be available from all nodes. The home file system is the local 100TB file system.