Gross and Colibri administrator's documentation

From CCM
Jump to: navigation, search
This page is for actions that may require administrator's powers. See Gross user's documentation and Colibri user's documentation for what to do as a regular user.

Gross and Colibri now run ROCKS+ from StackIQ.



How to get rid of a hung job

  • As root, qdel -p <job id> to remove job from the queue
  • If the job does not clear, run qdel -f <job id> to force removal
  • qstat -f to get a full listing that includes load on nodes, large load without a job running indicates runaway processes

How to add user

As root on the frontend:

useradd -m <username>
rocks sync users

Make an account for the user on this wiki and explain that maintaining a project page as described at Projects on the Gross cluster is a condition of continued access.

Add the user to the mailing list using the GUI list management tools at You must be a list owner to have permissions for this.

Adding packages

The easiest way to add packages is by yum. You may need to add the epel repository:

yum install epel-release

Then add packages by yum install. On the front end:


On the compute nodes:


How to tell if node has a GPU

grep nvidia /proc/devices

How to build MPI for a new compiler

Coming soon.

How to reinstall compute nodes

rocks run host compute 'reboot'
rocks set host boot compute action=install

When they reboot, they will PXE and the PXE response will instruct the compute nodes to install.

Running commands on nodes

Using rocks

  • reboot all compute nodes: rocks run host compute 'reboot'
  • reboot listed notes: rocks run host <node 1> <node 2> ... <node n> 'reboot'
  • all on nodes 0-11, as root: # rocks run host `cat ~/compute_nodes` 'command'

Using pdsh

Some command examples to run on gross frontend:

pdsh -w compute-0-[0-12,14-16]  free |grep Mem|sort -V|gawk '{print $1" total memory "$3}'

This command runs the command "free" on all nodes then pipes it to sort then pipes it to gawk to display the desired fields. Here is the result:

compute-0-0: total memory 24729604
compute-0-1: total memory 24729604
compute-0-2: total memory 24729604
compute-0-3: total memory 24729604
compute-0-4: total memory 24729604
compute-0-5: total memory 24729604
compute-0-6: total memory 24729604
compute-0-7: total memory 24729604
compute-0-8: total memory 24729604
compute-0-9: total memory 24729604
compute-0-10: total memory 24729604
compute-0-11: total memory 24729604
compute-0-12: total memory 397030476
compute-0-14: total memory 24729620
compute-0-15: total memory 24729700
compute-0-16: total memory 99194776

So compute-0-0 through compute 0-11 are the normal compute nodes.
compute-0-12 is the large memory node.
there is currently no compute-0-13 node
compute-0-14 is one of the original disk array nodes.
compute-0-15 is the 2nd disk array node.
compute-0-16 was the old original frontend which was starting to have memory reliability problems.

Another example command:

  pdsh -w compute-0-[0-12,14-16] uptime

We can see if all nodes can connect directly to the internet:

pdsh -w compute-0-[0-12,14-16]  "ping -c 1" |sort -V


External links

Personal tools