NCI

Raijin User Guide

Contents


→ Getting Started

Access

  • To gain access to our system, you first need to fill in the forms of Registration and Connection. Make sure you fill in the connection form in order to get a username and password.
  • To apply for new projects, please fill in the form here: Application for Resource
  • If you want to update your details, please fill in the form here: Update User Details

Logging In

To login from your local desktop or other NCI computer run ssh:

ssh -l userid raijin.nci.org.au

Your ssh connection will be to one of six possible login nodes, raijin[1-6] (If ssh to raijin fails, you should try specifying one of the nodes, i.e. raijin3.nci.org.au). As usual, for security reasons we ask that you avoid setting up passwordless ssh to raijin. Entering your password every time you login is more secure, or using specialised ssh secure agents.

Connecting under Unix/Mac:

  • For ssh – ssh
  • For scp/sftp – scp, sftp
  • For X11 – ssh -Y, make sure you installed XQuartz for OS X 10.8 or higher.

Connecting under Windows:

  • For ssh – putty, mobaxterm
  • For scp/sftp – putty, Filezilla, winscp, mobaxterm
  • For X11 – Cygwin, XMing, VNC, mobaxterm

If you are connecting for the first time, please change your initial password to one of your own choice via the passwd command, which will prompt you as below: (Note the % is the command prompt supplied by the interactive “shell” as in all examples in this document – it is not something you type in.)

% passwd
Old password:
New password:
Re-enter new  password:

Interactive Use and Basic Unix

The operating system on all systems is Linux. You can read our Unix quick reference guide for basic usage. 

When you login you will come in under the Resource Accounting SHell, (referred to as RASH), which is a local shell used to impose interactive limits and account for the time used in each interactive session.

Your account will be set up with an initial environment via a default .login file, and an equivalent .profile file, as well as a .rashrc file. The .rashrc file can be edited to change the default project (see Project Accounting) and the command interface shell to be started by RASH as you login. Your initial command interface shell will be the bash. You can change this to tcsh by changing the line in .rashrc from

setenv SHELL /bin/bash

to be

setenv SHELL /bin/tcsh

instead. Other shells including ksh are available but may not provide the same support for modules as tcsh and bash. There has been a local modification made for ksh and details of that are here. If you try to use a shell not registered with rash for the particular machine you will default to the bash.

Each interactive process you run on the login nodes has imposed on it a time (30mins) limit and a memory use (2GB) limit. If you want to run longer or more memory intensive interactive job, please submit an interactive job (qsub -I), see Interactive PBS Jobs in the section below for more details.


Login Environment

At login you will not be asked which project to use. A default project will be chosen by the login shell if one is not already set in ~/.rashrc. You can change your default project by editing .rashrc in your home directory. To switch to a different project for interactive use once you have already logged in you can use the following helpful command:

switchproj project_name

Note that this is just for interactive sessions. For PBS jobs, use the -P option to specify a project.

Back to top


→ Accounting

Monitoring Resource Usage

  • nci_account displays the usage of the project in the current quarter, as well as some recent history of the project if available. It also shows the /short and massdata storage system for the projects which you are connected to. You can also use -v to display detailed accounting information per user.
  • lquota displays your disk usage and quota in your home directory and the /short/project/ directories
  • short_files_report reports /short files usage. Use -G project  to see location and usage in /short owned by the group and use -P project to see group and user information of files in /short/ folder.
  • nf_limits -P project -n ncpus -q queue  displays walltime, memory limits for user. More default resources limits can be found in the section Queue Limits below.

Back to top


→ Job Submission and Scheduling

Queue Structure

The systems have a simple queue structure with two main levels of priority; the queue names reflect their priority. There is no longer a separate queue for the lowest priority “bonus jobs” as these are to be submitted to the other queues, and PBS lowers their priority within the queues.

express:
  • high priority queue for testing, debugging or quick turnaround
  • charging rate of 3 SUs per processor-hour (walltime)
  • small limits particularly on time and number of cpus
normal:
  • the default queue designed for all production use
  • charging rate of 1 SU per processor-hour (walltime)
  • allows the largest resource requests
copyq:
  • specifically for IO work, in particular, mdss commands for copying data to the mass-data system.
  • Note: always use -l other=mdss when using mdss commands in copyq. this is so that jobs only run when the
    the mdss system is available.
  • runs on nodes with external network interface(s) and so can be used for remote data transfers (you may need to configure passwordless ssh).
  • tars, compresses and other manipulation of /short files can be done in copyq.
  • purely compute jobs will be deleted whenever detected.

bonus time

Most projects can continue to submit jobs when their time allocation is exhausted – such jobs are called “bonus jobs”.

bonus jobs:
  • queue at a lower priority than other jobs and will generally only run if there are no non-bonus jobs
  • are more suspendable than non-bonus jobs
  • make use of otherwise idle cycles while minimally hindering other jobs
  • may be terminated if they are impeding normal jobs or for system management reasons (usually jobs are just suspended)
  • Please note jobs requesting more than 160 cpus will never run when the project is in bonus. You will have to reduce the number of cpus in your job request or wait until next quarter.

PBSPro Scheduling

There are many reasons jobs may be prevented from starting. The first thing to do is to run “qstat -s jobid”; this will print the comments from the job scheduler about your job.

  • If you see a “--” after the job, it means the scheduler has not yet considered your job. Be patient.
  • If you see “Storage resources unavailable”, it means that you have exceeded one of your storage quotas. Run “nci_account” to get more information.
  • If you see “Waiting for software licenses”, it indicates that all the licenses for a software package you have requested are currently in use.
  • If you see “Not Running: Insufficient amount of resource ncpus”, it indicates that all the cpus are busy. Please be patient, PBSPro scheduling is based on resources available and request, see our scheduling policy for more details. Furthermore, at the beginning and close to the end of each quarter, number of jobs increases significantly compare to the other time period, hence a longer waiting time. You can also find out about the current raijin usage at our website:

    http://nci.org.au/nci-systems/current-job-details/


PBSPro Basics

We are using PBSPro for job submission and scheduling. For example, a sample job script looks like this:

   #!/bin/bash
   #PBS -P a99 
   #PBS -q normal 
   #PBS -l walltime=20:00:00
   #PBS -l mem=300MB 
   #PBS -l wd
   ./a.out

You submit this script for execution by PBS using the command:

   % qsub jobscript

More detailed PBSPro usage can be found in How to use PBS.

Note: Please make sure you specify #PBS -lother=gdata1 when submitting jobs accessing files in /g/data1. If /g/data1 filesystem is not available, your job will not start.


Interactive PBS Jobs

The -I option for qsub will result in an interactive shell being started out on the compute nodes once your job starts. A submission script cannot be used in this mode - you must provide all qsub options on the command line. To use X windows in an interactive batch job, include the -X option when submitting your job – this will automatically export the DISPLAY environment variable.

Your job is subject to all the same constraints and management as any other job in the same queue. In particular, it will be charged on the basis of walltime, the same as any other batch job, since you will have dedicated access to the cpus reserved for your request. Don't forget to exit your interactive batch session when finished to avoid both leaving cpus idle on the machine and wasting your grant!

Interactive batch jobs are likely to be used for debugging large or parallel programs etc. Since you want interactive response, it may be necessary to use the express queue to run immediately and avoid your session being suspended. However the express queue attracts a higher charging rate, so again avoid leaving the session idle.


Queue Limits

The command nf_limits -P project -n ncpus -q queue will show your current limits. 
If you require exemptions to these limits please contact help@nf.nci.org.au.

The current default walltime and cpu limits for the queues are as follows:

    maximum jobs allowed queuing (running) per project available memory per node default cpu limit default walltime limit
express express (route) 5 queuing only --- --- 24 hours for 1-16 cores
5 hours for 17-128 cores
express-node 5 (3) 126GB <=16
express-def 5 (3*) 32GB <=128
normal normal (route) 200 queuing only --- --- 48 hours for 1-255 cores
24 hours for 256-511 cores
10 hours for 512-1024 cores
5 hours for 1025-56064 cores
normal-node 200 126GB <=16
normal-def 200 128GB 16<cpu<512
normal-hi 200 128GB 512<=cpu<=56064
copyq copyq 200 (25) 32GB 1 10 hours

The number of jobs that you can have running at any given time depends on the availability of resources. For express-def, max jobs allowed running also depends on the number of cpus request.

The version of PBS used on NF systems has been modified to include customisable per-user/per-project limits:

  • All limits can be (and are intended to be) varied on a per-user or per-project basis - reasonable variation requests will be granted where possible.
  • Resources on the system are strictly allocated with the intent that if a job does not exceed its resource (time, memory, disk) requests, it should not be unduly affected by other jobs on the system. The converse of this is that if a job does try to exceed its resource requests, it will be terminated.
  • Please note jobs requesting more than 256 cpus will never run when the project is in bonus. You will have to reduce the number of cpus in your job request or wait until next quarter.

Back to top


→ File Systems

Hardware

As well as 6 login nodes there are 3592 compute nodes with following configurations:

All nodes are Centos 6.5. Note that the Linux OS requires some physical memory to be reserved for the Systems functions, leaving the following memory available to user applications:

Memory Available to User jobs:

32GB: r1..r2395 (~67% of all nodes)
64GB: r2396..r3520 (~31% of all nodes)
128GB: r3521..r3592 (2% of all nodes)

All nodes have 16 cpu cores, meaning that OpenMP shared memory jobs that were on vayu previously restricted to 8 cpu cores can now run on up to 16 cpu cores. The architecture of each node is 2 sockets with 8 CPU cores each. As in the past, please check that your code can scale to these greater number of cores - many codes don't.

In a PBS job script, the memory you specify using the -lmem= option is the total memory across all nodes. However, this value is internally converted into the per-node equivalent, and this is how it is monitored. For example, since raijin has 16 CPUs per node, if you request -lncpus=32,mem=10GB, the actual limit will be 5GB on each of the two nodes. If you exceed this on either of the nodes, your job will be killed.

Please check out our FILIESYSTEMS page for more details.


MPI related issue

Our PBSPro does not currently support cpusets so it is possible for two small (i.e. fewer than 16 cpu) OpenMPI jobs to be scheduled to run on the same cpus. Experience suggests that using

mpirun -np $PBS_NCPUS -bind-to-none run.exe

will avoid this problem. We will be investigating this more and may modify the mpirun wrapper to automate this process but expect that future releases of PBSPro will handle cpusets.

MPI jobs that request more than 16 CPU cores will need to request full nodes, that is, a multiple of 16 in #PBS -l ncpus .

For MPI profiling and performance analysis tools, please see here for more details:
http://nf.nci.org.au/wiki/Profiling/MPI


Summary

Name(1) Purpose Availability Quota(2) Timelimit Backup
/home/unigrp/user Irreproducible data eg. source code raijin only 2GB (user) none Yes
/short/projectid Large data IO, data maintained beyond one job raijin only 72GB (project) 365 days No
/g/data/projectid Processing of large data files global   none No
massdata Archiving large data files external - access using the mdss command 20GB none 2 copies in two different locations
$PBS_JOBFS IO intensive data, job lifetime local to each individual raijin nodes unlimited(3) duration of job No
  1. Each user belongs to at least two Unix groups:
    unigrp - determined by their host institution, and
    projectid(s) - one for each project they are attached to.
  2. Increases to these quotas will be considered on a case-by-case basis.
  3. Users request allocation of /jobfs as part of their job submission - the actual disk quota for a particular job is given by the jobfs request. Requests larger than 396GB will be automatically redirected to /short (but will still be deleted at the end of the job).
  4. Please make sure you specify #PBS -lother=gdata1 when submitting jobs accessing files in /g/data1. If /g/data1 filesystem is not available, your job will not start. The following command can be used to monitor the status of /g/data1 on raijin and can be incorporated inside your jobscript for checking the status of /g/data1:
    /opt/rash/bin/modstatus -n gdata1_status
  5. Please make sure you specify #PBS -lother=mdss when submitting jobs accessing files in mdss. If mdss filesystem is not available, your job will not start. The following command can be used to monitor the status of mdss on raijin and can be incorporated inside your jobscript for checking the status of mdss:
    /opt/rash/bin/modstatus -n mdss_status

Back to top


→ Software

Software Enviroments

At login users will have modules loaded for pbs , openmpi and the Intel Fortran and C compilers.

The module command syntax is the same no matter which command shell you are using.

module avail will show you a list of the software environments which can be loaded via a module load package command.

module help package should give you a little information about what the module load package will achieve for you. Alternatively module show package will detail the commands in the module file. Please see module manual for more details.


Application Software

Access to the licensed third-party software package is granted by adding user to the appropriate software Unix group. Before that, user must fulfil all license requirements as stated in the 'Access prerequsites' on the third-party software package page in the 'Software Available' Section.

Back to top


→ Changes from Old System to Raijin

Major differences are shown here:

vayu raijin
-l vmem -l mem
-wd -l wd
quotasu and quota lquota an nci_account
/projects /g/data

More details can be found here.

Back to top


→ Useful Links

For Beginners

For Developers


In Collaboration With