National Computational Infrastructure

NCI

Starting at NCI

Contents

  1. Accounting
  2. Connecting
  3. UNIX
  4. Queueing
  5. Filesystems
  6. Troubleshooting

Introduction


What is the NCI?

  • Peak Facility, Raijin, Cloud Service and Data management
  • Specialised Support
    • Climate system science
    • Astronomy
    • Earth Observation
    • Geophysics
    • Cloud Computing

Allocation Schemes

  • National Computational Merit Allocation Scheme
    • NCMAS includes NCI(raijin), iVEC(magnus, epic, fornax), VLSCI(avoca), SF in Bioinformatics(barrine) and SF in Imaging and Visualisation(MASSIVE).
  • Partner allocations
    • Major Partners: e.g. CSIRO, INTERSECT, GA, QCIF, BoM
    • University Partners: e.g. ANU, Monash, UNSW, UQ, USyd, Uni Adelaide, Deakin
  • Flagship Projects
    • Astronomy/Astrophysics, CoE in Climate Systems Science, CoE Optics
  • Startup allocation
  • Director

Distrubtions of Allocations :2014

Approximate distribution of allocations across all compute systems for 2014:

  • NCMAS 15%
  • CSIRO 21.4%
  • BOM 18.9%
  • ANU 17.7%
  • Flagships 5.0% (including CoECSS, TERN, Astro, CoE Optics)
  • INTERSECT 3.8%
  • GA 3.4%
  • Monash, UNSW, UQ, USyd, Uni Adelaide, 1.7% each
  • Director’s share, QCIF, Deakin, MSI 6.3% in total

NCI HPC System

Integrated Infrastructure and Services

  • RAIJIN Fujitsu Primergy information.
  • Lustre Filesystems – raijin (/home and /short) and global (/g/data)
  • Cloud – OpenStack cloud (hosting services, specialised virtual labs, new services, special interactive use)
  • High-end visualisation services and support (Vizlab)
  • Software Packages

Getting Information


New Petascale System

Fujitsu Primergy – raijin

  • 3592 2X Intel Sandy Bridge E5-2670 (8 core, 2.6GHz)
  • 57472 cores
  • Total memory 158Tb
  • Lustre filesystems: (/short, /home, /g/data)
  • $PBS_JOBFS local to each node.
  • Infiniband network
  • See the system being installed.

Cloud

NCI’s Cloud services focus around:

  • Computation using the cloud
  • Data services using the cloud
  • Complementary services to NCI’s HPC that are best provided through cloud

NCI offers a NeCTAR node (National eResearch Collaboration Tools and Resources):

  • Designed to optimize for computation and floating point (Intel CPUs)
  • Designed for high speed data transfer (56Gigabit network between nodes)
  • Designed for high speed IO (All SSD disk storage in the cloud)

NCI can offer a high speed interconnect between the NCI Lustre based filesystems and NCI Cloud services.


Data Storage

  • global Lustre filesystem /g/data/ – stores persistent data, mounted on raijin and cloud nodes.
  • Mass Data storage – HSM storage with dual copies across two NCI data centres. Effective storage for managing data that can be staged in/out as part of batch processing.
  • RDSI national data collections – to be stored across the NCI data resources listed above.

Back to top

Accounting


How to Apply a New Project (for CI)

  • Project leaders (Chief Investigators) will fill out on-line forms with required details and be given a project ID.
  • Application process:
    • Partner (anytime)
    • Merit scheme (once a year, deadline Nov)
    • Start-up (anytime, max 1000 SU per year)
    • Commercial (anytime)

How to Apply a New Account (for User)

  • Register as a New User: register first. The registration ID is a number such as 12345, it is not a user ID.
  • Connect to Project: connection form should be submitted.
  • Accounts are set up when a CI approves a connection request.
  • New user will receive an email with account details.
  • NCI usernames are of the form abc123abc for your initials and 123 for affiliation.
  • Passwords are sent by SMS to the mobile number provided when you registered.
  • Passwords can be given over the phone if necessary, but not by email.
  • Use the passwd command to change this when you first log in.
  • An automated on-line tool for users to set passwords is being developed, expected availability late 2014.

Project accounting

  • All use on the compute systems is accounted against projects. Each project has a single grant of time per 3 month quarter.
  • If your username is connected to more than one project you will have to select which project to run jobs under.
  • A project may span several stakeholders (eg BoM and CSIRO).
  • To change or set the default project, edit your .rashrc file in your home directory, and change the PROJECT variable as desired. A typical .rashrc file looks like
    setenv PROJECT c25
    setenv SHELL /bin/bash

Login after editing .rashrc to see the changes.


Default Project

  • The following displays the usage of the project in the current quarter against each of the stakeholder funding the project.
    nci_account 
  • By adding -v you can see who is using the compute time.
    nci_account -v
  • You can also use -P for other project and -p for different quarter, ie:
    nci account -P c25 -p 2014.q2 -v 
  • Further information will be presented under nci_account – most notably storage usage.
  • If you have a project that is externally funded and requires more resource than provided, please contact us. It is possible to set up special funding, and track under nci_account.

Back to top

Connecting


Establish Connection

  • Connection under Unix/Mac:
    • For ssh – ssh (terminal)
    • For scp/sftp – scp/sftp (terminal)
    • For X11 – ssh -Y, make sure to install XQuartz for OSX 10.8 or above. (terminal)
  • Connection under Windows:
    • For ssh – putty, mobaxterm
    • For scp/sftp – putty, Filezilla, winscp
    • For X11 – Cygwin, XMing, mobaxterm, Virtual Network Computing.

Caution! Be sure to logout of xterm sessions, and quit the Window Manager before leaving the system.


Connecting to raijin

The hostname of the Fujitsu Primergy Cluster is

    raijin.nci.org.au

and can be accessed using the secure shell (ssh) command, for example,

    ssh -Y abc548@raijin.nci.org.au

Your ssh connection will be to one of six possible login nodes, raijin{1,6} (If ssh to raijin fails, you should try specifying one of the nodes, i.e. raijin3.nci.org.au).


Secure use of ssh

“passphrase-less ssh keys”: allow ssh to log in without a password.

Caution! Day-to-day use is strongly discouraged.
This considerably weakens both NCI and home institution system security. (Instead consider a key with passphrase + ssh-agent on your workstation.)

Can be useful to support copyq batch jobs:

  • Generate a new key specifically for such transfers
  • Use rrsync to restrict what it can do

More information: Using ssh keys

Back to top

UNIX


UNIX environment

The working environment under UNIX is controlled by shells (command-line interpreter). The shell interprets and executes user commands.

  • The default is bash shell (also popular is tcsh, you may use ksh)
  • Shell can be changed by modifying .rashrc
  • Shell commands can be grouped together into scripts
  • Unix Quick Reference Guide

Note Unix is case sensitive!!


UNIX environment

The shell provides environment variables that can be accessed across all the processes initiated from the original shell e.g. login environment.

csh/tcsh sh/bash/ksh
exec on login and compute nodes .cshrc .bashrc
exec on login nodes only .login .profile
modules .login .profile

tcsh syntax

    setenv VARIABLE value

bash syntax

    export VARIABLE=value

For an explanation of environment variables see Canonical user environment variables


Environment Modules

Modules provide a great way to easily customize your shell environment for different software packages. The module command syntax is the same no matter which command shell you are using.

Various modules are loaded into your environment at login to provide a workable environment.

module list        # To see the modules loaded
module avail       # To see the list of software for which environments 
                     have been set up via modules
module show name   # To see the list of commands that are carried out 
                     in the module 
module load        # To load the environment settings required by a 
                     software package
module unload      # To remove extras added to the environment for a 
                     previously loaded software package. This is 
                     extremely useful in situations where different 
                     package settings clash.

Environment Modules

Note To automate environment customisation at login module load commands can be added to the .login (tcsh) or .profile (bash) files.
Users should be aware that different applications can have incompatible environment requirements so loading multiple application modules in your dot file may lead to problems. We recommend that modules are loaded in scripts as needed at runtime and likewise discourage the use of module commands in shell configuration (dot) files.

More advanced information on modules can be found in the Modules User Guide.


Editors

Several editors are available

  • vi
  • emacs
  • nano

If you are not familiar with any of these you will find that nano has a simple interface. Just type nano.

Caution! Use dos2unix if your input/job script files were edited on a windows machine.


Exercise 1: Getting started

Logging on to raijin – use the course account.

  ssh -Y aaa777@raijin.nci.org.au

Remember to read the Message of the Day (MOTD) as you login.
Commands to try:

hostname        # to see the node you are logged into
nci_account     # to see the current state of the project
module list     # to check which modules are loaded on login 
module avail    # to see which software packages are installed 
                  and accessible in this way.
module show pbs # to see what environments are set by a module

Note In .cshrc (tcsh) or .bashrc (bash) that the intel-fc, intel-cc and openmpi modules are loaded by default.

Back to top

Job Scheduling


Batch Queueing System

  • Most jobs require greater resources than are available to interactive processes and must be scheduled by the batch job system (interactive mode available).
  • Queueing system:
    • distributes work evenly over the system
    • ensures that jobs cannot impact each other (e.g. exhaust memory or other resources)
    • provides equitable access to the system
  • Raijin uses a customised version of PBSPro.
  • nf_limits display the limits that are set for your projects.
  • Default queue limit
  • Job charging is based on wall clock time used, number of cpus requested, queue choice.

Batch queue structure

  • normal
    • Default queue designed for production use
    • Charging rate of 1 SU per processor-hour (walltime) on raijin
    • Requests for ncpus > a node (16 cores) need to be in multiples of 16.
    • If your grant is exhausted -> lower priority (bonus).
  • express
    • High priority for testing, debugging etc.
    • Charging rate of 3 SUs per processor-hour (walltime)
    • Smaller limits to discourage ”production use”
      (ncpus limits to 128, memory per core is 32GB, check nf_limits for project-specific detail. )
  • copyq
    • Used for file manipulation – e.g. copying files to MDSS

Using the Queueing System

  • Read the How to Use PBS
  • Use nf_limits to see your user/project queue limits.
  • Request resources for your job (using qsub).
    • walltime
    • memory (32GB, 64GB, 128GB per node)
    • disk (jobfs)
    • number of cpus
  • PBSPro will then
    • schedule the job when the resources become available
    • prevent other jobs from infringing on the allocated resources
    • display progress of the jobs (qstat, nqstat or nqstat_anu)
    • terminate the job when it exceeds its requested resources
    • return stdout and stderr in batch output files

PBS Resources Example


    #!/bin/bash

    #PBS -l walltime=20:00:00
    #PBS -l mem=2GB
    #PBS -l jobfs=1GB
    #PBS -l ncpus=16
    #PBS -l software=xxx (for licenced software)

Job Scheduling

  • Job priority is based on resource requested, currently running jobs under the user/project, and grant allocation.
  • Jobs start when sufficient resources are available. (qstat -s jobid to see comment why it’s not running)
  • Resources allocated to a job are unavailable to other jobs.

Only ask for the resources your job really needs!

  • This minimises delays for other users and uses the system efficiently.

Long-running jobs

  • When run jobs last longer than the queue limits
  • checkpoint/restart functionality is recommended for workflows that require long run times. Long run times expose users to system and/or numerical instabilities.
  • Example scripts for self-submitting jobs can be found at FAQs

Caution! Checkpoint/restart is not a filesystem or PBSPro capability – It must be implemented by the user or software vendor.


stdout and stderr files

PBSPro returns the standard output and standard error from each job in .o***** and .e***** files, respectively.

Example script.o123456

  ============================================================
    Resource Usage on 2013-07-20 12:48:04.355160:
    JobId:           123456.r-man2  
    Project:                   c25 
    Exit Status: 0 (Linux Signal 0)
    Service Units: 0.01
    NCPUs Requested: 1             NCPUs Used: 1
                                   CPU Time Used: 00:00:43
    Memory Requested: 50mb         Memory Used: 13mb
                                   Vmem Used: 52mb
    Walltime requested: 00:10:00   Walltime Used: 00:00:49
    jobfs request: 100mb           jobfs used: 1mb
  ============================================================

stdout and stderr files

  • .o***** file contains the output arising from the script (if not redirected in the script) and additional information from PBS.
  • .e***** file contains any error output arising from the script (if not redirected in the script) and additional information from PBS. For a successful job it should be empty.
  • Common errors to look for in the .e***** file:
    • Command not found. (check module list, path)
    • =>> PBS: job terminated: walltime 172818sec exceeded limit 172800sec (Increase runtime request)
    • =>> PBS: job terminated: per node mem 2227620kb exceeded limit 2097152kb (Increase memory per node request)
    • Segmentation fault. (check your program)

Monitoring the progress of jobs

Useful commands

  qstat       # show the status of the PBS queues
  nqstat      # enhanced display of the status of the PBS queues
  nqstat_anu  # enhanced display of the status of the PBS queues
  qstat -s    # display additional comment on the status of the job
  qps jobid   # show the processes of a running job
  qls jobid   # list the files in a job's jobfs directory 
  qcat jobid  # show a running job's stdout, stderr or script
  qcp jobid   # copy a file from a running job's jobfs directory 
  qdel jobid  # kill a running job

Caution! Please use nqstat_anu -a | grep $USER to see the cpu% of your jobs. An efficient parallel job should be close to 100%.


Exercise 2: Submitting jobs to the batch queue

  cd /short/$PROJECT/$USER/
  tar xvf /short/c25/intro_exercises.tar
  cd INTRO_COURSE
  qsub runjob
  watch qstat -u $USER
  ... (wait until job finishes, use Ctrl+C to quit)...
  

View the output in the file runjob.o**** and any error messages in runjob.e**** after the job completes.


Exercise 3: Interactive Batch Jobs

Sometimes the resource requirements (mem, walltime etc) are larger than allowed. You can run an interactive batch job as follows:

  qsub -I -l walltime=00:10:00,mem=500Mb -P c25 -q express -X

  qsub: waiting for job 215984.r-man2 to start
  qsub: job 215984.r-man2 ready

  [aaa777@r73 ~]$ hostname
  r73

  [aaa777@r73 ~]$ xeyes &

  [aaa777@r73 ~]$ module list
  Currently Loaded Modulefiles:
    1) pbs                   4) intel-fc/12.1.9.293
    2) dot                   5) openmpi/1.6.3
    3) intel-cc/12.1.9.293

  [aaa777@r73 ~]$ logout
  qsub: job 215984.r-man2 completed

Back to top

Filesystems


Filesystems

Things to consider –

  • Transferring large data files to and from raijin: scp, rsync, filezilla
  • Use designated data mover nodes, not interactive login nodes.
    r-dm.nci.org.au
  • How much data do you really need to keep?
  • Do you need metadata or a self-describing file format?
  • Decide on a structure for archived data before you start.
  • Staging in archived data from tape (Offline) to disk before starting jobs.
  • Archiving results automatically at the end of batch jobs.

RAIJIN Filesystems Overview

The Filesystems section of the userguide has this table in greater detail:

Filesystem Purpose Quota Backup Availability Time limit
/home Irreproducible data eg. source code 2GB (user) Yes raijin None
/short Input/output data files 72GB (project) No raijin 365 days
/g/data/ Processing large data project dependent No Global No
$PBS_JOBFS IO intensive data 100MB per node default No Local to node Duration of job
MDSS Archiving large data files 20GB 2 copies in two different locations External – access using mdss commands No

Note These limits can be changed on request.


Monitoring disk usage

  • lquota gives lustre filesystem usage (/home, /short, /g/data).
  • nci_account gives other filesystem usage (/short, /g/data, mdss)
  • short_files_report gives breakdown:
    -G
    lists files owned by group
    .
    -P
    lists files in /short/
    .

Caution! /short and /g/data are not backed up so it is the user’s responsibility to make sure that important files are archived to the MDSS or off-site.


Input/Output Warning

  • Lots of small IO to /short (or /home) can be very slow and can severely impact other jobs on the system.
  • Avoid “dribbly” IO, e.g. writing 2 numbers from your inner loop.

    Writing to /short every second is far too often!

  • Avoid frequent opening and closing of files (or other file operations)
  • Use /jobfs instead of /short for jobs that do lots of file manipulation
  • To achieve good IO performance, try to read or write binary files in large chunks (of around 1MB or greater)

Exercise 4: Writing to /short

  • Use the lquota and du commands to find how much disk space you have available in your home, short and gdata directories.
  • Use the short_files_report to see who uses most of the quota. Look at your project’s /short area. Anyone from your project can create their own directories and files here. There will be a directory of your own under your project area.
  • Note the different group ownership in the DATA directory.
    ls -l /short/c25/DATA

image


Exercise 4: Writing to /short (cont)

Change the permissions on your files and directories to allow/disallow others in your group to access them.

    man chmod
    chmod g+r filename      # allow group read to filename
    chmod g-r filename      # disallow group read to filename
    chmod g+w filename      # allow group write to filename
    chmod g+x filename      # allow group execute to filename

Verify with your neighbour that your file permissions are as expected.
Note

  • To be able to go into a directory requires execute permission (chmod -R +X folder)
  • You may not want to share files by making your /home directory world readable. For members of the same project you can use /short/$PROJECT. Talk to us about alternatives if you need to share source code, data files etc.

ACL Access Control Lists

ACLs are an addition to the standard Unix file permissions (r,w,x,-) for User, Group, and Other for read, write, execute and deny permissions.

ACLs give users and administrators flexibility and direct fine-grained control over who can read, write, and execute files.

Caution! We strongly recommend that you consult with NCI before using ACLs.


Using the MDSS

The Mass Data Store was migrated to a new SGI Hierarchical Storage Management System in January 2012.

  • MDSS is used for long term storage of large datasets.
  • If you have numerous small files to archive – bundle into a tarfile FIRST.
    Watch our tape robot at work
  • Every project has a directory on the MDSS.
    All members of the project group have read and write access to the top project directory.
  • mdss dmls -l gives information what is online (on disk cache) and what is on tape.

Using the MDSS

  • The mdss command can be used to “get” and “put” data between the login and copyq nodes of the raijin and the MDSS, and also list files and directories on the MDSS.
  • netcp and netmv can be used from within batch jobs to
    • Generate a batch script for copying/moving files to the MDSS
    • Submit the generated batch script to the special copyq which runs copy/move job on an interactive node.
  • netcp and netmv can also be used interactively to save you work creating tarfiles and generating mdss commands.
    • -t create a tarfile to transfer
    • -z/-Z gzip/compress the file to be transferred

Caution! Always use -l other=mdss when using mdss commands in copyq. This is so that jobs only run when the the mdss system is available.


Exercise 5: Using the MDSS

To see these commands in action do

     cd /short/$PROJECT/$USER
     mdss get Data/data.tar
     ls -l
     tar xvf data.tar
     ls
     rm data.tar
     mdss mkdir $USER
     netmv -t $USER.tar DATA $USER
     watch qstat -u $USER
     ... (wait until job finishes, use Ctrl+C to quit)...
     less DATA.o*
     mdss ls $USER
     mdss rm $USER/$USER.tar

Using /jobfs

  • Only available through queueing system:

    Request like -ljobfs=1GB
    Access via $PBS_JOBFS environment variable

  • All files are deleted at end of job. Copy what you need to /short or other global filesystem in job script.
  • Cannot use mdss or netcp commands for files on /jobfs.

Exercise 6: Managing Files between /short, /jobfs and MDSS

Submit a batch job with a /jobfs request, where the job:

  • Copies an input file from /short to /jobfs
  • Runs a code to use the input file and generate some output
  • Saves the output data back to the /short area
  • Uses the netcp command to archive the data to the MDSS

Exercise 6: Managing Files between /short, /jobfs and MDSS

Read the runjobfs script then submit it to the queueing system, monitor the job with qstat, and examine the job output files:

    cd /short/$PROJECT/$USER/INTRO_COURSE
    qsub runjobfs
    watch qstat -u $USER
    ... (wait until job finishes, use Ctrl+C to quit)...
    cat runjobfs.e*
    cat runjobfs.o*

Check out the output file that this job created on /short and the copy on the MDSS

    cd /short/$PROJECT/$USER
    ls -ltr
    less save_data.o*
    mdss ls $USER
    mdss rm -r $USER

Back to top

Troubleshooting


Troubleshooting

  • Make sure you use pbspro keywords. (ie, -l wd and -l mem)
  • .e and .o stderr and stdout files – check your input!
  • PBS emails, MOTD and Notices and News
  • Read the FAQs
  • Read the /g/data FAQs

In Collaboration With