National Computational Infrastructure

NCI Australia logo with the words Providing Australian researchers with world-class high-end computing services

Parallel Programming

Download PDF version
Download exercises


  1. Compiling, Optimising and Linking
  2. Parallelism
  3. Profiling
  4. Troubleshooting

Compiling, Optimising and Linking

Compiling and Optimising

  • NCI Programming for Developers
  • We recommend that you use the Intel compilers (ifort for fortran, icc for C, icpc for C++). Note that the Intel C/C++ compiler is compatible with gcc/g++.
  • Intel Fortran and C compilers loaded by default – check available versions with
    module avail intel-fc
    module avail intel-cc
  • gfortran, gcc and g++ provided with the operating system or by loading a more recent gcc module.
  • Read the reference pages for the compilers:
    man ifort
    man icc
    man icpc

Using Compiler Options

Optimisation Compiler Options

  • Default optimisation level is -O2
  • No optimisation, -O0, is very, very slow.
  • Debug option, -g, uses -O0
  • Highest optimisation is -O3, use with care.

Compiler Options, C++:

  • Link C++ code with icpc, not icc

The default optimisation level for the Intel compilers is -O2.
The default optimisation level for the GNU compilers is -O0.

Exercise 1: Compiling and Linking

    module load intel-cc
    module load intel-fc

    cd /short/$PROJECT/$USER/
    tar xvf /short/c25/parallel_exercises.tar

    ifort -O3 -o matmulf matmul.f
    time ./matmulf
    ifort -O0 -o matmuls matmul.f
    time ./matmuls
    icc -O3 -o matmulf matmul.c
    time ./matmulf
    icc -O0 -o matmuls matmul.c
    time ./matmuls

Exercise 2: ctd: Using Libraries

    module load netcdf
    module show netcdf
    module list
    ifort -o netcdfex netcdfex.f -lnetcdff -lnetcdf

Back to top




An Accountant’s View of the Parallel Economy


12% serial code (not parallelisable)
88% parallel code

No. of CPUs Walltime Serial Part Walltime Parallel Part Total Walltime Total Cost = Walltime * ncpus
1 12 88 100 100
2 12 44 56 112
4 12 22 34 136
8 12 11 23 184
16 12 5.5 17.5 280



Writing parallel code

Popular models of parallel computation:

  • SPMD
  • OpenMP
  • MPI (Message Passing Interface)


  • OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution.
  • Directives are added to the source code to parallelise loops and specify certain properties of variables. (Not to be confused with OpenMPI)


  • Correct single-thread programs can be parallelised incrementally.
      => Sequence of increasingly more efficient parallel programs that are always correct.
  • Only one version of the source code to maintain for both single-thread and multi-thread operation.


  • OpenMP operation is restricted to shared-memory systems.
  • OpenMP directives are usually restricted to the parallelisation of loops or systems of loops. Other forms of OpenMP parallelisation usually break the single source-code advantage above.

Note that shared memory parallelism may be relevant for multicore architectures.

Exercise 3: Running OpenMP code

For Fortran users look at matmul_omp.f

    ifort -O3 -openmp matmul_omp.f -o matmul_omp
    export OMP_NUM_THREADS=2
    time ./matmul_omp
    export OMP_NUM_THREADS=4
    time ./matmul_omp
  • These timings may not scale as expected because of conflicts with other users on the interactive nodes.
  • Do the timings by submitting a batch job.
  • Edit the script runompjob
  • Submit it to the queuing system.
  • The timings will appear in runompjob.e**** and other output in runompjob.o****.

For C programmers, compile with

    icc -O3 -openmp matmul_omp.c -o matmul_omp

then as above.


  • MPI is a parallel program interface for explicitly passing messages between parallel processes.
  • Message passing constructs must be added to the program.
  • An MPI application can be viewed as several copies of the same program running on individual but interconnected computers.
  • Each copy knows its own thread-number (often referred to as id or rank) thus the instruction flow can be modified.


  • MPI programs can be run on any computer equipment
  • The programming model is simple and the parallelism is explicit (no “gotchas”)
  • Virtually any application can be parallelised in some way with MPI
  • Performance/scaling is usually better


  • Often requires complete recoding/redesign of programs for data decomposition – no incremental parallelism
  • Low level functionality – can be tedious for some problems (can be alleviated with good programming)

Exercise 4: Running MPI code

Use the module command to check that the openmpi module is loaded.

    module load openmpi

There are four MPI example codes in the PARALLEL_COURSE directory.
To see a simple MPI code run:
    Fortran Example:

    mpif90 mpiexample1.f -o mpiexample.exe 
    mpirun -np 4 ./mpiexample.exe

    or for a more complicated example:

    mpif90 mpiexample2.f -o mpiexample.exe 
    mpirun -np 4 ./mpiexample.exe

    C Example:

    mpicc mpiexample3.c -o mpiexample.exe 
    mpirun -np 4 ./mpiexample.exe


    mpicc mpiexample4.c -o mpiexample.exe 
    mpirun -np 4 ./mpiexample.exe

mpirun is the usual instruction to start an MPI program. man mpirun for further details.
The job script runmpijob can be used to submit an MPI job to the batch queues.

MPI vs OpenMP – which should I use?


  • Shared data (good and bad)
  • Supposedly “simpler” and easier to use
  • Incremental parallelism (use existing codes)
  • Scalability?
  • Only on shared memory SMPs or multicore


  • Shared nothing!
  • Programming model clearer
  • Can be tedious for some distributed data
  • “All or nothing” (start from scratch)
  • Portable – “runs on anything”
  • Download free library or use vendor specific library

Back to top



  • Concerned with performance
    (and tuning –
    not discussed today c.f. MPI Applications and Optimisation course)
  • Different levels of monitoring and analysis.
    • PBS level info: nqstat, qps etc…
    • System level info: top, iostat, vmstat etc…
    • Profiling tools
    • Tracing tools


  • Performance Profiling
    • Performance analysis based on aggregated summary information of a program.
    • Small amount of data, less information
    • Lightweight
    • IPM and mpiP
  • Performance Tracing
    • Performance analysis based on detailed information of each events along the timeline.
    • Large amount of data, more information
    • Heavyweight
    • VampirTrace/Vampir
    • Intel Trace Collector and Analyzer (only works with Intel MPI – not fully supported at NF)

See software pages and for more information.

General Performance Monitoring

Analysis of the cputime, system time and IO time of an application can provide basic performance information that allows users to understand the performance of their programs.

  • HPCToolKit
  • OpenSpeedShop
  • gprof

Exercise 5: HPCToolkit

  module load hpctoolkit
  ifort -g -O3 jacobi_serial.f -o prog.exe
  hpcrun ./prog.exe < input.1
  hpcstruct ./prog.exe 
  hpcprof -S ./prog.exe.hpcstruct -I ./'*' hpctoolkit-prog.exe-measurements 
  hpcviewer hpctoolkit-prog.exe-database


Exercise 6: gprof

gprof Profiling the executable prog.exe will lead to profiling data being stored in gmon.out which can then be interpreted by gprof as follows:

    ifort -p -o prog.exe jacobi_serial.f
    ./prog.exe < input.1
    gprof ./prog.exe gmon.out

For the GNU compilers do

    gfortran -pg -o prog.exe jacobi_serial.f
    ./prog.exe < input.1
    gprof ./prog.exe gmon.out

MPI Performance Profiling with IPM

MPI profiler aggregates statistics at run time and provides performance overview of the whole job execution, in terms of time, message stats, and load balance etc.

  • Shows the total time each process spends in MPI calls and stats of message sizes
  • Shows load balance/imbalance


  • A very low overhead MPI profiler
  • Provides nice HTML profile results

IPM profile examples

image POOR

Exercise 7: Householder IPM

Block decomposition reduction with Householder Transformations Compile the code

    mpif77 -o householder.block householder.block.f
    mpif77 -o householder.cyclic householder.cyclic.f

Run the code

    qsub block.job
    qsub cyclic.job
    watch qstat -u $USER
    ... (wait until job finishes, use Ctrl+C to quit)...
    cat householder.block.out
    module unload hpctoolkit
    module load ipm
    module rm ipm

Back to top




  • ‘padb -x -t jobid’ for hung MPI jobs
    (module load padb)
  • valgrind for memory issues
  • strace for system calls
  • Intel Debugger or gdb for single processor code
    (module load intel-cc)
  • Totalview for single or multiprocessor code
    (module load totalview)

Exercise 8: Intel Debugger

   > icc -debug matmul.c 
   > idbc ./a.out
    Intel(R) Debugger for applications running on Intel(R) 64, Version 11.1, Build [1.2097.2.217]
    object file name: a.out 
    Reading symbols from /short/c25/abc777/PARALLEL_COURSE/a.out...done.
    (idb) break main
    Breakpoint 1 at 0x400f4f: file /short/c25/abc777/PARALLEL_COURSE/matmul.c, line 28.
    (idb) list main
    18  }
    22  int main()
    23  {
    24    int i, j, l;
    25    float a[N][N], b[N][N], c[N][N];
    (idb) run
    Starting program: /short/c25/abc777/PARALLEL_COURSE/a.out

    Breakpoint 1, main () at /short/c25/abc777/PARALLEL_COURSE/matmul.c:28
    28    for(i=0; i<N; i++)
    (idb) print i
    $1 = 0
    (idb) quit

Exercise 8: Intel Debugger

There is a graphical interface for idb which can be invoked by

    module load java
    idb ./a.out


Exercise 9: Totalview

Totalview can be used to debug sequential or parallel programs.

  module load totalview 
  mpif90 -g mpiexample1.f 
  mpirun --debug -np 4 ./a.out 


In Collaboration With