Pushing the boundaries to create a world-class supercomputer
Cutting-edge supercomputing has always been at the core of the National Computational Infrastructure’s mission. Our high-performance computing (HPC) infrastructure, coupled with high-performance data services, bridges the gap between science fiction and reality.
For thousands of Australian researchers, the NCI Australia systems have been indispensable in their quest for the next big scientific breakthrough, from investigating improved cancer treatments to developing fusion power.
It is all the more remarkable, then, that these achievements are being sought with technology that is not fundamentally different to the typical desktop computer. Computational chemistry and computer games share the same underlying binary elements in their execution – that is, 0s and 1s. Likewise, the processors present within many of the world’s fastest supercomputers, Intel’s x86 CPU (a central processing unit, made up of distinct processors cores), are the same ones that are used in today’s personal computers and video game consoles.
What is so super about supercomputing, anyway?
A common simplification is to focus on physical scale alone – that a supercomputer is just many, many thousands of smaller computers all working together, inside a large room, generating noise, heat, bits and bytes. Or that a supercomputer is simply a very, very fast computer, sitting at one end of a spectrum that covers slower workstations, laptop computers and smartphones. While these points ring true, they’re hardly the entire story.
In the case of NCI, it takes a contingent of software engineers and other world-leading experts to elevate a technically impressive cluster of computers to the point where it can be defined as a true supercomputer. The human element is the essential variable – the secret sauce.
“The research user experience is everything,” says Dr Muhammad Atif, “Everything we do makes our supercomputer easier to use and more attuned for your scientific work.” As NCI’s Manager of HPC Systems and Cloud Services, his team has been responsible for maintaining and improving NCI’s supercomputer, Raijin, for the past six years.
Raijin, a Fujitsu cluster, was ranked as the 24th fastest supercomputer in the world following initial benchmarking in 2012. At the time, it consisted of 53,504 Intel Xeon Sandy Bridge processor cores, generating a computational capacity of just over 1.1 petaflops – that is, more than 1,000 trillion operations every second.
These days, it would seem peculiar to hold on to a smartphone for longer than a couple of years. As new technology comes onto the market, supporting the next generation of better and faster applications, modern consumers are typically compelled to upgrade their device every two to three years.
For Raijin, the situation is no different. Just like a smartphone, as a supercomputer ages, its parts wear out and become harder to replace, and its software needs to be maintained to run on the older hardware. Except it’s a little more difficult (and far more expensive) to upgrade a world-class computing cluster than it is a handheld consumer device.
Extensive heterogeneity fosters collaboration and innovation
Since first coming online, NCI’s computational systems have evolved from a relatively straightforward x86-based cluster to a truly heterogeneous one made up of many different systems. This heterogeneity is one of the reasons why NCI can provide cutting-edge computational services to the Australian research community, six years on from Raijin’s debut.
In the past six years, Raijin was augmented with a wide variety of x86 and non-x86 based systems. The first upgrades included fourteen servers containing 56 NVIDIA GPUs (graphic processing units), coming online in the first half of 2016. Excelling at certain types of computation, GPUs have earnt their place alongside traditional CPU-based number crunching, both within NCI and in the wider world of high-performance computing.
With GPUs successfully integrated into the system, the floodgates were now open. NCI continued to innovate when we became one of the first facilities in Australia to deploy Intel’s new Knights Landing CPUs after our inclusion in their early shipping program. This series of processors has gone on to see heavy usage by eager researchers looking for a boost to their computational horsepower.
Next, NCI was the first site in Australia to become part of IBM’s OpenPower Foundation and access their brand new – and radically different – Power8 CPUs. NCI then became the first site anywhere in the world to run x86 and Power CPU architectures together under one job scheduling system, a significant engineering feat and major benefit to the user experience.
After this, NCI designed and road-tested the Agility System, which was installed at NCI in early 2017 by Xenon with hardware provided by Lenovo. This massive compute system added an estimated 948 Teraflops of Intel Broadwell CPUs to Raijin, a much-needed boost to our x86-based computational capacity. Throughout the process, NCI assisted the vendors with performance profiling and overall design, with the needs of the Australian research community as the primary consideration.
Most recently, the inclusion of additional Broadwell nodes with 3TB of RAM each, as well as several nodes of ARM CPUs from Cray, have added greatly to the variety of computational capacity available at NCI.
These various additions, each with their own distinct architectures, are available to NCI’s thousands of users via the same simple method they have always used. Each piece of NCI’s supercomputer runs the same operating system, an optimised version of the highly customized, NCI-developed Linux kernel that they all share. Having a shared code base across the entire NCI supercomputing cluster makes it simpler for users wishing to explore some of the newer computational avenues. Altogether, the heterogeneous parts of the NCI supercomputer come together to make a varied, highly powerful cluster.
These technological firsts, brought together into a unique heterogeneous structure, position NCI as a provider of cutting-edge infrastructure that is unmatched anywhere else in the region. Despite the challenges, Dr Atif and his team have a consistently robust, thoughtful and scalable approach to their craft, maintaining NCI’s world-class reputation throughout Raijin’s extended working life.
In Dr Atif’s words, “We do not only innovate – we lead.”
Next generation monitoring and management
The HPC Team monitors more than individual machines at any one time, consisting of compute nodes, Lustre servers, login nodes, cloud hypervisors and other systems. Within these, the individual components – CPU, memory, storage, etc – are each monitored individually. Everything from hardware faults and software crashes to temperature warnings are monitored by scripts constantly running in the background. Our supercomputer is, in fact, performing a kind of self-healing based on decision-making algorithms we have developed.
Adding intelligence to the monitoring system allows us to easily manage the whole, complex machine. Based on a set of specialist instructions, each individual compute node has the ability to take itself offline or shut itself down when it diagnoses certain hardware or software faults. The automated systems then alert our staff and any relevant vendors of issues so the compute node can be fixed. When significant problems are detected affecting multiple compute nodes, jobs can also be automatically suspended and restarted when the issue has been resolved. These automated actions generate messages via email, text and Instant Message directly to the HPC Team, so that they can deal with any issues as rapidly as possible, before they can become a bigger problem.
“What you cannot measure, you cannot manage,” explains Dr Atif. “So we try to catch an issue before it becomes a problem.”
This kind of continuous monitoring, running in the background of every compute node without impacting on researchers’ compute jobs, is only possible by cleverly leveraging the computers themselves to detect and solve certain problems.
When problems do arise, the HPC Team has perfected a ‘staggered rollout’ for patches and other software upgrades across the cluster, skirting the need for a complete system shutdown. For the users, the system keeps running with no interruptions. Indeed, the goal of every innovation is to make the user experience as seamless as possible. While complex scripts, algorithms and monitoring processes are taking place, for NCI’s users it’s business as usual.
In Raijin’s six years of service, this dedication has resulted in minimal disruption to services, and more time for groundbreaking science. As it is, this 6-year-old machine operates at near 100% uptime and over 90% utilisation with the compute nodes only ever idle while waiting for new jobs to load.
High-performance computing is a driving force for innovations in research and industry. It also requires innovations of its own to meet the continuous challenges in the area. HPC vendors have been providing solutions for the development of new technologies, hardware and software to keep pace with the innovations on the application side. NCI will continue to leverage those solutions and its own innovations for deploying and managing our upcoming, next-generation supercomputer, maintaining its reputation as a world-class leader in the area.
Through the continued improvement of our high-performance computing systems over the last six years, NCI has continued to enable some of Australia’s leading scientific research. The continued growth of computational power, alongside the increased variety of computational options, is thanks to cutting-edge innovations dreamt up and implemented on site by our world-leading experts.
NCI Australia continues to push the boundaries of what is possible, finding new ways to provide computational solutions to scientific problems. Alongside all of NCI’s data management, storage, cloud computing and data services innovation, our commitment to the steady development of new computational technologies makes NCI the ideal home for nationally significant research.