Thousands of genomes prepared for clinical use
In late 2016, biologists from the Garvan Institute of Medical Research and The Australian National University’s John Curtin School of Medical Research took 1,206 human genomes, and in one night of computation at NCI, realigned them to the human reference genome and identified the genetic variations they contained. Genome alignment is a technique for stitching together the many snippets of the genomic sequence that are produced by sequencers in the lab. There are millions of such snippets in any sequenced genome, and altogether they make up a file around 50 gigabytes in size.
The alignment process involves many steps and requires the computer processors involved to be constantly reading and writing new data from the genomic dataset into hard drives. As such, the entire process is constrained by the speed of those hard drives and the filesystem that manages the storage. Typical computational setups might manage to do around 30 alignments at a time, so the fact that 1,206 could be done at once at NCI was truly groundbreaking.
NCI’s high-performance filesystems, which include the two fastest in the Southern Hemisphere, are used to store all kinds of data, from earth observations through to astronomical modelling. In the case of human genomics, the filesystems make it possible for genome alignments to be done as fast as possible. Genomics is one of the computational tasks that is the most reliant on rapid communication between the filesystem and the processors.
Dr Dan Andrews, Program Manager at the NHMRC-funded Australian Genomics Health Alliance (AGHA), says “The combination of vast computing capacity coupled with the finely tuned fast storage that NCI provides helped us scale up the software to work with more than 1,200 genomes at once. Aligning that many genomes in one night is a clear demonstration of might of the NCI computational capacity – that couldn’t have been done elsewhere in Australia.”
The genomes came from Garvan’s Medical Genome Reference Bank (MGRB), supported by the AGHA. The MGRB is a groundbreaking database of human genetic information that will comprise more than 4,000 complete human genomes from disease-free seniors when complete.
Once the genomes are all assembled in the database, researchers and clinicians can query the fully anonymised information. To get to that stage, though, the genome sequences need to be aligned in a supercomputer. Working with this many genomes makes it impossible for an individual laboratory to deal with the data on their own. Instead, NCI provides the high-performance data and compute infrastructure that makes it possible.
As the field of computational biology develops, the improved software and knowledge gained through research such as this will allow hospitals and clinics to incorporate genomics into their daily operations. The use of genomics in medicine brings a huge potential for faster diagnosis and treatment of rare diseases. NCI is proud to be a part of the crucial early developments that will make this possible.
This research highlight was originally published in NCI’s 2016-17 Annual Report.