When genes and proteins inside a cell are damaged, it can result in diseases such as cancers, diabetes and obesity. Understanding exactly how this damage affects DNA is key to understanding and developing early and effective treatments, and raises the chances of a treatment succeeding.
Epigenetics is the study of how gene expression is regulated inside cells, as it provides extra information above the DNA sequence code. With the advance of Next Generation Sequencing (NGS, a massive parallel sequencing technology), epigenetic DNA modifications can be profiled on a genome wide scale. NGS produces large data sets of short DNA sequences that need to be mapped to the genome before use. It is essential that researchers use high-performance computing systems for this mapping, called genome alignment, so that they can gain the biological insights from within the epigenetic sequence data sets. The sequence data sets require computational pre-processing for quality control before the data can be analysed with confidence.
Dr Phuc Loi Luu, a senior bioinformatician from the Epigenetics Research laboratory of Professor Susan Clark at the Garvan Institute of Medical Research, performs the pre-processing and alignment of whole genome bisulphite sequence (WGBS) to study DNA patterns specific to cancer. The computational alignment workflow includes a 64-step process and takes three to four weeks to run on the in-house cluster. Recently, by using NCI's Virtual Desktop Infrastructure (VDI) to submit jobs to Raijin, Dr Luu built a pipeline to connect all the steps and run as a single process in a secure and easy-to-use environment.
Dr Luu has optimized his new pipeline for the VDI and obtained a significantly reduction in run time from three weeks to four days, rapidly speeding up the workflow and making it more efficient and cost effective.
"The DNA sequencing machines are now much faster than they used to be, but until this point, the way of working with large data in the WGBS alignment step has not been keeping pace. With the move to NCI, in a single secure environment, we don't have those problems any more. Now, right after the sequencing is performed, I can give biologically meaningful data to wet-lab biologists in 3 or 4 days instead of 3 to 4 weeks of processing time. The speed of processing allows for immediate insights into the data to accelerate the research and design of the next experiments" Dr Luu said.
The reduction in time spent calculating is not the only benefit that Dr Luu sees from the VDI.
"The easy-to-understand desktop interface of the VDI makes it much simpler for biologists with no programming knowledge to work on the analyses. The convenience of being able to log in from anywhere and monitor progress is a significant improvement," he said.
Dr Luu first learnt about the VDI at the HPC Summer School that NCI organised for users in February 2017. Since then he has made use of the high-performance computing, data storage, data tools and computational expertise available at NCI, in a great example of using NCI's integrated environment for enabling scientific research.