Behind the scenes of big data storage
NCI will soon host more than 10 petabytes of research data. That’s 10,000,000,000 megabytes of files. Hosting these datasets is not as simple as just copy and pasting the files onto a hard drive – they need to be thoroughly catalogued so they can be easily searched, downloaded and analysed. There’s no point having data if you can’t use it.
“These datasets are of enormous value to the research community and Australia as a whole,” explains NCI Data Collections Manager Dr Jingbo Wang.
“For example, to regenerate the Australian Geophysical Data Collection, which represents just 300TB of our 10PB database, would cost billions.”
By developing comprehensive and flexible data management plans, NCI is ensuring researchers can make the most out of these nationally significant datasets.
NCI is unique in the big data space in that it combines more than 10 petabytes of data storage infrastructure, supported by RDSI, with more than 57,000 high-performance compute cores on our Raijin supercomputer, and the highest performance node of the NeCTAR research cloud, purpose built for data-intensive research.
“There are many supercomputing centres around the world, but no one else has workable petascale research data storage facilities with a systematic catalogue on site, plus high-quality data services infrastructure configured on cloud,” says Dr Wang.
“That means the majority of data archive facilities are not connected to sufficient computing power to analyse these massive datasets. Each researcher needs to download the desired dataset onto their own computer or transfer it to a supercomputing facility for analysis. This creates unnecessary time-wastage and duplication
“At NCI, researchers can search national collections for the data they want and then analyse it on-site using Raijin – the data is already in the supercomputer’s filesystem.”
Dr Wang and her team are working to establish the National Environmental Research Data Collection (NERDC), the first of its kind in Australia. The collectionspans national datasets from deep space to the Earth’s core.
By bringing these important datasets together within the highly integrated data and computational environment of NCI, researchers will be able to find and use the data in ways that wouldn’t otherwise be possible.
“In the past, data was housed by whoever generated it,” says Dr Wang. “Researchers had to go to individual sites – such as Geoscience Australia for geophysical datasets, and the Bureau of Meteorology for weather or climate data. That’s a time consuming process and it’s often difficult to know who hosts which datasets and what is publicly available.
“By collecting all the research data together at NCI and making it publicly visible and searchable all in one place, RDSI has made a big step towards reducing the barrier to finding the data in the first place.”
NCI’s big data approach includes two areas of innovation. The first is implementing international data management standards into the cataloguing process to streamline the workflow.
“Our goal is to minimise the work for each data collection manager by maximising the automation of the whole workflow,” says Dr Wang.
“By aligning our cataloguing process with international standards, basically we are ensuring that all of the datasets are described in the same language, so they can be compared, shared and harvested by any researcher around the world.”
The other innovative road NCI has taken with its data management is a multi-level catalogue architecture.
“We use the same open source GeoNetwork cataloguing software as many of our major partners which has made it easy for us to transfer these large datasets across to NCI.
“But we have gone one step further by incorporating many more levels of hierarchy into our database. This allows for increased visibility of data: if a researcher is looking at the catalogue entry for one dataset they can see what larger category it belongs to, and from there find other relevant datasets that they might not know existed.”
The end result of the RDSI-funded project will not just be a large data store, but a usable database that stimulates scientific collaboration and endeavor, says Dr Wang.
“Instead of providing disconnected data collections, we are working toward an all-encompassing database that comes with ready-to-use tools and services for researchers.”
Dr Wang presented a talk entitled ‘Large-Scale Data Collection Metadata Management at NCI’ at the American Geophysical Union meeting in December. View it at the NCI website.