American Geophysical Union (AGU) Fall Meeting 2015
Enabling dynamic access to dynamic petascale Earth Systems and Environmental data collections is easy: citing and reproducing the actual data extracts used in research publications is NOT.
Authors: Jingbo Wang, Wei Si, Kelsey Druken, Ben Evans, Claire Trenham, Lesley Wyborn (National Computational Infrastructure, The Australian National University), Jens Klump (CSIRO Earth Science and Resource Engineering Perth), Nicholas Car (CSIRO)
The National Computational Infrastructure (NCI) at the Australian National University (ANU) has collocated over 10 PB of national and international Earth Systems and Environmental data assets within a HPC facility to create the National Environmental Research Data Interoperability Platform (NERDIP). Data are replicated to, or are produced at, NCI: in many cases they are processed to higher-level data products. Individual data sets within these collections can range from multi-petabyte climate models and large volume raster arrays, down to gigabyte size, ultra-high resolution data sets.
All data are quality assured to being ‘published’ and made accessible as services. Persistent identifiers are assigned during publishing at both the collection and data set level: the granularity and version control on persistent identifiers depend on the dataset.
However, most NERDIP collections are dynamic: either new data is being appended, or else models/derivative products are being revised with new data, or changed as processing methods are improved. Further, because the data are accessible as services, researchers can log in and dynamically create user-defined subsets for specific research projects: inevitably such extracts underpin traditional ‘publications’. Being able to reproduce these exact data extracts can be difficult and for the very larger data sets preserving a copy of large data extracts is out of the question.
A solution is for the researcher to use provenance workflows that at a minimum capture the version of the data set used, the query and the time of extraction. In parallel, the data provider needs to implement version controls on the data and deploy tracking systems that time stamp when new data are appended, or when modifications are made to existing data and record what these changes are. Where, when and how persistent identifiers are minted on these large and dynamically changing data sets is still open to debate.
Standardised online data access and publishing for Earth Systems and Climate data in Australia
Authors: Kelsey Druken, Claire Trenham, Jingbo Wang, Ben Evans, Lesley Wyborn, Jon Smillie, Chris Allen, David Porter – all National Computational Infrastructure, The Australian National University
The National Computational Infrastructure (NCI) hosts Australia’s largest repository (10+ PB) of research data collections spanning a wide range of fields from climate, coasts, oceans, and geophysics through to astronomy, bioinformatics, and the social sciences. Spatial scales range from global to local ultra-high resolution, requiring storage volumes from MB to PB. The data have been organised to be highly connected to both the NCI HPC and cloud resources (e.g., interactive visualisation and analysis environments). Researchers can login to utilise the high performance infrastructure for these data collections, or access the data via standards-based web services. Our aim is to provide a trusted platform to support interdisciplinary research across all the collections as well as services for use of the data within individual communities.
We thus cater to a wide range of researcher needs, whilst needing to maintain a consistent approach to data management and publishing. All research data collections hosted at NCI are governed by a data management plan, prior to being published through a variety of platforms and web services such as OPeNDAP, HTTP, and WMS. The data management plan ensures the use of standard formats (when available) that comply with relevant data conventions (e.g., CF-Convention) and metadata standards (e.g., ISO19115). Digital Object Identifiers (DOIs) can be minted at NCI and assigned to datasets and collections. Large scale data growth and use in a variety of research fields has led to a rise in, and acceptance of, open spatial data formats such as NetCDF4/HDF5, prompting a need to extend these data conventions to fields such as geophysics and satellite Earth observations.
The fusion of DOI-minted data that is discoverable and accessible via metadata and web services, creates a complete picture of data hosting, discovery, use, and citation. This enables standardised and reproducible data analysis.
The Interoperability Challenge for the Geosciences: Stepping up from Interoperability between Disciplinary Siloes to Creating Transdisciplinary Data Platforms.
Authors: Lesley Wyborn, Ben Evans, Claire Trenham, Kelsey Druken, , Jingbo Wang (National Computational Infrastructure, The Australian National University).
The National Computational Infrastructure (NCI) at the Australian National University (ANU) has collocated over 10 PB of national and international data assets within a HPC facility to create the National Environmental Research Data Interoperability Platform (NERDIP). The data span a wide range of fields from the earth systems and environment (climate, coasts, oceans, and geophysics) through to astronomy, bioinformatics, and the social sciences. These diverse data collections are collocated on a major data storage node that is linked to a Petascale HPC and Cloud facility. Users can search across all of the collections and either log in and access the data directly, or they can access the data via standards-based web services.
These collocated petascale data collections are theoretically a massive resource for interdisciplinary science at scales and resolutions never hitherto possible. But once collocated, multiple barriers became apparent that make cross-domain data integration very difficult and often so time consuming, that either less ambitious research goals are attempted or the project is abandoned. Incompatible content is only one half of the problem: other showstoppers are differing access models, licences and issues of ownership of derived products.
Brokers can enable interdisciplinary research but in reality are we just delaying the inevitable?
A call to action is required adopt a transdiciplinary approach at the conception of development of new multi-disciplinary systems whereby those across all the scientific domains, the humanities, social sciences and beyond work together to create a unity of informatics plaforms that interoperate horizontally across the multiple discipline boundaries, and also operate vertically to enable a diversity of people to access data from high end researchers, to undergraduate, school students and the general public. Once we master such a transdisciplinary approach to our vast global information assets, we will then achieve THE interoperability challenge for the geosciences and made geoscience data and information accessible to all domains and to all peoples.
Building the Petascale National Environmental Research Interoperability Data Platform (NERDIP): Minimizing the ‘Trough of Disillusionment’ and Accelerating Pathways to the ‘Plateau of Productivity’
Lesley Wyborn, Ben Evans ( (National Computational Infrastructure, The Australian National University).
The National Computational Infrastructure (NCI) at the Australian National University (ANU) has evolved to become Australia’s peak computing centre for national computational and Data-intensive Earth system science. More recently NCI collocated 10 Petabytes of 34 major national and international environmental, climate, earth system, geophysics and astronomy data collections to create the National Environmental Research Interoperability Data Platform (NERDIP). Spatial scales of the collections range from global to local ultra-high resolution, whilst sizes range from 3PB down to a few GB. The data is highly connected to both NCI HPC and cloud resources via low latency internal networks with massive bandwidth.
Now that the collections are collocated on a single data platform, the ‘Hype’ and expectations around potential use cases for the NERDIP are high. Not unexpected issues are emerging such as access, licensing issues, ownership, and incompatible data standards. Many communities are standardised within their domain, but achieving true interdisciplinary science will require all communities to move towards open interoperable data formats such as NetCDF4/HDF5. This transition will impact on software using proprietary or non-open standards.
But before we reach the ‘Plateau of Productivity’, there needs to be greater ‘Enlightenment’ of users to encourage them to realise that this unprecedented Earth system science platform provides a rich mine of opportunities for discovery and innovation for a diverse range of both domain-specific and interdisciplinary investigations including climate and weather research, impact analysis, environment, remote sensing and geophysics and develop new and innovative interdisciplinary use cases that will guide those architecting the system and help minimise the amplitude of the ‘Trough of Disillusionment’ and ensure greater productivity and uptake of the collections that make NERDIP unique in the next generation of Data-intensive Science.
It Takes A ‘Village of Partnerships’ To Raise A ‘Big Data Facility’ In A ‘Big Data World’.
Ben Evans, Lesley Wyborn ( (National Computational Infrastructure, The Australian National University).
The National Computational Infrastructure (NCI) at the Australian National University (ANU) has collocated a priority set of national and international data assets that span a wide range of domains from climate, oceans, geophysics, environment, astronomy, bioinformatics and the social sciences. The data are located on a 10 PB High Performance Data (HPD) Node that is integrated with a High Performance Computing (HPC) facility to enable a new style of Data-intensive in-situ analysis. Investigators can either log in via direct access to the data collections: access is also provided via modern standards-based web services.
The NCI integrated HPD/HPC facility is supported by a ‘village’ of partnerships. NCI itself operates as a formal partnership between the ANU and three major National Scientific Agencies: CSIRO, the Bureau of Meteorology (BoM) and Geoscience Australia (GA). These same agencies are also the custodians of many of the national data collections hosted at NCI, and in partnership with other collaborating national and overseas organisations have agreed to work together to develop a shared data environment and use standards that enable interoperability between the collections, rather than isolating their collections as separate entities that each agency runs independently.
To effectively analyse these complex and large volume data sets, NCI has entered into a series of national and national partnerships with international agencies to provide world-class digital analytical environments that allow computational to be conducted and shared.
The ability for government and research to work in partnership at the NCI has been well established over the last decade, mainly with BoM, CSIRO, and GA. New emerging industry linkages are now being encouraged by revised government agendas and these promises to foster a new series of partnerships that will increase uptake of this major government funded infrastructure and promise to foster further collaboration and innovation.