National Computational Infrastructure

NCI

European Geosciences Union General Assembly 2018

NCI staff are presenting several orals and posters at the EGU General Assembly, on 8-13 April 2018. This page provides links to the EGU abstract or information for each of the papers. PDF versions of all presentations and posters will be provided when they are finalised.


ORAL PRESENTATIONS

NCI’s GSKY – a scalable Geospatial data server

Authors: Ben Evans, Pablo Larraondo, Joseph Antony, Jon Smilie, Sean Pringle, Rui Yang, Matthew Sanderson, Chris Allen, Jingbo Wang and Clare Richards (NCI)

Abstract

Earth systems, environmental and geophysical datasets are an extremely valuable resource for a wide range of research, government, and industry applications. For researchers analysing, transforming, and integrating these large datasets into their work, the traditional approach has been to either download a relevant part of data and analyse these data subsets in an ad-hoc manner, or to invest significant work into batch processing large data and then store and organise for further analysis. This is now rapidly becoming infeasible due to the amount of storage space and data transformation work that it requires – and out of reach for most end-users that are unfamiliar with how to work with data at this scale. Recent developments in significant data repositories with integrated data processing infrastructure opens the door for new ways of processing data on demand.
The National Computational Infrastructure (NCI), hosted at the Australian National University (ANU), has developed a highly distributed geospatial data server, called GSKY which provides a new capability for high performance data analysis. GSKY is currently being used in some national and international initiatives – providing fast access to programs and tools over the network, and allowing researchers to analyse NCI’s multi-petabyte nationally significant research data collections: from satellite data products, climate and weather simulations, and rich geophysics data.
GSKY supports on demand processing of data that allows interactive data exploration presented as an OGC standards-compliant interface, allowing ready accessibility for users of the data via Web Map Services (WMS), Web Processing Services (WPS) or raw data arrays using Web Coverage Services (WCS). GSKY has functionality for specifying how ingested data should be aggregated, transformed and presented. It dynamically and efficiently distributes the requisite computations among computational nodes and thus provides a scalable analysis framework.
GSKY has required improvements in data management practice, ensuring that the data and service meets a new level of quality assurance to help meet data processing performance and end-user application requirements. In this talk we will be seeking collaborative opportunities to use, improve and further develop GSKY’s capability.

First steps towards internationally integrating data and services in the solid Earth sciences and beyond

PDF Download

Authors: Lesley Wyborn, Ben Evans (NCI), Kerstin Lehnert (Columbia University), Tim Rawling (AuScope), Jens Klump (CSIRO), Kirsten Elger (GFZ German Research Centre for Geosciences), Simon Cox (CSIRO), Helen Glaves (British Geological Survey), Mohan Ramamurthy (EarthCube), Erin Robinson (Earth Science Information Partners), and Shelley Stall (AGU)

Abstract

Globally, solid Earth science data are collected by large numbers of organizations across the academic, government and industry sectors. Spatially, the data collected covers multiple domains extending from the crust, through the lithosphere and mantle to the core. In all, many observed phenomena cross national, if not continental, boundaries, and increasingly require international networks of researchers to address growing global challenges such as scarce non-renewable resources, risk reduction for natural hazards, and fundamental research on the nature of the planet.
The last decade has seen a dramatic growth in the number of online solid Earth science datasets and in online computational power, particularly utilising Cloud or HPC hosted data and compute resources. However, data in many of these online resources have inconsistent and incompatible data descriptions and formats, and as much as 80% of data processing effort is spent on discovering, cleaning and converting pre-existing data. Software are often developed locally around specific applications and data sources, with the side-effect of a multiplicity of software providing similar and overlapping functions.

To be able to address growing global research challenges, more attention needs to be paid to harmonising multiple metadata/data standards to enable globally discoverable and accessible data, and to enhancing standards that make data programmatically actionable from robust data platforms. Knowing that data and the data access meets agreed standards means that the software community can focus on developing better algorithms, rather than creating a myriad of ways of accessing the same data type in multiple formats. It will also be easier to create workflows for those outside of the more specialised research community.

Currently there are established national efforts creating infrastructures that help connect solid Earth researchers. In the US these include the Earth Science Information Partners (ESIP) and the NSF’s EarthCube program, whilst in Australia there has been rapid advancements in supporting e-Infrastructure with investments  such as AuScope and the National Computational Infrastructure (NCI). In Europe equivalent Horizon2020 projects are the Environmental Research Infrastructure Plus (ENVRIplus) and European Plate Observing System (EPOS). All are linking data, cyberinfrastructure and research developments across the academic and government sectors. Furthermore, more generic software from standards bodies now support many of the core requirements directly

(e.g., W3C’s DCAT metadata vocabulary, W3C/OGC’s SOSA/SSN observations and sampling ontology, DataCite identifier and metadata systems).
What is needed now are mechanisms to internationally link these major infrastructures to provide not only efficiencies in funding, but also an environment where the research efforts can create globally interoperable networks of solid Earth science data, information systems, software and researchers. Furthermore, the solid Earth community also needs to understand how to build its data networks to be compatible with those of other communities. Data of similar forms are being collected ‘above Earth’ in the atmosphere, biosphere, cryosphere, hydrosphere, and pedosphere. To prepare for the future transdisciplinary science challenges, these data will be more valuable when linked with equivalent activities in data and services for environmental, atmospheric, climate and marine research.

The AuScope Virtual Research Environment – a data enhanced virtual laboratory for the solid earth sciences

Authors: Tim Rawling (AuScope), Lesley Wyborn (NCI), Ryan Fraser (CSIRO), Ben Evans (NCI), Carsten Friedrich (Data 61)

PDF Download

Abstract

AuScope has been delivering physical, software and data research infrastructure to the Australian Solid Earth research community for over a decade. In that time, many new data products have been developed across the geophysics, geochemistry and geodesy sectors, along with related software tools to enable value adding data manipulation through simulation and modelling. The data discovery, interoperability and delivery components of the infrastructure system have been provided by traditional portals and grid based technologies such as the Spatial Information Services Stack (SISS) with Virtual Laboratory based tools developed somewhat independently.
A broad change in usage requirements and the international move towards Findable, Accessible, Interoperable and Reusable (FAIR) data principles has provided AuScope with an opportunity to develop a new Data Enhanced Virtual Laboratory (DEVL) that will provide much closer integration of data products, analytics and simulation tools, as well as mechanisms for delivering FAIR and linked data. The DEVL will form part of the broader AuScope Virtual Research Environment (AVRE) which will be developed over the next 5 years.
Funding from Australia’s National Collaborative Research Infrastructure Strategy (NCRIS) partners at the Australian National Data Service (ANDS), National eResearch Collaboration Tools and Resources (NECTAR) and Research Data Services (RDS) will be utilised with co-contributions from AuScope to develop this new platform.
In the first instance, the DEVL component of the AuScope Virtual Research Environment will deliver geophysical datasets, passive seismic and magnetotellurics from AuScope’s AusLAMP and AusArray programs to support linked data workflows for laboratory information management systems for the Australian geochemistry and geochronology communities.
Subsequent development of the complete AuScope Virtual Research Environment will provide additional support for new data assimilation to enhance observational control on a priori models, as well as rapid three-dimensional geological model development, for Australia’s simulation, analytics and modelling communities.

POSTER PRESENTATIONS

Production Copernicus Data Dissemination via a consolidated Datahub

Authors: Joseph Antony (NCI), Fang Yuan (GA), Matt Nethery, Andrew Howard, Chris Allen (NCI), Neil Flood (DSITI)

Abstract

A consortium of Australian state government and Federal government partners required an integrated approach to support governmental decision making, founded on the European Copernicus program’s satellite data products. These data products would be provided for a region covering most of south-east Asia, the Pacific Islands and Australia. A key outcome for the consortium was the setup and operation of a datahub, to greatly improve access to Copernicus data in a densely populated region of the planet, experiencing high rates of economic growth, and facing significant challenges in areas where earth observation (EO) can assist eg. environmental protection, sustainable natural resource use and risk reduction from natural disasters.
Geoscience Australia and the NCI have been running a production datahub (www.copernicus.org.au, copernicus.nci.org.au) presenting unified source data products from both ESA and EUMETSAT for Sentinel-1, 2 for the region and a global Sentinel-3 replica.
In this presentation, we will touch upon the following aspects for this virtual collaborative environment (VCE): a) the pipelines which have continually transferred over a petabyte of EO imagery; b) the QA/QC and data publication process for Copernicus data; c) production issues in hosting the VCE in a research computing setting (network, storage and compute).
Australia’s unique geographic location presents a number of challenges for delivering high performance data transfer services. The extended distances and consequential network latency requires extensive network tuning to ensure timely and accurate data movement.
NCI utilises a number of strategies in collaboration with our domestic National Research and Education Network (NREN) AARNet (Australia’s Academic and Research Network), our regional network partners TEIN (Trans Eurasian Information Network) and international network partners Internet2 (USA) and GEANT (Europe) to tune the end systems, networks and connecting exchanges to enable a reliable and timely transfer of ESA data products from Europe to the Regional Hub operated by NCI.
By locating the data closer to the consumers, end-users within the region are able to access the data stored in the datahub at high speed without needing to perform special tuning of the systems or applications used for data access.
Ongoing monitoring of the performance of the Europe to Australia transfers, the connecting networks and storage services is performed to ensure the highest levels of availability are maintained. All end-users are able to programmatically or via a web-interface, request a region of interest they require and download the data of interest. Users with accounts at the National Computational Infrastructure (NCI) are able to access the data in-situ from HPC jobs and interactive cloud computing desktop environments.
For future work, we will further open access to the repository using OGC services like WMS, WPS and WCS. WMS for rapid imagery inspection (True color, False color) in the native file format, as distributed by ESA and EUMETSAT, including basic biophysical parameter retrieval such as NDVI, NDWI and EVI. WPS for simple time series analysis and WCS for access to underlying raw data.

IGSN – Status and Future Development

Authors: Jens Klump (CSIRO), Lesley Wyborn (NCI), Kerstin Lehnert (Columbia University)

Abstract

Samples have always been at the heart of the geological sciences. Compared to the infrastructure built-in recent years for literature and data, the availability of sample information on the internet still lags behind. Samples are only valuable within their context: without unique identification and documentation, a collection of samples is little more than rocks in a box.
The International Geo Sample Number (IGSN) is designed to provide unambiguous globally unique identifiers for physical samples. In 2011 the IGSN Implementation Organization (IGSN e.V.) was founded to build the infrastructure and the governance framework for the persistent identification of geological samples. Since then the organisation has grown to 23 members on five continents, and more than 6 million samples have been registered. Among the members of IGSN government geological surveys, research institutions, and universities.
IGSN is more than another label; its power lies in creating an internet representation of a sample that can be linked to the data that were derived from it and to the literature where the sample and the data are interpreted. This is made possible by using the same technological base as it is used in Digital Object Identifiers (DOI), thus making the two systems fully compatible. Also, DataCite DOI and IGSN are recognised as related identifiers in both systems, thus enabling machine-actionable cross-linking between samples and data.
Until recently, samples were catalogued locally, if at all, but federated catalogues on a global scale were missing. The IGSN system architecture and catalogue metadata schema allow catalogue information to be harvested and several catalogues to be compiled into one. A proof of concept demonstrator has been implemented by the Australian IGSN Agents. The Australian IGSN Portal Demonstrator is available at http://igsn.org.au.
The recent expansion of the IGSN membership and technical advances in information technology will require significant updates of the IGSN technical architecture to keep pace with the growing demand. The current business model will also need to be reviewed.
The application of IGSN is not limited to geological materials. Earth sciences themselves have become more interdisciplinary over time. It is, therefore, no surprise that IGSN have been applied not only to geological materials but also to water and plant materials. In addition, IGSN have been applied to extraterrestrial materials from NASA’s Apollo Mission and other NASA missions. In principle, the IGSN governance model and technology stack can be transferred to any other discipline dealing with physical samples.

GSio: A programmatic interface for delivering Big Earth data-as-a-service

Authors: Pablo Larrondo, Edison (Jian) Guo, Joseph Anthony and Ben Evans (NCI)

Abstract

Big Earth Data can greatly benefit from the ubiquitous and scalable characteristics of the Cloud. However, the Cloud’s storage and computing models differ from those used in classic computing. Transferring traditional workflows and file formats into the Cloud normally results in suboptimal performance. There is an opportunity to define a new Cloud native model for storing geospatial data that enables large scale access.
We present GSio: a generic interface for delivering Big Earth Data, with a Cloud native implementation. Benefiting from the unbounded scalability of Cloud object stores, GSio splits geospatial data into small subsets containing multidimensional arrays, which are stored as individual objects. Fragmenting the data allows compute nodes to access objects in parallel, resulting in higher IO throughputs. GSio also makes use of fast compression algorithms to minimise the size of the data and its access latency. We demonstrate how this simple approach outperforms other traditional formats, such as HDF5 or GRIB, in Cloud environments.
This implementation is demonstrated by exposing a subset of ECMWF’s ERA5 reanalysis dataset. Remote access to the data can be performed using a Python client, allowing its interactive use on Jupyter notebooks or deep learning frameworks such as Tensorflow. We will cover the design principles behind GSio, present performance benchmarks and carry out a live demonstration.

PICO

Enabling FAIR and Open Data in Earth and Space Sciences Publications

Authors: Shelley Stall (AGU), Kerstin Lehnert (Columbia University), Lesley Wyborn (NCI), Erin Robinson (Earth Science Information Partners), Helen Glaves (British Geological Survey), Mark Parsons (Rensselaer Polytechnic Institute), Brooks Hanson (AGU), Joel Cutcher-Gershenfeld (Heller School for Social Policy and Management), Brian Nosek (University of Virginia), and Lynn Yarmey (Rensselaer Polytechnic Institute)

Abstract

Our research ecosystem is diverse and dependent on many interacting stakeholders that influence and support the process of science. These include funders, institutions, libraries, publishers, researchers, data managers, repositories, archives and communities. Process improvement in this ecosystem commonly requires the support of most, or all, of these stakeholders.
In October of 2014 a Coalition on Publishing Data in the Earth and Space Sciences (COPDESS) was formed to connect the Earth and space science publishers and data facilities to help translate the aspirations of open, available, and useful data from policy into practice. Initially funded by the National Science Foundation, then the Alfred P. Sloan Foundation, COPDESS provides an organizational framework for Earth and space science publishers and data facilities to jointly implement and promote common policies and procedures for the publication and citation of data across Earth Science journals. The launch of the partnership was announced on 15 January 2015 and included a joint Statement of Commitment signed by key publishers and repositories.
More recently, the value of FAIR (Findable, Accessible, Interoperable and Reusable) and Open Data has encouraged funders to sponsor discussions with tangible agreements, specifically with publishers, that include the steps needed to move the ecosystem towards results. Work by many of these stakeholders over the past years have developed pilot efforts that are ready to be scaled with broader engagement.
Building on the work of COPDESS, and with funding from the Laura and John Arnold Foundation, a partnership of the American Geophysical Union, Earth Science Information Partners (ESIP), Research Data Alliance (RDA), DataCite, Center for Open Science (COS), National Computational Infrastructure (NCI), Australian National Data Service (ANDS), AuScope and key publishers including Science, Nature, Elsevier, PLOS, Wiley, and the Proceedings of the National Academy of Science (PNAS) have agreed to work together along with leading repositories to develop integrated processes, leveraging these pilots, to make FAIR and open data the default for Earth and space science publications.
Along with the initial COPDESS effort, this project builds on the work of ESIP, RDA, the scientific journals, and domain repositories to ensure that well-documented data, preserved in a repository with community agreed-upon metadata, and supporting persistent identifiers becomes part of the expected research products submitted in support of each publication. No longer will data be locked up in hard-to-discover supplements. Data that supports research publications will be available in public repositories and capable of being accessed via persistent identifiers. (Protected and sensitive data will still have appropriate metadata that is open and accessible, but the data will have proper access controls.) This is a significant policy shift in how the research products (in particular data) supporting the research results are identified and referenced in publications. This effort takes decisive steps to enabling FAIR and open data, and supporting research integrity and reproducible science.

In Collaboration With