National Computational Infrastructure

NCI

European Geosciences Union General Assembly 2017

NCI staff are presenting several orals and posters at the EGU General Assembly, on 23-28 April 2017. This page provides links to the EGU abstract or information for each of the papers. PDF versions of all presentations will be provided when they are finalised.


ORAL PRESENTATIONS

Improving global data infrastructures for more effective and scalable analysis of Earth and environmental data: the Australian NCI NERDIP Approach (Presentation)

Authors: Ben Evans, Lesley Wyborn, Kelsey Druken, Clare Richards, Claire Trenham, Jingbo Wang, Pablo Rozas Larraondo, Adam Steer and Jon Smillie (NCI, ANU)

Abstract

The National Computational Infrastructure (NCI) facility hosts one of Australia’s largest repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans, and geophysics through to astronomy, bioinformatics, and the social sciences domains. The data are obtained from national and international sources, spanning a wide range of gridded and ungridded (i.e. line surveys, point clouds) data, and raster imagery, as well as diverse coordinate reference projections and resolutions.

Rather than managing these data assets as a digital library, whereby users can discover and download files to personal servers (similar to borrowing ‘books’ from a ‘library’), NCI has built an extensive and well-integrated research data platform, the National Environmental Research Data Interoperability Platform (NERDIP, http://nci.org.au/data-collections/nerdip/). The NERDIP architecture enables programmatic access to data via standards-compliant services for high performance data analysis, and provides a flexible cloud-based environment to facilitate the next generation of transdisciplinary scientific research across all data domains.

To improve use of modern scalable data infrastructures that are focused on efficient data analysis, the data organisation needs to be carefully managed including performance evaluations of projections and coordinate systems, data encoding standards and formats. A complication is that we have often found multiple domain vocabularies and ontologies are associated with equivalent datasets. It is not practical for individual dataset managers to determine which standards are best to apply to their dataset as this could impact accessibility and interoperability. Instead, they need to work with data custodians across interrelated communities and, in partnership with the data repository, the international scientific community to determine the most useful approach. For the data repository, this approach is essential to enable different disciplines and research communities to invoke new forms of analysis and discovery in an increasingly complex data-rich environment.

Driven by the heterogeneity of Earth and environmental datasets, NCI developed a Data Quality/Data Assurance Strategy to ensure consistency is maintained within and across all datasets, as well as functionality testing to ensure smooth interoperability between products, tools, and services. This is particularly so for collections that contain data generated from multiple data acquisition campaigns, often using instruments and models that have evolved over time. By implementing the NCI Data Quality Strategy we have seen progressive improvement in the integration and quality of the datasets across the different subject domains, and through this, the ease by which the users can access data from this major data infrastructure. By both adhering to international standards and also contributing to extensions of these standards, data from the NCI NERDIP platform can be federated with data from other globally distributed data repositories and infrastructures.

The NCI approach builds on our experience working with the astronomy and climate science communities, which have been internationally coordinating such interoperability standards within their disciplines for some years. The results of our work so far demonstrate more could be done in the Earth science, solid earth and environmental communities, particularly through establishing better linkages between international/national community efforts such as EPOS, ENVRIplus, EarthCube, AuScope and the Research Data Alliance.


Building an Internet of Samples: The Australian Contribution

Authors: Lesley Wyborn (NCI, ANU), Jens Klump (CSIRO), Irina Bastrakova (Geoscience Australia), Anusuriya Devaraju (CSIRO), Brent McInnes (Curtin University), Simon Cox (CSIRO), Linda Karssies (CSIRO), Julia Martin (Australian National Data Service), Shawn Ross (Department of Ancient History), John Morrissey (CSIRO) and Ryan Fraser (CSIRO)

Abstract

Physical samples are often the ground truth to research reported in the scientific literature across multiple domains. They are collected by many different entities (individual researchers, laboratories, government agencies, mining companies, citizens, museums, etc.). Samples must be curated over the long-term to ensure both that their existence is known, and to allow any data derived from them through laboratory and field tests to be linked to the physical samples. For example, having unique identifiers that link back ground truth data on the original sample helps calibrate large volumes of remotely sensed data. Access to catalogues of reliably identified samples from several collections promotes collaboration across all Earth Science disciplines. It also increases the cost effectiveness of research by reducing the need to re-collect samples in the field. The assignment of web identifiers to the digital representations of these physical objects allows us to link to data, literature, investigators and institutions, thus creating an “Internet of Samples”.

An Australian implementation of the “Internet of Samples” is using the IGSN (International Geo Sample Number, http://igsn.github.io) to identify samples in a globally unique and persistent way. IGSN was developed in the solid earth science community and is recommended for sample identification by the Coalition for Publishing Data in the Earth and Space Sciences (COPDESS). IGSN is interoperable with other persistent identifier systems such as DataCite. Furthermore, the basic IGSN description metadata schema is compatible with existing schemas such as OGC Observations and Measurements (O&M) and DataCite Metadata Schema which makes crosswalks to other metadata schemas easy. IGSN metadata is disseminated through the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) allowing it to be aggregated in other applications such as portals (e.g. the Australian IGSN catalogue http://igsn2.csiro.au). The metadata is available in more than one format. The software for IGSN web services is based on components developed for DataCite and adapted to the specific requirements of IGSN. This cooperation in open source development ensures sustainable implementation and faster turnaround times for updates.

IGSN, in particular in its Australian implementation, is characterised by a federated approach to system architecture and organisational governance giving it the necessary flexibility to adapt to particular local practices within multiple domains, whilst maintaining an overarching international standard. The three current IGSN allocation agents in Australia: Geoscience Australia, CSIRO and Curtin University, represent different sectors. Through funding from the Australian Research Data Services Program they have combined to develop a common web portal that allows discovery of physical samples and sample collections at a national level.International governance then ensures we can link to an international community but at the same time act locally to ensure the services offered are relevant to the needs of Australian researchers. This flexibility aids the integration of new disciplines into a global community of a physical samples information network.


Moving Virtual Research Environments from high maintenance Stovepipes to Multi-purpose Sustainable Service-oriented Science Platforms

Authors: Jens Klump (CSIRO), Ryan Fraser (CSIRO), Lesley Wyborn (NCI/ANU), Carsten Friedrich (CSIRO), Geoffrey Squire (CSIRO), Michelle Barker (NeCTAR) and Glenn Moloney (NeCTAR)

Abstract

The researcher of today is likely to be part of a team distributed over multiple sites that will access data from an external repository and then process the data on a public or private cloud or even on a large centralised supercomputer. They are increasingly likely to use a mixture of their own code, third party software and libraries, or even access global community codes. These components will be connected into a Virtual Research Environments (VREs) that will enable members of the research team who are not co-located to actively work together at various scales to share data, models, tools, software, workflows, best practices, infrastructures, etc. Many VRE’s are built in isolation: designed to meet a specific research program with components tightly coupled and not capable of being repurposed for other use cases – they are becoming ‘stovepipes’. The limited number of users of some VREs also means that the cost of maintenance per researcher can be unacceptably high.

The alternative is to develop service-oriented Science Platforms that enable multiple communities to develop specialised solutions for specific research programs. The platforms can offer access to data, software tools and processing infrastructures (cloud, supercomputers) through globally distributed, interconnected modules.

In Australia, the Virtual Geophysics Laboratory (VGL) was initially built to enable a specific set of researchers in government agencies access to specific data sets and a limited number of tools, that is now rapidly evolving into a multi-purpose Earth science platform with access to an increased variety of data, a broader range of tools, users from more sectors and a diversity of computational infrastructures. The expansion has been relatively easy, because of the architecture whereby data, tools and compute resources are loosely coupled via interfaces that are built on international standards and accessed as services wherever possible. In recent years, investments in discoverability and accessibility of data via online services in Australia mean that data resources can be easily added to the virtual environments as and when required.

Another key to increasing to reusability and uptake of the VRE is the capability to capturing workflows so that they can be reused and repurposed both within and beyond the community that that defined the original use case. Unfortunately, Software-as-a-Service in the research sector is not yet mature. In response, we developed a Scientific Software solutions Center (SSSC) that enables researchers to discover, deploy and then share computational codes, code snippets or processes both in a human and machine-readable manner.

Growth has come not only from within the Earth science community but from the Australian Virtual Laboratory community which is building VREs for a diversity of communities such as astronomy, genomics, environment, humanities, climate etc. Components such as access control, provenance, visualisation, accounting etc. are common to all scientific domains and sharing of these across multiple domains reduces costs, but more importantly increases the ability to undertake interdisciplinary science. These efforts are transitioning VREs to more sustainable Service-oriented Science Platforms that can be delivered in an agile, adaptable manner for broader community interests.


eodataservice.org: how to enable cross-continental interoperability of the European Space Agency and Australian Geoscience Landsat datacubes

Authors: Simone Mantovani (MEEO S.r.l., SISTEMA GmbH), Damiano Barboni (MEEO S.r.l.), Stefano Natali (MEEO S.r.l., SISTEMA GmbH), Ben Evans (NCI, ANU), Adam Steer (NCI, ANU), Patrik Hogan (NASA) and Peter Baumann (JACOBS University)

Abstract

Globally, billions of dollars are invested annually in Earth observations that support public services, commercial activity, and scientific inquiry. The Common Data Framework [1] for Earth Observation data summarises the current standards for the international community to adopt a common approach so that this significant data can be readily accessible.

Concurrently, the “Copernicus Cooperation Arrangement” between the European Commission and the Australian Government is just one in a number of recent agreements signed to facilitate Satellite Earth Observation data sharing among the users’ communities. The typical approach implemented in these initiatives is the establishment of a regional data access hub managed by the regional entity to collect data at full scale or over the local region, improve access services and provide high-performance environment in which all the data can be analysed. Furthermore, a number of datacube-aware platforms and services have emerged that enable a new collaborative approach for analysing the vast quantities of satellite imagery and other Earth Observations, making it quicker and easier to explore a time series of image data.

In this context, the H2020-funded EarthServer2 project brings together multiple organisations in Europe, Australia and United States to allow federated data holdings to be analysed using web-based access to petabytes of multidimensional geospatial datasets. The aim is to create and ensure that these large spatial data sources can be accessed based on OGC standards, namely Web Coverage Service (WCS) and Web Coverage Processing Service (WCPS) that provide efficient&timely retrieval of large volumes of geospatial data as well as on-the-fly processing.

In this study, we provide an overview of the existing European Space Agency and Australian Geoscience Landsat datacubes, how the regional datacube structures differ, how interoperability is enabled through standards, and finally how the datacubes can be visualized on a virtual globe (NASA – ESA WebWorldWind) based on a WC(P)S query via any standard internet browser.


NetCDF4/HDF5 and Linked Data in the Real World – Enriching Geoscientific Metadata without Bloat

Authors: Alex Ip (Geoscience Australia), Nicholas Car (Geoscience Australia), Kelsey Druken (NCI, ANU), Yvette Poudjom-Djomani (Geoscience Australia), Stirling Butcher (Geoscience Australia), Ben Evans (NCI, ANU), and Lesley Wyborn (NCI, ANU)

Abstract

NetCDF4 has become the dominant generic format for many forms of geoscientific data, leveraging (and constraining) the versatile HDF5 container format, while providing metadata conventions for interoperability. However, the encapsulation of detailed metadata within each file can lead to metadata “bloat”, and difficulty in maintaining consistency where metadata is replicated to multiple locations. Complex conceptual relationships are also difficult to represent in simple key-value netCDF metadata. Linked Data provides a practical mechanism to address these issues by associating the netCDF files and their internal variables with complex metadata stored in Semantic Web vocabularies and ontologies, while complying with and complementing existing metadata conventions.

One of the stated objectives of the netCDF4/HDF5 formats is that they should be self-describing: containing metadata sufficient for cataloging and using the data. However, this objective can be regarded as only partially-met where details of conventions and definitions are maintained externally to the data files. For example, one of the most widely used netCDF community standards, the Climate and Forecasting (CF) Metadata Convention, maintains standard vocabularies for a broad range of disciplines across the geosciences, but this metadata is currently neither readily discoverable nor machine-readable.

We have previously implemented useful Linked Data and netCDF tooling (ncskos) that associates netCDF files, and individual variables within those files, with concepts in vocabularies formulated using the Simple Knowledge Organization System (SKOS) ontology. NetCDF files contain Uniform Resource Identifier (URI) links to terms represented as SKOS Concepts, rather than plain-text representations of those terms, so we can use simple, standardised web queries to collect and use rich metadata for the terms from any Linked Data-presented SKOS vocabulary.

Geoscience Australia (GA) manages a large volume of diverse geoscientific data, much of which is being translated from proprietary formats to netCDF at NCI Australia. This data is made available through the NCI National Environmental Research Data Interoperability Platform (NERDIP) for programmatic access and interdisciplinary analysis. The netCDF files contain both scientific data variables (e.g. gravity, magnetic or radiometric values), but also domain-specific operational values (e.g. specific instrument parameters) best described fully in formal vocabularies. Our ncskos codebase provides access to multiple stores of detailed external metadata in a standardised fashion.

Geophysical datasets are generated from a “survey” event, and GA maintains corporate databases of all surveys and their associated metadata. It is impractical to replicate the full source survey metadata into each netCDF dataset so, instead, we link the netCDF files to survey metadata using public Linked Data URIs. These URIs link to Survey class objects which we model as a subclass of Activity objects as defined by the PROV Ontology, and we provide URI resolution for them via a custom Linked Data API which draws current survey metadata from GA’s in-house databases.

We have demonstrated that Linked Data is a practical way to associate netCDF data with detailed, external metadata. This allows us to ensure that catalogued metadata is kept consistent with metadata points-of-truth, and we can infer complex conceptual relationships not possible with netCDF key-value attributes alone.


PICO PRESENTATION

GSKY: A scalable distributed geospatial data server on the cloud

Authors:Pablo Rozas Larraondo, Sean Pringle, Joseph Antony and Ben Evans (NCI, ANU)

Abstract

Earth systems, environmental and geophysical datasets are an extremely valuable sources of information about the state and evolution of the Earth. Being able to combine information coming from different geospatial collections is in increasing demand by the scientific community, and requires managing and manipulating data with different formats and performing operations such as map reprojections, resampling and other transformations. Due to the large data volume inherent in these collections, storing multiple copies of them is unfeasible and so such data manipulation must be performed on-the-fly using efficient, high performance techniques. Ideally this should be performed using a trusted data service and common system libraries to ensure wide use and reproducibility. Recent developments in distributed computing based on dynamic access to significant cloud infrastructure opens the door for such new ways of processing geospatial data on demand.

The National Computational Infrastructure (NCI), hosted at the Australian National University (ANU), has over 10 Petabytes of nationally significant research data collections. Some of these collections, which comprise a variety of observed and modelled geospatial data, are now made available via a highly distributed geospatial data server, called GSKY (pronounced [jee-skee]).

GSKY supports on demand processing of large geospatial data products such as satellite earth observation data as well as numerical weather products, allowing interactive exploration and analysis of the data. It dynamically and efficiently distributes the required computations among cloud nodes providing a scalable analysis framework that can adapt to serve large number of concurrent users.

Typical geospatial workflows handling different file formats and data types, or blending data in different coordinate projections and spatio-temporal resolutions, is handled transparently by GSKY. This is achieved by decoupling the data ingestion and indexing process as an independent service. An indexing service crawls data collections either locally or remotely by extracting, storing and indexing all spatio-temporal metadata associated with each individual record.

GSKY provides the user with the ability of specifying how ingested data should be aggregated, transformed and presented. It presents an OGC standards-compliant interface, allowing ready accessibility for users of the data via Web Map Services (WMS), Web Processing Services (WPS) or raw data arrays using Web Coverage Services (WCS). The presentation will show some cases where we have used this new capability to provide a significant improvement over previous approaches.


POSTER PRESENTATION

An open, interoperable, transdisciplinary approach to a point cloud data service using OGC standards and open source software. (Poster)

Authors: Adam Steer, Claire Trenham, Kelsey Druken, Benjamin Evans, and Lesley Wyborn (NCI, ANU)

Abstract

High resolution point clouds and other topology-free point data sources are widely utilised for research, management and planning activities. A key goal for research and management users is making these data and common derivatives available in a way which is seamlessly interoperable with other observed and modelled data.

The Australian National Computational Infrastructure (NCI) stores point data from a range of disciplines, including terrestrial and airborne LiDAR surveys, 3D photogrammetry, airborne and ground-based geophysical observations, bathymetric observations and 4D marine tracers. These data are stored alongside a significant store of Earth systems data including climate and weather, ecology, hydrology, geoscience and satellite observations, and available from NCI’s National Environmental Research Data Interoperability Platform (NERDIP) [1].

Because of the NERDIP requirement for interoperability with gridded datasets, the data models required to store these data may not conform to the LAS/LAZ format – the widely accepted community standard for point data storage and transfer. The goal for NCI is making point data discoverable, accessible and useable in ways which allow seamless integration with earth observation datasets and model outputs – in turn assisting researchers and decision-makers in the often-convoluted process of handling and analyzing massive point datasets.

With a use-case of providing a web data service and supporting a derived product workflow, NCI has implemented and tested a web-based point cloud service using the Open Geospatial Consortium (OGC) Web Processing Service [2] as a transaction handler between a web-based client and server-side computing tools based on a native Linux operating system. Using this model, the underlying toolset for driving a data service is flexible and can take advantage of NCI’s highly scalable research cloud. Present work focusses on the Point Data Abstraction Library (PDAL) [3] as a logical choice for efficiently handling LAS/LAZ based point workflows, and native HDF5 libraries for handling point data kept in HDF5-based structures (eg NetCDF4, SPDlib [4]). Points stored in database tables (eg postgres-pointcloud [5]) will be considered as testing continues.

Visualising and exploring massive point datasets in a web browser alongside multiple datasets has been demonstrated by the entwine-3D tiles project [6]. This is a powerful interface which enables users to investigate and select appropriate data, and is also being investigated as a potential front-end to a WPS-based point data service.

In this work we show preliminary results for a WPS-based point data access system, in preparation for demonstration at FOSS4G 2017, Boston.

[1] http://nci.org.au/data-collections/nerdip/

[2] http://www.opengeospatial.org/standards/wps

[3] http://www.pdal.io

[4] http://www.spdlib.org/doku.php

[5] https://github.com/pgpointcloud/pointcloud

[6] http://cesium.entwine.io

In Collaboration With