National Computational Infrastructure

NCI

American Geophysical Union Fall Meeting 2017

NCI staff will be presenting multiples talks and posters at the AGU 2017 Fall Meeting, on 11-15 December 2017.

PDF versions of all presentations and posters will be provided upon completion of the Meeting.


Oral Presentations

Large-scale, high-performance and cloud-enabled multi-model analytics experiments in the context of the Earth System Grid Federation

Authors: Sandro Flore, Marcin Plociennik, Charles Doutriaux, Ignacio Blanquer, Roberto Barbera, Dean Williams, Valentine Anantharaj, Ben Evans, Davide Salomoni, Giovanni Aloisio

Abstract

The increased models resolution in the development of comprehensive Earth System Models is rapidly leading to very large climate simulations output that pose significant scientific data management challenges in terms of data sharing, processing, analysis, visualization, preservation, curation, and archiving.

Large scale global experiments for Climate Model Intercomparison Projects (CMIP) have led to the development of the Earth System Grid Federation (ESGF), a federated data infrastructure which has been serving the CMIP5 experiment, providing access to 2PB of data for the IPCC Assessment Reports.

In such a context, running a multi-model data analysis experiment is very challenging, as it requires the availability of a large amount of data related to multiple climate models simulations and scientific data management tools for large-scale data analytics. To address these challenges, a case study on climate models intercomparison data analysis has been defined and implemented in the context of the EU H2020 INDIGO-DataCloud project.

The case study has been tested and validated on CMIP5 datasets, in the context of a large scale, international testbed involving several ESGF sites (LLNL, ORNL and CMCC), one orchestrator site (PSNC) and one more hosting INDIGO PaaS services (UPV). Additional ESGF sites, such as NCI (Australia) and a couple more in Europe, are also joining the testbed.

The added value of the proposed solution is summarized in the following: it implements a server-side paradigm which limits data movement; it relies on a High-Performance Data Analytics (HPDA) stack to address performance; it exploits the INDIGO PaaS layer to support flexible, dynamic and automated deployment of software components; it provides user-friendly web access based on the INDIGO Future Gateway; and finally it integrates, complements and extends the support currently available through ESGF. Overall it provides a new “tool” for climate scientists to run multi-model experiments.

At the time this contribution is being written, the proposed testbed represents the first implementation of a distributed large-scale, multi-model experiment in the ESGF/CMIP context, joining together server-side approaches for scientific data analysis, HPDA frameworks, end-to-end workflow management, and cloud computing.


Geospatial Data as a Service: Towards planetary scale real-time analytics

Authors: Pablo Rozas Larraondo, Ben Evans, Joseph Antony

Abstract

The rapid growth of earth systems, environmental and geophysical datasets poses a challenge to both end-users and infrastructure providers. For infrastructure and data providers, tasks like managing, indexing and storing large collections of geospatial data needs to take into consideration the various use cases by which consumers will want to access and use the data.

Considerable investment has been made by the Earth Science community to produce suitable real-time analytics platforms for geospatial data. There are currently different interfaces that have been defined to provide data services. Unfortunately, there is considerable difference on the standards, protocols or data models which have been designed to target specific communities or working groups.

The Australian National University’s National Computational Infrastructure (NCI) is used for a wide range of activities in the geospatial community. Earth observations, climate and weather forecasting are examples of these communities which generate large amounts of geospatial data. The NCI has been carrying out significant effort to develop a data and services model that enables the cross-disciplinary use of data.

Recent developments in cloud and distributed computing provide a publicly accessible platform where new infrastructures can be built. One of the key components these technologies offer is the possibility of having “limitless” compute power next to where the data is stored. This model is rapidly transforming data delivery from centralised monolithic services towards ubiquitous distributed services that scale up and down adapting to fluctuations in the demand.

NCI has developed GSKY, a scalable, distributed server which presents a new approach for geospatial data discovery and delivery based on OGC standards. We will present the architecture and motivating use-cases that drove GSKY’s collaborative design, development and production deployment. We show our approach offers the community valuable exploratory analysis capabilities, for dealing with petabyte-scale geospatial data collections.


Tracking Research Data Footprints via Integration with Research Graph (Invited)

Authors: Jingbo Wang, Amir Aryani, Michael Condon, Ben Evans, Lesley Wyborn, Sayeed Choudrry

Abstract

The researcher of today is likely to be part of a team that will use subsets of data from at least one, if not more external repositories, and that same data could be used by multiple researchers for many different purposes. At best, the repositories that host this data will know who is accessing their data, but rarely what they are using it for, resulting in funders of data collecting programs and data repositories that store the data unlikely to know: 1) which research funding contributed to the collection and preservation of a dataset, and 2) which data contributed to high impact research and publications. In days of funding shortages there is a growing need to be able to trace the footprint a data set from the originator that collected the data to the repository that stores the data and ultimately to any derived publications.

The Research Data Alliance’s Data Description Registry Interoperability Working Group (DDRIWG) has addressed this problem through the development of a distributed graph, called Research Graph that can map each piece of the research interaction puzzle by building aggregated graphs. It can connect datasets on the basis of co-authorship or other collaboration models such as joint funding and grants and can connect research datasets, publications, grants and researcher profiles across research repositories and infrastructures such as DataCite and ORCID.

National Computational Infrastructure (NCI) in Australia is one of the early adopters of Research Graph. The graphic view and quantitative analysis helps NCI track the usage of their National reference data collections thus quantifying the role that these NCI-hosted data assets play within the funding-researcher-data-publication-cycle. The graph can unlock the complex interactions of the research projects by tracking the contribution of datasets, the various funding bodies and the downstream data users. RMap Project is a similar initiative which aims to solve complex relationships among scholarly publications and their underlying data, including IEEE publications. It is hoped to combine RMap and Research Graph in the near futures and also to add physical samples to Research Graph.


Building a Generic Virtual Research Environment Framework for Multiple Earth and Space Science Domains and a Diversity of Users.

Authors: Lesley Wyborn, Ryan Fraser, Ben Evans, Carsten Friedrich, Jens Klump, David Lescinsky

Abstract

Virtual Research Environments (VREs) are now part of academic infrastructures. Online research workflows can be orchestrated whereby data can be accessed from multiple external repositories with processing taking place on public or private clouds, and centralised supercomputers using a mixture of user codes, and well-used community software and libraries. VREs enable distributed members of research teams to actively work together to share data, models, tools, software, workflows, best practices, infrastructures, etc. These environments and their components are increasingly able to support the needs of undergraduate teaching. External to the research sector, they can also be reused by citizen scientists, and be repurposed for industry users to help accelerate the diffusion and hence enable the translation of research innovations.

The Virtual Geophysics Laboratory (VGL) in Australia was started in 2012, built using a collaboration between CSIRO, the National Computational Infrastructure (NCI) and Geoscience Australia, with support funding from the Australian Government Department of Education. VGL comprises three main modules that provide an interface to enable users to first select their required data; to choose a tool to process that data; and then access compute infrastructure for execution. VGL was initially built to enable a specific set of researchers in government agencies access to specific data sets and a limited number of tools. Over the years it has evolved into a multi-purpose Earth science platform with access to an increased variety of data (e.g., Natural Hazards, Geochemistry), a broader range of software packages, and an increasing diversity of compute infrastructures. This expansion has been possible because of the approach to loosely couple data, tools and compute resources via interfaces that are built on international standards and accessed as network-enabled services wherever possible. Built originally for researchers that were not fussy about general usability, increasing emphasis on User Interfaces (UIs) and stability will lead to increased uptake in the education and industry sectors. Simultaneously, improvements are being added to facilitate access to data and tools by experienced researchers who want direct access to both data and flexible workflows.


Smoothing Data Friction through building Service Oriented Data Platforms

Authors: Lesley Wyborn,  Clare Richards, Ben Evans, Jingbo Wang, Kelsey Druken 

Abstract

Data Friction has been commonly defined as the costs in time, energy and attention required to simply collect, check, store, move, receive, and access data. On average, researchers spend a significant fraction of their time finding the data for their research project and then reformatting it so that it can be used by the software application of their choice. There is an increasing role for both data repositories and software to be modernised to help reduce data friction in ways that support the better use of the data.

Many generic data repositories simply accept data in the format as supplied: the key check is that the data have sufficient metadata to enable discovery and download. Few generic repositories have both the expertise and infrastructure to support the multiple domain specific requirements that facilitate the increasing need for integration and reusability.

In contrast, major science domain-focused repositories are increasingly able to implement and enforce community endorsed best practices and guidelines that ensure reusability and harmonization of data for use within the community by offering semi-automated QC workflows to improve quality of submitted data.

The most advanced of these science repositories now operate as service-oriented data platforms that extend the use of data across domain silos and increasingly provide server-side programmatically-enabled access to data via network protocols and community standard APIs. To provide this, more rigorous QA/QC procedures are needed to validate data against standards and community software and tools. This ensures that the data can be accessed in expected ways and also demonstrates that the data works across different (non-domain specific) packages, tools and programming languages deployed by the various user communities.

In Australia, the National Computational Infrastructure (NCI) has created such a service-oriented data platform which is demonstrating how this approach can reduce data friction, servicing both individual domains as well as facilitating cross-domain collaboration. The approach has required an increase in effort for the repository to provide the additional expertise, so as to enable a better capability and efficient system which ultimately saves time by the individual researcher.


Challenges in Managing Trustworthy Large-scale Digital Science (Invited)

Authors: Ben Evans

Abstract

The increased use of large-scale international digital science has opened a number of challenges for managing, handling, using and preserving scientific information. The large volumes of information are driven by three main categories – model outputs including coupled models and ensembles, data products that have been processing to a level of usability, and increasingly heuristically driven data analysis. These data products are increasingly the ones that are usable by the broad communities, and far in excess of the raw instruments data outputs. The data, software and workflows are then shared and replicated to allow broad use at an international scale, which places further demands of infrastructure to support how the information is managed reliably across distributed resources. Users necessarily rely on these underlying “black boxes” so that they are productive to produce new scientific outcomes.

The software for these systems depend on computational infrastructure, software interconnected systems, and information capture systems. This ranges from the fundamentals of the reliability of the compute hardware, system software stacks and libraries, and the model software.

Due to these complexities and capacity of the infrastructure, there is an increased emphasis of transparency of the approach and robustness of the methods over the full reproducibility. Furthermore, with large volume data management, it is increasingly difficult to store the historical versions of all model and derived data. Instead, the emphasis is on the ability to access the updated products and the reliability by which both previous outcomes are still relevant and can be updated for the new information.

We will discuss these challenges and some of the approaches underway that are being used to address these issues.


Improving Seismic Data Accessibility and Performance Using HDF Containers

Authors: Jingbo Wang, Rui Yang, Ben Evans

Abstract

The performance of computational geophysical data processing and forward modelling relies on both computational and data. Significant efforts on developing new data formats and libraries have been made the community, such as IRIS/PASSCAL and ASDF in data, and programs and utilities such as ObsPy and SPECFEM.

The National Computational Infrastructure hosts a national significant geophysical data collection that is co-located with a high performance computing facility and provides an opportunity to investigate how to improve the data formats from both a data management and a performance point of view. This paper investigates how to enhance the data usability in several perspectives: 1) propose a convention for the seismic (both active and passive) community to improve the data accessibility and interoperability; 2) recommend the convention used in the HDF container when data is made available in PH5 or ASDF formats; 3) provide tools to convert between various seismic data formats; 4) provide performance benchmark cases using ObsPy library and SPECFEM3D to demonstrate how different data organization in terms of chunking size and compression impact on the performance by comparing new data formats, such as PH5 and ASDF to traditional formats such as SEGY, SEED, SAC, etc.

In this work we apply our knowledge and experience on data standards and conventions, such as CF and ACDD from the climate community to the seismology community. The generic global attributes widely used in climate community are combined with the existing convention in the seismology community, such as CMT and QuakeML, StationXML, SEGY header convention. We also extend such convention by including the provenance and benchmarking records so that the r user can learn the footprint of the data together with its baseline performance. In practise we convert the example wide angle reflection seismic data from SEGY to PH5 or ASDF by using ObsPy and pyasdf libraries. It quantitatively demonstrates how the accessibility can be improved if the seismic data are stored in the HDF container.


Using the Principles of F.A.I.R Data to Improve the Measure of Value of Big Data and Big Data RepositoriesPresentation Download

Authors: Clare Richards, Lesley Wyborn, Ben Evans, Jingbo Wang, Kelsey Druken, Jon Smillie, Sean Pringle

Abstract

In a data-intensive world, finding the right data can be time-consuming and, when found, may involve compromises on quality and often considerable extra effort to wrangle it into shape. This is particularly true as users are exploring new and innovative ways of working with data from different sources and scientific domains.

It is recognised that the effort and specialist knowledge required to transform datasets to meet these requirements goes beyond the reasonable remit of a single research project or research community. Instead, Government investments in national collaborations like the Australian National University’s National Computational Infrastructure (NCI), provide a sustainable way to bring together and transform disparate data collections from a range of disciplines in ways which enable new and innovative analysis and use.

With these goals in mind, the NCI established a Data Quality Strategy (DQS) for managing 10PB of reference data collections with a particular focus on improving data use and reuse across scientific domains, making the data suitable for use in a high-end computational and data-intensive environment, and supporting programmatic access for a range of applications.

Evaluating how effectively we’re achivieving these goals and maintaining ongoing funding requires demonstration of the value and impact of these data collections. Standard approaches to measuring data value involve basic measures of ‘data usage’ or make an attempt to track data to ‘research outcomes’. While useful, these measures fail to capture the value of the level of curation or quality assurance in making the data available.

To fill this gap, NCI has developed a 3-tiered approach to measuring the return on investment which broadens the concept of value to include improvements in access to and use of the data. Key to this approach was integrating the guiding principles of the Force 11 community’s F.A.I.R data into the DQS because it provides a community-driven standards-based framework which can be used for metrics. The NCI metrics provide useful information for data users, data custodians as well as data repositories and, most importantly, can be used to demonstrate the return on investment in both quantitative and qualitative terms.

Poster Presentations

Tracking Research Data Footprints via Integration with Research Graph (Invited) – Poster Download

Authors:  Jingbo Wang, Amir Aryani, Michael Condon, Ben Evans, Lesley Wyborn, Sayeed Choudrry

Abstract

The researcher of today is likely to be part of a team that will use subsets of data from at least one, if not more external repositories, and that same data could be used by multiple researchers for many different purposes. At best, the repositories that host this data will know who is accessing their data, but rarely what they are using it for, resulting in funders of data collecting programs and data repositories that store the data unlikely to know: 1) which research funding contributed to the collection and preservation of a dataset, and 2) which data contributed to high impact research and publications. In days of funding shortages there is a growing need to be able to trace the footprint a data set from the originator that collected the data to the repository that stores the data and ultimately to any derived publications.

The Research Data Alliance’s Data Description Registry Interoperability Working Group (DDRIWG) has addressed this problem through the development of a distributed graph, called Research Graph that can map each piece of the research interaction puzzle by building aggregated graphs. It can connect datasets on the basis of co-authorship or other collaboration models such as joint funding and grants and can connect research datasets, publications, grants and researcher profiles across research repositories and infrastructures such as DataCite and ORCID.

National Computational Infrastructure (NCI) in Australia is one of the early adopters of Research Graph. The graphic view and quantitative analysis helps NCI track the usage of their National reference data collections thus quantifying the role that these NCI-hosted data assets play within the funding-researcher-data-publication-cycle. The graph can unlock the complex interactions of the research projects by tracking the contribution of datasets, the various funding bodies and the downstream data users. RMap Project is a similar initiative which aims to solve complex relationships among scholarly publications and their underlying data, including IEEE publications. It is hoped to combine RMap and Research Graph in the near futures and also to add physical samples to Research Graph.


Geospatial Data as a Service: The GEOGLAM Rangelands and Pasture Productivity Map Experience

Authors: Joseph Antony, Juan P Guerschman, Pablo Rozas Larrraondo, Ben Evans, Clare Richards

Abstract

Empowering end-users like pastoralists, land management specialists and land policy makers in the use of earth observation data for both day-to-day and seasonal planning needs both interactive delivery of multiple geospatial datasets and the capability of supporting on-the-fly dynamic queries while simultaneously fostering a community around the effort.

The use of and wide adoption of large data archives, like those produced by earth observation missions, are often limited by compute and storage capabilities of the remote user. We demonstrate that wide-scale use of large data archives can be facilitated by end-users dynamically requesting value-added products using open standards (WCS, WMS, WPS), with compute running in the cloud or dedicated data-centres and visualizing outputs on web-front ends.

As an example, we will demonstrate how a tool called GSKY can empower a remote end-user by providing the data delivery and analytics capabilities for the GEOGLAM Rangelands and Pasture Productivity (RAPP) Map tool.

The GEOGLAM RAPP initiative from the Group on Earth Observations (GEO) and its Agricultural Monitoring subgroup aims at providing practical tools to end-users focusing on the important role of rangelands and pasture systems in providing food production security from both agricultural crops and animal protein. Figure 1, is a screen capture from the RAPP Map interface for an important pasture area in the Namibian rangelands. The RAPP Map has been in production for six months and has garnered significant interest from groups and users all over the world.

GSKY, being formulated around the theme of Open Geospatial Data-as-a-Service capabilities uses distributed computing and storage to facilitate this. It works behind the scenes, accepting OGC standard requests in WCS, WMS and WPS. Results from these requests are rendered on a web-front end. In this way, the complexities of data locality and compute execution are masked from an end user.

On-the-fly computation of products such as NDVI, Leaf Area Index, vegetation cover and others from original source data including MODIS are achived, with Landsat and Sentinel-2 on the horizon.

Innovative use of cloud computing and storage along with flexible front-ends, allow the democratization of data dissemination and we hope better outcomes for the planet.

Panels

Panel Discussion on state of the art and perspectives in network theory and linked data in the earth sciences (Invited)

Authors: Peter Fox, Kerstin Lehnert, Mark Parsons, Jingbo Wang, Rahul Ramachandra

In Collaboration With