National Computational Infrastructure

NCI Australia logo with the words Providing Australian researchers with world-class high-end computing services

American Geophysical Union Fall Meeting 2018

NCI staff will be presenting multiples talks and posters at the AGU 2018 Fall Meeting, on 10-14 December 2018.The words AGU 100 Advancing Earth And Space Science in blue

PDF versions of all presentations and posters will be provided upon completion of the Meeting.


Oral Presentations

Evolving Data-driven science: the unprecedented coherence of Big Data, HPC, and informatics, and crossing the next chasms – Invited talk: Greg Leptoukh Lecture

Authors: Benjamin Evans (NCI)

Abstract

As we approach the AGU Centenary, we celebrate the successes of data-driven science whilst looking anxiously at the future, with consideration of hardware software, workflow and interconnectedness that need further attention.

The colocation of scientific datasets with HPC/cloud compute has successfully demonstrated the overall supercharging of our research productivity. Over time we questioned whether to “bring data to the compute”, or “compute to the data” and considered and reconsidered the benefits, weaknesses and challenges both technically and socially. The gap between how large volume data and longtail data are managed is steadily closing, and the standards for interoperability and ability for connectivity between scientific fields have been slowly maturing. In many cases transdisciplinary science is now a reality.

However, computing technology is no longer advancing according to Moore’s law (and equivalents) and is evolving in unexpected ways. For some major computational software codes, these technology changes are forcing us to reconsider the development strategy, how to transition existing code to both address the needs of scientific improvements in capability, while at the same time improving the ability to adjust to changes in the underlying technical infrastructure. In doing so, some old assumptions of data precision and reproducibility are being reconsidered. Quantum computing is now on the horizon which will mean further consideration of software and data access mechanisms.

Currently, for data management, despite the apparent value and opportunity, the demands on high quality datasets that can be used for new data-driven methods are testing the funding/business case and overall value proposition for celebrated open data and its FAIRness. Powerful new technologies such as AI and deep learning have a voracious appetite for big data and much stronger (and unappreciated) requirements around quality of data, information management, connectivity and persistence. These new technologies are evolving at the same time as the ubiquitous IOT, fog computing, and blockchain pipelines have emerged creating even more complexity and potentially hypercoherence issues.

In this talk I will discuss the journey so far in data-intensive computational science, and consider the chasms we have yet to cross.


The important role of HPC and data-intensive infrastructure facilities in supporting a diversity of Virtual Research Environments (VREs): working with Climate – Invited TalkDownload link

Authors: Clare Richards, Benjamin Evans, Kate Snow, Chris Allen, Jingbo Wang, Kelsey A Druken, Sean Pringle, Jon Smillie and Matt Nethery (NCI)

Abstract

As an integrated Tier 1 HPC and petascale Data Repository facility, the Australian National Computational Infrastructure (NCI) plays an essential role in developing and supporting VREs for a range of research communities and international federations across Climate, Weather, Environment and Geoscience. With needs and skills varying across research communities, our role goes beyond provision of infrastructure to include development and maintenance of software and expertise which assist users to fully utilise the HPC resources available. To support this diversity, we have championed a transdisciplinary approach with the aim of meeting the needs for a range of communities while also achieving the discipline-specific goals.

This capability has been developed through a series of collaborative projects which involved engaging with local users and research communities as well as integrating with international federations to adopt best practice solutions based on standards that can be applied across multiple domains.

This talk will focus on how we have addressed the needs of the Climate community, and how this has influenced the development of NCI’s infrastructure to support multiple VREs. This community has a diverse skills base and undertake a range of activities including model and code development as well as data analysis and visualisation. The infrastructure to underpin this diversity requires highly integrated research platforms suitable for demanding HPC and data-intensive analysis as well as the software and services to search the petascale, internationally distributed, data collections. The scale of these requirements needs a well coordinated national and international approach, including the use of standards and conventions in data management and services, cataloguing, file metadata, directory structures and publishing processes.

Uptake of this infrastructure by our diverse user community and use of Climate data formats and metadata conventions for the Geophysics and Earth Observations collections demonstrates that common infrastructure can be designed to meet the needs of multiple domains. Not only does this move us towards transdisciplinary research but it also helps to address the considerations of long term sustainability of VREs for individual communities.


Towards Networks of Trusted Virtual Domain Repositories that are connected to Networks of Persistent Physical Repositories

Authors: Lesley A Wyborn, Benjamin Evans (NCI), Kerstin Lehnert (Columbia University), Andrew Treloar, Adrian Burton (ANDS), Tim Rawling (AuScope) and Shelley Stall (AGU)

Abstract

There are currently three main categories of repositories for research data: institution-focused, domain-specific, and national petascale facilities. Each have challenges in fulfilling the diverse requirements of the research community, and in meeting the new and demanding requirements of FAIR as well as trusted repository certification.

Single institution repositories are persistent entities capable of justifying resources required to meet minimum obligations to sustain storage infrastructure and provide a basic catalogue to make data findable. However, many struggle to make the data useful and interoperable, or offer specialised tools for extraction, visualisation and fusion that make data reusable.

Tightly managed domain repositories for specific types of data are better positioned to implement community-endorsed best practices to ensure interoperability and reusability of data, and make it easier to aggregate into (inter)national collections. However, many do not have resources to obtain certification as trusted repositories or fail because funders only see their role as establishing, but not maintaining them in the long term.

National petascale data facilities have been emerging whose mission is to curate large-volume specialised FAIR data collections such as climate and Earth observation: many are colocated with HPC resources. But generally these are reluctant to take on complex, low volume, highly variable long tail collections.

All three have issues in implementing FAIR, making it confusing for the researcher.

A potential solution is for institutional and petascale facilities to continue to provide existing infrastructure (in cloud or physical repositories) but move the domain repositories to ‘virtualising’ or federating services they currently provide. More work can be done to develop national registries (e.g., Research Data Australia) that assist with enhancing metadata aggregated from individual institutions. National, if not international, domain focused data validation/QA/QC services will improve reusability and interoperability of data at institutional repositories. Petascale facilities can then focus on aggregating domain-validated data from institutions to create National Reference Collections suitable for HPC and facilitate global transdisciplinary science.


Addressing the massive CMIP6 data science challenge through the ESGF global federation

Authors: Benjamin Evans (NCI), Michael Lautenschlager (DKRZ), Luca Cinquini (JPL), Sébastien Denvil (IPSL), Sasha Ames (LLNL), Robert Ferraro (JPL, CalTech), V. Balaji (Princeton), Philip Kershaw (NCEO, STFC), Tom Landry (CRIM) and Dean Norman Williams (LLNL)

Abstract

The World Climate Research Programme’s (WCRP) Coupled Model Intercomparison Project (CMIP) phase 6 experiment is currently underway, and will be the most demanding globally distributed data project for the climate community so far. More climate HPC centers will run more versions of more models of increasing complexity and higher model resolutions with that data to be made available for rigorous analysis so as to release the enormous value to researchers and to society.

To manage this multi-petabyte distributed data archive, CMIP6 relies on the Earth System Grid Federation (ESGF), an international collaboration of major Climate centres that provide computational and data analysis capabilities to their national researchers, and reliable data replication and sharing across international networks. The ESGF provides many hundreds of climate researchers across the planet with the ability to access and analyze the output model data, observational data and reanalysis data, and then compare across the different climate scenarios. For example, in the last CMIP experiment (CMIP5), approximately 45% of climate research papers published in 2016 in the Journal of Climate explicitly cited CMIP data managed by the ESGF. The ESGF and CMIP have agreed to conform to common standards, APIs and processes, and use open software and data formats so as to ensure that the data are well managed as a distributed repository, openly accessible, can be used across the diverse set of analysis requirements for the climate community.


Deeper Search Capabilities to Permit Interdisciplinary Data Access and Research

Authors: Kate Snow, Benjamin Evans, Sean Pringle, Jon Smillie, Paola Petrelli and Scott Wales

Abstract

Earth science data represents a broad array of complex multi-petabyte datasets that constitute a significant portion of the data hosted at the National Computational Infrastructure (NCI), Australia. Maintaining such a large repository and permitting advanced discoverability for interdisciplinary research of the data presents a significant technical challenge. To aid researcher in accessing the interdisciplinary datasets spanning the climate spectrum, we have developed the Metadata Attribute Search (MAS) service. In contrast to a catalogue metadata service, MAS permits a deeper search capability by harvesting the metadata of the self-describing files and placing it in the high performance MAS database. The MAS database may then be accessed by API’s written by the community to permit deep search capabilities, and thus ease of access to and efficient research of, the Earth science data.

One such example, the ARCCSSive API, developed and maintained by ARC Centre of Excellence for Climate System Science (ARCCSS) and the ARC Centre of Excellence for Climate Extremes (CLEX), is a python API developed in a way that can take advantage of the information provided by MAS to permit ease of search and access of the Earth science datasets. For example, MAS permits users to query over multiple interdisciplinary data collections to determine which files satisfy particular criteria and point them to paths on the filesystem.

We also further describe the broad applications of the MAS across domains, other use case examples and MAS integration with search APIs, exploring the overall benefits the MAS capability has provided to permit improved access to Earth science datasets.

eLightning

The Rescue of Magnetotellurics (MT) Time Series Datasets

Authors: Nigel Rees, Kelsey A Druken, Benjamin Evans (NCI), Graham S Heinson, Dennis Conway (University of Adelaide), Jingbo Wang and Lesley A Wyborn (NCI)

Abstract

It is very common for a magnetotellurics (MT) geophysicist to have filing cabinets full of unique and unpublished MT datasets on data storage media that are no longer maintained or considered safe, with critical metadata recorded completely independently. Such datasets are potentially valuable to the wider MT community as they contain important information over large geographical areas that would otherwise require expensive re-acquisition. The raw time series are typically not released by the MT scientist who worked on the data and are routinely stored in private spaces on tapes, CDs, hard drives or local network drives. The metadata associated with the raw time series, which are critical for undertaking subsequent analysis, are often stored on separate PDF documents or sometimes on paper in folders or workbooks. By rescuing old MT datasets and processing and securing along with new data in modern archival formats, the MT community can build on the many diverse collection programs from the last thirty years. Vintage MT data is still valuable today and there is no reason why today’s data will not also be valuable into the future.

As part of the 2017-2018 AuScope-Australian Research Data Commons (ARDC) funded Geoscience Data-enhanced Virtual Laboratory (DeVL) project, the National Computational Infrastructure (NCI) has been working with The University of Adelaide to rescue their high quality and valuable collection of historic raw MT time series data, transfer functions, model outputs and survey metadata dating back to 1993. To realise their full potential, significant time was expended linking the associated survey metadata to the rescued raw time series and then made accessible as file downloads in their pre-existing EDI and text formats. However, in order to make these vintage datasets comply with modern data repository needs of Findable, Accessible, Interoperable and Reusable (FAIR), NCI have been investigating the value of converting the data to modern open scientific self-describing formats, with a view of demonstrating better accessibility through data services such as OPeNDAP. The investigation has shown the value of aggregating data from multiple historical surveys and enabling reuse for continental scale analysis and/or use of a much wider range of interoperable scientific software from other domains.


Making data from a national catalogue discoverable by Web data search tools

Authors: Adrian Burton (ANU), Lesley A Wyborn (NCI), Joel Benn (ANU) and Mingfang Wu (Monash)

Abstract

Research Data Australia (RDA, https://researchdata.ands.org.au/) is a data catalogue service funded by the Australian Government to operate a national register for the discovery of research datasets, including those in repositories from both academic institutions and government agencies. The mission of RDA is to help discoverability, access and reuse of research data and the catalogue has descriptive pages (metadata landing pages) for about 130k data sets/collections dynamically updated from over 100 Australian organisations. We have been striving to make datasets of value to the research community more discoverable not only from the catalogue itself, but also through other means such as Scholix, Web search engines and Web data search tools.

We have included URLs of each metadata landing pages in a sitemap for easy indexing by Web search engines. Since 2015, we have marked the 130k metadata landing pages in Schema.org for their discoverability through Web data search tools. There are two advantages in doing this: 1) As a national data catalogue, the syndication of metadata landing pages to a Web data search tool would not require the same syndication activity from each of the contributing 100 disciplinary/institutional repositories (many of whom currently are technically unable to do this). By having relevant, consistent information in a centralised Australian-wide catalogue there is a greater chance that all the data will be indexed, particularly as smaller sites are often less reliable: greater economies of scale and lower maintenance costs are also achieved. 2) We are taking the advantage of the Web architecture to make data more discoverable, particularly as a log analysis of our catalogue’s activities indicates that about 90% of the traffic are from Web search engines.

Our presentation will reflect on challenges we are facing in applying Schema.org, for example, the level of granularity of Schema.org in describing research data. We will also reflect on how well schema.org adapts to the needs of scientists searching for data to re-use in research. Can schema.org encode the appraisal information that scientists need in order to decide to re-use data?

Posters

Roles and skills required for communities to successfully build research data standards and data infrastructures.Download Link

Authors: Lesley A Wyborn and Clare Richards (NCI)

Abstract

The size of the community that can programmatically access scientific data is equivalent to the size of the community that developed and know about the data infrastructures/standards used. Hence, for global integration of data, standards developed by international communities are optimal. But developing complex data infrastructures, particularly those built on standards such as requiring data are Findable, Accessible, Interoperable and Reusable (FAIR), takes more than individual researchers working in isolation.

The best outcomes are achieved when teams and even communities are committed to working together. Building this cohesion and commitment, particularly when it involves globally distributed individuals and teams, rarely happens by chance.

As an example, for data to be interoperable with other data sets and reuseable by those who did not collect the data, it must comply with agreed standards. Transforming datasets to comply with standards involves major change and requires building a team with the right skills and highly targeted outreach activities to transition data producers and users to the new standard.

For most countries, it is rare to find more than one or two people with the specialist technical skills needed to develop standards hence international collaboration is needed. Further, many technical people do not have a science background so the role of the ‘boundary spanner’, a person who can work in both a technical and a scientific role, is essential to translate needs and concerns between teams.

Once developed the new standard requires a concerted effort in outreach to ensure that the standard becomes widely known to increase uptake and testing: technical experts are not usually ideal in this role and require a team that have experience in outreach and governance.

Given the diversity of the teams involved in developing a standard a trusted, ‘neutral’ coordinator is essential – they need to be a leader and an enthusiast. They are critical not only in bringing the diverse communities developing the standard together but also in engaging external communities to increase uptake. The larger the community that uses the standard, the more likely it is also to persist and be maintained. This paper will be supported by real world examples from multiple international standards development efforts.


Improving quality within the FAIRly Big Data repository at NCI AustraliaDownload link

Authors: Kelsey A Druken, Benjamin Evans, Jingbo Wang, Clare Josephine Richards, Nigel Rees, Kate Snow, Sean Pringle, Kashif Gohar, Jon Smillie, Chris Allen and Qurat-ul-Ain Tariq

Abstract

In this rapidly evolving Big Data era, data quality practices are the foundation to a trusted repository and add value to the data they manage. Petabytes of information are effectively useless unless users have an understanding and trust in the quality. The Australian National Computational Infrastructure (NCI) is home to one of Australia’s largest Earth and Space Sciences data repositories (10+PBytes), co-located with NCI’s high performance computing (HPC) facility and data-intensive environments. To ensure seamless, programmatic access to the repository, NCI applies F.A.I.R. (Findable, Accessible, Interoperable, Reusable) practices, and has the added benefit of improving data use and reuse across a broad range of applications and scientific domains. These include elements such as data quality control (i.e., compliance with recognized community and domain standards), testing of quality assurance (i.e., demonstrated functionality and performance across common platforms, tools and services), and application-level benchmarking. Key is also the fusion of high-level collection and dataset catalogue metadata with deeply embedded information in self-describing data down to the variable-level. This is vital for both users, who need to easily search and discover, and for the management of a repository of this scale. The value of this investment in data quality is to enable transdisciplinary use of data at NCI as well as to the broader community, which in turn lays the groundwork for building integrated networks of repositories.


Building a multipurpose Geoscience Virtual Research Environment to cater for multiple use cases, a range of scales and diverse skill sets –  Download link

Authors: Carsten Friedrich (CSIRO Canberra), Tim Rawling (AuScope), Mingfang Wu (Monash University), Geoffrey Squire (CSIRO Data 61), Lesley A Wyborn (NCI), Jens F Klump (CSIRO Earth Science Resource Engineering), Ryan Fraser (CSIRO)

Abstract

AuScope, the Australian Earth Science Research infrastructure capability, has been delivering software and data research infrastructure across the geophysics, geochemistry, and geodesy domains for over a decade. Initially, the focus was on providing access to data located in repositories at multiple sites via web services, websites, and portals: file downloads were also enabled. Access to download analytic software tools was also provided through those websites and portals. In 2010, an experimental Virtual Geophysics Laboratory (VGL) was created to provide integrated access to both data and software. It enabled researchers to achieve ‘online’ workflows that facilitated processing, lowered the barriers to entry to new analytical capabilities (including on HPC), and increased uptake of data and software resources.

This triggered a demand for other specific laboratories to be built for geohazards, geochemistry, mineral exploration and more. Even though succeeding virtual laboratories were built using the generic Portal Core and VL Core software platform, the sustainability and maintenance costs soon became significant for each. It was also hard to find the ‘sweet spot’ that resulted in maximum usage for a given amount of effort, as some users wanted more effort to be put into user interfaces, whilst others wanted more bespoke and/or more complex processing workflows to be added.

A different approach is now being taken and the AuScope Virtual Research Environment (AVRE) is currently being created to enable users with varying skills to specifically target their needs and access a range of online data and software resources available as services to either create their own workflows in their own environment or utilise pre-existing workflows on a variety of computational infrastructures. Both data and software are accessible via standardised interfaces and are now utilised by individual researchers who commonly use Python Notebooks to mix and match data, software and tools to create their own exploratory workflows.

Funding from the Australian Research Data Commons (ARDC) will be utilised with co-contributions from AuScope to develop this new platform.


Delivering IGSN to Australian Academics from Multiple Laboratories via a Centralised National ServiceDownload link

Authors: Gerry Ryder (ARDC), Lesley Wyborn (NCI), Adrian Burton (ARDC), Jens Klump (CSIRO), Brent McInnes (Curtin University), Julia Martin (ARDC)

Abstract

The International Geo Sample Number (IGSN) provides a globally unique persistent identifier for physical samples.In July 2018, the Australian Research Data Commons (ARDC) released a national IGSN allocation service to support research activity within Australian earth science academia. The objective is to improve sample identification and connection with other scholarly outputs (publications, data, etc). The scope of this national infrastructure is earth scientists across the Australian university sector.

One difficulty faced in establishing this service was that earth scientists in Australian universities do not necessarily have the resources or systems to support the management of samples as well as provide persistent storage of the sample metadata. To meet this constraint, ARDC provides an online service and stores the metadata for individual researchers. ARDC also uses the IGSN metadata to generate a human readable landing page which can be managed by the researcher.

Allocating an IGSN to a physical sample supports discovery, access, sharing and citation of samples. Researchers benefit from creating and allocating an IGSN by ensuring its inclusion in data derived from the sample, literature where the sample and data are interpreted, as well as aiding discovery of the curator and/or collector of the sample. Giving a sample a persistent identifier also supports preservation of and access throughout the sample management lifecycle, making discovery and reuse more efficient.

While the scope of the ARDC service is currently limited to earth science samples, there is interest from other academic communities, such as herberia, archaeology and biology. Extending the ARDC IGSN service to these communities would support transdisciplinary research.

The International Geo Sample Number (IGSN) provides a globally unique persistent identifier for physical samples.In July 2018, the Australian Research Data Commons (ARDC) released a national IGSN allocation service to support research activity within Australian earth science academia. The objective is to improve sample identification and connection with other scholarly outputs (publications, data, etc). The scope of this national infrastructure is earth scientists across the Australian university sector.

One difficulty faced in establishing this service was that earth scientists in Australian universities do not necessarily have the resources or systems to support the management of samples as well as provide persistent storage of the sample metadata. To meet this constraint, ARDC provides an online service and stores the metadata for individual researchers. ARDC also uses the IGSN metadata to generate a human readable landing page which can be managed by the researcher.

Allocating an IGSN to a physical sample supports discovery, access, sharing and citation of samples. Researchers benefit from creating and allocating an IGSN by ensuring its inclusion in data derived from the sample, literature where the sample and data are interpreted, as well as aiding discovery of the curator and/or collector of the sample. Giving a sample a persistent identifier also supports preservation of and access throughout the sample management lifecycle, making discovery and reuse more efficient.

While the scope of the ARDC service is currently limited to earth science samples, there is interest from other academic communities, such as herberia, archaeology and biology. Extending the ARDC IGSN service to these communities would support transdisciplinary research.

In Collaboration With