National Computational Infrastructure

NCI

eResearch Autralasia 2016

These are the presentations that NCI staff gave at the 2016 eResearch Australasia Conference in Melbourne.

Implementing a Data Quality Strategy to simplify access to data (PDF)

Authors: Kelsey Drucken, Claire Trenham, Lesley Wyborn, Ben Evans (NCI, ANU)

Abstract

To ensure seamless programmatic access, standardisation of both data and services is vital. At the National Computational Infrastructure (NCI) we have developed a Data Quality Strategy (DQS) that currently provides processes for: (1) the uniformity of the underlying HPD file format, (2) quality control through compliance with recognised community standards, and (3) data assurance through demonstrated functionality across common platforms, tools, and services. NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans, and geophysics through to astronomy, bioinformatics, and the social sciences. By implementing our DQS we have seen progressive improvement in the quality of the datasets across the different subject domains, and through this, the ease by which the users can access the data.


Versioning of Data Sets: Why, How, What and Where? (PDF)

Authors: Lesley Wyborn (NCI, ANU), Jens Klump (CSIRO), Adrian Burton (ANDS)

Abstract

There has been considerable investment in data storage infrastructures in Australia in the last seven years as result of the NCRIS and Super Science tranches of funding. As data sets become more easily accessible online, there is a growing trend for research activities to be based on these existing data sets, which can also be repurposed and/or reprocessed to generate higher level products using the higher availability of HPC and of cloud resources. At the same time, there is also a growing demand research results to be transparent and reproducible, but because many of the data sets are either being constantly added to, or are being dynamically changed, it is becoming difficult to cite the exact data extract that was used as input to a particular research project. Very few publicly available data sets are versioned and when they are, there is no consistency. A first step is to begin a dialog around whether agreed standards are required for data versioning based around the themes: Why?, How?, What?, and Where?. This BoF is seeking community agreement on the need for Data Versioning, to help survey existing practices across all eResearch domains and determine if there is demand for a more formal activity to be set up within the Australian eResearch community.


Adopting Outputs from the Research Data Alliance (PDF)

Authors: Stephanie Kethers (ANDS), Lesley Wyborn (NCI, ANU), Malcolm Wolski (Griffith University)

Abstract

The Research Data Alliance (RDA, http://rd-alliance.org), founded in 2013 by the Australian Government’s [then] Department of Innovation, the European Commission, and the US National Science Foundation and National Institute of Standards and Technology aims to build the social and technical bridges that enable open sharing of data. The RDA vision is researchers and innovators openly sharing data across technologies, disciplines, and countries to address the grand challenges of society. Participation in RDA is open to anyone who agrees to its guiding principles of openness, consensus, balance, harmonisation, with a community driven and non-profit approach. RDA has a broad, committed membership of individuals from academia, industry and government – over 3,500 from 110 countries – and 47 organisational members and affiliates (April 2016).


Research Graph: Connecting Researchers, Research Data, Publications and Grants using the Graph Technology

Authors: Nathaniel Lewis (University of Sydney), Jingbo Wang (NCI, ANU), Marta Poblet (RMIT University), Amir Aryani (ANDS)

Abstract

In this presentation, we will discuss the challenge of connecting research information including linking researchers to their publications, datasets and research grants. The main focus of the presentation is on using the research graph and the Neo4j graph technology to link the scholarly works, and we demonstrate how these technologies have been implemented in The National Computational Infrastructure (NCI), Australian National Data Service (ANDS) and The university of Sydney using the open source software including Research Data Switchboard, Neo4j and Research Graph schema.


The American Geophysical Union Data Management Maturity Program (PDF)

Authors: Shelley Stall (AGU), Brooks Hanson (AGU), Lesley Wyborn (NCI, ANU)

Abstract

The American Geophysical Union (AGU), with 60,000 members internationally, is the largest global professional society for the Geosciences. In response to emerging data management mandates from funders, AGU has developed a program that will help data repositories, large and small, domain-specific to general, use best practices to assess and improve their data management practices. The cornerstone of the program is the Data Management Maturity (DMM)SM framework which has been adapted to the specific needs of the Earth and space sciences. A data management assessment using the DMMSM involves identifying accomplishments and weaknesses in an organization, compared with leading practices for data management. Recommendations can help to improve quality and consistency across the community that will facilitate reuse in the data lifecycle. Through governance, quality, and architecture process areas the assessment can measure the ability for repositories to make their data accessible, discoverable, and interoperable.


Exploiting the Long Tail of Scientific Data: Making Small Data BIG

Authors: Lesley Wyborn (NCI, ANU), Kerstin Lehnert (Columbia University)

Abstract

Big Data is no longer on the Garter hype curve: increasingly small data is gaining recognition that it is a highly valuable asset in its own right, and that its collective sum has the potential to be of far greater importance than all of its parts. However, for the Earth and environmental sciences funding for data support is still primarily focused on those areas that generate massive volumes of observational or computed data using large-scale, shared instrumentation such as global sensor networks, satellites, or high-performance computing facilities. In their own right, small data sets concatenated into standardized BIG data sets have the potential to make a valuable contribution to research and can be a breeding ground for new and innovate research ideas. Small data can also be used to calibrate large volume remotely sensed data collections and can provide clues that uncover unforeseen trends in big data sets. In many Earth and environmental areas of research, especially those where data are primarily acquired by individual investigators or small teams (known as ‘Long-tail science communities’), data are poorly shared and integrated, and lack a communitybased data infrastructure that ensures persistent access, quality control, standardization. Because of their heterogeneity and lack of standardization long tail collections are not attractive to funders as Returns On Investments (ROIs) are perceived to be low. Different strategies are required that apply to multiple collections of the same data type. Options include (1) a more modular approach to developing the required standards, (2) developing domain specialized repositories and (3) working with instrument manufacturers that generate a substantial proportion of the long tail data to develop agreements for instrument outputs to be compatible with internationally agreed standards.


Delaying the ‘Peak Data Crisis’ in the Era of Data-intensive Science

Authors: Lesley Wyborn, Ben Evans (NCI, ANU)

Abstract

The ‘Peak Oil Crisis’ refers to the point in time when the maximum rate of extraction of petroleum is reached, after which it is expected to enter terminal decline. Although originally predicted to happen between 1985 and 2000, more efficient use of existing resources, combined with new discoveries, has extended the current estimate to 2020. In parallel, the ‘Peak Data Crisis’ refers to the point in time at which there is insufficient affordable persistent storage available for all the copies of the scientific data sets and their derivative products that have been deemed of importance to researchers in the community. It is well documented that data volumes are growing at a rate that is faster than exponential, and it is also common for raw data to go through a series of processing levels as the data are converted into more useful parameters and products. The growing data volumes have driven a move towards larger more centralized processing through well-managed facilities with data repositories that are co-located with computational systems. Analysis suggests that an increasing demand in these centralized facilities for storage is coming from different individual researchers/research groups wanting to reformat/process data into formats and specifications that are either very specific to their particular use case and/or their chosen application. On these centralized facilities, multiple copies of petabyte-scale datasets in different formats is becoming untenable and there needs to be a shift towards internationally agreed community High Performance Data sets that permit users to interactively invoke different forms of computation. To achieve this individual researchers, or individual research teams, need to join a growing number of global scientific communities to determine agreed formats and standards that make more effective use of existing storage and help delay the ‘Peak Data Crisis’ in the era of Data-intensive Science.


A learner-centred approach to specialised user training (PDF)

Authors: Claire Trenham, Kelsey Druken, Ben Evans, Rika Kobayashi, Chris Allen (NCI, ANU)

Abstract

NCI has implemented the Moodle Learning Management System1 on our private cloud infrastructure, to enable delivery of training materials via a learner-directed approach. This involves transitioning our materials from an instructor-centred method of structured content delivery, to a modular approach where the learner (system user) is able to choose the parts most relevant to them. Our training materials have been broken down into grouped modules and lessons, and a suite of Jupyter Notebooks containing interactive tutorial materials have been created for students to download and use as training exercises. In this way we are moving NCI into evidence-based instruction mechanisms[1] which permit self-directed learning by our users, as well as allowing the “students” to work through material at their own pace, and letting the “teacher” act as a tutor to assist with problems rather than taking a central didactic position in the instruction. Advantages to this approach include ability to embed exercises within content; ease of tailoring training courses to different user group interests, and requirements; as well as making it easier for users to seek instruction at any time instead of needing to wait for the next course to be delivered. As the system is more widely adopted by our user base we hope the use of forums will enable the community to support each other with queries, NCI trainers may then be able to take more of a monitoring role in the process. Face-to-face courses remain an invaluable part of training, as having direct access to “helpers” in the early stages of learning can enable much faster resolution of problems, while we believe it is important to draw on educational theory and best practice in delivering professional level training to our users as much as we would in school classrooms.


Persistent Identifier Practice for Big Data Management at NCI (PDF)

Authors: Jingbo Wang, Wei Si (NCI, ANU), Nicholas Car (Geoscience Australia), Ben Evans (NCI, ANU)

Abstract

The National Computational Infrastructure (NCI) manages over 10PB of research data, which is co-located with Top 100 high performance computer for fast processing (Raijin). The NCI’s data platform services include building catalogues, DOI minting, data curation, data publishing, and data delivery services. Data indexing and search capabilities are important for users to be able to find datasets easily. To help with this, the NCI uses persistent identifier (PID) services to provide a robust data identification for items within the massive data collection catalogues as well as for data service endpoint URLs. We demonstrate NCI’s approach to utilising a PID management tool, known as the PID Service, to manage its persistent identifiers.


The Dawn of the Exascale Age: Using Integrated HPC and Connected Data (PDF)

Authors: Ben Evans (NCI, ANU)

Abstract

The age of exascale has now arisen and several major, coordinated Earth Systems research communities have been established to undertake the computational challenges for real-world societal needs. These have incorporated the efforts of both government agencies and the research community, and transitioning to common good information infrastructures, which can then energize the innovation system more broadly. Whilst most of the critical challenges are fundamentally being addressed through the communities of interest, there are some technical challenges that are also common to the eResearch community – albeit satisfying different criteria, and so have a different focus. This talk will explore some of the major computational and data challenges being addressed in multi-scale simulation and in access to the richness of data, as well as exploring some of the cross-over issues. Targeting a number of important finer detailed areas may remove impediments that can then assist with the major Earth Systems challenges, as well as assist other areas within the broader eResearch-engaged activities.


How do we improve the Measure of Value in eResearch Infrastructure? (PDF)

Authors: Clare Richards, Ben Evans, Lesley Wyborn (NCI, ANU)

Abstract

If you build it, they will come. Of course they will. Won’t they? If they do come, was it what they expected and was it better than an alternative? How can we tell how valuable the work in digital infrastructure is – and how do we gauge it? Even with success there is always something to learn, so how do we make sure that all the parties providing resources and expertise get appropriate feedback so that we can continue to improve, or is just building infrastructure and driving uptake the main achievement? In this talk we explore the issues, then categorise and test a new way of measuring the value of research infrastructure, through the value to researchers.

In Collaboration With