American Geophysical Union Fall Meeting 2016
NCI staff are presenting several presentations at the AGU 2016 Fall Meeting, on 12-16 December 2016. This page provides links to the AGU listing for each of the papers. PDF versions of all presentations will be provided when they are finalised.
Authors: Pablo Rozas Larraondo, Ben Evans, Joseph Antony (NCI, ANU)
Research using deep neural networks have significantly matured in recent times, and there is now a surge in interest to apply such methods to Earth systems science and the geosciences. When combined with Big Data, we believe there are opportunities for significantly transforming a number of areas relevant to researchers and policy makers. In particular, by using a combination of data from a range of satellite Earth observations as well as computer simulations from climate models and reanalysis, we can gain new insights into the information that is locked within the data. Global geospatial datasets describe a wide range of physical and chemical parameters, which are mostly available using regular grids covering large spatial and temporal extents. This makes them perfect candidates to apply deep learning methods. So far, these techniques have been successfully applied to image analysis through the use of convolutional neural networks. However, this is only one field of interest, and there is potential for many more use cases to be explored.
The deep learning algorithms require fast access to large amounts of data in the form of tensors and make intensive use of CPU in order to train its models. The Australian National Computational Infrastructure (NCI) has recently augmented its Raijin 1.2 PFlop supercomputer with hardware accelerators. Together with NCI’s 3000 core high performance OpenStack cloud, these computational systems have direct access to NCI’s 10+ PBytes of datasets and associated Big Data software technologies (see http://geonetwork.nci.org.au/ and http://nci.org.au/systems-services/national-facility/nerdip/).
To effectively use these computing infrastructures requires that both the data and software are organised in a way that readily supports the deep learning software ecosystem. Deep learning software, such as the open source TensorFlow library, has allowed us to demonstrate the possibility of generating geospatial models by combining information from our different data sources. This opens the door to an exciting new way of generating products and extracting features that have previously been labour intensive. In this paper, we will explore some of these geospatial use cases and share some of the lessons learned from this experience.
Authors: Ben Evans, Lesley Wyborn, Kelsey Druken, Clare Richards, Claire Trenham, Jingbo Wang (NCI, ANU)
The Australian National Computational Infrastructure (NCI) manages a large geospatial repository (10+ PBytes) of Earth systems, environmental, water management and geophysics research data, co-located with a petascale supercomputer and an integrated research cloud. NCI has applied the principles of the “Common Framework for Earth-Observation Data” (the Framework) to the organisation of these collections enabling a diverse range of researchers to explore different aspects of the data and, in particular, for seamless programmatic data analysis, both in-situ access and via data services.
NCI provides access to the collections through the National Environmental Research Data Interoperability Platform (NERDIP) – a comprehensive and integrated data platform with both common and emerging services designed to enable data accessibility and citability. Applying the Framework across the range of datasets ensures that programmatic access, both in-situ and network methods, work as uniformly as possible for any dataset, using both APIs and data services.
NCI has also created a comprehensive quality assurance framework to regularise compliance checks across the data, library APIs and data services, and to establish a comprehensive set of benchmarks to quantify both functionality and performance perspectives for the Framework.
The quality assurance includes organisation of datasets through a data management plan, which anchors the data directory structure, version controls and data information services so that they are kept aligned with operational changes over time. Specific attention has been placed on the way data are packed inside the files. Our experience has shown that complying with standards such as CF and ACDD is still not enough to ensure that all data services or software packages correctly read the data. Further, data may not be optimally organised for the different access patterns, which causes poor performance of the CPUs and bandwidth utilisation. We will also discuss some gaps in the Framework that have emerged and our approach to resolving these.
Authors: Clare Richards, Ben Evans, Lesley Wyborn, Jingbo Wang, Claire Trenham, Kelsey Druken (NCI, ANU)
The Australian National Computational Infrastructure (NCI) has ingested over 10PB of national and international environmental, Earth systems science and geophysics reference data onto a single platform to advance high performance data (HPD) techniques that enable interdisciplinary Data-intensive Science. Improved Data Stewardship is critical to evolve both data and data services that support the increasing need for programmatic usability and that prioritises interoperability rather than just traditional data download or portal access.
A data platform designed for programmatic access requires quality checked collections that better utilise interoperable data formats and standards. Achieving this involves strategies to meet both the technical and ‘social’ challenges. Aggregating datasets used by different communities and organisations requires satisfying multiple use cases for the broader research community, whilst addressing existing BAU requirements. For NCI, this requires working with data stewards to manage the process of replicating data to the common platform, community representatives and developers to confirm their requirements, and with international peers to better enable globally integrated data communities.
It is particularly important to engage with representatives from each community who can work collaboratively to a common goal, as well as capture their community needs, apply quality assurance, determine any barriers to change and to understand priorities. This is critical when managing the aggregation of data collections from multiple producers with different levels of stewardship maturity, technologies and standards, and where organisational barriers can impact the transformation to interoperable and performant data access.
To facilitate the management, development and operation of the HPD platform, NCI coordinates technical and domain committees made up of user representatives, data stewards and informatics experts to provide a forum to discuss, learn and advise NCI’s management. This experience has been a useful collaboration and suggests that in the age of interdisciplinary HPD research, Data Stewardship is evolving from a focus on the needs of a single community to one which helps balance priorities and navigates change for multiple communities.
Authors: Lesley Wyborn, Ben Evans (NCI, ANU)
Reproducibility is a fundamental tenant of the scientific method: it implies that any researcher, or a third party working independently, can duplicate any experiment or investigation and produce the same results. Historically computationally based research involved an individual using their own data and processing it in their own private area, often using software they wrote or inherited from close collaborators. Today, a researcher is likely to be part of a large team that will use a subset of data from an external repository and then process the data on a public or private cloud or on a large centralised supercomputer, using a mixture of their own code, third party software and libraries, or global community codes.
In ‘Big Geoscience’ research it is common for data inputs to be extracts from externally managed dynamic data collections, where new data is being regularly appended, or existing data is revised when errors are detected and/or as processing methods are improved. New workflows increasingly use services to access data dynamically to create subsets on-the-fly from distributed sources, each of which can have a complex history. At major computational facilities, underlying systems, libraries, software and services are being constantly tuned and optimised, or as new or replacement infrastructure being installed. Likewise code used from a community repository is continually being refined, re-packaged and ported to the target platform.
To achieve reproducibility, today’s researcher increasingly needs to track their workflow, including querying information on the current or historical state of facilities used. Versioning methods are standard practice for software repositories or packages, but it is not common for either data repositories or data services to provide information about their state, or for systems to provide query-able access to changes in the underlying software. While a researcher can achieve transparency and describe steps in their workflow so that others can repeat them and replicate processes undertaken, they cannot achieve exact reproducibility or even transparency of results generated. In Big Geoscience, full reproducibiliy will be an elusive dream until data repositories and compute facilities can provide provenance information in a standards compliant, machine query-able way.
Authors: Kelsey Druken, Claire Trenham, Ben Evans, Clare Richards, Jingbo Wang, Lesley Wyborn (NCI, ANU)
To ensure seamless programmatic access for data analysis (including machine learning), standardization of both data and services is vital. At the Australian National Computational Infrastructure (NCI) we have developed a Data Quality Strategy (DQS) that currently provides processes for: (1) the consistency of data structures in the underlying High Performance Data (HPD) platform; (2) quality control through compliance with recognized community standards; and (3) data quality assurance through demonstrated functionality across common platforms, tools and services. NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences. A key challenge is the application of community-agreed data standards to the broad set of Earth systems and environmental data that are being used. Within these disciplines, data span a wide range of gridded, ungridded (i.e., line surveys, point clouds), and raster image types, as well as diverse coordinate reference projections and resolutions.
By implementing our DQS we have seen progressive improvement in the quality of the datasets across the different subject domains, and through this, the ease by which the users can programmatically access the data, either in situ or via web services. As part of its quality control procedures, NCI has developed a compliance checker based upon existing domain standards. The DQS also includes extensive Functionality Testing which include readability by commonly used libraries (e.g., netCDF, HDF, GDAL, etc.); accessibility by data servers (e.g., THREDDS, Hyrax, GeoServer), validation against scientific analysis and programming platforms (e.g., Python, Matlab, QGIS); and visualization tools (e.g., ParaView, NASA Web World Wind). These tests ensure smooth interoperability between products and services as well as exposing unforeseen requirements and dependencies. The results provide an important component of quality control within the DQS as well as clarifying the requirement for any extensions to the relevant standards that help support the uptake of data by broader international communities.
Authors: Brent McInnes (Curtin University), Tim Rawling (AuScope), Warick Brown, Matthias Liffers (Curtin University), Lesley Wyborn (NCI, ANU), Adam Brown, Simon J D Cox (CSIRO).
Technological improvements in laboratory automation and microanalytical methods are producing an unprecedented volume of high-value geochemical data for use by geoscientists in understanding geological and planetary processes. In contrast, the research infrastructure necessary to systematically manage, deliver and archive analytical data has not progressed much beyond the minimum effort necessary to produce a peer-reviewed publication. Anecdotal evidence indicates that the majority of publically funded data is underreported, and what is published is relatively undiscoverable to experienced researchers let alone the general public. Government-funded “open data” initiatives have a role to play in the development of networks of data management and delivery ecosystems and practices allowing access to publically funded data. This paper reports on progress in Australia towards creation of an open data ecosystem involving multiple academic and government research institutions cooperating to create an open data architecture linking researchers, physical samples, sample metadata, laboratory metadata, analytical data and consumers.
Authors: Claire Trenham, Kelsey Druken, Adam Steer, Ben Evans, Clare Richards, Jon Smillie, Chris Allen, Sean Pringle, Jingbo Wang, Lesley Wyborn (NCI, ANU)
The Australian National Computational Infrastructure (NCI) provides access to petascale data in climate, weather, Earth observations, and genomics, and terascale data in astronomy, geophysics, ecology and land use, as well as social sciences. The data is centralized in a closely integrated High Performance Computing (HPC), High Performance Data (HPD) and cloud facility. Despite this, there remain significant barriers for many users to find and access the data: simply hosting a large volume of data is not helpful if researchers are unable to find, access, and use the data for their particular need. Use cases demonstrate we need to support a diverse range of users who are increasingly crossing traditional research discipline boundaries.
To support their varying experience, access needs and research workflows, NCI has implemented an integrated data platform providing a range of services that enable users to interact with our data holdings. These services include:
– A GeoNetwork catalog built on standardized Data Management Plans to search collection metadata, and find relevant datasets;
– Web data services to download or remotely access data via OPeNDAP, WMS, WCS and other protocols;
– Virtual Desktop Infrastructure (VDI) built on a highly integrated on-site cloud with access to both the HPC peak machine and research data collections. The VDI is a fully featured environment allowing visualization, code development and analysis to take place in an interactive desktop environment; and
– A Learning Management System (LMS) containing User Guides, Use Case examples and Jupyter Notebooks structured into courses, so that users can self-teach how to use these facilities with examples from our system across a range of disciplines.
We will briefly present these components, and discuss how we engage with data custodians and consumers to develop standardized data structures and services that support the range of needs. We will also highlight some key developments that have improved user experience in utilizing the services, particularly enabling transdisciplinary science.
This work combines with other developments at NCI to increase the confidence of scientists from any field to undertake research and analysis on these important data collections regardless of their preferred work environment or level of skill.
Authors: Dean Williams (DOE), Michael Lautenschlager (DKRZ), Luca Cinquini (NASA/NOAA), Sébastien Denvil (IPSL), Robert Ferraro (NASA), Daniel Duffy (NASA), V Balaji (NOAA), Claire Trenham (NCI, ANU)
The Earth System Grid Federation (ESGF) is primarily funded by the Department of Energy’s (DOE’s) Office of Science (the Office of Biological and Environmental Research [BER] Climate Data Informatics Program and the Office of Advanced Scientific Computing Research Next Generation Network for Science Program), the National Oceanic and Atmospheric Administration (NOAA), the National Aeronautics and Space Administration (NASA), and the National Science Foundation (NSF), the European Infrastructure for the European Network for Earth System Modeling (IS-ENES), and the Australian National University (ANU). Support also comes from other U.S. federal and international agencies.
The federation works across multiple worldwide data centers and spans seven international network organizations to provide users with the ability to access, analyze, and visualize data using a globally federated collection of networks, computers, and software. Its architecture employs a series of geographically distributed peer nodes that are independently administered and united by common federation protocols and application programming interfaces (APIs). The full ESGF infrastructure has now been adopted by multiple Earth science projects and allows access to petabytes of geophysical data, including the Coupled Model Intercomparison Project (CMIP; output used by the Intergovernmental Panel on Climate Change assessment reports), multiple model intercomparison projects (MIPs; endorsed by the World Climate Research Programme [WCRP]), and the Accelerated Climate Modeling for Energy (ACME; ESGF is included in the overarching ACME workflow process to store model output). ESGF is a successful example of integration of disparate open-source technologies into a cohesive functional system that serves the needs the global climate science community.
Data served by ESGF includes not only model output but also observational data from satellites and instruments, reanalysis, and generated images.
OPeNDAP servers like Hyrax and TDS can easily support common single-sign-on authentication protocols using the Apache httpd and related software; adding support for these protocols to clients can be more challenging
Authors: James Gallagher, Nathan Potter (OPeNDAP, Inc.), Ben Evans (NCI, ANU)
OPeNDAP, in conjunction with the Australian National University, documented the installation process needed to add authentication to OPeNDAP-enabled data servers (Hyrax, TDS, etc.) and examined 13 OPeNDAP clients to determine how best to add authentication using LDAP, Shibboleth and OAuth2 (we used NASA’s URS).
We settled on a server configuration (architecture) that uses the Apache web server and a collection of open-source modules to perform the authentication and authorization actions. This is not the only way to accomplish those goals, but using Apache represents a good balance between functionality, leveraging existing work that has been well vetted and includes support for a wide variety of web services, include those that depend on a servlet engine such as tomcat (which both Hyrax and TDS do). Or work shows how LDAP, OAuth2 and Shibboleth can all be accommodated using this readily available software stack. Also important is that the Apache software is very widely used and is fairly robust – extremely important for security software components.
In order to make use of a server requiring authentication, clients must support the authentication process. Because HTTP has included authentication for well over a decade, and because HTTP/HTTPS can be used by simply linking programs with a library, both the LDAP and OAuth2/URS authentication schemes have almost universal support within the OPeNDAP client base. The clients, i.e. the HTTP client libraries they employ, understand how to submit the credentials to the correct server when confronted by an HTTP/S Unauthorized (401) response. Interestingly OAuth2 can achieve it’s SSO objectives while relying entirely on normative HTTP transport. All 13 of the clients examined worked.
The situation with Shibboleth is different. While Shibboleth does use HTTP, it also requires the client to either scrape a web page or support the SAML2.0 ECP profile, which, for programmatic clients, means using SOAP messages. Since working with SOAP is outside the scope of HTTP, support for Shibboleth must be added explicitly into the client software. Some of the potential burden of enabling OPeNDAP clients to work with Shibboleth may be mitigated by getting both NetCDF-C and NetCDF-Java libraries to use the Shibboleth ECP profile. If done, this would get 9 of the 13 clients we examined working.
Authors: Kelsey A Druken, Claire Trenham, Jingbo Wang (NCI, ANU), Irina Bastrakova (Geoscience Australia), Ben Evan, Lesley Wyborn (NCI, ANU), Alex Ip, Yvette Poudjom Djomani (Geoscience Australia)
The National Computational Infrastructure (NCI) hosts one of Australia’s largest repositories (10+ PBytes) of research data, colocated with a petascale High Performance Computer and a highly integrated research cloud. Key to maximizing benefit of NCI’s collections and computational capabilities is ensuring seamless interoperable access to these datasets. This presents considerable data management challenges across the diverse range of geoscience data; spanning disciplines where netCDF-CF is commonly utilized (e.g., climate, weather, remote-sensing), through to the geophysics and seismology fields that employ more traditional domain- and study-specific data formats. These data are stored in a variety of gridded, irregularly spaced (i.e., trajectories, point clouds, profiles), and raster image structures. They often have diverse coordinate projections and resolutions, thus complicating the task of comparison and inter-discipline analysis. Nevertheless, much can be learned from the netCDF-CF model that has long served the climate community, providing a common data structure for the atmospheric, ocean and cryospheric sciences. We are extending the application of the existing Climate and Forecast (CF) metadata conventions to NCI’s broader geoscience data collections.
We present simple implementations that can significantly improve interoperability of the research collections, particularly in the case of line survey data. NCI has developed a compliance checker to assist with the data quality across all hosted netCDF-CF collections. The tool is an extension to one of the main existing CF Convention checkers, that we have modified to incorporate the Attribute Convention for Data Discovery (ACDD) and ISO19115 standards, and to perform parallelised checks over collections of files, ensuring compliance and consistency across the NCI data collections as a whole. It is complemented by a checker that also verifies functionality against a range of scientific analysis, programming, and data visualisation tools. By design, these tests are not necessarily domain-specific, and demonstrate that verified data is accessible to end-users, thus allowing for seamless interoperability with other datasets across a wide range of fields.
Authors: Michelle Barker (University of Melbourne), Lesley Wyborn (NCI, ANU), Ryan Fraser (CSIRO), Ben Evans (NCI, ANU), Glenn Moloney (NeCTAR), Roger Proctor (University of Tasmania), Aurel Moise (Bureau of Meteorology), Hamish Holewa (Queensland Cyber Infrastructure Foundation)
Across the globe, Virtual Laboratories (VLs), Science Gateways (SGs), and Virtual Research Environments (VREs) are being developed that enable users who are not co-located to actively work together at various scales to share data, models, tools, software, workflows, best practices, etc. Outcomes range from enabling ‘long tail’ researchers to more easily access specific data collections, to facilitating complex workflows on powerful supercomputers.
In Australia, government funding has facilitated the development of a range of VLs through the National eResearch Collaborative Tools and Resources (NeCTAR) program. The VLs provide highly collaborative, research-domain oriented, integrated software infrastructures that meet user community needs. Twelve VLs have been funded since 2012, including the Virtual Geophysics Laboratory (VGL); Virtual Hazards, Impact and Risk Laboratory (VHIRL); Climate and Weather Science Laboratory (CWSLab); Marine Virtual Laboratory (MarVL); and Biodiversity and Climate Change Virtual Laboratory (BCCVL).
These VLs share similar technical challenges, with common issues emerging on integration of tools, applications and access data collections via both cloud-based environments and other distributed resources. While each VL began with a focus on a specific research domain, communities of practice have now formed across the VLs around common issues, and facilitate identification of best practice case studies, and new standards. As a result, tools are now being shared where the VLs access data via data services using international standards such as ISO, OGC, W3C. The sharing of these approaches is starting to facilitate re-usability of infrastructure and is a step towards supporting interdisciplinary research.
Whilst the focus of the VLs are Australia-centric, by using standards, these environments are able to be extended to analysis on other international datasets. Many VL datasets are subsets of global datasets and so extension to global is a small (and often requested) step. Similarly, most of the tools, software, and other technologies could be shared across infrastructures globally. Therefore, it is now time to better connect the Australian VLs with similar initiatives elsewhere to create international platforms that can contribute to global research challenges.
Authors: Ben Evans (NCI, ANU), Ilia Bermous , Justin Freeman (Bureau of Meteorology), Dale S Roberts, Marshall Ward, Rui Yang (NCI, ANU)
The Australian National Computational Infrastructure (NCI) has a national focus in the Earth system sciences including climate, weather, ocean, water management, environment and geophysics. NCI leads a Program across its partners from the Australian science agencies and research communities to identify priority computational models to scale-up. Typically, these cases place a large overall demand on the available computer time, need to scale to higher resolutions, use excessive scarce resources such as large memory or bandwidth that limits, or in some cases, need to meet requirements for transition to a separate operational forecasting system, with set time-windows.
The model codes include the UK Met Office Unified Model atmospheric model (UM), GFDL’s Modular Ocean Model (MOM), both the UK Met Office’s GC3 and Australian ACCESS coupled-climate systems (including sea ice), 4D-Var data assimilation and satellite processing, the Regional Ocean Model (ROMS), and WaveWatch3 as well as geophysics codes including hazards, magentuellerics, seismic inversions, and geodesy. Many of these codes use significant compute resources both for research applications as well as within the operational systems. Some of these models are particularly complex, and their behaviour had not been critically analysed for effective use of the NCI supercomputer or how they could be improved.
As part of the Program, we have established a common profiling methodology that uses a suite of open source tools for performing scaling analyses. The most challenging cases are profiling multi-model coupled systems where the component models have their own complex algorithms and performance issues. We have also found issues within the current suite of profiling tools, and no single tool fully exposes the nature of the code performance.
As a result of this work, international collaborations are now in place to ensure that improvements are incorporated within the community models, and our effort can be targeted in a coordinated way. The coordinations have involved user stakeholders, the model developer community, and dependent software libraries. For example, we have spent significant time characterising I/O scalability, and improving the use of libraries such as NetCDF and HDF5.
Authors: Jingbo Wang (NCI, ANU), Amir Aryani (ANU), Ben Evans (NCI, ANU), Melanie Barlow (ANU), Lesley Wyborn (NCI, ANU)
Making research data connected, discoverable and reusable are some of the key enablers of the new data-intensive revolution in research. Using the Research Data Switchboard (RD-Switchboard) (http://www.rd-switchboard.org/)on the Australian National Computational Infrastructure (NCI) data collections metadata catalogue, we show how connectivity graphs can provide a possible solution to machine-actionable literature searches to discover links between reseachers, publications and datasets (seehttp://rd-switchboard.nci.org.au). RD-Switchboard is an open and collaborative software solution initiated by the Data Description Registry Interoperability (DDRI) working group of the Research Data Alliance (RDAhttps://rd-alliance.org/groups/data-description-registry-interoperability.html). RD-Switchboard connects datasets on the basis of co-authorship or other collaboration arrangements, such as joint funding and grants.
The connections among researchers, publications and datasets can help answer questions like “How many datasets published at NCI has being referenced in research journal articles and which articles?”; “How many researchers and institutes are connected to a given dataset?”; “What are derived data products depend on the source reference data at NCI, who generates those derived data products and who uses them?” Hence, NCI incorporated the RD-Switchboard software to help track and analyze the connectivity.
The RD-Switchboard connection report provides the number of connections a dataset has – the more connections a dataset has, the higher the relevance it has within the research community. Through analyzing the connections to datasets, it is also possible to identify high value datasets to researchers and organisations, and help measure the impact that these datasets have had in the published literature.
Authors: Pablo Rozas Larraondo, Ben Evans, Joseph Antony (NCI, ANU)
Earth systems, environmental and geophysics datasets are an extremely valuable source of information about the state and evolution of the Earth. However, different disciplines and applications require this data to be post-processed in different ways before it can be used. For researchers experimenting with algorithms across large datasets or combining multiple data sets, the traditional approach to batch data processing and storing all the output for later analysis rapidly becomes unfeasible, and often requires additional work to publish for others to use. Recent developments on distributed computing using interactive access to significant cloud infrastructure opens the door for new ways of processing data on demand, hence alleviating the need for storage space for each individual copy of each product.
The Australian National Computational Infrastructure (NCI) has developed a highly distributed geospatial data server which supports interactive processing of large geospatial data products, including satellite Earth Observation data and global model data, using flexible user-defined functions. This system dynamically and efficiently distributes the required computations among cloud nodes and thus provides a scalable analysis capability. In many cases this completely alleviates the need to preprocess and store the data as products. This system presents a standards-compliant interface, allowing ready accessibility for users of the data. Typical data wrangling problems such as handling different file formats and data types, or harmonising the coordinate projections or temporal and spatial resolutions, can now be handled automatically by this service. The geospatial data server exposes functionality for specifying how the data should be aggregated and transformed. The resulting products can be served using several standards such as the Open Geospatial Consortium’s (OGC) Web Map Service (WMS) or Web Feature Service (WFS), Open Street Map tiles, or raw binary arrays under different conventions. We will show some cases where we have used this new capability to provide a significant improvement over previous approaches.
Authors: Lesley A Wyborn (NCI, ANU), Kerstin Lehnert (Columbia University), Jens F Klump (CSIRO), Robert A Arko (Columbia University), Simon J D Cox (CSIRO), Anusuriya Devaraju (CSIRO), Kirsten Elger (Helmholtz Centre), Fiona Murphy (University of Reading), Dirk Fleischer (Helmholtz Centre)
The process of sampling, observing and analyzing physical samples is not unique to the geosciences. Physical sampling (taking specimens) is a fundamental strategy in many natural sciences, typically to support ex-situ observations in laboratories with the goal of characterizing real-world entities or populations. Observations and measurements are made on individual specimens and their derived samples in various ways, with results reported in research publications. Research on an individual sample is often published in numerous articles, based on multiple, potentially unrelated research programs conducted over many years. Even high-volume Earth observation datasets are proxies of real world phenomena and require calibration by measurements made on position located, well described physical samples.
Unique, persistent web-compatible identifiers for physical objects and related sampling features are required to ensure their unambiguous citation and connection to related datasets through web identifiers. Identifier systems have been established within specific domains (e.g., bio, geo, hydro) or different sectors (e.g., museums, government agencies, universities), including the International Geo Sample Number (IGSN) in the geosciences, which has been used for rock, fossil, mineral, soil, regolith, fluid, plant and synthetic materials.
IGSNs are issued through a governance system that ensures they are globally unique. Each IGSN directs to a digital representation of the physical object via the Handle.net global resolver system, the same system used for resolving DOI. To enable the unique identification of all samples on Planet Earth and of data derived from them, the next step is to ensure IGSNs can either be integrated with comparable identifier systems in other domains/sectors, or introduced into domains that do not have a viable system. A registry of persistent identifier systems for physical samples would allow users to choose which system best suits their needs. Such a registry may also facilitate unifying best practice in these multiple systems to enable consistent referencing of physical samples and of methods used to link digital data to its sources. IGSNs could be extended into other domains, but additional methodologies of sample collection, curation and processing may need to be considered.
Authors: James Biard (North Carolina Institute for Climate Studies), Jonathan Yu (CSIRO), Mark Hedley (UK Met Office), Simon Cox (CSIRO), Adam Leadbetter (Irish Marine Institute), Nicholas Car (Geoscience Australia), Kelsey Druken (NCI, ANU), Stefano Nativi (CNR Institute of Atmospheric Pollution Research), Ethan Davis (University Corporation for Atmospheric Research)
Geophysical data communities are publishing large quantities of data across a wide variety of scientific domains which are overlapping more and more. Whilst netCDF is a common format for many of these communities, it is only one of a large number of data storage and transfer formats. One of the major challenges ahead is finding ways to leverage these diverse data sets to advance our understanding of complex problems.
We describe a methodology for incorporating Resource Description Framework (RDF) triples into netCDF files called netCDF-LD (netCDF Linked Data). NetCDF-LD explicitly connects the contents of netCDF files – both data and metadata, with external web-based resources, including vocabularies, standards definitions, and data collections, and through them, a whole host of related information. This approach also preserves and enhances the self describing essence of the netCDF format and its metadata, whilst addressing the challenge of integrating various conventions into files.
We present a case study illustrating how reasoning over RDF graphs can empower researchers to discover datasets across domain boundaries.