Talking Shop: Making the most of the data deluge with GSKY
NCI has created the new GSKY data service and software to make analysis and management of world-leading earth observation datasets easier than ever. Read on to find out the story behind this revolutionary data service.
Problem: Data Wrangling
Much like sustained wet weather, the ongoing ‘data deluge’ – a figurative flood of information – presents the research community with both enormous opportunities and not-so-trivial challenges.
A staggering amount of data has been collected from different sources and continues to grow at exponentially increasing rates. A fleet of satellites has been orbiting the planet over a 16-day cycle since the early 1970s with the US Landsat and MODIS missions. The Japanese Himawari-8 satellite takes images over our region every 10 minutes, and the European-led Sentinel satellites have a resolution of 10 metres.
This data is then combined with other data, such as from computational models or recorded observations, and data outputs are created and shared between scientists as part of their analysis.
NCI is working to make such data available for our country’s most advanced research in a form that is analysable over any time period, at any location, and at any available resolution.
To feed NCI’s supercomputer of more than 85,000 CPU cores, data needs to flow quickly to and from these processors. This is achieved through HPC technology that makes it possible to tightly integrate a high-performance and scalable storage system (with nearly 40 PBytes capacity) with more than 15,000 enterprise-grade hard disk drives.
Plus, to support the data being shared with other international supercomputing centres, NCI routinely transfers data across intercontinental networks at speeds of up to 100 gigabits per second (1,000 times faster than the NBN) – all to help support this exponentially growing data.
Most of these technical details remain behind-the-scenes, but users are confronted with the ‘data wrangling’ challenge: how to analyse and transform data between a wide range of file formats and data types during their research process.
Data wrangling is an occupational necessity for most scientists, and can consume much more than half their time. Even the simplest case of comparing two satellite images from the same spatial region over different seasons has historically involved a time consuming process of retrieval and processing.
However, the complexity scales exponentially when considering comparisons between regions of interest with different data sources altogether – requiring transformation of different coordinate projections as well as stitching together of data to address the whole region in a scientifically correct and seamless manner.
Even a small spatial area of interest needs many such image tiles, but when the analysis is looking for changes over months or years, the number of data files can easily increase to tens of thousands.
Such analysis then requires the sifting and applying of computational algorithms and then management of the large output.
Historically this problem could only be solved by batch-processing for combining and analyzing the different data, and then storing this for potential future use-cases. However, this is not feasible for most research cases. The batch-processing of derivative data produces a drain on resources, spending valuable disk storage on derivatives that may never be required.
Also, many of the users who need this data are not familiar with HPC systems and supercomputers. Instead they are used to desktop systems using domain-tailored GIS software and often needing to combine very precise data that they have acquired for the area of interest.
NCI’s Research Engagement and Initiatives Team have approached part of this problem by reconsidering the options for on-demand processing. Users can use NCI’s high performance computing capability, which can handle dynamically applying algorithms and processing, via well-known data service protocols that are now commonplace within environmental GIS packages and commonly used programming languages such as scientific python.
As a result, NCI created a scalable, distributed geospatial data server, known as ‘GSKY’ (pronounced ji-skee). In essence, GSKY combines all the perks of user interfaces found in contemporary mapping frontends (kind of similar to Google Maps) with the higher dimensional geospatial data that is stored at NCI.
Even if the requests encompass a large geographical area, NCI’s service can process queries in milliseconds – or virtually instantaneously, as far as the user is concerned.
Geospatial (mapping) data is only the beginning – GSKY has the potential to improve the way we analyse many more types of data.
Distributed Pipelines, in detail
GSKY relies on several software systems developed by the NCI team, and better ways to organize datasets.
For each data source, the underlying data are first organised into versioned timeseries datasets. NCI then uses software called MAS (Metadata Attribute Search) that scans and stores all metadata that is associated within the data files and makes it available for software (such as GSKY) requiring extremely fast and deep search. MAS is kept up-to-date by using ‘crawlers’ to seek out new or modified data across NCI’s petabytes of data collections.
GSKY’s underlying compute engine is programmed to allow a work-flow capable of creation of a scalable distributed processing system so as to take advantage of parallel processing utilising hundreds of CPU cores.
The internal GSKY pipeline is composed of distinct modules that are networked to create a manageable workflow.
The first step of the workflow is to determine the user’s intentions through the given parameters – that is, the location and timeframe of the requested information. The engine examines this request, and uses the indexing system to identify the files that contain the relevant data.
The second step then extracts the required data from each file found, and makes the necessary transformations to fit the user’s request. This module is special – the workload can be distributed amongst many computer nodes in the cluster. This decision was made by the developers to mitigate the high CPU and I/O usage they were finding when running this module.
Remote Procedure Calls (RPCs) split the work across a cluster of dedicated real-time nodes, and then reconstruct the information once the processing is completed.
These individual pieces of files are then merged, scaled and finally generated and sent back to applications using commonly used network query systems as either images or raw data files.
What about that name?
The Apollo moon landings were supported by a computer interface known as DSKY. Astronauts could input data and commands into the keypad and see the results returned on an electronic display. In much the same way, GSKY is an interface that allows human manipulation of deeply buried geospatial data. Using GSKY, a user can make complex requests and see the results in their web browser in near real-time.
GSKY cannot navigate its users to the moon – it can, however, help us understand it.
Staff at NCI are already looking to the next evolution of GSKY. Increasing the number of datasets and services that GSKY employs will broaden its usage for researchers needing access to new and different types of geospatial information, and may also extend GSKY’s usefulness across multiple scientific disciplines.
With the buzz surrounding machine learning, deep learning and artificial intelligence, GKSY offers a tantalising glimpse into the future of this flourishing computer science field. Just like humans, the algorithms that enable machine learning require access to large pre-prepared data collections – something that GSKY makes short work of.
The GSKY service is available at: http://gsky.nci.org.au/