Teaching computers to understand language is no easy task. Humans learn and develop their language skills instinctively as children, but developing that ability for a computer system is very hard indeed. To work on this major research challenge, scientists are using Natural Language Processing (NLP). NLP is a fusion of linguistics, computer science and machine learning that uses a variety of computational methods running on supercomputers for the automated understanding, analysis and creation of spoken, written and signed language.

We rely on language to communicate, to take notes, to share information, to learn and to remember, be it through text or audio and video recordings. In amongst our daily utterances and copious text records lies valuable information, if only we could extract and understand it.

Associate Professor Hanna Suominen from The Australian National University (ANU) is developing improved ways of extracting information and learning from language in all its forms, in the various domains of health, language education, language description and more. Hospital records, for example, are a treasure trove of diagnostic and treatment information that currently requires human input, transcription and description. An automated system could collate and summarise patient records to avoid communication or misdiagnosis errors during nurse handovers.

She says, “NLP allows us to train machines to take on the laborious, sometimes impossible, task of reading and summarising thousands or millions of pages of text. This way, humans can concentrate on the skilled tasks they are good at, using their unique skills where it counts.”

Books stacked in a bookshelf, showing their spines.
Natural Language Processing allows researchers to train computers to read and summarise large volumes of text, helping them learn about languages, document their structures, and build new tools for speeding up text-intensive tasks.

As part of the Our Health in our Hands Grand Challenge from the ANU, Professor Suominen is part of the team aiming to develop new personalised health technologies in collaboration with patients, clinicians and health care providers. Leading the Big Data program, the group is working to incorporate the analysis of large-scale patient data into secure diagnostic and monitoring systems and patient-friendly information. For example, PhD student Ms Sandaru Seneviratne is working on a machine learning method for turning dense medical information about Type 1 Diabetes into a simplified form suitable for adolescents to learn from when diagnosed with the disease. This is expected to be beneficial for parents, caretakers and teachers as well to enhance their knowledge of the disease, and to help adolescents in their daily activities and avoid severe medical consequences.

University of Melbourne Laureate Professor Tim Baldwin says, “Modern NLP is built on big models that rely on Graphics Processing Units (GPUs) for their processing. The extreme parallelism that GPUs offer is orders of magnitude faster than the previous workflows from a decade ago. In particular, researchers use large GPU clusters such as the one at NCI for the training and parameter tuning to build the best model for a given use case.”

Another use of NLP and the language models that underpin it is in understanding the complexities of how languages work. NLP lets us not only artificially translate and generate text, it also lets us understand and document some of the more intricate processes of language. ANU PhD student Ms Saliha Muradoglu is studying the incredible complexities of the Nen language of Papua New Guinea, spoken by only around 300 speakers. Transitive verbs in Nen can take up to 1740 different forms, a consequence of the way that the language uses verbs to encode so much of its information. Understanding the way that Nen rules interact to form grammatical verbs helps better document exactly how the language works, and safeguard it for future generations.

Natural Language Processing is a research area that has benefited hugely from the jumps in computational performance, especially of GPUs, that have occurred in the past ten years. It has the potential to significantly improve automated systems for text analysis and translation used for cross-cultural communication, medical processes, language education and much more. By teaching computers to perform automated analysis and summarisation tasks, we could support humans to do the complex things they are best at, and move data-intensive research forwards.