Developed by the NCI Training Team, this in-person course showcases how you can use Natural Language Processing for literature analysis and HPC workflows.

Natural language processing (NLP), the deciphering of text and data by machines, has revolutionised data analytics across all industries. It is the artificial intelligence-driven process of making human input language decipherable to software. We will showcase NLP use cases that are relevant to the HPC and STEM communities. One of the typical concerns that impacts all researchers is the analysis of literature where, for example, using automatic text mining techniques would save a huge amount of time in terms of literature reviewing and text summarising. This is traditionally done by researchers at the early stage of a research project and is time consuming and labour intensive. NLP can help not only to speed up this process but also provide a much more comprehensive overview by extracting all the relevant papers from a global scholarly connected database. In this course, we will teach you how to clean the text data, analyse the text and provide a quick overview of the whole literature database.


Having basic programming experience with Python is highly recommended. Knowledge about using text processing python packages like NLTK is advantageous.

Attendees will ideally know some basic theory of Machine Learning and Deep Learning, and have intentions of using AI/ML and supercomputers to boost their research.

We will use the NCI ARE service and the Gadi Supercomputer. Attendees are encouraged to review the ARE User Guide for background information.


This course series is designed to help researchers to apply NLP in text mining and take advantage of the supercomputer (Gadi) to boost their research. Therefore, it aims to help attendees:

  • Understand the basics of NLP

  • Understand the pre-processing steps of text data 

  • Understand how to apply basic NLP techniques

  • Understand how to run NLP applications on Gadi

Learning Outcomes

  • Know how to use a python machine learning package: Scikit learn
  • Know how to use a python deep learning platform: Tensorflow

  • Know how to setup a python environment in Gadi

  • Know how to do text data processing: Lemmatization, Stemming, and Sentiment Analysis.
  • Know popular topic modeling methods: LDA, k-means clustering, t - SNE

  • Know popular text mining methods: Summarization, Topic Modeling, Text Classification and Keyword Extraction.

  • Know popular DL tools: Transformer

  • Know how to distribute model training in Gadi