Real-time Twitter mining
ARC Future Fellow Professor Tim Baldwin and his team from The University of Melbourne have trained computer models run at NCI to search through millions of tweets for clues as to the whereabouts of their authors.
“Basically we are training our models to predict location from text,” explains Professor Baldwin. “We want to know whether we can predict the tweeter’s home town based on what they’ve said in their tweets.”
The first step is to find Twitter users who have enabled automatic geotagging.
“Your smart phone has a chip in it that, if enabled, will tag each post with your location,” says Professor Baldwin.
“We take data from a few thousand geotagging users that we’re confident are based in a particular city and use their tweets to train up our model.”
Once the model knows which words are linked to certain cities, it can predict where users are tweeting from with around 30 per cent accuracy.
“It’s not just people mentioning city names,” explains Professor Baldwin. “There are all sorts of surprising words associated with particular regions; if I mention trams and cafes and complain about cold mornings at particular times of the year, I’m probably in Melbourne.”
The models can also detect the language the post is written in – if it’s in Finnish, the chances are the tweeter is in Finland.
The more information the model has access to, the more powerful it becomes.
“Once you start adding the account metadata like the user name and the user-declared location, you can get up to about 50 per cent accuracy at the city level.”
Professor Baldwin says the results will be of interest to privacy-conscious Twitter users.
“We want to put this information into the hands of Twitter users so they can make decisions about how they use their accounts. If you are concerned about people finding out where you live, you might want to think about what you’re posting publicly,” he says.
The sheer volume of data required to perform this research was one of the main reasons Professor Baldwin, and co-Chief Investigator Assistant Professor Paul Cook from the University of New Brunswick, applied to use NCI’s facilities.
“Continuous streaming of publicly available Twitter data very quickly fills up terabytes of storage, which NCI has the capacity to host,” he says.
“NCI has also been phenomenally helpful for training lots of models simultaneously. We’ve got access to a lot of computing power so we can train hundreds of different models in parallel.”
Check out a live demo of the user geolocation system (Google Chrome only)