ALTA 2014 Shared Task Description

Language Technology Programming Competition 2014

Home

2014 Shared Task Description

Basic Task Description

The goal of this task is to identify all mentions of locations in the text of tweets. Location is any specific mention of a country, city, suburb, street, or POI (Point of Interest). A POI can be the name of a shopping centre, such as "Macquarie Centre" or name of a hospital, e.g., "Ryde Hospital". This is an information extraction task important for applications that want to find out where people are or they are talking about which location. This task requires you to only identify which word in the text of a tweet refers to a location, and does not expect you to find the location on the map. For example, in the tweets:

"France and Germany join the US and UK in advising their nationals in Libya to leave immediately http://bbc.in/1rVmrDJ"
"Dutch investigators not going to MH17 crash site in eastern Ukraine due to security concerns, OSCE monitors say"
"Seeing early signs of potential flash flooding with stationary storms near St. Marys, Tavistock, Cambridge #onstorm pic.twitter.com/BtogIxgQ5G"

locations are:

France, Germany, US, UK, Libya
MH17 crash site, eastern Ukraine
St. Marys, Tavistock, Cambridge

Locations can be in the text itself, or in hashtags (e.g, #australia), URLs, or sometimes even in mentions (e.g., @australia). As location mentions can span over words, all these words must be identified, however, partial identification of location names will be rewarded too. For example if from "eastern Ukraine" your system only identifies "Ukraine", it will be half correct.

You will be given a list of tweet-ids and a script to help you download the tweets from Twitter. If a tweet is deleted by its author, it will not be retrieved. Your system should find the location mentions, and list them all in lowercase as blank separated words next to their tweet-id. For example,

Input id
493450763931512832

retrieves

author: BBCBreaking
tweet text: France and Germany join the US and UK in advising their nationals in Libya to leave immediately http://bbc.in/1rVmrDJ

and your output should be:

493450763931512832,france germany us uk libya

All punctuation in the word containing the location must be removed, including the hash symbol (#).

If a locations is repeated in a tweet, you need to number them from the second occurrence. For example, if there are three mentions of Australia, then you will have

australia australia2 australia3

If a location has multiple words, separate them with blank space so that, in effect, it does not matter whether it is one location expression with two words or two different location expressions. Thus, if a tweet with ID "1234" has two location expressions "London" and "United States" the following are valid and equivalent descriptions:

1234,london united states
1234,united london states

If a tweet does not have any location mention, then use the marker NONE.

Evaluation

We will use Kaggle in Class to evaluate the systems using F-measure on the word level.

Data Files and Submission

We will use Kaggle in Class for this year's competition (look for the ALTA 2014 Challenge). The data files and submission instructions will be provided in the competition website.

There is a training set and a test set. The training set contains 2000 tweets sorted in time, together with the location mentions. The format of this file is exactly the same as the format of the submission file. The test set contains just over 1000 tweets sorted in time, this time without the location mentions. Your task is to find the location mentions of this test set and submit the results to Kaggle in Class. The timestamp of the tweets of the test set are after those of the training set, to model a realistic scenario where we train on known tweets and we want to predict the location mentions in future tweets.

Important Dates

Release of training data	On registration
Deadline for submission of results over test data	21 Oct 2014
Notification of results	24 Oct 2014
Deadline for submission of system description poster	7 Nov 2014