ALTA 2013 Shared Task Description

Language Technology Programming Competition 2013

Home

Task Description

Useful Information

Submission

2013 Shared Task Description

Basic Task Description

July 10, 2013 | Version 1

The goal of this task is to recover missing information about word casing and punctuation in English text. Such low quality text can be the result of automated speech transcription, or optical character recognition (OCR), or text written in a hurry such as quick notes in minutes, instant messaging, or web forums.

To make this task easier, we have simplified it from a more ambitious task whose goal is to recover all the capitalisation and punctuation marks. For example, given the following text:

... stored at the ucla television archives the archived episodes were telecast march 8 16 and 24 1971 april 1 and ...

We would hope to restore it to:

... stored at the UCLA Television Archives. The archived episodes were telecast: March 8, 16, and 24, 1971, April 1 and ...

In this task we only ask you to predict wheter the word in its original form has any characters in uppercase, and whether the word is followed by one of these punctuation marks:

,.;:?!

You do not need to determine what particular characters of the word are in uppercase, or what punctuation mark follows the word.

You will be given a file that lists a word per line like this:

ID WORD
255 stored
256 at
257 the
258 ucla
259 television
260 archives
261 the
262 archived
263 episodes
264 were
265 telecast
266 march
267 8
268 16
269 and
270 24
271 1971
271 april
273 1
274 and

The first line contains header information that you can ignore. Each of the following lines contains a word ID and the actual word.

You will need to produce a file that lists the IDs of all words that have at least one capitalised character and the IDs of all words that are followed by a punctuation mark. The correct submission for the above example is:

Id,documents
Case,258 259 260 261 266 272
Punct,260 265 267 268 270 271

This submission says that word with ID 258 has at least one character in uppercase, word 260 has uppercase and punctuation marks, and so on.

Data Files and Submission

We will use Kaggle in Class for this year's competition (look for the ALTA 2013 Challenge). The data files and submission instructions will be provided in the competition website.

Important Dates

Release of training data	On registration
Deadline for submission of results over test data	4 Oct 2013
Notification of results	11 Oct 2013
Deadline for submission of system description poster	27 Oct 2013