ALTA logo Language Technology Programming Competition 2011
Task Description
Useful Information

2011 Shared Task Description

Basic Task Description

June 21, 2011 | Version 1

The basic task is to build an automatic evidence grading system for evidence-based medicine. Evidence-based medicine is a medical practice which requires practitioners to search medical literature for evidence when making clinical decisions. The practitioners are also required to grade the quality of extracted evidence on some chosen scale. The goal of the grading system is to automatically determine the grade of an evidence given the article abstract(s) from which the evidence is extracted.

You will be provided with: (1) a set of training documents; (2) a set of development documents; and (3), closer to the submission deadline, a set of test documents. For the training and development sets, you will additionally have access to the evidence grades. For the test documents, this will not be provided.

The grading scale used for this task is the Strength of Recommendation Taxonomy (SORT). This taxonomy has 3 grades - A (strong), B (moderate) and C (weak). The grade of an evidence depends on multiple factors and information about this grading scale can be found in the paper by Ebell et al. (2004)

The grades used for this task have been generated by medical experts. Your task is to implement a grading system based on the training and development datasets, to then run over the test documents to determine the grade of each evidence.

Data Files and Format

The training and development sets will contain:

  1. A text file containing the evidence IDs, a grade for each evidence and one or more abstract IDs for each evidence. The first few lines of this text file may look like this:
  2. 41711 B 10553790 15265350
    53581 C 12804123 16026213 14627885
    53583 B 15213586
    52401 A 15329425 9058342 11279767
  3. A zip file containing the abstracts. The abstracts will have file names of the form id.xml. For example, the first abstract of evidence 41711 (shown above) will have the name 10553790.xml. All abstracts will be in the xml format used by PubMed.
  4. A python evaluation script ( which takes the following arguments:
    • Required arguments:

      -o FILENAME FILENAME contains the output of the system over a test dataset
      -g FILENAME FILENAME contains the gold standard
    • Optional arguments:

      -e Outputs the IDs of the misclassified instances

    For example, if you run your system over the development set and want to compare its accuracy against the gold standard for the development set, the following command can be used:

    python -o out.txt -g devtestset.txt -e

    where out.txt is the output of your system and devtestset.txt is the text file (that we will provide) containing the evidence IDs and their grades.

The test set will contain:

  1. A text file containing the evidence IDs and one or more abstract IDs for each evidence. The first few lines may look like this:
  2. 41711 10553790 15265350
    53581 12804123 16026213 14627885
    53583 15213586
    52401 15329425 9058342 11279767
  3. A zip file containing the required abstracts.


The results of your evidence grading system should be submitted in a single text file with each line containing:

  1. The evidence ID (e.g. 41711)
  2. The predicted grade for that evidence (i.e. A, B or C)

The first few lines of a submission file may look as follows:

41711 B
53581 C
53583 B
52401 A

Important Dates

Release of training and development data On registration
Release of test data (without annotations) 4 October 2011
Deadline for submission of results over test data 7 October 2011
Notification of results 12 October 2011
Deadline for submission of system description poster 28 October 2011


Here are the results of all participants who submitted a poster. Each participant was allowed to submit up to three runs. The evaluation meaure is the accuracy (number of correct classifications divided by the total number of classifications). Since none of the participants obtained results that were statistically significantly better than the baseline, no prizes were awarded.


The baseline is a simple majority baseline: classify all elements with class "B". The results with 5% confidence intervals are:

  • Accuracy = 0.486 (0.415-0.558)
Student Category
  • University of Melbourne [poster]
    1. Accuracy = 0.486
    2. Accuracy = 0.426
Open Category
  • UAB_NLP [poster]
    1. Accuracy = 0.437



© ALTA 2011. Competition Organisers.