TC12

Introduction

The development of visual information retrieval systems has long been hindered by the lack of standardised benchmarks for the evaluation and comparison of emerging techniques. In recent years, researchers have built numerous systems and proposed new techniques in a continual way. However, although different systems clearly have their particular strengths, there is a tendency to use different datasets and queries in order to highlight the advantages of a particular algorithm. A degree of bias might therefore exist which makes a meaningful comparison between new techniques difficult to establish.

Thus, there is an acknowledged need for a standardised benchmark in order to assess the performance of image retrieval systems. One of the main components of any benchmark is a representative collection of documents (e.g. images, texts or videos). However, finding such resources for general use is difficult as image collections are expensive and often copyrighted, which restricts both the distribution and future access of the data for evaluation purposes (e.g. consider the Corbis Image Database or Getty Images). The Corel CDs, currently the de–facto standard for the evaluation of image retrieval systems and techniques (and used in many publications to demonstrate performance), is one example dataset which falls into this category.

There are a few image collections that are free of charge and copyright-free, like the dataset provided by the University of Washington which contains about 1,000 images that are clustered by the location from which the images were taken. Amsterdam Library of Objects and Images (ALOI) and LTU Technologies have large databases with colour images of small objects with varied viewing (and illumination) angles. The Benchathlon Network created an evaluation resource, but without query tasks and ground truths. There are a few royalty-free databases available in specialised domains like Casimage for medical imaging or the St. Andrews collection for retrieval of historic (mainly black and white) photographs. Yet, there is still a lack of more general image collections in order to cater to the growing research interest in information access to personal photographic collections.

Benchmark History

In 2000, IAPR TC-12 recognized the need for such a standardized benchmark and began an effort to create a freely available database with annotated images. This was initiated by first developing a set of recommendations and specifications of an image benchmark system [1]. Based on these criteria, a first version of a benchmark (consisting of 1000 multi-object, colour images, 25 queries, and a collection of performance measures) was set up in 2002 and published in 2003 [2].

Developing a benchmark is an incremental and ongoing process. The IAPR benchmark was refined, improved and extended to 5,000 images in 2004, using a specially developed benchmark administration system [3]. At the end of that year, an independent travel company provided access to around 10,000 images with raw multilingual annotations in three different languages (English, German, Spanish), increasing the total number of available images in the benchmark to 15,000.

A benchmark is not beneficial unless it also used by researchers. Discussions began in 2005 for using the IAPR TC-12 Benchmark for an ad-hoc image retrieval task at the Cross Language Evaluation Forum's (CLEF) text and/or content-based image retrieval track (ImageCLEF, see related article, “The ImageCLEF benchmark on multimodal, multilingual image retrieval”) from 2006 onwards. With 10,000 additional images from the travel company, the total number of available images rose to 25,000 images [4] but was soon reduced to 20,000 images due to the strict benchmark image selection rules [2].

Benchmark Composition

At present, the image collection of the IAPR TC-12 Benchmark consists of 20,000 images (plus 20,000 corresponding thumbnails) taken from locations around the world and comprising an assorted cross-section of still, natural images. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes and many other aspects of contemporary life (Fig. 1).

Each photograph is associated with a text caption that consists of the following seven fields: a unique identifier, a title of the picture, a free-text description of the semantic content of the image, notes for additional information about the photograph, the originator of the photo and the location and date of where and when the photo was taken. These annotations exist in three different languages, with the English and German versions manually checked and corrected to provide a reliable set of annotations, and the Spanish version currently being processed. Annotations are stored in a database which is also managed by a benchmark administration system that allows the specification of parameters according to which different subsets of the image collection can be generated. More information on the benchmark can be found on the web page of IAPR TC-12.

The IAPR TC-12 Benchmark at ImageCLEF

ImageCLEF has been provided with such a subset for its upcoming evaluation event (ImageCLEF 2006) for a task concerning the ad-hoc retrieval of images from photographic image collections (ImageCLEFphoto). Participants are provided with the full collection of 20,000 images, however will not receive the complete set of annotations, but a range from complete annotations to no annotation at all. Data will be provided in English and German in order to enable the evaluation of multilingual text-based retrieval systems. In addition to the existing text and/or content based cross-language image retrieval task, ImageCLEF will also use the IAPR TC-12 Benchmark in an extra task for Content Based Image Retrieval. ImageCLEF has also expressed interest in having just one text annotation file with a randomly selected language for each image for ImageCLEF 2007, making full use of the benchmark's parametric nature.

The IAPR TC-12 Benchmark in the Future

It is recognised that the benchmarks are not static, as the field of visual information search might (and will) develop, mature and/or even change. Consequently, benchmarks will have to evolve and be augmented with additional features or characteristics, and the IAPR TC-12 Benchmark will be no exception here. Apart from the planned completion of the Spanish annotations, this could comprise the addition of several different annotation formats following a structured annotation defined in MPEG-7, an ontology-based keyword annotation or even non-text annotations like an audio annotation.

The method of generating various types of visual information might produce different characteristics in the future, and databases might have to be searched in different ways accordingly. Hence, benchmarks with several different component sets geared to different requirements will be necessary, and the parametric IAPR TC-12 Benchmark has taken a significant step towards that goal.