New Testbed of One Million Images

by Gregory Grefenstette, Pierre-Alain Moëllic, Patrick Hède, Christophe Millet and Christian Fluhr

The Commissariat à l’Energie Atomique (CEA) in France has produced an image database of one million images that will allow researchers in content-based image retrieval to test their system on a life-size collection.

Content-based image retrieval involves searching a collection of images for those relevant to a given ‘query image’. The user submits an image (or sometimes only the description of an image) and wants to find the images in the collection that most closely match. Currently, researchers have been limited to small collections in developing and testing their content-based image retrieval techniques. The largest test collections currently used (such as the University of Columbia databases, or the Corel databases) contain from hundreds up to 60,000 images. To serve as a test collection, the image database must contain labeled images, where the labels show which image is relevant to which other image. Any image can then serve as a test query, and researchers can measure the performance of their system using values such as precision (the proportion of retrieved images that actually are relevant to the query) and recall (the percentage of relevant images in the database that the system retrieves).

Within the EU-sponsored Network of Excellence MUSCLE (Multimedia Understanding through Semantics, Computation and Learning), a new testbed image collection of one million test images has been created. The database is called CLIC (for CEA List Image Collection) and has been produced by the LIC2M team at the CEA, which is one of the MUSCLE partners. LIC2M stands for Laboratoire d’Ingenérie de la Connaissance Mutlimédia Mutlilingue, and is a laboratory outside of Paris that specializes in image and text processing in many languages.

The CLIC image collection contains labelled images, each of which can be used as a query image. The image collection was created by hand-labelling photographs that were donated to the project by colleagues. Any photographs containing identifiable persons were removed. The remaining 15,200 photos were classified into a shallow hierarchy:

Food: images of food, and meals
Architecture: images of architecture, architectural details, castles, churches, Asian temples
Arts: paintings, sculptures, stained glass, engravings
Botanic: various plants, trees, flowers
Linguistic: images containing text areas
Mathematics: fractals
Music: images of musical instruments
Objects: images representing everyday objects such as coins, scissors etc
Nature&Landscapes: landscapes, valley, hills, deserts etc
Society: images with people
Sports&Games: stadiums, items from games and sports
Symbols: iconic symbols, roadsigns, national flags (real and synthetic images)
Technical: images involving transportation, robotics, computer science
Textures: rock, sky, grass, wall, sand etc
City: buildings, roads, streets etc.

Some images from classes of the kernel of the CLIC content-based image retrieval testbed.

These labelled images form the kernel of the collection (see the figure for examples). Each image in the kernel was then altered in 69 different ways. The transformations applied to each original kernel image included: geometric transformations (such as rotation, translation, projection, and splitting), chromatic transformations (such as negative, saturation, black-and-white, and quantification) and various other transformations (such as low-pass filtering, noise addition, border addition, text incrustation, mosaic, resizing, and edge outlining). The altered images were added to the collection with the same labels, thereby generating the one million labelled images in CLIC. Any image can be used as a query, in the knowledge that at least 69 other images are relevant to the query. The transformations were designed to cover a wide variety that occur in natural image manipulation. For example, it is common to find slightly altered and cropped pictures to avoid detection for copyright infringement.

The database will be distributed to the research community through the MUSCLE Network of Excellence, and will prove useful for testing text-based content-based image retrieval (using the labels), evaluating algorithm behaviour over large databases, testing the invariance of algorithms towards transformations, for automatic classification (using the hierarchy), for object and person recognition, and for the detection of text in images. The database will be distributed for research free of charge: please check the MUSCLE Web site for the latest information on its availability.

Link:
http://www.muscle-noe.org

For further information about the CLIC testbed, please consult the following article:
Pierre-Alain Moëllic, Patrick Hède, Gregory Grefenstette, Christophe Millet, “Evaluating
Content Based Image Retrieval Techniques with the One Million Images CLIC TestBed”,
Proceedings of the Second World Enformatika Congress, WEC’05, February 25-27, 2005, Istanbul, Turkey, pp 171-174.

Please contact:
Gregory Grefenstette, Commissariat à l’Energie Atomique, France
Tel: +33 1 46 54 96 56
E-mail: Gregory.Grefenstettecea.fr