Migrating Thesauri to the Semantic Web

by Michael Wilson and Brian Matthews

If Semantic Web technologies using RDF are going to be adopted and assimilated as HTML and XML have been, a clear migration path from present technologies to new ones is required. Thesauri are used throughout the information retrieval world as a method of providing controlled vocabularies for indexing and querying.

W3C is developing standards for the representation of ontologies to constrain the vocabularies of resource descriptions based on RDF. Such ontologies will allow distributed authoritative definition of vocabularies that support cross-referencing. Such ontology representations are planned to fulfil the role currently undertaken by thesauri. Therefore a migration path is required from current thesauri to ontologies, or support for their co-existence if those ontologies are to be adopted and assimilated into existing information retrieval infrastructure.

The structure of thesauri is controlled by international standards that are among the most influential ever developed for the library and information field. The main three standards define the relations to be used between terms in monolingual thesauri (ISO 2788:1986), the additional relations for multilingual thesauri (ISO 5964:1985), and methods for examining documents, determining their subjects, and selecting index terms (ISO 5963:1985). The general principles in ISO 2788 are considered language- and culture-independent. As a result, ISO 5964:1985 refers to ISO 2788 and uses it as a point of departure for dealing with the specific requirements that emerge when a single thesaurus attempts to express 'conceptual equivalencies' among terms selected from more than one natural language.

The ISO standards for thesauri (ISO 2788 and ISO 5964:1985) are developed and maintained by the International Organization for Standardization, Technical Committee 46 whose remit is Information and Documentation - not IT. ISO 5964:1985 is currently undergoing review by ISO TC46/SC 9, and it is expected that among changes to it will be the inclusion of a standard interchange format for thesauri. To facilitate the growth of the Semantic Web, it would be sensible to try to ensure that such an interchange format is as compatible with Semantic Web ontology representations as possible.

Several proposals have arisen for thesauri interchange formats based on either RDF or DAML+Oil. The major problems with these is that either they cannot accommodate the multiple inheritance common in many multilingual thesauri or that the semantics of thesauri in the ISO standards are not as precise as these languages require. The links in thesauri hierarchies define the top term in the hierarchy, and the broader or narrower coverage of terms down the hierarchy. There are also links between hierarchies to show equivalence in different languages, or similar meaning in the same language.

However, the hierarchical links in thesauri are semantically overloaded, and the potential exists using Semantic Web ontology representations to develop ontologies with less overloading. The terms 'broader', 'narrower', 'used for', 'related' and 'equivalent' are not defined by precise semantics. Therefore the proposals are either too precise to be compatible with some existing thesauri or include explicit statements of semantics which are seen to be unacceptable to other thesaurus developers and users. We have developed a proposal for a thesaurus interchange format expressed in RDF to overcome these limitations, which has been applied to one large multi-lingual thesaurus for evaluation by users - ELSST. It is planned to represent many more thesauri in this representation, and to show how they can both be migrated into Semantic Web ontologies, and how such ontologies allow the thesauri from different domains to be related to each other.

ELSST: a multilingual thesaurus (English, French, Spanish, German) for the social science domain has been represented in the Thesaurus Interchange Format. The thesaurus has been produced containing 49 hierarchies, incorporating 1456 preferred terms. Following the initial development further translations of terms into Finnish, Norwegian, Danish & Greek are planned, as are the inclusion of terms related through inexact translations in addition to the exact translations already included. Also the CESSDA group of European Data Archives has agreed to adopt ELSST as the European Controlled Vocabulary for Social Science.

Please contact:
Michael Wilson, Brian Matthews, CLRC
Tel: +44 1235 44 6619