Multilingual Metadata to access Social Science Data

ERCIM News No.41 - April 2000 [contents]

Multilingual Metadata to access Social Science Data

by Brian Matthews and Michael Wilson

Most countries have a national archive for social science data. This is data about social attitudes, and financial and environmental details, which describe the current and past state of each nation. This information derives from government statistics, and through academic studies, often commissioned by government. These archives are usually a few hundred gigabytes in size, and accessible partially via the Internet and the Web. However, it is becoming increasingly important to allow access across national and linguistic boundaries so that decision makers have a comparative picture of European society.

The European funded LIMBER project, which began in January 2000, brings together CLRC-RAL, Intrasoft, the UK national archive at Essex University, the Norwegian national archive at NSD, and three other national archives within Europe The vision behind LIMBER is the interoperability of data, for example, from WHO Health archives, with social science datasets on behaviour, and genetic datasets from the Human Genome Project. These can be integrated together to show the localisation of potential genetically abnormal populations. This would then be presented through a multilingual interface so that it can used for policy making and planning.

LIMBER will achieve this by providing a uniform metadata description. Metadata allows the explicit specification of the semantics of data, relationships between data, and its quality (recency, accuracy etc). We describe LIMBER’s key features.

Multilingual Thesaurus

Using a controlled vocabulary to index metadata increases the relevance to retrieval and when this is structured in a thesaurus it further helps to refine searches. LIMBER will extend existing thesauri by using an XML representation and adding ‘equivalent terms’ in other languages. This provides the terms to catalogue datasets, and, using equivalent terms in a multilingual search, relevant data across the archives can be discovered.

Indexing

Creating metadata usually takes considerable effort, and whilst its benefits are evident to users, data contributors have little incentive to spend this effort. To ameliorate this, an automatic tool will index the metadata, by scanning for relevant terms and converting these to the controlled vocabulary, in the language of the metadata

Multilingual User Interface

The multilingual thesaurus will allow users to perform searches using their own language. To refine their search, the thesaurus will be presented in the user’s natural language. The relevance of data returned, can be shown by displaying terms from the controlled vocabulary in the user’s own language.

Metadata using RDF

Metadata is a very active area with many proposed schemes. The problem with standardising metadata definitions, is to achieve the correct layering, so that domain specific elements are defined on a generic base technology. Recognising this, the World Wide Web Consortium (W3C) has defined a recommendation for web-based metadata, the Resource Description Framework (RDF). This defines a model, syntax, and a representation of schema to capture semantics in terms defined by authoritative bodies. RDF is defined in XML, so as browsers are supporting XML, they are also supporting RDF, providing economic and easy to use tools.

LIMBER will develop a metadata model (in XML and RDF) for social science datasets to allow their integration within and across archives, building on the existing work of the Data Documentation Initiative (DDI) from University of Michigan, the most advanced metadata structure currently proposed for social science. RDF also offers a uniform vehicle to manage multilingual thesaurus as full ontologies. Further, common RDF format allows the seemless integration with metadata from related fields, such as geographical, environmental, and health data.

Links:
http://www.linglink.lu/hlt/projects/limber/

Please contact:
Michael Wilson - CLRC
Tel: +44 1235 44 6619
E-mail: M.D.Wilson@rl.ac.uk