DELOS Workshop on Cross-Language Information Retrieval

ERCIM News No.29 - April 1997

DELOS Workshop on Cross-Language Information Retrieval

by Páraic Sheridan

The third workshop of the DELOS working group, on the topic of 'Cross-Language Information Retrieval', was hosted by ETH Zurich, 5-7 March 1997. DELOS is a working group funded by the IT Long Term Research programme of the European Commission to study and investigate existing and emerging technologies and issues relevant to digital libraries.

The DELOS Working Group is just one of a series of ERCIM-sponsored initiatives aimed at promoting research and operational activities in the Digital Library field. The DELOS consortium consists mainly of members of ERCIM institutes.

As was borne out by many of the workshop presentations, many research projects addressing issues of digital information repositories in Europe must deal with information in several languages, even when multi-lingual or cross-language information retrieval is not a central theme of the project. We distinguish multi-lingual information retrieval as involving several languages, though a user's search query is always evaluated against only those documents in the query language, and cross-language information retrieval as the case where a user's query may retrieve documents in languages other than the language of the query.

A total of 27 participants attended the workshop, representing 9 different European countries, as well as invited speakers from the United States and Korea, who helped to broaden the discussions beyond the European perspective. Apart from the geographical diversity of the participants, backgrounds in Information Retrieval, Computational Linguistics, Lexicography, Controlled Vocabulary Thesauri, and Internet Technology, also helped to bring many different viewpoints to the discussions of the work presented.

To set the scene for the workshop, Doug Oard of the University of Maryland gave an overview of Cross-Language Information Retrieval in the USA, including a schematic breakdown of the various approaches: corpus-based (parallel, comparable or unaligned corpora) versus knowledge based (dictionaries or ontologies). He presented a substantial amount of US-based research on cross-language retrieval, and showed that current approaches have demonstrated performance in the range of 50% to 75% of the performance of the comparable monolingual retrieval task. He was followed by Sung-Huyn Myaeng of the National University Taejon, Korea, who gave an in-depth presentation of the particular problems of working with Asian languages, including the use of different scripts, the problem of word segmentation and the similar problem of compound noun analysis. This was appropriately followed by Martin Duerst, University of Zurich, who, in recognition of the increasing role of the World Wide Web in this area of research, detailed the emerging HTTP and HTML standards for supporting multi-script and multi-language information on the Web.

Other presentations from European researchers focussed on the approaches being adopted for cross-language and multi-language retrieval in various projects such as Twenty-One, MULINEX, Aquarelle, ILIAD and MedExplore, some of which are funded by the European Commission. A common sentiment expressed was that, even in cases where multilinguality was not a core concern of the project consortia, it was a topic that had to be addressed given the European dimension. We therefore saw some novel approaches to cross-language retrieval being taken by these researchers. An important parallel theme was also the identification, conflation and use of multi-word terms for cross-language retrieval, given the observation that these can serve to greatly reduce translation ambiguities.

From the Information Retrieval point of view, David Hull of Rank Xerox research centre, Grenoble, France, presented a model for weighted Boolean retrieval for cross-language retrieval, and Páraic Sheridan of ETH Zurich presented a method of using a retrieval model for building information structures called similarity thesauri for cross-language retrieval. The presentation of similarity thesauri showed how this approach has been implemented also for cross-language retrieval of speech documents, and a demonstration of the EuroSpider retrieval system was given (http://www.eurospider.ch/). Approaches from the Computational Linguistics perspective were presented by Carol Peters and Eugenio Picchi of CNR, Italy, who showed how the use of comparable corpora together with lexical resources could bring to light useful translation equivalences for cross-language retrieval, and Piek Vossen of the University of Amsterdam presented the EuroWordnet project (http://www.let. uva.nl/~ewn/) which is augmenting the Princeton Wordnet of English with wordnets in Dutch, Italian and Spanish. The workshop concluded with a discussion of the important issue of evaluating different approaches to cross-language information retrieval, and the fact that this year's Text Retrieval Conference (TREC 6) will include a track evaluating cross-language retrieval was highlighted as highly significant.

The next DELOS workshop will address Image Indexing and Retrieval, and will take place in Pisa Italy, 28-30 August 1997, in conjunction with the First European Conference on Research and Advanced Technology for Digital Libraries (see announcements on page 49).

Please contact:
Costantino Thanos - IEI-CNR
Tel: +39 50 593429
E-mail: thanos@iei.pi.cnr.it

return to the contents page