DIGITAL LIBRARIES
ERCIM News No.27 - October 1996



A System for Cross-Language Information Retrieval


by Carol Peters and Eugenio Picchi

We describe a system to query comparable corpora, ie collections of texts in more than one language from a common domain. Given a particular term or set of terms in the texts in one language, contexts which contain lexically equivalent or related expressions can be retrieved from the texts in the other(s). The system has been developed to process Italian/ English texts but could be extended to include other languages. The initial implementation was made with the requirements of contrastive linguistics in mind; however, the system could have applications in the fields of bilingual/ multilingual document retrieval.

With the recent rapid diffusion over Internet of world wide distributed document bases, the question of multilingual information retrieval is becoming increasingly relevant as the disadvantages of allowing English-only systems and document collections to dominate the global scene unchallenged are gradually being recognised. Natural language processing techniques and tools have already been incorporated into IR processes with varying degrees of success. We feel that such methodologies have an important role to play in the development of multilingual document systems in which users can formulate queries in their preferred languages and retrieve all relevant documents in whatever language they are stored. Here below, we describe a strategy being studied for comparable corpus querying and explain why we feel this approach can be extended to cross-language information retrieval applications.

Comparable corpora are sets of texts in pairs (or multiples) of languages on the same topic or domain. Given a particular term or set of terms in a domain-specific corpus in one language, the aim is to identify contexts which contain equivalent or related expressions in a comparable corpus in another language. We do this by extracting lexical and linguistic knowledge from the first corpus, and projecting it onto the second. Our starting point is a basic tenet of corpus linguistics: a word acquires sense from its context. We thus attempt to isolate the vocabulary related to the terms in the corpus in one language (L1) ­p; hypothesising that lexically equivalent terms will be associated with a similar vocabulary in the comparable corpus for the other (L2).

Thus for any term, T, and using a well-known statistical procedure (Church and Hanks' Mutual Information Index), we calculate its set of significant collocates in our L1 corpus; the set of lemmas derived makes up the vocabulary, V1, that characterizes T in this particular subdomain corpus. Next, using our lexical tools (eg English/Italian morpho-logical procedures, a bilingual lexical database), we construct an equivalent L2 vocabulary of translation equivalents (V2). Words or expressions that can be considered as lexically equivalent to our selected term in the L1 texts are then searched in the L2 corpus, ie we identify those contexts in L2 in which there is a significant presence of the L2 vocabulary for T. The significance is determined on the basis of a statistical procedure that assesses the probability for different sets of L2 co-occurrences to represent lexically equivalent contexts for T. The L2 contexts retrieved are written in a file and listed in descending order of relevance to our L1 term.

Although these procedures are still in an experimental phase, the first results are encouraging, ie we can retrieve contexts which refer to a particular concept represented in L1 by a given expression (term or set of terms), without the necessity for a known translation equi-valent for that expression being present.

When we began this work our main interest was linguistic, however, we now intend to extend the procedures to run in a multilingual document query system. Most current IR systems which include a multilingual component use a thesaurus in order to search keywords over languages. However, a multilingual thesaurus that makes any attempt towards exhaustiveness is difficult and expensive to construct and maintain. Technical vocabulary is in continual development as new ideas mature, new processes are introduced. Any thesaurus must be frequently updated if it is to be useful for query and retrieval purposes. Even if a thesaurus is well constructed and includes pointers to semantically (eg synonyms, hyponyms, meronyms) and lexically (eg close collocates) related items, the users are still obliged to base their query on a keyword list rather than formulating a fully natural language query.

We thus intend to test two applications of our system: as a method that can be used when no multilingual thesaurus is available; as a method for constructing and/or enriching a multilingual termbank. Our hypothesis is that a document base consisting of texts on the same topic in more than one language in itself constitutes a set of comparable corpora. It should thus be possible to apply procedures such as those outlined here to retrieve all documents in a second language which contain lexical equivalences to a term or set of terms searched in a first language even when no multilingual thesaurus is available. We also intend to test the system as a tool for the semi-automatic construction of a thesaurus in a second language on the basis of an existing thesaurus in L1. In this case, the system would be run for each term in the L1 thesaurus in order to retrieve corresponding L2 equivalent contexts. The terminologist could then select the relevant set of L2 (multiword) terms for each L1 item searched. At the same time, both the L1 and L2 thesauri could be enriched by automatically associating with each node of each side of the multilingual thesaurus all the significant collocates characterising that particular term. In this way, we can create a multilingual search tool which combines the features of a keyword-based tool with that of our comparable procedure, and thus searches for both pre-identified multilingual thesaurus terms and also cross-language lexical equivalences.

There should be no problems in extending the system to cover additional languages, providing the necessary lexical and morphological resources are available. Any language can be adopted as the starting point, much as is currently done in the construction of many multilingual thesauri where one language (usually English) acts as the base. The vocabulary associated with any term in the corpus for this language (V1) will then be translated into all the other languages (constructing Vn vocabu-laries). Each comparable corpus (or set of documents), will then be searched for contexts with a significant cooccurrence of lexical items from the relative target language vocabulary for T. This will be the subject of a future study.

Please contact:
Carol Peters - IEI-CNR
Tel: +39 50 593 429
E-mail: carol@iei.pi.cnr.it
or Eugenio Picchi - ILC-CNR
Tel: +39 50 560481
E-mail: picchi@ilc.pi.cnr.it

return to the contents page