ERCIM News No.36 - January 1999
Eurosearch: a Federation of European Search Engines
by Martin Braschler, Mounia Lalmas, Luigi Madella and Carol Peters
The objective of the EuroSearch project is to build a federation of European search engines. The main aims of the federation are to join forces in order to be able to better compete with global search engines, to enhance the visibility of European web sites, and to help to preserve European language and cultural diversity. The technologies under development will provide linguistic support for querying over different search services and enable the automatic generation of catalogues, best reflecting local cultures within the federation.
The decision of the EuroSearch consortium of industrial (Italia On-Line, Pisa; CINET, Barcelona; EuroSpider Information Technology, Zurich) and academic partners (CNR, Pisa, and Dortmund University) to build a federation of European search engines originated from the consideration that the World Wide Web is still dominated by US culture and so far little effort has been put into promoting European web sites. A study of the incoming traffic of the services provided by the EuroSearch partners determined that about 70% comes from the same country or from countries using the same language; of the outgoing traffic more than 50% is directed to the US, while almost all the rest remains in the country of origin. This situation has been analysed as depending mainly on language barriers between European countries, the poor multilingual support in traditional search engines, and on the US cultural domination of most popular web catalogues.
The EuroSearch project thus aims at helping to restore linguistic and cultural equilibrium on the Web by building a pan-European federation of national search and categorization services. The main objectives of the federation are to:
- promote traffic across Europe by exchanging links and sharing services
- provide language support for query translation
- provide tools for automatic categorization in order to overcome the high costs of traditional catalogues, still affordable only by big international organisations.
The Cross-Language Approach
The aim of the EuroSearch distributed, multilingual service is to permit users to enter queries in their own, or their preferred language, and to carry out search and information retrieval over some or all of the federations national sites.
Differences in the partners document collections and indexing mechanisms have led to the implementation of different search strategies, depending on the collection to be queried. The cross-language search component of EuroSearch thus activates two distinct types of searching:
- Query translation using a multilingual lexicon; this employs the pivot language concept and semantic indicators are assigned to polysemous words to permit interactive sense disambiguation. Queries can also be expanded using corpus-extracted data.
- Similarity thesaurus technology; a multilingual similarity thesaurus contains entries linking terms in one language to a list of similar terms in another, each assigned with a similarity value based on statistical co-occurrence, ie basically how often the terms co-occur in similar texts taken from training data.
The languages covered are currently German, Italian and Spanish, plus English. The two approaches are integrated through the development of common translation server interfaces and data exchange formats; this will facilitate future extensions of the Eurosearch components.
A preliminary simplified prototype of a Translation Server has been developed and integrated in the Arianna search engine, allowing queries in Italian to be formulated and directed to Alta Vista. This server will be extended with the addition of a corpus-based query expansion mechanism. In 1999 the integration of the linguistic resources on all the federated services will be completed.
The Automatic Categorization Technology
Another important goal of the project is to facilitate the creation of Web catalogues by developing techniques for the automatic categorization of documents. In this way, even small corporations will be able to develop their own catalogues.
The categorization approach is grounded on an automatic textual analysis of web documents associating weighted terms with documents. The determination of the weighted terms is based on the description-oriented indexing approach developed at the University of Dortmund. It takes into account features:
- specific to web documents (whether a term appears in a title, a heading, or is highlighted)
- standard to text documents (term frequency).
The weights are probabilistically determined using the Least Square Polynomial (LSP) approach and a test-bed of pre-categorized documents taken from the Computers and Internet part of the Yahoo! catalogue.
This approach produces two main results:
- the automatic classification of new documents into appropriate categories
- the determination of documents that belong to given categories.
The approach is fully automatic, and is portable to the various languages involved in the federation. We are currently applying our techniques to German web documents from the DINO-online catalogue.
A preliminary on-line prototype is now running, and an engineered version is available in the Arianna catalog. This is one of the first examples in the world of automatically generated catalogues available on the Web.
For further information and demos, see the EuroSearch Web site at: http://eurosearch.iol.it/
Luigi Madella - Italia Online
Tel: +39 050 944258