ERCIM News No.20 - January 1995 - CNR

From Electronic Dictionaries to Lexical Reference Systems

by Carol Peters and Eugenio Picchi

Much research in computational lexicography in the late 80's focussed on the implementation of procedures to analyse and represent computationally the lexical data contained in machine-readable versions of published dictionaries. At the same time, corpus linguistics was developing strategies to process text archives and automatically extract lexical information which could be used in dictionary and lexical knowledge base construction. Now, in the early 90's, the first results of these studies are being applied by the publishing industry to the dictionary production process, and increasingly sophisticated versions of electronic dictionaries and lexical reference systems are becoming available.

In recent years, much research activity in the US and in Europe has been concentrated on developing procedures to extract, analyse and exploit in various ways the lexical data contained on computer typesetting tapes provided by dictionary publishers. The Istituto di Linguistica Computazionale, (ILC-CNR), Pisa, has been particularly active in this sector, both in its own research activity and in several EU funded projects, e.g. MULTILEX, ACQUILEX, ET-10/51.

There have been two major lines of interests:

the construction of mono- and bilingual lexical databases which offer hypertextual access to the data as opposed to the static, alphabetical ordering of the printed volume;
the construction of lexicons1 for natural language processing applications, such as machine translation.

The starting point for all studies has been the definition of a computational model of the lexical entry. This has lead to much discussion on the formalism to be adopted, the types of information to be included, and the most suitable structure to represent the complex network of relationships between the different elements in the lexicon.

The publishing world has followed these studies with interest and, in addition to providing data for research purposes, at times has been directly involved in the projects. Most dictionary publishers are now aware that the computer should not be limited to the final stage of the publishing process and that their dictionaries must be based on corpus evidence. In fact, both the research world and industry (generally in collaboration) have been devoting considerable resources and efforts to building up large language reference corpus which provide the data for many different kinds of linguistic studies, and also constitute the basis for the construction of new dictionaries by providing sources of attested evidence on real world usage of language.

At ILC-CNR, a lexicographical workstation, which can be used in all stages of dictionary production, from definition of the entry structure to the printing or producing in electronic format of the finished product has been developed by E. Picchi (see Ercim News 11). The components of the system include mono- and bilingual dictionaries and text corpora. The core module is a procedure for on-line dictionary editing which includes functions for windowing into and copying data from dictionary and corpora and is integrated with a structured indexing procedure that can be used to query the dictionary in compilation in order to check the regularity and consistency of the input.

The benefits are manifold: a computational model permits a tight control over the work of the lexicographer, ensuring far greater consistency and coherency in the data, and facilitating revisions and updates. But perhaps the main advantage for the commercial world is the reusability of the computerized lexicon i.e. from a single comprehensive database many different types of dictionaries can be derived: learners' dictionaries, lexicons1 for special languages, dictionaries of synonyms and antonyms, etc.

For this reason, major dictionary publishers are now building up large, exhaustive lexical databases from which to generate their dictionaries: both printed and electronic dictionaries can be derived from the same source. This is important as electronic dictionaries, on disk or CD-ROM, are now quite common on the market and pocket versions are becomingly increasingly popular, the advantages of fast and highly flexible access to the lexical data far outweighing the pleasure of a leisurely but limited consultation. It is to be expected that, in the future, the role of the printed dictionary will be greatly reduced and the trend will be towards on-line computerized lexicons1, where the user can navigate freely through the structured information, or more complex lexical reference systems in which both mono- and bilingual dictionaries and corpora can be accessed and queried, and information derived from one source can be used to access or query another.

Please contact:
Carol Peters - IEI-CNR
Tel: +39 50 593429
E-mail: carolvm.iei.pi.cnr.it
or
Eugenio Picchi - ILC-CNR
Tel: +39 50 560481
E-mail: picchiicnucevm.cnuce.cnr.it