Building Harmonised Semantic Lexicons1

ERCIM News No.38 - July 1999

Building Harmonised Semantic Lexicons1

by Nicoletta Calzolari

SIMPLE is a project sponsored by EC DGXIII in the framework of the Language Engineering programme. To our knowledge, this project represents the first attempt to develop wide-coverage semantic lexicons1 for a large number of languages (12), with a harmonised common model that encodes structured ‘semantic types’ and semantic (subcategorisation) frames. Even though SIMPLE is a lexicon building project, it also addresses challenging research issues and provides a framework for testing and evaluating the maturity of the current state-of-the-art in the realm of lexical semantics grounded on, and connected to, a syntactic foundation.

Many theoretical approaches are currently tackling different aspects of semantics. However, such approaches have to be tested i) with wide-coverage implementations, and ii) with respect to their actual usefulness and usability in real-world systems both of mono- and multi-lingual nature. The SIMPLE project addresses point i) directly, while providing the necessary platform to allow application projects to address point ii).

SIMPLE is coherent with the strategic EC policy that aims at providing a core set of language resources for the EU languages and should be considered as a follow up to the PAROLE project (see http://www.ilc.pi.cnr); SIMPLE adds a semantic layer to a subset of the existing morphological and syntactic layers developed by PAROLE. The semantic lexicons1 (about 10,000 word meanings) are built in a harmonised way for the 12 PAROLE languages. These lexicons1 will be partially corpus-based, exploiting the harmonised and representative corpora built within PAROLE. In this way, the semantic encoding will respect actual corpus distinctions. The lexicons1 are designed bearing in mind a future cross-language linking: they share and are built around the same core ontology and the same set of semantic templates. The ‘base concepts’ identified by EuroWordNet (about 800 senses at a high level in the taxonomy) are used as a common set of senses, so that a cross-language link for all the 12 languages is already provided automatically through their link to the EuroWordNet Interlingual Index.

The Model

In the first stage of the project, the formal representation of the ‘conceptual core’ of the lexicons1 was specified, ie the basic structured set of ‘meaning-types’ (the SIMPLE ontology). This constitutes a common starting point on which to base the building of the language specific semantic lexicons1. The development of 12 harmonised semantic lexicons1 requires strong mechanisms for guaranteeing uniformity and consistency. The multilingual aspect translates into the need to identify elements of the semantic vocabulary for structuring word meanings which are both language independent but also able to capture linguistically useful generalisations for different NLP tasks.

The SIMPLE model is based on the recommendations of the EAGLES Lexicon/Semantics Working Group (http://www.ilc.pi.cnr.it/EAGLES96/rep2) and on extensions of Generative Lexicon theory. An essential characteristic is its ability to capture the various dimensions of word meaning. The basic vocabulary relies on an extension of ‘qualia structure’ for structuring the semantic/conceptual types as a representational device for expressing the multi-dimensional aspect of word meaning. The model has a high degree of generality in that it provides the same mechanisms for generating broad-coverage and coherent concepts independently of their grammatical/semantic category (entities, events, qualities, etc.).

In order to combine the theoretical framework with the practical lexico-graphic task of lexicon encoding, we have created a common ‘library’ of language independent template-types, which act as ‘blueprints’ for any given type - reflecting the conditions of well-formedness and providing constraints for lexical items belonging to that type. The relevance of this approach for building consistent resources is that types both provide the formal specifications and guide subsequent encoding, thus satisfying theoretical and practical methodological requirements.

The large number of languages covered by SIMPLE is reflected in the size of its Consortium: Università di Pisa (coordinator: A. Zampolli), Erli (now Lexiquest)-Paris, Institute for Language and Speech Processing-Athens, Institut d'Estudis Catalans, University of Birmingham, Univ. of Sheffield, Det Danske Sprog-og Litteraturselskab, Center for Sprogteknologi-Copenhagen, Språkdata-Göteborgs Universitet, University of Helsinki, Instituut voor Nederlandse Lexicologie-Leiden, Université de Liège BELTEXT, Centro de Linguística da Universidade de Lisboa, Instituto de Engenharia de Sistemas e Computadores-Lisboa, Fundacion Bosch Gimpera Universitat de Barcelona, Institut für Deutsche Sprache, Istituto di Linguistica Computazionale - CNR Pisa, University of Graz.

Please contact:

Nicoletta Calzolari - ILC-CNR
Tel: +39 050 560 481
E-mail: glottolo@ilc.pi.cnr.it

return to the ERCIM News 38 contents page