ILEX: a Lexical Resource for Italian
by Giacomo Ferrari
The objective of Natural Language Processing is the building of computational systems to handle natural language, such as information and knowledge storage and retrieval, document processing, and generation of texts and abstracts, or to use natural language as a language for access and interaction with computers.
After several decades, during which researchers have carried out theoretical studies and built prototypes of natural language processing systems, it has become clear that such products are in principle feasible, but the realization of large and complex natural-language-based services is prevented by the unavailability of the linguistic knowledge necessary to make these systems work. In particular, procedures for language analysis or generation do not have adequate resources from which to acquire syntactic knowledge, i.e. grammar rules expressed in a computer oriented formalism, and lexical knowledge, i.e. large computational dictionaries. Thus, in recent years, much effort has been devoted to the building of such linguistic resources, as well as other kinds of linguistic data banks, which can be used as support in the development of refined and exhaustive knowledge of a language.
The Italian Situation
In Italy, as in other countries, lexical databases are in a more advanced state of development than other language resources, such as computational grammars, tagged corpora (textual data bases where words and segments of text are classified and labelled accordingly), or tree banks (repositories of syntactic trees for fragments of natural language sentences).
Efforts aimed at building a repository of lexical knowledge for Italian date back to the second half of the '60s, when the construction of a Machine Dictionary of Italian was began by Antonio Zampolli and his group in Pisa. This first computational dictionary consists of a list of roughly 100,000 entries - fully tagged for their lexico-syntactic and usage categories - from which nearly one million of forms can be (semi-) automatically derived. About 250,000 definitions are also provided.
Similar enterprises have also been carried at the Universities of Venice and Turin, although neither has worked on such a large scale. In the following decades, there were several initiatives in a number of research centres including some Italian companies, such as Synthema, Thamus, Olivetti, CSELT, Sogei etc. However, the resulting products were expected to satisfy only very specific requirements and thus were not designed to be generalizable; in addition, in all these projects, the number of words treated has been relatively small if compared with the Italian Machine Dictionary.
Recently, the European Community has funded a number of projects on natural language, which - either as a primary result or as a by-product - have produced lexical data for Italian. This is the case of PAROLE, which is producing a set of about 20,000 words morpho-syntactically coded according to the guidelines given by EAGLES and GENELEX, SPARKLE, which has as objective the implementation of tools for automatic lexical acquisition, and CRISTAL, a project on intelligent information retrieval, which uses a dictionary of 40,000 forms from 8100 entries. (See Ercim News, No. 26, section on Computational Linguistics.)
The assumption that a computational dictionary can be used as a reference list of common knowledge lies at the base of the American project CYC, which has no correspondence for Italian, while another American project WordNet, whose result is a monolingual American English lexicon, accessible on the Web, where words are connected by conceptual links, has stimulated the setting up of EuroWordNet, funded by the European Community, which operates on 30,000 nouns and 15,000 verbs, for four European languages including Italian.
The ILEX Project
Experiences acquired so far in the building of lexical repositories for different purposes and using different methodolo-gies have highlighted the need for a national language resource that is complete at all levels of lexical description and that can be employed in different kinds of natural language applications without serious porting efforts.
The ILEX project was thus begun two years ago by a consortium formed by the Istituto per la Ricerca Scientifica e Tecnologica in Trento, and the Universities of Venice, Turin and Vercelli. The aim is to build a computational lexicon with a minimum core of 30.000 entries, fully coded in accordance with the most recent international standards in the sector. The following information will be encoded:
- Part-of-Speech (POS) and syntactic-semantic classes, plus all the codes necessary to accurately describe the morphological behaviour of a word; this information will be used by procedures for morphological analysis and generation
- syntactic subcategorization codes, describing the syntactic behaviour of entries, especially with respect to their argument structure and dependent constituents
- conceptual relations between words, in the style of WordNet, as well as compositional semantic information which can be used by programmes for the semantic analysis of sentences.
Other information will be encoded modularly, in separate but compatible files. The aim of ILEX is not only to create a repository of lexical knowledge, but also a distributed access system, which takes advantage of Internet facilities to offer integration and modularity. The dictionary will be easily accessible at various levels (ie entire vocabulary or sublexica) and updating procedures will be automatic, easy, and fast.
Giacomo Ferrari University of Vercelli
Tel: +39 161 228224