Activities at the Institute for Computational Linguistics of CNR

by Nicoletta Calzolari

The Institute for Computational Linguistics of the Italian National Research Council (ILC-CNR) of Pisa, known, together with the Department of Linguistics of Pisa University and the Consorzio Pisa Ricerche, as the 'Pisa Group' has been active in the field of Computational Linguistics (CL) since 1967.

The Pisa Group is now involved in a large number of national and international projects, ranging from Text Processing (concordances, indices, lemmatisation, statistical analyses, etc.), to building a Reference Corpus of the Italian Language in co-ordination with parallel initiatives on other languages, and including the development of large Textual Databases, use and analysis of Machine-Readable Dictionaries, development of large Lexical Knowledge Bases (monolingual and bilingual), study of parallel/contrastive multilingual corpora, implementation of morphology for several languages, design of computational grammars and development of parsers (in different frameworks), implementation of Knowledge Representation languages and systems, study of dialogue and natural language interfaces, Machine Translation, digital image processing, acquisition of (lexical) information from large text corpora, application of natural language processing techniques in Information Retrieval applications, in the field of digital libraries, etc.

As it would be impossible to provide a detailed description of all the ongoing activities here, we simply outline the main sectors of research and development, highlighting the fact that they cover central and mainstream themes in the state-of-the-art CL, and are articulated within an overall design which encompasses both CL proper and so-called Literary and Linguistic Computing. This important convergence allows us to integrate the best of the two areas of interest.

We can mention six main sectors of activity:

Very Large Reusable Linguistic Resources

In recent years the Pisa Group has strongly promoted the concept of reusability of linguistic resources at the international level. In particular, it has been very active in promoting awareness of the need for adequate linguistic resources for all languages, and for actions directed at fostering development in this sector. It has thus co-ordinated, and now co-ordinates, a number of projects and activities of the EC: for the definition of common specifications (LRE-EAGLES); for the definition of a European infrastructure for the creation, management and distribution of resources; for exploration of possible cooperation with the United States (NSF/ESPRIT, and EAGLES Inter-national Cooperation); for experimen-tation of methods for the re-use of existing resources (ESPRIT ACQUILEX); for a harmonised develop-ment of large generic Corpora and Lexicons1 for European languages based on common specifications (LE-PAROLE); for the (semi-)automatic acquisition of lexical information from large corpora (LE-SPARKLE); for the collection and distribution of linguistic resources (the director of ILC is president of ELRA; see article by K. Choukri in this issue).

Methods and Tools for Text Processing

Conceived for both literary and linguistic work and paying particular attention to lexicographic needs, the Pisa text processing system has now been developed as a highly complex structure composed of independent modules with the DBT, a textual database system, as the core system, and including components for mophological analysis and generation, POS tagging and lemmatisation, and lexical database management. The DBT is software for mono- and bilingual full text access and analyses. Recently, a set of procedures have been added to the DBT system so that it can be used on the INTERNET circuit.

Image Processing and Computational Philology

The combination of text and image processing techniques seems to offer interesting possibilities to various designers dealing with large collections of texts. A particular task is the computer-assisted presentation and translation of ancient manuscripts and old printed documents. In this field, the ILC has also developed methods and tools in the framework of the European project BAMBI.

This line of activity provides a system, particularly appropriate for classical scholars, to facilitate look-up of an image archive with digital representation of the sources, transcribe the text contained in the images, and match electronically each word of the transcription with the portions of image in which the word is inscribed.
Data, Methods and Tools for Analysis and Generation
of Linguistic Structures

Data, methods and tools are designed and developed to deal with linguistic structures at different levels of description. We give just a few examples. At the phonological level, the Italian lexicon (both lemmas and inflected word-forms) is provided with the phonological transcription, and a very large inventory of proper names accompanied by the phonological transcription has been built within LRE ONOMASTICA.

A Reference Corpus of Contemporary Italian and an Italian Lexicon are being built within the framework of national and EC projects (among others LRE DELIS, LRE MULTEXT, ET-10-Cobuild, in addition to those mentioned above). A corpus of child language is being built within a national project to further the study of language acquisition. A semantic lexicon is being built in the framework of the LE EuroWordNet project, modelled on the Princeton WordNet, and linked to other European WordNets.

At the syntactic level two main lines are worth mentioning: i) the creation of grammars for Italian (eg in ATN and CGU (Complex Grammar Unit)), and ii) the implementation of development tools. In the EC projects COLSIT and LS-GRAM the aim is to import the EUROTRA grammars on the ALEP platform.

Within ESPRIT IDEAL and EUREKA PROMETHEUS, the ILC has contributed to formulate a theoretical model for communication and dialogue applied to a number of man-machine interactions.

Knowledge Representation and Cognitive Research

The theoretical study of knowledge structures, analysed in their logical and cognitive components, allowed the development of a knowledge representation language in the form of a semantic network. This type of research has been developed in a number of national projects and in LRE CRISTAL, with the design of modules for conceptual modelling and for developing domain-specific ontologies.

Language Engineering Applications

The various data, methods, techniques and tools listed above are used also in a number of application projects, as components of systems in the broad area of information technology. We list a few of the relevant application areas here.

In the Information Retrieval field the ILC has developed linguistic components in LRE RENOS for the extraction of legal terms from a corpus, while in LRE CRISTAL has developed a multilingual environment for a natural language interface in the retrieval of information from financial texts. In LE TAMIC-P, a technical dictionary will be linked to a Knowledge Base in the pension domain.

In the digital libraries area, we can cite the MLAP MEMORIA project aimed at designing an intelligent reading environment to access large electronic libraries, exploiting both natural language and image processing techniques.

In the multimedia sector, some projects use NLP methods and tools in an environment which helps in learning, teaching and supporting the disabled, eg by inproving vocal man-machine interaction.

In the didactic area, the ADDIZIONARIO project aims at creating a hypermedia linguistic laboratory to help children in the process of first language learning, in particular through the use of a multimedia dictionary.

Much fuller information on our activities can be found at our Web site:

Please contact:
Nicoletta Calzolari
Tel: +39-50-560481

