ERCIM News No.26 - July 1996 - CNR


by Antonio Zampolli

The discipline of Computational Linguistics (CL) as it is known today originated from the Machine Translation Research of the '50s and '60s. In 1966, a DARPA report attributed the failure of the machine translation project to achieve concrete results to the concentration of the work on 'small scale examples', and on 'miniature models of language'. However, despite the report's recommendations to work with 'real language problems, above a certain scale of grammar size, dictionary size and available corpora', in the following two decades, CL focused mainly on the computational implementation of linguistic models to deal with short, carefully selected sentences. This work produced excellent theoretical insights, but proved to be insufficient when the major funding agencies turned their attention to the potential of CL in answering the growing needs of managing and accessing the wealth of information transmitted by natural languages in the fast developing information society.

The demand for language industry products, to assist the traditional linguistic professions (translation, language teaching, etc.) and to develop new language processing applications (natural language interfaces, speech input and output, document retrieval and indexing, etc.) has lead to the emergence of the language engineering paradigm, which requires development of robust language processing components, capable of dealing with real texts in concrete information and communication systems. This, in turn, requires the availability of reusable language resources, (typically large) sets of language data and descriptions, for building, training, evaluating written and spoken language processing systems: spoken and written corpora, lexica, grammars and terminologies.

In this way, the major national and international funding agencies and organisations have assumed and continue to have a key role in shaping our field. They are currently sponsoring a large part of the on-going research, through programmes which, determining the objectives of the largest projects, in practice define the main trends and strategies. For this reason, we felt it appropriate to invite leaders of the main North American (NSF) and European (EU) programmes to describe the general framework and the overall objectives of the sponsored activities (see articles by Ballim et al., etc.).

The global information society has clear multilingual implications. Recently authoritative sources have warned that languages, for which no adequate computer processing is being developed, risk gradual loss of their place in the global information society, with serious implications for the culture of which they are the vehicle, to the detriment of one of the greatest humanity values: cultural diversity. Bernard Quemada discusses the relationship between language technology and multilingualism, and presents a set of recommended actions.

International collaboration is particularly important for the progress of our field and the success of its applications, especially those aiming at producing multilingual information and communi-cation services. Multilingual systems production requires close coordination between the partners of the different languages, to ensure the integrity of the components, and in particular the interoperability of the embedded language resources. Two major infrastructural European initiatives, EAGLES and ELRA, are described by Calzolari et al. and Choukri, respectively.

The paradigm shift is reflected in the topics of the articles presented by the ERCIM associated authors, which describe either individual projects or the general action lines of their Institutes. The current Zeitgeist is witnessed from the fact that several articles refer to the construction of (large) language resources, and/or are focusing on practical appli-cations of real language use. The mandate and the programmes of some Institutes explicitly include the creation of multifunctional language resources, ie of resources intended for reuse by R&D community (eg see the articles by Moens, Calzolari, Hajicová and Wittmann). Other language resources are created for direct use in the author specific systems.

Innovative methods are researched and tools constructed for extracting knowledge from language resources, e.g. identifying stylistic variations (Karlgren), term extraction from corpora and use of corpora for training language processing components (Samuelsson, Calzolari) and for structuring and organising the knowledge acquired (Calzolri and André et al.). In parallel models, methods and tools are actively explored for annotating corpora and lexica with increasingly deep levels of linguistic descriptions (see contributions from ILC-CNR and CWI). In this way, synergies and convergences are reinforced between abstract theoretical work and concrete data-driven activities.

Multilingual resources are developed for preparing multilingual applications: e.g. translation (CRCIM) and aids for disabled (FORTH). Tools are developed for localising software (see the contribution from VTT). Robust linguistic processing tools are developed to annotate large, real language corpora: properly adapted, they can be incorporated as morphological, syntactical, semantic components in applicative systems (see contributions from Ballim et al., Prószéky and Calzolari).

The process of producing and using documents has received great attention in the language engineering framework. Documents transmit knowledge and present information organised for human understanding and work. The use of language technology achieves significant enhancements in all the document processing phases and in the work productivity in general. Topics discussed in this issue's articles include document preparation and production, multilingual document generation, content representation and synthesis, document navigation and retrieval, extension of multimedia capabilities of information systems (Alexa et al., André et al., Toussaint et al., Ballim et al., Pierrel et al.).

In the past, the activities in the field of speech and of (written natural) language processing have been developed separately for various reasons, including the different scientific and technical knowledge and disciplinary backgrounds required. Recently, the need for integration has become increasingly apparent.

The last decade has witnessed a dramatic improvement in speech recognition. The transition from laboratory demonstrations to commercial deployment has already begun, providing services like voice dialling, call routing, simple data entry. The next challenge is a fascinating one: to build spoken language interfaces, in which both the user and the computer play active roles in conversation, in the user's own language. Speech interfaces are the most efficient, flexible, natural for humans, and will open access to the wealth of information and services in the information networks, to a larger part of society. The realisation of these interfaces requires that language processing components work in synergy with speech recognition and generation components, to produce meaning representation and natural speech output. Activities in this direction are reported in the articles by Moens, Calzolari, Alexa et al., Stephanidis and Antona, Trancoso, Gyimóthy, and Pierrel et al. Trancoso refers not only to the activities of her Institute, but also to ELSNET and in particular, to actions for joint student formation.

Please contact:
Antonio Zampolli - ILC-CNR
Tel: +39 50 560481

return to the contents page