Natural Language Processing at INESC

by Luzia Wittmann

The Natural Language Group of INESC has developed a broad coverage system named Palavroso, for automatic morphological processing of European and Brazilian Portuguese. It is intended to be the first block of a more complex system, a base for the development of commercial products and to be useful for scientific research on the Portuguese Language.

The core of Palavroso is a rule based morphological analyzer, to which lexicons1 of a variable dimension can be linked. The actual European Portuguese (EP) lexicon contains about 60,000 root words accepting up to 1,300,000 forms. The Brazilian Portuguese lexicon is now in the concluding phase of its constitution. Palavroso encompasses all inflectional morphology of Portuguese, and handles correctly enclitics (enclise and mesoclise), compounds, superlatives, augmentatives and diminutives.

The EP lexicon is compatible with the EAGLES recommendations and will be reused for the Portuguese lexicon of the LE-PAROLE project, in which INESC is participating as associated partner of the Centre of Linguistics of the University of Lisbon - CLUL. The Group is sharing with CLUL the construction of a lexicon with 20,000 entries, with morphosyntactic and syntactic infor-mation. The lexical entries will be selected with the help of the corpora tools based on Palavroso. In the same project, Palavroso will be used for tagging the Portuguese corpus (20 million running words).

Palavroso has been designed to run, and is successfully installed on two different computer platforms: UNIX and Windows, and is easily adaptable to any other computer system. Several applications have already been developed using Palavroso as the core and underlying base. The most important are a set of corpora tools, and a spelling checker, named Correcto.

Correcto also runs on the UNIX and Windows platforms, and ­p; as it is intended to be commercialised ­p; has been compared with the existing commercial spelling aids for European Portuguese. The results show that Correcto has a very good performance in all of the aspects measured. It is definitely better in pro-viding less and more precise suggestions. In addition, its coverage of specific morphological phenomena (such as, for instance, compounds and verbs with clitics), is far superior, due to the under-lying system. The measures and the method adopted are published and available. The adaptation of Correcto to Brazilian Portuguese is under way.

Contrastive studies between European and Brazilian variants of Portuguese are one of our research lines since 1994, bearing in mind that a common effort from the several variants for NLP can be advantageous for the Portuguese language as a whole. The Natural Language Group developed a first survey of qualitative and quantitative differences between the two variants in a joint project with Logos Inc. (USA) and is now continuing work in this domain, expecting to have official funds to start a larger project in collaboration with CLUL and UNESP (University of the State of Sao Paulo - Brazil).

Created in 1987 as a joint centre with IBM (IBM-INESC Scientific Group), the Natural Language Group at INESC was reorganized in 1990 as a regular R&D group of INESC, loosing contact with IBM and diversifying cooperation at national and international level. Since then the Group has been acquiring experience in several domains of NLP for Portuguese. At present, besides the activities mentioned above, our main fields of interest in the Group are grammar checking, machine translation (including MT between closely related languages), Portuguese teaching computational aids, and intelligent text retrieval.

Luzia Wittmann - INESC
