MorphoLogic - A Language Engineering Company from Hungary

by Gábor Prószéky

MorphoLogic, a Hungarian enterprise, was established in 1991 by a group of NL researchers from the Hungarian Academy of Sciences and universities in Budapest. MorphoLogic is the only organization in Hungary that is doing R&D solely in the field of natural language processing. MorphoLogic has noticed that one way of bridging the gap between basic research and profit-oriented development is to do basic research within the company. The close ties that MorphoLogic maintains with academic labs in Budapest have resulted in a number of very profitable language products, and the sale of these products not only provide funding for the profit-making activities of the company, but for its non-profit activities, as well. MorphoLogic is involved in four EC-sponsored (Copernicus) projects: GLOSSER, GRAMLEX, MULTEXT-EAST, ELSnet goes East, with academic and industrial partners from more than ten European countries.

The name of MorphoLogic refers to the company's focus on R&D work in morphology and syntax. R&D efforts over the past few years have focused on the following main related areas: Each of these areas has one or more specific projects and partners associated with it. The Research Institute for Linguistics (RIL) of the Hungarian Academy of Sciences has been an important partner in the development of the string-based morpho-syntactic formalism, since the first users of commercial morpho-syntactic systems in Hungary were the lexicographers at RIL who were writing a corpus-based Historical Dictionary of Hungarian.

MorphoLogic's basic system consists of both a morphological analyser and a generator, and it can handle derivational and inflectional affixes and compounding. Both the linguistic description language and the internal database format with its search routines are in-house developments of MorphoLogic. The linguistic databases of these models cover various natural languages, from Hungarian through Eastern-European Slavic languages to German or English. All the kernel linguistic software has been written in standard C, hence the MorphoLogic program modules are totally portable.

Spell-checking for highly inflectional, agglutinative languages, such as Hungarian, requires a thorough morphological analysis, very different from spell-checking for morphologically simple languages, like English, that involves the trivial task of looking up the word in a word list. The morphology-based speller, called Helyes-e? consists of lexicons1 and algorithms that enable the software to handle billions of possible words and to propose intelligent corrections for the misspelled words. Helyes-e? can be customized by the user, and thus it is easily adapted to OCR, handwriting and speech recognition systems where error-types are different from typical typing errors. The hyphenator, Helyesel, hyphenates any word-form, again using a morphological segmentation algorithm. This model is useful for languages in which morpheme boundaries override the usual hyphenation points. List-based hyphenation does not work in such languages. What's more, Helyesel also allows hyphenation with optional letter-insertion or letter-change. Helyette, the so-called inflectional thesaurus, is a combination of a morphological analyser, a synonym dictionary and a morphological generator. It works by finding the lexical base of an input word and storing the inflectional information. It then offers the synonyms of the stem, and finally, it generates the morpho-phonologically correct combination of the chosen synonym and the stored inflectional information. Helyette is meant to be language-independent. Its first implementation with the complex suffix system of Hungarian has been successful and MorphoLogic is now looking to test the system on other languages.

The project concentrating on intelligent dictionaries is called MoBiDic (MorphoLogic Bilingual Dictionaries). The word or expression to be translated goes through a morphological segmentation and its stem(s) are the real query that has to be found either among the headwords or in the full entry. This latter option makes the lexical search similar to free text search with linguistic filters. MoBiDic is, therefore, able to treat dictionaries and corpora with the help of the same set of linguistic functions. Furthermore, the number of dictionaries is not limited: MoBiDic looks up a word in open dictionaries that you either buy or build yourself. The most recent features are the possibility of using any sort of multimedia and the well-defined API to MoBiDic which is open to researchers.

Two projects have been started recently: for the linguistic support of recognition tools, and, for the development of a parser that relies on the morphological engine, Humor, the High-speed Unification Morphology, Enhanced with Syntactic Knowledge, ie, HumorESK.

