Data driven Linguistics

by Marc Moens

When it was established in 1989, the Human Communication Research Centre in Edinburgh and Glasgow committed itself to expanding the scope of formal investigations of language to encompass as wide a range as possible of real language structure and real language use. Since then, HCRC has helped create, collect and disseminate corpus resources, which are available for study and use by researchers the all over the world.

With the exception of sociolinguistics, traditional linguistics ­p; including computational linguistics ­p; has tended to concentrate on short, carefully constructed sentences. Not surprisingly, this has had consequences both for the types of theories developed, and for their applicability to real world problems. But recent years have seen a sea of change in attitudes among researchers addressing human linguistic communication. Particularly in computationally-oriented research and development, people have turned away from abstract, theoretical work, towards concrete data-driven activities. This shift has been made possible because substantial bodies of text and speech have become available in electronic form. In turn, the shift in attitude has increased demand for real data, and as a result, there has been a dramatic growth in the number of new text collections or corpora.

These corpora tend to be large - in the order of hundreds of millions of words of text. By way of comparison, a page of printed text usually contains around 600 words, so a 100 million word corpus occupies more than 150,000 printed pages.

Our initial foray in the field of linguistic resources, the HCRC Map Task Corpus, was built up from 128 dialogues between people carrying out a simple cooperative task. Each of the two participants has a schematic map which the other cannot see, but both collaborate to reproduce on one of the maps a route already printed on the other. The dialogues were annotated at several levels of detail, and these annotations, together with the maps and the digitally-recorded speech itself, are included on an eight disc CD-ROM set.

Other corpus collection work was concerned with textual, rather than spoken, material, such as the European Corpus Initiative, carried out by HCRC under the aegis of the European Union and the Association for Computational Linguistics. Until our production and distribution of the ECI disc (100 million words in 22 languages), researchers in languages other than English had essentially no access to large amounts of real text in their language in electronic form. A new corpus collection project covers financial journalism between 1989 and 1991 across 6 European languages. This balanced collection makes it particularly suitable for comparative study, and as the basis for the development of multilingual systems.

As well as helping provide linguistic resources for worldwide use, HCRC obviously carries out various research projects, using these and other corpus resources. In the course of this work, a number of tools have been developed which help researchers find their way through these large corpora, derive significant generalisations, etc. These tools are distributed to other R&D groups via HCRC's Language Technology Group. The Web pages at give full details on how the tools can be downloaded.

