DIGITAL LIBRARIES
ERCIM News No.27 - October 1996


Bipartite Graphs and Automatic Generation of Thesauri


by Michiel Hazewinkel

The key to information finding (information retrieval) in large and very large collections is METADATA, that is extra information added to the main document (article, or chapter, or... ) designed to make it findable (retrievability) - particularly, metadata in the form of subject-indicators (key words and key phrases) and classifi-cation information.

A thesaurus and classification scheme can be of enormous value to make a given (large) corpus of material accessible. This is 'proved' for example by the very considerable effort that Elsevier Science Publishers invests in maintaining the thesaurus of 30,000 concepts underlying EMBASE (Excepta Medica, four full time employees). Of course there are many other aspects to the general idea of metadata. For instance, the matter of suitable containers for Metadata has attracted considerable interest in recent years.

Building a sufficiently large thesaurus by hand is an enormously expensive and laborious task; certainly if we are thinking of one of the size really needed to deal with a field like mathematics or computer science (let alone a field like physics). For mathematics I estimate that a thesaurus of key phrases (not words ­p; these are not information rich enough) should comprise about 120,000 terms. Thus it is instinctive to begin to think about the automatic generation of thesauri. By now there are (first generation) commercial and academic programs available for extracting index and thesaurus terms (noun phrases, pre-positional noun phrases) from a corpus of electronic texts. Other linguistic software can be used to 'standardize' the raw list of key phrases thus obtained. The available data currently consists of a bipartite graph between terms and documents. At this stage, a number of mathe-matical and computer science problems also arise and I should like to mention some of them.

Thesaurus problems

A thesaurus according to ISO standard 2788 and a number of other national and international standards, is much more than an alphabetical list of standardized key phrases. Fortunately, using the Hamming distance, a metric is defined on both the set of terms and the set of documents. This gives a notion of distance between terms (and between documents) thus giving a quantitative version of the thesaurus concept 'related term'. This also gives promising possibilities for such things as neighbourhood search. A much harder problem mathematically is how to capture (quantitatively) the thesaurus concepts of 'broader terms' and 'narrower terms'.

Bottom up classification

The notion of distance on the set of terms makes it possible to apply clustering techniques and thus to define a hierarchy. The problem is to relate this bottom up hierarchy with the top-down hierarchy represented by a classification scheme for a given field.

Missing terms problem

The set of terms generated as above will have a tendency to be incomplete, particularly as regards the more general, broader terms (which can also be thought of as missing centres of clusters). The problem is how to recover these missing centres.

Matching problem

Generating a thesaurus in one go for a large area is not feasible (Mathematics is as large a field as I care to contem-plate in this respect; life sciences or physics are far larger fields). Thus the problem arises of constructing several thesauri (to be thought of as charts forming part of an atlas) and to match them, ie to describe the overlap between them.

These and various other related problems (such as the multilingual version of the Matching problem) form the subject matter of a number of research projects being carried out at CWI in the context of Digital Libraries. Related efforts involve interactive books and multimedia.

Please contact:
Michiel Hazewinkel - CWI
Tel: +31 20 5924204
E-mail: mich@cwi.nl


return to the contents page