ERCIM News No.25 - April 1996 - INRIA

Authors may index their own Web Documents

by Jacques André and Hélène Richy

The main access to information from the World Wide Web is navigational. Many projects or commercial crawlers have been designed for that purpose. In particular the concept of cartography is one of the most successfull. On the other hand, many studies are concerned with automatic indexation: tools are written for extracting from (full) texts the pertinent information the reader is looking for.

Between these two approaches, structural and statistical, we propose another one, based on the traditional technique: authors have the best knowledge about the contents of their documents. They are able to give key-words summarizing their thought. However, many problems are still yet unsolved.

A first approach, using the structured document editor Grif, allowed us to produce large index tables for traditionnal paper-form books, such as Cartulaire de Saint Laurent, the first Cartular written in French during the XIV century.

Extending such tools for the Web requires a lot of improvements at various levels. Note that index is here a concept that is extended to other concepts such as bibliography, references, table of contents, etc.

From the authoring system point of view, a set of three tasks is usefull:

a preliminary task is to decide which entities are to be indexed and how these entities will be indexed: a marking tool should enable the creation of such descriptions
a second task consists in specifying how index tables will be constructed, an index selector should propose a list of index tables to be constructed
Finally, an index builder should produce structured and formatted index tables after collecting and sorting information.

When considering large documents, from the Web (ie from the reader) point of view, such an index is not a static document, but rather an active one that has to be updated. Many occasions require to update index documents, such as:

the content of some previously indexed documents has changed * new pages have to be indexed
new index table is required (with new options).

Various updating strategies may be proposed:

immediate updating, which is more or less unrealistic
updating when index are accessed. This supposes that before displaying index tables, all links are checked
on the user's demand.

At INRIA-Rennes, we are working, in the context of the Thot system (Opera project/Inria), on such index manipulation. Work is in progress to implement such a system based on the second strategy (updating when index is accessed) in the frame of the Tamaya environment.

More info in the Web: http://www-bi.imag.fr/OPERA/BibOpera.html.

Please contact:
Jacques André or Hélène Richy - INRIA
Tel: +33 99 84 71 00
E-mail:jacques.andre@irisa.fr or helene.richy@irisa.fr

return to the contents page