Reactive Structured Documents

by Afzal Ballim, Christine Vanoirbeek, and Giovanni Coray

The emergence of document-oriented applications and WWW tools clearly demonstrates the evolving status of electronic documents. Recent developments such as the Hot Java browser subscribe to this approach and expand the Internet browsing techniques by implementing the ability to add behaviour and transform the static document into dynamic applications. The Laboratoire d'Informatique Theorique (LITH) in the Swiss Federal Institute of Technology (EPFL) has a number of themes that it pursues along this line of research.

Structurally Marked Documents: The representation of electronic documents has moved away from the simple encoding practices of early usage, and towards richer encodings that indicate the structure of the document and multimedia extensions. Word processors have used various proprietary methods for representing texts, but recently there has been a tendency towards developing and supporting standards for this task. The Standard Generalised Mark-up Language (SGML) is one of the most well-known of these standards and has mechanisms for defining document classes.

Hypertext: While traditional texts are linear in nature (they are read straight through) there has been a trend in recent years with electronic texts to go beyond this linearity into texts that have a non-linear organisation (hypertexts) and contain non-textual elements (hypermedia).

Document Understanding: We are also interested in techniques of natural language processing, and discourse analysis in their application to document understanding - which is necessary for next-generation indexing and retrieval techniques for access to document collections such as the WWW.

Dynamic Documents: Once created a normal textual document is a static and unchanging object. We are interested, however, in the notion of documents that can change interactively in response to the reader, as well as in response to collaborative creation and updating of the document by multiple authors.

LITH is deeply involved in research on topics related to complex structured documents. Since 1993 LITH has actively participated in the development of and initiatives related to the WWW. A new WWW browser (SpiderWooman) was developed by us which extended the functions provided by early browsers. This software provided a new generation of WWW tools that unified the user interface of the OS with the web's client, and integrated a service-based approach.


There are a number of recent and ongoing projects at LITH which emphasize the above mentioned themes. A brief description of the major ones is given below.

IDEA: The Innovative Document Engineering Applications project addresses the problems met by information providers in handling large volumes of information obtained from heterogenous sources. The approach is to exploit the use of a structured document at each state of the information process. It is planned to develop methods to capture the implicit multiple structures of electronic documents and facilitate their integration in a local user environment. The project emphasizes that: Management of multilingual documents will be explicitely supported. Information access and retrieval will be improved by using intelligent querying techniques and the logical and semantic structures. Electronic dissemination will be investigated with a view of improving navigation in universal hypertext information system.

Document Collections & Reactive Hypertext Documents: The preliminary objective of this project, on the basis of the macro and micro-structure of documents, was to establish hypertext links that make explicit the relations that exist between the documents in a collection. In particular, the project was concerned with developing different measures that allow for classification and generation of structures (hierarchies, graphs, aggregates and composites) that facilitate the management and navigation within a relatively large collection of documents. We now aim to extend this by integrating work in robust parsing by members of LITH. Applying robust analysis techniques to important structural elements of documents will allow for intelligent indexing and retrieval of documents. By parsing these elements, complex information about the document can be gathered and used for indexing and the creation of dynamic documents which provide different views of a document collection.

HIPOCAMPE: The objective of this project is to automate the production of hypertext documents from pedagogical information text books. The process starts with an optical capture phase, followed by a structural analysis phase and the addition of conceptual links that connect diverse information in the document. An expert system proposes to the student a route through the hypertext document which takes account of the students level of knowledge. LITH's contribution to this project is in relation to the structure of the document, the modelling of the hypertext, and in improving the user interface with new navigation methods.

AGENDA: The objective in this project is the realisation and validation of an original system that identifies, quantifies, and qualifies in real-time the activity of a health care unit in the hospital environment. Our work is on the definition of a friendly and flexible information gathering environment based on the optical recognition of certain hand-written information contained in an ad hoc agenda used by the unit personnel.

DICA: This project aims to provide an environment for the cooperative development of distributed applications. Its main points of development are:

DICA intends offering these services using a uniform visual metaphor where the difference between user interface and dynamic document blurs.

