Combining XML and Description Logics for Describing and Querying Documents
by Rim Alhulou and Amedeo Napoli
Representing and handling documents by their content is the core of ESCRIRE, a coordinated action involving three INRIA research teams.
In order to take advantage of the huge amount of information available on the Web and in document databases, it is necessary to design efficient techniques for retrieving, extracting and querying documents. Ontologies are playing an increasing role in these tasks, especially for content-based annotation and manipulation of documents. The objective of the ESCRIRE project is to use an ontology to annotate a set of abstracts of biological documents extracted from the NIH Medline public database, and then to query the annotated documents within a knowledge representation (KR) formalism. In the following, we restrict our attention to the use of a description logic, namely the RACER system, within the ESCRIRE project.
In order to manipulate documents by their content, annotations are attached to documents and a domain ontology has been designed for this purpose. The annotations and the ontology are described within a pivot language based on XML. This pivot language relies on a set of syntactic rules controlled by a DTD. The pivot language plays the role of a bridge between documents and the description logics (DL) formalism: every element in the ontology and every annotation have a corresponding element in the DL formalism. The pivot language is also used to describe SQL-like queries, which are in turn represented within the DL formalism to be handled by the DL classifier. The pivot language has been especially built for the needs of the application, and is not simply another XML-based language for document description.
Briefly, the ontology consists of a hierarchy of classes representing concepts, eg genes, and relations between classes, eg interactions between concepts. Each class in the ontology is described by a set of attributes and roles representing the properties of the class.
Two types of classes are available: defined classes with necessary and sufficient conditions, and primitive classes, with only necessary conditions.
A document is composed of three parts: (1) a textual abstract; (2) a set of classic metadata (Dublin core); and (3) a set of metadata concerning the content (annotations). The pivot language is used for representing the annotations according to the ontology, especially the objects and relations referenced in the documents, in this case, the genes and the interactions. Objects (instances of classes) and relations (instances of classes of relations) are described by their properties (names and values of roles and attributes), and the class to which they belong. The structure of a query respects the classical schema SELECT-FROM-WHERE, with some additional constructs being available.
The classes and relations of the ontology are translated into concepts within the DL system. Relations are also represented as concepts. All attributes and roles are translated into roles in the DL system. The two types of classes in the ontology - defined and primitive - are transformed into DL concepts according to their status: defined classes become defined concepts in the DL system, while primitive classes become primitive concepts. However, the properties of a relation, eg reflexivity, antisymmetry and transitivity, must be managed by a module that is external to the DL system.
Each document is then represented as an individual within the DL system. Objects and relations referenced in the ontology or in the documents are translated into individuals within the DL system. Moreover, an individual is related to all the individuals filling its roles and attributes. Individuals representing objects and relations are linked to those representing the documents where they are referenced.
Classification and subsumption are the main reasoning methods in the DL systems, and they are used for processing queries. A query Q can be translated into one or more query concepts Ci in the DL system. Each query concept Ci is then classified in the concept hierarchy. The answer to the query Q is constituted by the set of instances of each classified query concept Ci.
A number of problems have appeared during the development of the project. We may underline the following difficulties showing the extensions of a KR formalism for representing and handling documents by their content. The DL system does not provide any special constructor for taking into account binary as well as n-ary relations, especially for handling the properties of relations such as reflexivity, symmetry and transitivity. The possibility of working with or without the closed-world assumption was not available, and would have been very useful. The translations and the evaluations of queries were not simple nor always efficient. Actually, a more sophisticated module for handling queries based on the formalism of conjunctive queries has to be practically designed for a realistic document manipulation.
The first results of the ESCRIRE project show that a DL system such as RACER can be used with relative success for representing an ontology of a domain, and for describing and querying a set of documents in an XML-like form. More work must still be done to solve the problems mentioned above. A number of theoretical tools exist, but still haveto be made practical for an effective and realistic manipulation of documents by their contents.
Rim Alhulou and Amedeo Napoli