Querying Heterogeneous Semi-Structured Data
by Vassilis Christophides, Michel Scholl and Anne-Marie Vercoustre
The central issue in the Aquarelle project is to provide a uniform access to heterogeneus collections of data on the Internet. The information discovery system is based on a set of so-called Access Points which provide a uniform Z39.50 based access to the various collections. This flat and minimal access model has the advantage of simplicity but does not exploit the benefits of querying archives using more structure and semantics.
In this minimal approach to heterogeneous data, Access Points are not used to describe the various cultural objects but just as a support to map local data structures into a 'common access model'. The advantages are:
- legacy cultural databases (ie, archives) are not altered within the Aquarelle network and integration of new information sources is trivial
- semantic discrepancies among heterogeneous data sources are roughly captured by this minimal access model
- mediation between the Access Server and heterogeneous data servers is simplified.
However, most of the existing structure and semantics richness of the archives and folders is lost for querying since there is no translation foreseen between richly structured archives and folders on the one hand and the Access Points view in the Access Server, on the other hand. The data source structure is extremely useful:
- for facilitating query refinement and improving precision, compared to keyword or full-text based search
- for addressing more easily fine grain chunks of information compared to hypertext navigation
- for enabling sophisticated data integration from various data sources.
New Trends in Querying Heterogeneous Data Sources
Providing integrated access to multiple, distributed, heterogeneous databases and other information sources has been studied in the database research community for well over a decade from Multidatabase/Federated approaches (Pegasus, Amos.Garlic) to new generation mediator based systems (TSIMMIS, DISCO, Information Manifold).
The common feature to all multidatabase architectures is the existence of a Canonical or Common Data Model (CDM) to reduce the complexity of the problem of mapping data and commands between the different data models and languages of the component sources. Such an approach is appropriate for integrating a small number of sources whose structure is known and stable.
New-generation systems are interested in integrating a large number of sources storing data, possibly with no structure or with implicit structure, such as the Web or Information Retrieval Systems (IRS). In this context, a global schema, or even federated schemas is hard to implement: the emerging mediation services embed the knowledge allowing for processing specific sources of information. Each source is wrapped with a translator (or wrapper) that logically converts the underlying data objects into a common information format.
Querying and Integrating Heterogeneous Data with Incomplete Structure
In the context of Aquarelle, the Verso research team of INRIA has been exploring a complementary approach to querying and integrating heterogeneous databases: use of a language called POQL developed in the project on top of the DBMS O2, as a first step towards querying data without complete knowledge of their structure. The idea is that the user does not know in advance which servers to query.
Instead of artificially mapping the folders structure semantics onto Aquarelle Z39.50 Access Points (APs), one might use POQL for APs based queries. The fact that the structure does not have to be totally specified allows for integrating, to a certain extent, several sources which do not have the same structure. The power of this approach to query both structure and data at the same time is illustrated by the INRIA demo (http://cosmos.inria.fr:8080/poql.html) on a database of the french Inventaire whose documents obey the SGML CI DTD.
For instance, a user who wants to find all folders containing Cognac in their title, would issue the following query:
- select f
where x contains Cognac and name(#A) contains tl; where Folders is the name of a folder server database, f is a variable ranging over the folders, @P is a path variable allowing to express navigation through the unknown structure of folders, and #A a variable ranging over the attributes ending the paths. Then, the filtering condition specifies that the required attributes (ie, the Access Points) contain tl in their name and the corresponding values x contain the string Cognac. This is logically equivalent to the definition of an Access Point title and its corresponding mapping to the related elements of folders as for instance tl-cl (for classeurs), tl-th (for thematics), tl-dos (for dossiers), tl-obj (for objets) in the CI DTD.
The advantage of POQL is that one does not have to specify in the query all possibilities (tl-cl, tl-th, tl-dos, etc) that we do not have to specify the paths to access to those classeurs, thematics, etc. Furthermore, and this is more important, the POQL queries are in a certain degree independent of the data structure.
Geo-Referenced Navigation and Querying
In many culture heritage applications, folders are related to geographical areas, ie they are geo-referenced by a point or a zone depending on the scale. The association of folders to geographical maps is useful for at least two reasons:
- user interface, navigation, query refinement: instead of access point based access to information, the user might want to navigate through a geographical area zooming from a country scale down to a county scale before deciding which folder(s) to access. At each scale, points featuring folders are displayed on a background map
- querying: as in geographical information systems (GIS), the user might want to combine the usual search criteria with spatial ones : "give me the folders associated with the 18th century farms located within 5km from Cognac".
The current experiment by the French Ministry of Culture (Inventaire) jointly with INRIA and Euroclid aims at prototyping the access through the web to geo-referenced folders structured according to the CI SGML DTD including the two above features. The results of this experiment should be looked at in the Aquarelle context.
Michel Scholl INRIA
Tel: +33 01 39 63 53 29
Anne-Marie Vercoustre INRIA
Tel: +33 01 39 63 56 62