ERCIM News No.41 - April 2000 [contents]

Components for Data Intensive XML Applications

by Peter Fankhauser, Gerald Huck and Ingo Macherius

XML-based applications blur the lines between Internet, publishing and database technology. While classic SGML was focused on presentation oriented publishing applications, typical XML deployments in e-commerce bias toward message oriented middleware, transactions and data exchange scenarios. At GMD Institute for Integrated Publication and Information Systems (IPSI), a component-based framework has been designed and implemented to allow for the rapid development of data intensive, XML based applications.

The eXtensible Markup Language (XML) is the next generation data format for structured information interchange on the World Wide Web. Under the auspices of the World Wide Web Consortium (W3C), XML has grown into a family of standards integrating key technologies from three previously independent domains: documents, databases, and the Internet. This powerful mix is a strategic component of the rapidly growing ‘dot com’ industry.

A Native Approach to XML Processing

Many of today’s XML technologies originate from document processing, where scalability was not considered important. In e-commerce, however, scalability is essential. The use of XML with database technologies overcomes this problem, but raises others, such as data model and query paradigm mismatches. XML has a semistructured nature, which is incompatible with the flattened structure of relational tables, and is not easily decomposable into objects. Queries do not capture the expressiveness of XML.

We designed middleware components, which overcome the mismatches without sacrificing the simplicity of XML. They provide two important DBMS capabilities: declarative queries and transaction safe persistence. Thus the components can cope with the weakly structured, high volume XML data typically generated by wrappers from legacy data sources such as HTML pages.

Component 1: The Persistent DOM

The Document Object Model (DOM) is a platform- and language-neutral interface for XML standardized by the W3C and widely used throughout the industry. It provides a standard set of objects for representing XML documents, a standard model of how these objects can be combined, and a standard interface for accessing and manipulating them. The PDOM (Persistent Document Object Model) is an object manager, which transparently maps standardized W3C-DOM API method calls into operations on binary files, enabling processing of XML documents far beyond main memory limits. Unlike traditional databases, the DOM does not require a schema and thus allows skipping the design intensive setup phase typical of traditional DBMS.

The implementation is scalable and achieves a throughput of several MB of XML data per second on a standard PC. PDOM is lightweight, in that the code size is relatively small as well as programming effort is less compared to DBMS based solutions.

Component 2: The XQL Query Processor

The Extensible Query Language (XQL) is a declarative, path-oriented query language for XML. It includes most operations familiar to SQL, eg selection, restructuring, joins, and views, and handles the semi-structured nature of XML. Introduced first at W3C’s conference on XML query languages in 1998, XQL has since been implemented by several large IT-vendors. IPSI’s query processor implements the complete XQL proposal, and augments it with extensions to cross-document joins and restructuring of results. Its robust and efficient mix of algebraic and physical query optimization techniques yields superior performance. The processor can also be used on top of any W3C compliant DOM implementation, including the PDOM.

Component 3: The XML-Broker Data Server

Both the XQL processor and the PDOM were integrated into an XML data server, making their functionality accessible through the HTTP protocol. Our extensive experience showed the importance of diversity in interfaces. Thus we support the low-level DOM API, string-based queries, URL with embedded queries, and means to post-process query results with XSLT (XSL Transformations). Complicated information processing tasks on XML data can be concatenated into pipelines, encoded in standard URL syntax – ready to be bookmarked and reused.

Applications: From Molecules to Markets

The first deployment of our components was the nationally funded RELIMO project, whose goal is the integration of data sources of interest to drug designers. This still evolving technology was chosen by the IST funded OPELIX and eBroker projects as a platform for B2B e-commerce applications. The software package is also being distributed on the Web. So far, more than one thousand have been downloaded. This popularity made its commercialisation viable, and resulted in a product called the Infonyte XQL Suite.

Research meets Business

The components discussed in this article are part of an ambitious project at IPSI, the GMD’s XML Competence Center. This center has already become a focal point for cooperation between industry, research partners and standardization bodies such as W3C. Our state-of-the-art know-how and proven research background in database technology, document management, publishing, information retrieval, and graphical user interfaces make GMD-IPSI a natural partner for advanced XML information management issues, and their application in the evolving digital economy.

XML Competence Center: http://xml.darmstadt.gmd.de/
XQL and PDOM download: http://xml.darmstadt.gmd.de/xql/

Please contact:
Peter Fankhauser - GMD-IPSI
Tel: +49 6151 869 939
E-mail: xmlcc@darmstadt.gmd.de