Whats Next in Database?
by Keith G Jeffery
This short paper attempts to encapsulate the thinking that has been going on in the ERCIM Database Research Group over the last 8 years. Of course, the members of the group are internationally known researchers and participants in communities covering their own special areas of interest, and so the group acts as a focus integrating many aspects of database (or more generally information intensive systems)
The predominant technology today is relational. Founded on theoretical work by Ted Codd and others in the late sixties, the technology took 15 years to mature to market acceptability. However, much data processing still uses earlier hierarchic and network-structured (CODASYL) database systems and even file systems. Relational databases have several advantages:
they are founded on solid theoretical principles
they utilise predicate queries as opposed to navigational (what I want, not how to get it)
there is a standard query language so interoperation is assisted;
they are easy to use and a solid basis for advanced application (business or technical) systems
The major factors requiring a move forward from a relational basis are: Expressivity, Representativity, Distribution, Interoperability, and in a wider context (dealt with under Current Hot Topics below): WWW-database integration, Metadata, Data Warehousing and Data Mining.
Expressivity This is the feature that allows an end-user to communicate effectively with the information system. Where can I buy inexpensive but good coffee downtown? is the user expression - this is a long way syntactically and semantically above anything SQL (Structured Query Language) can express. Similarly, expressions involving the temporal dimension are handled poorly.
Representativity This feature concerns the ability of the system to represent data structures and content adequately. Relational systems originally handled poorly data types such as text strings, multimedia content (e.g. images) and more complex data structures the partitioning across relations (tables) is awkward for query and user-comprehension and has performance implications because of excessive joins. Relational systems - despite having a type date / time - represented inadequately temporal information, especially concerning temporally-defined versions of information.
Distribution Distributed database technology has been researched for over 30 years. The move from mainframe and dumb terminals to client-server computing allowed a client to access multiple information servers and various proprietary and standard protocols emerged. Distributing a centrally designed, homogeneous database system for performance or in order to concentrate data processing close to the users is a well-known and understood technology, although some other non-functional aspects of the system design are less well understood. Relational theory, with horizontal and vertical partitioning of the database, provides a solid underpinning. However, replication for performance and synchronisation of updates across networks poses availability and performance problems not yet solved adequately.
Interoperability A major requirement studied since the seventies is the need to interoperate information systems. The user at a client workstation requires apparently homogeneous access to heterogeneous, distributed information servers. The heterogeneity occurs in hardware / software platform, names of attributes, types of attributes, character set or medium of representation, language of representation, media types used, storage architecture, access architecture, data structure, query language, availability of information on precision, accuracy, domain limits and the semantics of the data. Techniques of data exchange and data access have been developed, but much more needs to be done. The ubiquity of WWW has highlighted the need - user access to WWW information sources usually requires a multi-step process utilising the intelligence of the end-user to navigate and to resolve heterogeneity.
The Solutions Emerging
Expressivity Advanced user interfaces with graphical representations (draw what you mean by downtown) and advanced interaction based on user models, domain ontologies and dialogue models- utilising logic programming technology - are a topic of active research with products emerging. In the temporal dimension the latest SQL standard incorporates much of the R&D work over the past decade on temporal data handling.
Representativity The object-oriented paradigm was intended to address the representativity issue. The use of object classes with inheritance of properties and the tight binding to methods that were class-specific allowed a direct connection between conceptual level modelling of an information system and the implementation. Versioning was included rather naturally, and complex data types were handles as specific classes - with associated methods. However, pure object-oriented systems despite the excellence of some of them such as O2 developed at INRIA have been less successful than expected due partly to performance problems. Furthermore, the tight binding of data and program (method) is in direct opposition to the trend of the last 30 or more years to separate them to allow maximal flexibility and re-use.
Distribution Advanced R&D work on transactions, locking and commit over the last 15 years has produced excellent products. However, there are still some remaining problems of integrity especially with large multimedia objects when a check-out, check-in transaction style with reconciliation of any synchronous updates is required to avoid lengthy hangs while another user is updating. Such technology also requires version handling; this is particularly difficult with an object instance that has many sub-object instances of different versions. Typical application areas are CAD (Computer-Aided Design), office systems and multimedia editing.
There are no tools yet developed that distribute a database system reliably taking into account all the factors, such as network bandwidth, network reliability, performance, security, availability, privacy, client geographic location, server geographic location, time zone of location. A skilled analyst is still required for this task - largely because many of the values of the factors are uncertain or unavailable.
Interoperability The solution of the interoperability problems will almost certainly always involve the end-user providing additional guidance to the information system. However, modern R&D has provided increasingly sophisticated solutions allowing schema reconciliation so that data structures and attribute names in two different database schemas can be matched and the match proposed to the end-user for verification or correction. Machine translation is overcoming some of the language problems and there is intense R&D in this area.
The problems concerning information about the data - such as accuracy and precision - are starting to be resolved by the use of metadata (data about data). Early attempts in the database community to standardise so-called data dictionaries through the IRDS standard were not universally successful. However, with the overpowering need caused by WWW there is renewed interest in this area and from W3C the RDF (Resource Description Framework) coded in XML (eXtensible Markup Language) has emerged as a widely-used metadata standard. However, much more work is required to agree definitions for particular application areas and to provide the tools to automate the interoperation process further.
Current Hot Topics
In addition to R&D on the topics discussed above - bringing in the integration of database technology with object-orientation, logic programming, functional programming, artificial intelligence, multimedia and other areas of computer science and engineering there are a few areas of work which are current and of wider applicability. They commonly draw on recent advances in the topics discussed above.
WWW-Database Integration From early days in the WWW concept, the idea of using WWW to provide browser client access to pre-existing database-based information servers has been attractive. Work in the first half of the nineties provided the CGI (Common Gateway Interface) and several organisations (including some of the ERCIM institutes) developed Dataweb Techology as it was named at CLRC. Significant ongoing R&D is attacking issues of performance and standards to avoid lock-in to proprietary solutions. Also some groups are working on techniques and tools analysing the structure and content of websites to provide a database-style conceptual model to assist uniform querying across database-based and non-database-based web servers.
The great advantage of Dataweb Technology is that the data content and structure is maintained independently of its access (through WWW forms as query templates) and presentation (through HTML (HyperText Markup Language) or XML with associated presentation control using CSS (Cascading Style Sheets) or XSL (eXtensible Stylesheet Language) respectively).
The integration has also encouraged the emergence of a different architecture; between client and server a mid-layer is inserted for the application server. It despatches Java applets to the client browser workstation and servlets to the information server(s) so providing a thread of control for the particular transaction in the application.
The integration has to span database-based websites, information retrieval-based websites, conventional document websites and websites with high multimedia content. This is a great challenge and increasingly attention is turning to the use of metadata to describe websites and intelligent assists to the client browser query process.
Metadata Metadata is arguably the key facility for interoperability and intelligently-assisted user access to global information resources. Metadata can be used by intelligent agents to expedite search and retrieval, handle security and privacy issues and re-route network traffic improving availability. Metadata can be used to provide query assistance to the end user, in order to achieve the ultimate goal get me what I mean, not what I say.
Metadata has been used especially in the scientific / technical community for many years. The library community has large metadata collections in computerised catalogue systems. Catalogues of engineering parts or technical products are stored as metadata. Building upon the concepts of a database schema, data exchange in the seventies and even earlier utilised additional data describing the data being exchanges to allow automated processing at the receiving server and better interpretation of the results by the end-user. The business community, working from a basis of rigid standardisation for data exchange, has come to utilise metadata.
The explosion of demand for universal information access caused by WWW has really highlighted the need for metadata. The web-indexing engines (such as Alta Vista, Excite, HotBot, Yahoo, Lycos) build metadata databases to describe succinctly and with uniformity heterogeneous information servers. However, use of specialised metadata servers for specific application domains - utilising advanced database techniques with knowledge-based systems and a domain ontology to capture, store and provide knowledge about the application domain is starting to emerge as a powerful way forward.
Data Warehousing and Data Mining The need for readily available management information for business led to data warehousing- essentially storing a restructured and usually summarised copy of the base data in a form suitable for statistical analysis and visualisation. The technology facilitates looking at data on, for example sales, by periods of time, by geographic region or by class of product. The technology features deeper exploration of the data (drill-down) and higher-level summarisations (roll-up). Long used in the science and technology domain where data centres for many disciplines exist and are used heavily - data warehousing is having a profound effect on business decision-making.
Data Mining is the technique of finding patterns in data. The data are analysed using special functions to look for correlative patterns; the technique is much more brute force than classical multivariate statistics but has produced interesting results in several application domains, both scientific / technical and business. The basic concept that a computer system can detect the pattern (representing a hypothesis) in a mass of data mirrors one technique of human hypothesis creation. However, the performance issues are receiving attention as is the theoretical underpinning of data mining and current R&D includes work on utilising data mining with multivariate statistical analysis results as a guide, to improve success of the mining techniques.
The rapid advances in database systems (or, more widely, information intensive systems) over the last 30 years or more indicates that the future is an exciting challenge. The usual issues of performance and accuracy will continue alongside issues such as improved representativity, expressibility, universality of access and universality of understanding of the information assisted by metadata and the WWW.
Keith G Jeffery - CLRC
Tel :+44 1235 44 6103