Data Access and Integration
by Malcolm Atkinson
We can make decisions and discoveries with data collected within our own business or research. But we improve decisions or increase the chance and scope of discoveries when we combine information from multiple sources. Then correlations and patterns in the combined data can support new hypotheses, which can be tested and turned into useful knowledge. We see this requirement in medicine, engineering, sciences, economics and business.
An approach is to organise a communal data resource so that the data is maintained in one place, under one organisational scheme. There are three reasons why this doesn't work well for research data:
- Much of the data has already been collected under different regimes chosen by the groups who conducted the work.
- To integrate the collection process itself would lose a valuable asset the skill and knowledge of the collectors they understand their domain and know best how to organise it for their work.
- It is incompatible with human nature people want to 'own' their own work and decide how it should be conducted; in health care and engineering, I have seen the process of trying to agree on a common model go on indefinitely, consuming the time of many experts, but never converging.
An important trend in research is to organise larger shared data resources, but at the same time the number of collections is growing rapidly. Their diversity increases, while multi-disciplinary research increasingly requires integration of data from multiple autonomous and heterogeneous data resources managed by independent communities.
The Data Warehouse Approach
Another approach is the data warehouse much used in retail and financial decision support for example. Here, data from multiple data resources is copied, 'cleaned' and 'integrated' under a common schema. Typically, more data is added to the data warehouse from a standard set of sources on a regular periodic basis. This works well with a small set of stable data resources serving well-defined goals. In research, there is a large and growing set of resources, an open-ended set of goals and each source changes, both in content and structure, as experiments and surveys progress and as understanding and technology advances. Data warehousing has its place in securing and concentrating data, but encouraging and capturing multi-source evolution is vital. It becomes part of the scientific communication processes and enables rapid use in other domains of new data, new discoveries and new classifications.
A virtual data warehouse (I am indebted to Mark Parsons, EPCC, Edinburgh for this term.) would meet this requirement. It would enable researchers to combine data from a dynamically varying set of evolving data sources. It would accommodate all of the diversity, handling schema integration (the differences in the way data is structured and described) and data integration (the differences in the sets of values used and their representations). It would accommodate the changes in policy of data owners, and the changes in organisation and content of each data source. And it would do all of this while presenting its users with understandable and stable facilities, that nevertheless reflect the new information. It would provide them with confidence and provenance information, so that they could use the resulting data as reliable evidence. At the same time, it would not place restrictions on the autonomy of the data providers, would assure them that their data was not misused and assist in ensuring they gained credit for the quality and content of their data.
A complete realisation of a virtual data warehouse is beyond the current state of the art, instead various aspects of it are realised. Many research teams hand-craft a one-off solution. This does not scale well, as the skilled work of building schema and data mappings grows with the number and richness of the data sources integrated. It is vulnerable to being unsustainable as those resources evolve and rarely implements mechanisms to ensure data access policies and data provenance tracking. Projects, such as myGrid and Discovery Net craft workflows to assemble relevant data for each analysis. The Virtual Data Technology (Pegasus, Chimera, and DagMan) developed by the GriPhyN project, encourage high-level expression of combined data integration and analysis, enabling the underlying system to plan and evaluate more optimally. VDT exploits the predominant update pattern in physics data of incremental addition against a constant schema. Projects such as BIRN and GEON at SDSC catalogue data from multiple sources, describing their structure and data representation. For example, each geological surveys rock classification is described with an ontology and its coordinate system is defined. Tools are provided to manage these descriptions and use them to construct schema and data mappings.
Technical solutions to recurring distributed tasks underlie the assembly of data in data warehouses, communal or shared repositories, bespoke solutions, catalogued registries and in advances towards virtual data warehouses:
- data description and discovery
- access, authentication, authorisation & accounting
- data transformation
- data transport
- query, selection and aggregation
- data update, bulk load and archiving
- provenance tracking, audit trails and diagnostic logging.
The UKs OGSA-DAI (Open Grid Services Architecture Data Access and Integration) system provides a framework and set of components operating across grids or web services to deliver these mechanisms reliably. Using this framework, many projects are building on OGSA-DAI and extending its repertoire. It already handles a range of relational systems, XML databases and collections of files.
Perhaps the most important challenges that remain are:
- automatic adaptation of schema and data transformations in response to changes in a data source
- integrated optimisation of computation, data management and data movement
- high-level presentation of data composition, operations, analyses and transformations.
All three require a formal foundation, based on tractable mathematical models of sufficient scope and realism.
UK National e-Science Centre: http://www.nesc.ac.uk/
National e-Science Centre, Edinburgh, UK