Data Management in Climate Research
by Kerstin Kleese
Parallel environmental models are currently some of the most demanding codes we have. They push the machines available to their limit and are still in need of more resources. However the bottleneck in running these codes is not so much the code performance as the data handling strategies employed. The High Performance Computing Initiative Centre at CLRC, Daresbury Laboratory (UK) is setting up a new project to investigate this important issue. Besides analysing existing solutions, the project will also look for new, more flexible and portable approaches.
The UK environmental research community relies heavily on High Performance Computing (HPC) to facilitate its studies. Fortunately the environmental model codes have a large potential for parallelization, which has been well explored by numerous research groups in the UK (eg UGAMP and OCCAM). Still increased resolutions, longer runs or larger ensembles have a significant impact on the HPC resources required and can easily overwhelm current systems. Often data handling has the most dramatic influence on model run times or determine whether a model can be run at all. Thus optimal data handling is vital for this type of application. Unfortunately the problems do not stop with the end of a successful model run. These codes produce vast amounts of data which have to be archived for future analyses. Current data storage systems leave something to be desired in speed and ease-of-use. The situation for data retrieval is even worse. Tedious searching for archived data, long waiting times and no selective extraction possibilities are common problems for modern scientists. Sometimes it is faster to run a model again instead of retrieving data from a previous run.
It is already clear that the data requirements of the community will increase even more over the coming years. Big centres like the European Centre for Medium Range Weather Forecast expect the volume of their data archive to double every 18 months. New machine architectures allow potentially larger models to be run. Data handling presents a severe bottleneck for today's science, without new strategies it might prevent future progress.
For our project it was decided to take a holistic approach, identifying four main areas of interest:
- data handling within the model codes
- file access during run time (on different platforms)
- data archival
- data retrieval.
Although all these areas have been investigated separately, and some interesting in-house solutions exist, little has been done to offer a portable solution that can be easily adapted to the actual requirements of different sites.
The first step is to analyse the current situation. Research results of leading scientific groups concerning data handling strategies within model codes have been examined. A lot of work has been done in this area over the past years, and we can certainly benefit from that. We would like to compare the different approaches, trying to find similarities, differences and tendencies that could be useful for the community. Secondly a list is in preparation covering the different file access mechanisms during run time on various systems. This gives a clear overview about what is available, how fast is it and what the user can do to make the most of it. This information will serve as a base for further investigations. In connection with vendors and other sites we have started to gather more information about the data archival and retrieval systems that are in use today. The clear message so far is that there is a desperate need for more intelligent solutions.
Our project will continue to analyse existing solutions. It will test which data handling strategies within model codes are best for which type of application. We will frequently investigate new machines or relevant changes to existing system architectures. A collaboration with a leading systems house has just started, to determine which off-the-shelf products could be used to provide more flexible data archival and retrieval mechanisms for scientific data.
Kerstin Kleese - CLRC
Tel: +44 1 925 60 3207