ERCIM News No.45 - April 2001 [contents]
MetaCenter - Building a Virtual Supercomputer
by Ludek Matyska, Michal Vocú and Ales Krenek
The MetaCenter project builds a nation-wide Czech computing and data storage GRID. Started in 1996 with an initiative from Masaryk University, its main goal is development and deployment of middleware products to create homogeneous environment on top of heteregeneous geographically distributed computing resources.
Three academic high performance computing centers, located at Masaryk University in Brno, Charles University in Prague and West Bohemia University in Pilsen, currently under the umbrella of CESNET, are connected in projects whose long term goal is a creation of an academic GRID for the Czech Republic, the environment which supports large scale distributed and parallel applications and in the same time leads to more efficient use of available computing resources. To fulfill this goal, ie to create MetaCenter GRID environment which hides the details of individual computing resources and their distance from the end users, set of appropriate middleware components are build. The work is focused on the following areas:
Information services, both for end users, via a web based interface, and for program devellopers, via appropriate APIs. Information about users, data sets and application programs is stored in Oracle database within the perun system, which was developed as part of the MetaCenter project. The relevant data from the database are regularly exported to LDAP based directory service, and to the Kerberos authentication system.
Security, including uniform access to all computing resources. A single sign-on system based on Kerberos 5 protocol was created, which allows once per access authentication. The Kerberos 5 implementation is based on the Heimdal system, which is being extended to suits MetaCenter requirements. Currently, GRID wide login name is associated with each user, but research towards virtual accounts mapping is under way. This will allow a seamless collaboration with other GRIDs. The perun system based information services are built to allow on-demand creation of user accounts on individual machines and are therefore well prepared to the incorporation of virtual accounts and their mapping to actual physical persons.
Shared data space, with an illusion of location independency of users (and application) programs and data sets. This is achieved via use of AFS, distributed file system with global unique file naming, and also via simple file transfer protocols using the single sign-on mechanism (curently scp a ftp protocols are supported, with the goal to support their GRID-aware extensions like Grid-FTP). High capacity backup storage is also available and serves uniformly all GRID nodes. The AFS is fully integrated within the single sign-on mechanism, the location independency is supported through replica servers of read-only data. All information about physical placement of individual data volumes is stored in perun and hidden from end users, making administration of the whole GRID rather easy. Applications are also installed in AFS and are accessible using system of modules. A module is a virtual entity representing access to a particular application. Instead of remembering different locations of applications, scratch space etc., users willing to access a particular application issue a single add <application> command which among other things creates appropriate shell environment necessary to run the apllication. Global module name space (also stored in perun) secures uniformity in access to applications regardless of their GRID physical location (and also support easy use of floating licenses).
Batch systems are used to control all non-interactive use of the MetaCenter GRID (in fact, ideas to support even interactive jobs via batch queue system are currently discussed and tested). The original system of choise was LSF (Load Sharing Facility), but despite it advantages it is currently replaced by OpenPBS batch system (the main reasons are the cost while LSF is very costly, OpenPBS is available under open licensing terms and ability to repair errors and create own extensions, which are required to support more advanced and experimental scheduling policies LSF is available in binary form which cannot be directly modified nor extended). Set of global batch queues is created and users, when submitting their jobs, can left the decision where the programs will be run completely on the batch system, increasing thus the total system throughput.
High speed network infrastructure is provided by the national-wide academic backbone CESNET 2 (2.5 Gb/s between Prague and Brno, and 34 Mb/s between Pilsen and Prague to be this year upgraded to at least 1 Gb/s) and metropolitan area networks with 155 Mb/s connections. This network provides high bandwith and low latency necessary for distributed and parallel applications. Recently purchased PC clusters located in Prague and Brno are connected directly to the backbone via 1 Gb/s uplinks, providing environment allowing to study influence of latency on distributed applications.
Czech computing and data storage GRID centres.
Since 2000 the MetaCenter project became involved in two important European wide Grid activities the European Grid Forum (http://www.egrid.org/) and the Datagrid project (http://grid.web.cern.ch). Under the framework of the former one we took an active part in a successful demonstration of the EGrid functionality during the SC2000 conference in November 2000. The demonstration presented a dynamic, migrating scientific computation (a worm) built on the Cactus and Globus metacomputing toolkits. Nine supercomputing centres of seven European countries were involved in the experiment. Under the Datagrid project scheduling, related security problems and information services are our primary are of interest.
Ludek Matyska - CRCIM / Masaryk University
Tel: +420 5 41 512 310