The Production-Level Hungarian ClusterGrid Initiative
by Péter Stefán
A breakthrough represented by the Hungarian ClusterGrid was achieved when production grid development commenced in mid-2002, involving hundreds of desktop computer nodes. Launching ClusterGrid has been a result of centred research/development and organizational/management efforts. However, the investments have already returned manifold by using the 500 Gflops supercomputing power of a production PC grid in many scientific fields.
The Hungarian ClusterGrid Infrastructure handles the Grid problem in a slightly different way to contemporary grid systems, building up from low-level hardware to the application level by introducing a layered architectural model. As its name suggests, the basic building blocks of the grid are PC labs that perform dual functions. During the day, the labs, which are located at universities, high schools or even public libraries, serve an educational purpose in an office-like environment. Whenever they are fulfilling this function typically during nights and weekends they are used for supercomputing in a Unix environment.
The production grid is used by researchers from many scientific fields and their industrial partners: mathematics, biology, chemistry, and data mining are a few examples of scientific areas that need large amounts of computational power and make use of such facilities with true parallel and parameter study applications. One of the most popular case studies is simulating the radiation process within the heater elements of a nuclear reactor.
At the technical level, there are many novel technological elements introduced in the ClusterGrid. The first innovative feature compared to traditional grid systems is that the labs are connected to one another through private computer networking (see Figure), using the capabilities of the high-bandwidth Hungarian Academic Network in order to enhance security and user confidence in the whole system.
The second innovative element is the use of dynamic user mapping during job execution. One of the most serious causes of bottlenecks in contemporary solutions is the insufficient separation of user and job credentials, which yields monitoring and authorization problems. Furthermore, it is not necessary in the ClusterGrid architecture to have the submitters user credentials configured at all clusters or supercomputers in the grid. This gives much more freedom to jobs traversing through different resources (in fact the job becomes an atomic unit on which different operations, such as execution, transfer, store etc can be defined).
The third innovative idea is the Web-service transaction-based, state-full, and distributed resource broker that provides interoperable gateway functionality to those grid systems built on classical disciplines using XML/SOAP interface. The broker itself contains simple implementation of all basic grid services, such as grid information systems, job execution systems, and file transfer subsystems.
The fourth innovative element is the job definition format that allows a job to be defined in both static and in dynamic terms. Jobs are built up in directory hierarchies, ie all pieces of binary, library, input and output files are encapsulated into the structure, and at the same time the job is also a temporal entity, ie a set of operating system processes on different hosts and the relationships (communication, data transfer) between them. The runtime execution structure provides the following features: the job is allowed to take its complete environment to the place of the execution (even a licence file, or organization/virtual organization certificate), the job can be customized, and workflow definitions can be treated as part of the job. Sub-jobs (ie jobs within the master job) can be defined and executed, and meta-jobs, such as code compilation, can easily be treated as ordinary grid jobs.
|Connection of grid clusters via MPLS VPN over the Hungarian Academic Network.
The ClusterGrid Infrastructure currently involving 1100 compute nodes has a cumulative performance of 500 Gflops/sec (50 billion floating-point operations per second), which is comparable with that of the top five hundred clusters. The system is also cost effective: the measured performance is achieved at an annual operational cost of 40 000 Euro. The framework works not only for the integration of PC clusters, but also of heterogeneous resources such as supercomputers.
In the future, NIIF/HUNGARNET plans to improve the national production grid in both qualitative and quantitative terms. This means improving the number of compute nodes to two thousand (or more), installing storage nodes and eliminating data network bottlenecks.The introduction of new technical solutions such as job gateways, SOAP-interfaced Web portals and experimental IPv6 grid technologies is also of key importance for the forthcoming development.
Péter Stefán, Office for National Information and Infrastructure Development (NIIF/HUNGARNET), Hungary