The GridML Project: Next-Generation Data Mining on High-Performance, Parallel Distributed Systems

by Csaba Szepesvári

The aim of the GridML project is to develop a prototype of an advanced, next-generation data-mining system, capable of more efficiently exploiting the vast computational resources of distributed, parallel computational environments. Researchers will look not only into the problem of predicting the performance of the algorithms separately, but also at the whole process of data mining, where the cost of making such predictions is also taken into account.

GridML (Grid-based Machine Learning) is a research project supported by the Economic Competitiveness Operative Programme (GVOP) of the Hungarian Government. The project began in January 2005 and will run for two years. The GridML Consortium consists of four partners: SZTAKI (coordinator) with the participation of the Machine Learning Group and the Laboratory of Parallel and Distributed Systems, the Research Group on Artificial Intelligence from the University of Szeged, AAM Consulting Inc, and Magyar Telekom Ltd (T-Systems).

There exist numerous applications areas for data mining. Examples include marketing, chemistry, CRM, environmental control and security, just to mention a few.

One characteristic feature of data mining is that it works with extremely flexible models, often involving dozens, sometimes hundreds of tunable parameters. The promise is that such flexible models are capable of capturing the complex relationships hidden in data. The price one must pay is a significant increase in the computational cost of fitting such models to the data. However, experience has shown that the quality of the resulting solutions makes this approach worthwhile.

Often however, the process is expensive not only in terms of computational resources but also in a monetary sense. This is primarily due to the high human resource costs, which follow from the current practice of assigning several data-mining experts for the full duration of the project. The tasks of these experts range from defining the problem, through obtaining and cleaning the data, to selecting the right model and tuning the models’ and learning algorithms’ parameters. Out of these, the latter two contribute most significantly to the total time spent on the project.

Due to the high cost of experienced experts, many data-mining projects are abandoned or never see the light of day. Another problem with the current approach is that the quality of the solutions obtained depends largely on the preferences or the experience of the experts involved. This significantly decreases the predictability of the results of the mining process, making companies uncertain of whether it is worth starting the project. It is not only industrial data mining that suffers from the issue of ad hoc parameter and model selection: researchers also face this problem due to the lack of sufficient computational power.

One possibility to overcome the computational bottleneck is to employ a large number of computers in parallel, connected into a cluster or Grid. Projects devoted to migrating data-mining algorithms to such distributed environments have existed for some time. The two most widespread approaches are either to parallelize the algorithms themselves or to let many instances of the algorithms (same or different) run in parallel.

It would be naive to think that distributed data mining is a universal cure. With the increase of computational resources, researchers and data-mining experts see new opportunities that would not previously have been feasible, and the available resources an soon become insufficient. Hence, the central issue investigated in our project is how to better utilize the available resources, with the ultimate goal of the project being to develop a prototype data-mining system that organizes computations in such a way that given a fixed computational budget it produces the best possible models in a reliable manner.

As such, the main research theme of the project belongs to the domain of meta-learning, a relatively new field concerned with the relationship of tasks or domains and learning algorithms. However, unlike most approaches to meta-learning, in our project the emphasis is on the online performance of the algorithms. Most previous methods produce an initial ranking of the various algorithms without giving further advice on how best to utilize the resources. In contrast, our approach not only provides an initial ranking, but also continuously monitors the performance of the various methods throughout the mining process, and then feeds the result back to the resource allocation strategy. The resource allocation problem is treated as an online learning problem, the best-known example of which is the so-called multi-armed bandit problem where a gambler faces the problem of deciding which arm of a K-slot machine to pull to maximize his total reward in a series of trials. The payoffs of the machines are random and initially unknown. After trying each machine at least once, the gambler has to decide if a machine with a high estimated payoff should be chosen or one that has a lower estimated payoff but which has been tried less often.

Researchers plan to apply bandit theory to optimizing the execution of data-mining algorithms on Grids.

In addition to researching the performance of various advanced online resource allocation schemes for optimizing distributed data mining, another goal of the project is to create a prototype system capable of meeting several technical requirements such as accountability, security and the automated documentation of the mining process.

Link:
http://zaphod.aml.sztaki.hu/wg/index.pl/gridml

Please contact:
Csaba Szepesvári, SZTAKI, Hungary
Tel: +36 1 279 6262
E-mail: szcsabasztaki.hu