Data Mining Applied in Anaerobic Wastewater Treatment
by Simon Lambert
The research and development project TELEMAC is creating an adaptable, customisable system for the monitoring and control of anaerobic wastewater treatment plants. One of the techniques being employed is data mining - the extraction of useful knowledge from data using a variety of techniques.
Coordinated by ERCIM, TELEMAC is a project within the European IST program that is developing innovative approaches to managing anaerobic wastewater treatment plants, with particular application to the wine industry (see also ERCIM News No. 48). One of the features of the project is the integration of a variety of approaches, including soft sensors, fault detection and isolation, and remote monitoring and access of multiple plants. The project has now been running for two years and has made significant progress.
|Anaerobic digester at industrial scale (Sauza, Mexico).
Within TELEMAC, CCLRC has been working specifically on data mining. This is an important approach, since data, such as pH, temperature, and more advanced measurements such as volatile fatty acids (VFA), is constantly being accumulated. Data mining opens up the prospect of learning from this data in order to manage plants better. A number of possibilities are being studied:
- developing models or rules that help to predict dangerous conditions on the plant from trends in sensor readings
- detecting faulty sensors through inconsistent sets of readings
- partially substituting for expensive sensors by combining readings from more commonly available sensors.
Within the TELEMAC project, data is available from a number of plants of differing types and sizes. These range from large industrial waste-processing plants, through pilot-scale plants with a full range of instrumentation, to small laboratory-scale set-ups used for running specific experiments.
Data mining is often regarded as one part of the broader problem of knowledge discovery. Knowledge Discovery in Databases (KDD) is defined as 'a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data' and data mining as 'exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns' (Kumar, V. & Joshi, M. , Tutorial on high-performance data mining, at University of Minnesota). Although much of the activity is data mining, the goal is usually knowledge discovery. Data mining is not simply a matter of running algorithms for rule induction or neural networks; a considerable amount of preliminary work is usually required, and TELEMAC is no exception. The overall process includes stages such as data selection, data cleaning (for example, dealing with missing values or outliers), data reduction or enrichment, data preparation, the data mining itself, and reporting (through visualisation, statistics etc).
The software being used at CCLRC for data mining is Clementine, a product available from SPSS. Clementine was developed in Europe as a commercial, general-purpose data mining tool and was adopted by the Business and Information Technology Department of CCLRC in 1998 for scientific applications such as the EC-funded project DECAIR. Like a number of visualisation systems (for example AVS, Nag Iris Explorer, IBM Open DX), it presents a visual interface that allows a user to connect modules together, allowing data to flow from one end to the other. Visualisation software such as XMDV is also being used.
Initial work focussed on the use of data mining to recover a known model from synthetic data, thereby giving prima facie assurance of the applicability of the approach. Fitting was generated using a simultaneous prediction of three key variables. A Pruned Neural Net modelling tool was employed. The data was split into a test set and a training/validation set. The training/validation set is randomly split 50/50. The reason for this three-way split is to avoid over-training the neural net - a common difficulty faced in practice. It has proved possible to reproduce the synthetic data to a good level of accuracy.
Exploratory work has also been done on confidence and prediction intervals for TELEMAC data. This refers to the distinction between the accuracy of the fit itself (the regression) and in the predictions arising from its use. These are different, since an individual value will have extra variability due to noise.
Simple modelling of target variables for some of the industrial data is also under way. The objective is to determine whether a neural net model constructed in one time period could be used to make predictions of a target variable in another time period. The technique used is the Extended Pruning approach with a measurement of VFA as the target, and it gave 84% accuracy when applied to data from another time period from a particular plant.
Techniques of rule induction are being applied to estimate values of sensor readings based on more easily obtained values, and to determine how reliable the models so developed remain over time. Rules are generated in forms such as the following:
Variable X falls in a particular range of high values if variable Y falls within a particular range of low values this rule scores (N, p).
Here N and p are indications of the degree of satisfaction of the rule. Rule induction has the advantage that the rules generated are often meaningful to human domain experts, and can be critiqued and validated by them.
A key question for future work is the handling of the evolution of the state of the plant over time. This will begin with the approach of Dorffner to neural networks for time series processing, in which a model is built up from training records that contain both prior time values and current values. This technique is often known as lagging.
A further important question is the place of data mining in the final TELEMAC system. The project as a whole is now considering the form of the final TELEMAC system and its tools, but at the time of writing it seems that data mining will operate as a component at the Telecontrol Centre (responsible for monitoring several plants), and will run periodically to update the knowledge base. It remains an open question as to how to detect when it is necessary to re-run the data mining algorithms for a particular plant.
Simon Lambert, CCLRC
Responsible for data mining work
Tel: +44 1235 445716
Bruno Le Dantec, ERCIM
Tel: +33 4 9238 5013
Olivier Bernard, INRIA
Tel: +33 4 9238 7785