Data Mining applied to Air Pollution
by Brian J Read
An understanding of the behaviour of air pollution is needed to predict it and then to guide action to ameliorate it. Calculations with dynamical models are based on the relevant physics and chemistry. To help with the design and validation of such models, a complementary approach is described here. It examines data on air quality empirically by Data Mining using, in particular, machine learning techniques, aiming for a better understanding of the phenomenon and a more direct interpretation of the data.
The work of the Database Group at CLRC has long concentrated on the practical application of data management technology. The emphasis is on helping users at the laboratory and externally to exploit the value in data. Implementing databases and providing easy access is the basis for this, supplemented by data exploration and decision support tools. More recently, interest has extended to data mining, or more fully Knowledge Discovery in Databases (KDD). This may be defined as "the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data". Data mining is just the discovery stage of the whole KDD process. Indeed, most of the work in practice lies in the preparatory stages of data selection and data cleaning. Extensive data exploration is essential if the data mining is to yield intelligent results.
Data mining is multi-disciplinary: it covers expert systems, database technology, statistics, machine learning ("AI") and data visualisation. It goes beyond directed querying of a database (eg by using SQL) by instead looking for hypotheses or questions rather than detailed answers. Most interest is in mining commercial data - for example credit profiling or market basket analysis. However it is starting to be used in scientific applications too. CLRC as a leading research laboratory has masses of data. Thus there is the motivation to see how data mining techniques might supplement the more traditional scientific analysis in formulating and testing hypotheses. Of specific interest are the induction of rules and neural net models. Considering environmental data, measurement and possibly control of air pollution is increasingly topical. In applying the KDD process, our objectives are two-fold:
- to improve our understanding of the relevant factors and their relationships, including the possible discovery of non-obvious features in the data that may suggest better formulations of the physical models
- to induce models solely from the data so that dynamical simulations might be compared to them, and that they may also have utility, offering (short term) predictive power.
The investigation uses urban air quality measurements taken hourly in the City of Cambridge (UK). These are especially useful since simultaneous weather data from the same location are also available. The objectives are, for example, to look for and interpret possible correlations between each pollutant (NO, NO2, NOx, CO, O3 and PM10) and a) the other pollutants b) the weather (wind strength and direction, temperature, relative humidity and radiance) looking in particular for lags that is, one attribute seeming to affect another with a delay of perhaps hours or of days. Other factors are possible. For example, clearly noticeable is lower NOx on Sundays through less traffic.
The initial analysis concentrated on the daily maxima of the pollutants. This simplifies the problem, the results providing a guide for a later full analysis. Also the peak values were further expressed as bands (eg low, medium and high). The bands are directly related to standards or targets recommended by the Expert Panel on Air Quality Standards (EPAQS) that the public can appreciate. The two principal machine learning techniques used are neural networks and the induction of rules and decision trees. Expressing their predictions as band values makes the results of such rules and models easier to understand.
Work so far supports the common experience in data mining that most of the effort is in data preparation and exploration. The data must be cleaned to allow for missing and bad measurements. Detailed examination leads to transforming the data into more effective forms. The modelling process is very iterative, using statistics and visualisation to guide strategy. The temporal dimension with its lagged correlations adds significantly to the search space for the most relevant parameters.
More extensive investigation is needed to establish under what circumstances data mining might be as effective as dynamical modelling. (For instance, urban air quality varies greatly from street to street depending on buildings and traffic.) A feature of data mining is that it can short circuit the post-interpretation of the output of numerical simulations by directly predicting the probability of exceeding pollution thresholds. More generally, data mining analysis might offer a reference model in the validation of simulation calculations.
Brian J Read - CLRC
Tel: +44 1235 44 6492