spaceSPECIAL THEME: E-GOVERNMENT
ERCIM News No.48, January 2002 [contents]

Secure Dissemination of Census Results using Interactive Probabilistic Models

by Jirí Grim, Pavel Bocek and Pavel Pudil


A research team at the Institute of Information Theory and Automation of the Czech Academy of Sciences has proposed a new user-friendly method of interactively presenting census results. The method is based on estimating a probabilistic model of the original microdata in the form of a discrete distribution mixture, which can be used as the knowledge base of a probabilistic expert system.

Last year, the countries of the European Union organised a complete coordinated census as an important basis for future co-operation. Both the scope - the entire population of the European Union was included - and the corresponding cost of this General Census made it unique. The huge expenses are officially justified by the key significance of a census for the national economies and their institutions. Even more, it holds for the coming ‘e-government’ age of decision-making. Unfortunately, despite the considerable related costs, the availability of census results is greatly limited by the necessary confidentiality conditions. The new user-friendly method of interactive presentation of census results developed at the Institute of Information Theory and Automation of the Czech Academy of Sciences can help to solve this problem. The method is based on estimating a probabilistic model of the original microdata in the form of a discrete distribution mixture, which can be used as the knowledge base of a probabilistic expert system. The final software product is able to deduce any required information solely from the estimated model - without any further contact with the original data.

In this way the information contained in census data can be made freely accessible without any risk of confidentiality violation.

The availability of census results is greatly limited because of the necessity of preserving confidentiality. As the anonymous respondent can often be identified by combining a sufficient amount of external information, the individual census records (microdata) must not be directly accessible to general users. Thus the confidentiality conditions, however inescapable, become rather restrictive for economic and social research, causing under-utilisation of data that have been collected at great cost.
Census results are mostly published in the form of tables. The exact relative frequencies of suitable feature combinations are stored into cells of multiway tables. Usually only small order tables (eg, 6-10 variables) can be stored and distributed because of technical limitations. The number of table entries quickly increases with the number of combined variables. For example, if we combine only pairs of variables (questions), we obtain hundreds of thousands of possible table cells, whereby many of them could be interesting and useful for users in specific situations. The problem of accessibility of census information can hardly be solved by choosing some ‘relevant’ subsets of variables since potential users may formulate very specific and diverse queries. Regardless of the choice adopted, a huge part of the potential statistical information would remain unpublished. Moreover, appropriate techniques must be employed to test if the published cells are sufficiently anonymous.

The most informative method of publishing is the dissemination of representative subsets of microdata displaying statistical properties similar to that of the original census database. For this purpose, the selected subsets of microdata have to be made anonymous using various techniques, such as data swapping, identification and perturbation of unsafe records to disable any disclosure of individual respondents. Different disclosure risk models are used to guide the identification of unsafe records in a microdata file to provide maximum data protection with minimum loss of information content. Unfortunately, both the choice of a subset of the original data and the manipulation of the chosen original records negatively influence the accuracy of the contained statistical information. Despite careful preprocessing, the distribution of microdata is a sensitive task because of the remaining diclosure risk. For this reason, the access to microdata is not guaranteed in all European countries and is regulated in almost all cases.

In view of these problems there is a common interest in developing new techniques to exploit the full information potential of census data. With this aim we have proposed a new flexible and user-friendly method of interactive presentation of census results by means of a probabilistic expert system. The method is based on maximum likelihood estimation of the underlying joint probability distribution of data records in the form of a discrete distribution mixture with product components. In this way the statistical properties of data are described in a highly compressed form by a distribution mixture which can be used without change as the knowledge base of a probabilistic expert system. Once estimated from the original data, the mixture model contains all statistical information about the microdata. Hence the final software product can derive statistical information from the estimated model without any further access to the original data, meaning that the information supplied by the census can therefore be made generally accessible without any risk of loss of respondent anonymity.

The fundamental motivation of our research has been the application of the proposed method to the General Census of the Czech Republic in 2001, organised in coordination with all the countries of the European Union. However, the realisation of the proposed project has an obvious international dimension because of the possibility of the proposed approach being applied in other European countries. The following facts illustrate the significance of the proposed project:

  • since it is based on a probabilistic model, the method makes the statistical information contained in a census freely available to a large community of potential users
  • the confidentiality protection is perfectly guaranteed by avoiding user contact with the original microdata
  • the user may formulate questions relating to the statistical information without any constraints
  • the final software product could be easily distributed on CD or via the Internet
  • the proposed solution could effectively increase the information potential of the statistical offices.

The project presentation was awarded the ‘F. de P. Hanika Memorial Award’ at the Eleventh European Meeting on Cybernetics and Systems Research in Vienna, April 1992. The practical application of the proposed method to the General Census in 2001 may extensively benefit from its complete verification on the database of 535,000 Prague households from the 1991 Czechoslovakian census. In this experiment all aspects of the proposed solution have been successfully tested.

Link:
Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic:
http://www.utia.cas.cz/RO

Please contact:
Jirí Grim, CRCIM - UTIA
Tel: +420 2 6605 2215
E-mail: grim@utia.cas.cz