< Contents ERCIM News No. 55, October 2003
SPECIAL THEME: Machine Perception

Video Understanding and Indexing for Surveillance: Image Perception, Quality and Understanding

by Tamás Szirányi

Motion tracking and scene analysis, especially in surveillance systems, require a high-level interpretation of possible shapes and their events, even in the case of incomplete vision conditions and transient motion. To this end, a multi-camera surveillance system has been developed and new efficient algorithms constructed at the Analogical and Neural Computing Systems Laboratory of SZTAKI, with fragments of motion and bodies being grouped through methods of statistical inference.

What is the common thread in the indexing of archive films and the registration of objects in surveillance tasks? In both cases, some understanding of the scene is required, and in the case of films we can be sure that the director's intention lies behind the movement of the camera. In surveillance, however, there is no director, area of focus or even structured scene. Statistical inference is needed to extract meaningful information when it is required. Clever effects such as illumination or associations can help to elucidate incomplete visual data to support human visual understanding. These effects are exploited in composed films, and could be also helpful in interpreting poor surveillance data.

If surveillance problems are related back to better-defined scenes, then Bayesian approaches can lead to an understanding of unanticipated events. The main problem involves the duration of events. While film sequences have a beginning and an end, surveillance events are usually transient dissolving scenes. The arbitrary motions of low-resolution objects (think of a police camera surveying a whole street) are not easily interpreted. However, if we compose beforehand a set of possible events with motion samples and statistical inductions, then real-life surveillance events can be more easily analysed. While it is an achievement to successfully connect the two distinct areas of visual analysis, problems can occur on both sides: definitions of objects and motion in transient events, indexing of sequences and interpretation of scenes.

Tracking and center definition of street scenery of slowly and discursively moving objects of indefinite shapes.

Motion and shape can hardly be described in real-life applications of noisy video surveillance. Using a greater number of cameras can help to obtain super-resolution or registered tracks, but this entails calibration of the cameras. The definition of shape or motion in outdoor applications is ambiguous. Unfixed cameras and transient motion of objects together present the challenge of applying statistical inference and optimisation to simultaneously estimate the relative positions of cameras and moving objects. In developing a multi-camera surveillance system, our task is the detection of motion, and the segmentation and characterisation of moving objects. The information coming from the various camera units is interchanged to obtain more precise data concerning direction and vectors of motion, and to recognise the object and its behaviour. These tasks need fast data transmission, comparison of images, and continuous evaluation and explanation of image characteristics. This in turn requires high-speed processing to ensure real-time operation. We are working on relaxation-based object-tracking methods for indefinite object shapes, where moving objects following different paths may obscure each other, or partially fade into other objects in low contrast areas.

This research has strong connections with theoretical work being done at the Laboratory of Image Processing in Veszprém University (led by Tamás Szirányi), where a new semi-automatic digital restoration system is being introduced for motion-picture restoration for film archives. The automation is controlled by occasional operator interactions. For this purpose, the film analysis is supported by cut detection and film indexing based on colour information and motion activity. Data representation in XML also aims to create well-defined and controllable processes. When restoring defective films, corresponding scene sequences must be registered with each other by scene-based indexing.

Our surveillance work is supported by appropriate hardware and network tools. These include intelligent camera units with processing and optimised networking capabilities, ultra-fast image-analysis engines of Cellular Neural Network processors, and wired and radio transmission of images and data. The camera system is robust in arbitrary connections and geometry. We are developing the system for automatic grouping and calibration.

Another important issue involves human sensation and the information content of the digital image, including the artistic interpretation of a scene. We have run several human tests for qualifying methods in which artefacts and objects must be sequestered.

Our project is partly supported by the Hungarian National Research and Development Program: TeleSense NKFP 2001/02/035. With this activity, we joined a new Network of Excellence project run by ERCIM: MUSCLE (Multimedia Understanding through Semantics, Computation and Learning).

Links:
http://www.sztaki.hu/~sziranyi
http://lab.analogic.sztaki.hu

Please contact:
Tamás Szirányi, SZTAKI
Tel: +36 1 279 6106
E-mail: sziranyi@sztaki.hu