ERCIM News No.47, October 2001 [contents]
Scene Interpretation for Video Communication
by Monique Thonnat
Based on Scene Interpretation, a Video Communication System can improve permanent and informal communication between distant individuals. Such a system has been realized by the Orion team at Inria Sophia Antipolis from 1999 to 2000.
By Scene Interpretation, we mean detecting and tracking several individuals in a scene (eg an office) and recognizing their behavior (eg two persons are meeting near the blackboard). The concept of video communication is dedicated to allow permanent and informal Communication and group awareness between spatially separated individuals (including video-conference). Current video communication (VC) systems rise two problems. First, the static nature of the cameras interferes with the real word activity; ie the VC system should automatically follow and zoom at the main individual in the office. Second, the level of availability (protection of privacy) has to be set up automatically, if not, the studies show that VC systems will be less accepted.
For these reasons, we propose to use an interpretation system to be a part of the global VC system to automate these tasks.
Given image sequences of everyday office life, our proposed interpretation system is able to analyze the behavior of individuals. This system is composed of three modules. First, from images the detection and classification module detects moving regions by subtraction to a background image and it classifies moving regions into mobile objects adding a label giving its type, such as individual or noise. Second, the tracking module associates the mobile objects with already tracked targets that correspond to real individuals. Third, the scenarios recognition module recognizes the scenarios relative to the behavior of tracked individuals. Finally, thanks to the scenario analysis, the VC system can decide what to send to the network, a zoomed sub-image of the camera or a filtered (blur) image. For example, if the system recognizes a work meeting situation, the VC system can automatically broadcast a blurred image to protect visitors privacy. Moreover, if the scenario the user is writing on the blackboard is recognized, the system can broadcast an image zoomed on the blackboard area.
In this article, we focus (1) on the tracking module and (2) on the behavior recognition module. The tracking problem is a central issue in scene interpretation as the lost of a tracked object prevents from analyzing its behavior. We have developed a tracking method, based on a 3D model of the scene (including 3D geometric and semantic information), on explicit models of individuals (including the 3D size of the mobile objects and the presence of skin areas) and of individuals trajectories and on the computation of several possible paths for each individual (using a time delay to improve the robustness of the algorithm). Resulting algorithm is able to cope with severe occlusions, merging/splitting individuals, errors in detection and in classification. For the behavior recognition module, we have developed a description language, close to natural language, which enable users to specify the behaviors they want to be recognized. By this way, the users can configure their own VC System and setup a desired level of their privacy.
This interpretation system has been tested on several office sequences. The longest sequence is 15 minutes time long. Some of these sequences are available on the web, at http://www-sop.inria.fr/orion/personnel/Alberto.Avanzi. These sequences show people entering the office together, crossing and overlapping each other. An example of scenario recognition two people are meeting in the office is illustrated by the figure. This scenario illustrates the case when guests are visiting the office and when the availability level should decrease (we dont want to broadcast guest pictures). Guests can be recognized according to the place where they sit. On a 3D animation, a flag (cube), showing the availability level, changes its color following the number of people in the office: green (nobody in the office, maximum of availability), orange (low probability of guests in the office, medium level of availability) and red (high probability of guests in the office, meeting situation, no availability). Currently, the two main limitations of the system are the mix of individual identifiers in difficult cases (eg long crossings) and to handle situations where individuals are mixed with mobile objects corresponding to noise (eg a chair moved by a person).
The first results of this Video Communication System based on Scene Interpretation are encouraging. We are planning to model more sophisticated scenarios and to complete the scene model to include 3D scene objects that are able to move (like chairs and doors). This new model will prevent paths to be mixed with mobile objects corresponding to the motion of these scene objects. This work has been done in collaboration with the IIHM and PRIMA teams from the University Joseph Fourrier in Grenoble.