Adaptive Scene and Traffic Control in Networked Multimedia Systems

by Leif Arne Rønningen

The Distributed Multimedia Plays (DMP) Systems Architecture provides three-dimensional multiview video and sound collaboration between performers over packet networks. To guarantee an end-to-end time delay under twenty milliseconds, and to obtain high network resource utilization, the perceived quality of audio-visual content is allowed to vary with the traffic in the network. Typical applications are in the next generation of televisions (Multimedia Home Platform (MHP) extended with DMP), games, education, and virtual meetings that have a realistic feel. The architecture is also suitable for use in creating virtual collaborations for jazz sessions, music lessons and distributed opera.

Research at the Norwegian University of Science and Technology is building on the concept of DMP, which was first proposed in 1996 as an extension to MHP. A collaborative effort is underway between researchers in the departments of Psychology, Telematics and Art and Media Studies, with research looking at the design and testing of visual sequences that support a comparison between realistic meetings and virtual collaborative meetings using three-dimensional, high-resolution display.. The temporal and spatial resolution will be varied for individual objects in scenes. Eye movements and object focus are tracked when subjects watch selected video clips. Practical applications for this developing technology could be in use before 2015.

A DMP terminal system is shown in Figure 1. The two upper layers must always be present, while the two lower layers represent wireless and fibre alternatives. Network nodes include resolution and traffic control in addition to normal routing functions.

Figure 1: The DMP Terminal System.

Scene Characteristics
To give the virtual collaboration a natural feel, the size, form and position of objects should be near natural size, the displayed picture shall be flicker-free, and individual pixels should not be visible at a viewing distance of fifty centimetres. The normal comfortable distance between people generally varies between fifty centimeters and several metres, but may also be several hundred metres. The natural viewing area of human eyes is used as the ‘frame’ of the scene.

Sub-objects and 2D Interlace
In the RL algorithm - an integrated resolution and traffic control algorithm - each object is divided into two, four or more sub-objects, which are sent in separate streams. This is called the 2D interlace.

Video Segmentation and Multiview Shooting
Object recognition and tracking is carried out by analysing video sequences, shot by multiview cameras. Some of the most important factors for DMP are face and eye recognition and tracking. Note that the motion estimation of object-oriented scene objects is moved away from the coder (compressor) to the object- and eye-tracking systems.

Compression and Coding
Compression algorithms can make use of the Discrete Cosine Transform together with Huffman coding, or wavelet transforms that represent data in the time-frequency space. The Wavelet transform makes level scalability easy.

The RL Resolution and Traffic Control Algorithm
A detailed description of the RL-C algorithm can be found in the reference. Simulations show that the end-to-end delay can be guaranteed. However, formal tests must still be conducted to show when and to what extent the time-varying resolution of audio-visual content is acceptable.

Service and Scene Setup, and Address Allocation
SMIL (Synchronized Multimedia Integration Language) is used for scene composition. The SIP/TCP/IPv6 protocols are used for set-up, and the SIP URL identifies services, scenes and subscenes. Parameters in the combined RTP/UDP/IPv6 protocol header identify the same for content transfer. IPv6 addresses will be allocated in a private way, and the SIP and RTP protocols are adapted for this application.

Variable-Resolution Display
A viewer focuses on one scene object at a time, and most viewers focus on the same object at any given time. An eye-tracking system identifies these objects, and orders the encoder on the shooting side to shoot and represent these objects with high resolution.

A Virtual Dinner Scenario
Researcher A in Trondheim enters his dining room, sits on his sofa and requests a Virtual Dinner with researcher B in Padova, also sitting on his sofa. This interaction generates different levels of traffic from A sent to B. The system ‘finds’ two faces and a plate with food for researcher B. A and B talk for about 30 seconds (7 Gbps), then researcher A arises (0.5-1 sec), goes out for a plate of food (5-10 sec), is absent (1-2 min), returns (5-10 sec) and sits down (0.5-2 sec). This increases traffic to nearly 60 Gbps, which then drops to the background of 127 Mbps. The system tracks the plate and the food. They start eating and talking, during which their faces, arms, hands, plates and food dominate the data rate, about 8 Gbps.

Figure 2: Virtual Dinner, traffic generated by researcher A (time-scale not correct)

Researcher B stands up and walks across the room. After a few minutes they need to talk to researcher C in Poznan, and establish a three party DMP. After eating, B leaves the room. A asks the system to disconnect B. Meanwhile, researcher C has to leave home and go to his office, but he wants to continue the session with A while travelling to the office.
PhD funding has been provided from the Research Council of Norway (NFR).

Link:
http://www.item.ntnu.no/~leifarne/

Please contact:
Leif Arne Rønningen, Norwegian University of Science and Technology (NTNU), Norway
Tel: +47 9003 0473
E-mail: leifarneitem.ntnu.no