by Claude Labit and Jean-Pierre Leduc
The huge increase of the communication services has suggested to investigate new algorithmic avenues to solve the perpetual rate/distortion trade-off in video coding applications. The problem is drawn nowadays according to very severe constraints: bit-rate budgets become smaller (to eventually attain domestic communication networks at 8 Kbit/s) and perceptual criteria prescribe high levels of reconstruction quality. Two approaches are explored. The first consists of extending the standards already developed in image communication by successive improvements. The second proposes a complete change of the basic tools by introducing some new methodological techniques which have been already validated by the Computer Vision community .
The compatible approach
The main issue in image-sequence coding compared to still-image coding lies in an appropriate processing of the temporal changes i.e. essentially the motion information. High performances can be reached by combining motion compensated temporal prediction and linear transform techniques to constitute the so-called hybrid coding approach. Following the basic algorithmic modules and where promising options could reasonably increase their performances:
Many efforts have been carried out during the last two decades to improve the models and algorithms related to the estimation motion vector fields. The fundations of these improvements can be classified as follows:
Increase of the motion model complexity: polynomial (affine or quadratic) models identified upon squared or polygonal regions enable the estimation of finer motion representations. Going far beyond pure translational motions, these techniques takes into account component like rotation, divergence and deformation. These different motion models can be included in an ordered hierarchy adapting locally on a region basis the most efficient model.
Choice of alternative estimation framework: Block Matching Algorithms (BMA) are only based on temporal correlation measures. Further commonly experimented sets of motion estimators are gradient-based methods namely, the pel-recursive and the block-recursive approaches.
Introduction of regularization techniques: simultaneously to smoothing the motion estimates and reducing the encoding load of such an information map, regularization techniques have to be added. These lead to smoothing the constraints in the estimation process itself under the constraint of preserving the motion discontinuity along borders of moving objects.
Well-conditioned motion estimation stage: rather than estimating everywhere - i.e at each pixel for a given region- the motion feature, it is more efficient to select the pixels which are relevant to a given motion and those which have be considered as outliers.
Multiresolution motion analysis: multiscale (based on a unique resolution of image data but on several embedded motion labels) or multigrid (based on pyramidal representations of image and motion data) techniques are presently exploited to estimate large magnitudes of motions, to speed up the convergence of iterative estimation procedure or to avoid local minima in the minimization of a not-purely monotonic functional.
Two strategies have to be distinguished: the first in which no motion features are transmitted requires to design a predictive version of the motion vector for motion compensation. It often leads to using the previous motion vector estimator or an average of previous estimated vectors to predict the motion feature at the current pixel. The second approach (as in BMA technique) assumes the transmission of the motion field and then motion compensation error is quite identical to the motion estimation one.
Whereas usual DCT transforms can be considered as a powerful tool to decorrelate and take into account the perceptual content, it is well known presently that some other orthogonal transforms could provide better frequency-based approach with respect to time-frequency localization and regularity which is quite important for iterated versions of analysis/synthesis filters and also for robustness to quantization noise.
In this way, subband decomposition has been introduced using the filter bank formalism or the wavelet theory. For VLBR purposes, obviously, the inherent property of a multiscale representation for image data within a dyadic subband decomposition could be very useful to design a hierarchy of reconstruction quality (and associated bit rate) and appear adequate for versatile and compatible versions of similar encoding strategies fitted to a hierarchy of services or communication networks. Moreover , for ATM network technology, such an adaptive approach is necessary to adapt the coding options to the instantaneous priority of the communication services and the current available bit rate to transmit them.
The main extension to the current scalar quantization scheme consists of taking into account the local (in the spatial or temporal dimensions) redundancies of successive image data. This naturally leads to propose vector quantization built with or without no codebook learning. The main reason why this compression tool was not frequently used in the past, was essentially its calculation complexity. However, current approaches (e.g lattice vector quantization or tree-structured VQ) have improved substantially this problem by introducing a geometric (tree-based or lattice-based) structure over the vector space to be quantized.
Some examples of alternative frameworks
Analysis/Synthesis filter banks decompose and reconstruct the signal on the basis of frequency coefficients. If we try to propose a methodological link with others methods, let us say that, rather than or complementary to frequency informations, the image signal can also be decomposed by a structure or object-oriented partition. This approach yields to segment the image (resp. the spatiotemporal sequence) with homogeneous regions based on texture (resp. motion) features and on the detection of spatial edges (resp. motion discontinuities). This general approach can be also entitled by analysis/synthesis scheme because of the concatenation of an analysis stage (at the coder) and a synthesis one (at the decoder) of objects. Such an approach enables the introduction of algorithmic tools developed -sometimes since ,many years- by other scientific communities: i.e Computer Vision, Computer graphics, Object-oriented programming, etc.
As concluding remarks, let us just confirm that, rather than the discovery of an universal codec, the future researches will propose a more and more growing hierarchy of embedded solutions providing encoding schemes well fitted to dedicated applicative fields and perceptually efficient if a human observer finally is the crucial referee.
To illustrate these considerations, we have proposed two new algorithmic studies as partial contribution to this challenge. The first one considers that, to achieve very low bit rate, usual quality criterion globally measured is unadequate: for a given scene, some parts are of great interest (these one will be designated as "Region-of-Interest") and needs high quality of reconstruction contrary to others regions (e.g the fixed background) where no semantic information has to be perfectly reconstructed. The second approach, using long-term temporal linking of an object-based segmentation using a Markov Random Field formulation, proposes some experiments on how to introduce these tools to increase the performances of a motion-compensated coding scheme.