Machine Perception and Understanding: Introduction
by Eric Pauwels
The unrelenting pace of innovation in computer and communication technology continues to affirm the central role played by information in modern society. Indeed, the typical amount of digital storage capacity is doubling every year, and bandwidth in both wired and wireless networks increases at an even faster rate, supporting instant and almost ubiquitous access to an abundance of resources. As a consequence, there is a pressing need for tools that can assist us in exploring the huge quantities of data with which we are confronted when managing large multimedia databases or monitoring complex sensor streams. These data can have intricate spatial, spectral or dynamic structures (eg text that refers to images, audio punctuating video) and are potentially an extremely valuable source of information. However, unless we can extract the knowledge buried in the bits and bytes, all this data serves little purpose. In this respect, it is becoming increasingly clear that in order to be efficient, data processing needs to be content-based. As the enormous size of these collections precludes comprehensive human supervision, the only viable alternative is the development of reliable machine perception and understanding, and in particular, the automatic creation of semantically rich metadata that can be used as input for ensuing high-level processing or decision support.
Addressing these challenges is quite a daunting task. Fortunately, we are witnessing prodigious activity in key scientific and technological areas that promise to have a profound impact on the way we tackle this deluge of information. First, progress in signal processing for the different modalities (image, audio, speech, etc) has given rise to the creation of sophisticated tools capable of performing reliably for specialised sub-problems (eg face- or motion-detection and text-to-speech translation). In addition, researchers are increasingly turning their attention to cross-modal integration, combining different modalities to maximise information extraction and robustness. Concurrently, progress in statistical and machine learning has boosted the wider acceptance of automated learning methodologies, and it has transpired that these techniques can contribute significantly to the automatic exploration and structuring of large datasets.
To underscore the urgency of the data-mining problem and its potential impact on society and industry, the European Commission has made semantic-based knowledge systems a priority theme in its call for Information Society Technologies for the upcoming Sixth Framework. ERCIM has positioned itself to play an important role by submitting a Network of Excellence (NoE) on Multimedia Understanding through Semantics, Computation and Learning (MUSCLE, currently in the negotiation phase). This NoE will run for four years and should stimulate closer collaboration between European groups working on research projects that aim to integrate machine perception and learning for multimedia data-mining. The consortium members have agreed to focus on different aspects of single- and cross-modal processing (in video, audio and speech) as well as various flavours of statistical learning.
To encourage close coordination of effort and durable scientific integration, MUSCLE will set itself two 'Grand Challenges'. These are ambitious research projects that involve the whole spectrum of expertise that is represented within the consortium and as such, will act as focal points. The first challenge focuses on natural high-level interaction with multimedia databases. In this vision it should become possible to query a multimedia database at a high semantic level. Think Ask Jeeves for multimedia content: one can address a search engine using natural language and it will take appropriate action, or at least ask intelligent, clarifying questions. This is an extremely complicated problem and will involve a wide range of techniques, including natural language processing, interfacing technology, learning and inferencing, merging of different modalities, federation of complex meta-data, appropriate representation and interfaces, etc. The second Grand Challenge is related more closely to machine perception and addresses the problem of detecting and recognising humans and their behaviour in videos. At first glance, this might seem rather a narrow scope, but it has become clear that robust performance will heavily rely on the integration of various complementary modalities such as vision, audio and speech. Applications are legion: surveillance and intrusion detection, face recognition and registration of emotion or affect, and automatic analysis of sports videos and movies, to name just a few. For more information on this Network, we invite the reader to visit the MUSCLE Web page (see below).
This special issue highlights the breadth and depth of the research related to machine perception and understanding that is currently being conducted by various ERCIM groups. There continues to be a strong interest in biologically inspired approaches, which is hardly surprising, since nature still easily outperforms the most intricate technical solutions. Another trend that re-affirms itself is the reliance on computationally intensive methodology to extract statistical information and simulate computational models. Equally varied are the applications on display, ranging from mobile robots to support for virtual studios. Researchers are also enthusiastically accommodating the advances in sensor and (wireless) communication technology that support the creation of networks of interacting and context-aware components, thus moving closer to the vision of genuine ambient intelligence. It is our hope that this Special Issue will offer the reader a taste of the exciting developments on which the various ERCIM laboratories are working.
Eric Pauwels, CWI