MultimediaN: Personalized Information Delivery
by Marcel Worring and Nellie Schipper
How can we make the best use of the abundance of multimedia information? The multidisciplinary PID project in the Netherlands is developing methods for capturing information, and then automatically indexing and presenting it to users in an optimal way.
The amount of multimedia information being captured and produced is constantly increasing. While providing the right information to a user is already difficult for structured information, it is much harder in the case of multimedia information. When multimedia collections become large, complete manual annotation is no longer an option. As a consequence, automatic indexing of multimedia is becoming an essential ingredient in any modern information system.
The state-of-the-art in automatic video indexing is evaluated in the yearly TRECVID, an international video retrieval benchmark that focuses on news data. Our Mediamill team (UvA/TNO) has participated in all editions. Much progress has already been made in this field, and the performance of automatic indexing techniques has proven to be useful for interactive retrieval. In the TRECVID we have shown that for successful indexing, all information about the data should be employed. For news video this means combining information from both the speech and the visual channel. Furthermore, analysis should not be restricted to the content of the two channels, but should also consider how the data is captured and what recurring use of style can be observed. For TRECVID2004 we indexed 32 concepts, and we are now scaling up to 50-100. From there we use ontologies to scale up by orders of magnitude the number of concepts for which an index is available.
Ultimately this will lead to automatic annotation both for produced video, like news and film, and non-produced video captured with a security camera or by someone walking around with a camera. In the latter case, the user can employ the speech channel for spoken annotation. Further, it is clear that the use of video restricts us to a two-dimensional representation of the world. Ideally, the three-dimensional world and the objects within it could also be stored in the database. We are working on 3D reconstruction methods from video for this purpose.
In the end, our information systems will be filled which large collections of images, videos, and 3D worlds, together with annotations of these data items. Deciding what to present to the user depends on a number of different factors. What is the device the person is using? Is the user sitting behind her PC in her office, or is she walking around in the field with her PDA? Furthermore, it depends on the task being performed and the context in which it is performed. A cognitive engineering approach is vital, in order to provide the user with the right information by taking into account the task, context and user capabilities.
Clearly only a multidisciplinary approach can bring together all of the above. Experts are needed in computer vision, machine learning, information systems, information visualization and human computer interaction. These different disciplines have been brought together in the PID project.
The PID project is part of the large-scale MultimediaN project funded by the Dutch government. It started on 1st April, 2004, and will have a total duration of four years. The PID consortium consists of research institutes (University of Amsterdam, TNO), system integrators (LogicaCMG, Compano/Ziuz) and application holders (Dutch Olympic Committee, Dutch Forensic Institute, the police). Thus, the project covers the whole chain from research to applications.
To study the above methodologies a number of concrete applications are being pursued. For each, the whole chain from data capturing to presentation is considered, but each application has its emphasis on one of the elements. The application being developed with the Dutch Forensic Institute is the 3D reconstruction of crime scenes using video cameras. Indexing of the crime scene will be performed using a combination of speech and visual analysis. A project in collaboration with the Dutch Olympic Committee is developing a personal coach, which captures the 3D movement of athletes, combining this information with data from other sensors, such as heart rate. This information is then used for tracking and improving the athletes performance. For the police, the emphasis is on the cognitive side, aiming at equipping police on the job with attentive mobile devices that provide them with relevant information. Finally, home videos are considered where the emphasis lies on creating summaries of the data and finding relations within the data.
Marcel Worring, University of Amsterdam, The Netherlands
Tel: +31 20 5257521
Nellie Schipper, TNO, The Netherlands