Structuring Multimedia Archives with Static Documents
by Denis Lalanne and Rolf Ingold
If we consider static documents as structured and thematic vectors towards multimedia archives, they can be used as a tool for structuring events such as meetings. Here we present a method for bridging the gap between static documents and temporal multimedia data, such as audio and video. This is achieved by first extracting electronic document structures, then aligning them with multimedia meeting data, and finally using them as interfaces to access multimedia archives.
Interfaces to textual-document libraries are improving, but search and browsing interfaces in multimedia-document libraries are still in the early stages of development. Most existing systems are mono-modal and allow searching either for images, videos or sound. For this reason, much current research in image and video analysis is focusing on automatically creating indexes and pictorial video summaries to help users browse through multimedia corpuses. However, such methods are often based on low-level visual features and lack semantic information. Other research projects use language-understanding techniques or text captions derived from OCR, in order to create more powerful indexes and search mechanisms. Our assumption is that in a large proportion of multimedia applications (eg lectures, meetings, news etc), classical printed documents or their electronic counterparts (referred to by the term printable) play a central role in the thematic structure of discussions.
Unlike other multimedia data, static documents are highly thematic and structured, and thus relatively easy to index and retrieve. Documents carry a variety of structures that can be useful for indexing and structuring multimedia archives, but such structures are often hard to extract from audio or video. It is therefore essential to find links between documents and multimodal annotations of meeting data, such as audio and video.
Recently there has emerged a significant research trend toward recording and analysing meetings. This is done mostly in order to advance research on multimodal content analysis and multimedia information retrieval, which are key features for designing future communication systems. Many research projects aim at archiving recordings of meeting in forms suitable for later browsing and retrieval. However, most of these projects do not take into account the printed documents that often form part of the information available during a meeting. We believe printable documents could provide a natural and thematic means for browsing and searching through large multimedia repositories.
For this reason, we have designed and implemented a tool that automatically extracts the hidden structures contained in PDF documents. The semantics of the information behind layout and logical structures is largely underestimated and we believe their extraction can drastically improve both document indexing and retrieval, and linking with other media.
In order to browse multimedia corpuses using documents as interfaces, it is necessary to build links between printable documents, which are inherently non-temporal, and other temporal media. We use the term temporal document alignment to refer to the operation of extracting the relationships between a document excerpt at variable granularity levels, and the meeting presentation time. Temporal document alignment creates links between document extracts and the time intervals in which they were in either the speech focus or the visual focus. It is thus possible to align document parts with audio and video extracts, and by extension with any annotation of audio, video and/or gesture.
In the FRIDOC multimedia browser that we have developed, users can first search at a cross-meeting level by typing in a set of keywords: this will retrieve all relevant documents. Clicking on a document or an article then allows users to view the related multimedia data attached to this element and to directly jump to the portions of meetings in which it was in focus. At the intra-meeting level, all the components (documents, audio/video, transcription and annotations) are synchronized through the meeting time, thanks to the document alignments; clicking on one of them causes all the components to visualize their content at the same time. For instance, clicking on a journal article cues audio/video clips from the time at which it was discussed, cues the speech transcription from the same time period, and displays the document that was projected.
This work demonstrates the role of static documents as structured and thematic vectors towards multimedia archives and proposes a method for bridging the gap between static documents and multimedia meeting archives. The results obtained so far through user evaluations tend to prove that documents are an efficient means of accessing multimedia corpuses, such as multimedia meeting repositories or multimedia conference archives.
Rolf Ingold, University of Fribourg, Switzerland