SOMLib - New Approaches for Information Presentation and Handling

by Andreas Rauber

While tools exist that allow us to search through vast amounts of text within seconds, most systems fail to assist the user in getting an overview of the information available or maintaining orientation within an information space, and fail also to convey meta-information in an intuitively graspable way. SOMLib is a digital library system addressing these issues by providing automatic content-based organisation and metaphor-graphics-based visualisation facilitating exploration and understanding of information spaces.

With the increasing availability of information in electronic form, be it online magazines, legal or medical document databases, project archives, or documentation on a company-internal intranet, advanced digital library systems that support users in interacting with large information repositories are gaining in importance. Yet, while databases and search engines help us in retrieving snippets of information, current tools fail to provide us with a feeling of 'where' information is available, and how different facts relate to each other. The ability to keep an overview of factors such as the information available, the topics covered by a given site and the amount of information available on a given topic is only poorly supported. In addition to the powerful search methods offered by modern information systems, it seems difficult to provide equally powerful means of organising and structuring the information.

What we would like are ways of information organisation and representation that allow us to make use of the concepts that we are using constantly, unconsciously, when handling and navigating real-world information spaces. Libraries, bookstores, project documentation in binders, working material and paper collections are all conventionally organised (also) by thematic criteria. This allows us to immediately get an overview of which kind of information is available in which section of an archive, how many reports have been filed on a specific topic in a binder, and so on. Due to the spatial location it is also easier to find a paper, report etc for the second time, as it is easier to recall roughly where a given document was located than to remember sufficiently precise search criteria, or its relative position within a listing.

The SOMLib system: from text collections via content-based organisation, to metaphor-graphics-based representation of document repositories facilitating intuitive browsing and exploration.

With the SOMLib digital library, we created a system providing content-based organisation of document repositories, facilitating intuitive browsing and exploration of the information space. It builds on and incorporates works in the fields of information retrieval, neural networks, information visualisation, and usability analysis.

Low-level features based on word frequencies are extracted from the text to provide a domain- and language-independent content representation of text documents in a high-dimensional vector space. The 'self-organising map' (SOM), a popular unsupervised neural network model, is further used to cluster the document feature vectors, performing a topology-preserving mapping of the documents from the high-dimensional vector space onto a two-dimensional map space. Documents are thus grouped according to their mutual similarity, having documents on similar topics mapped onto neighbouring locations on the map. Using the 'growing hierarchical self-organising map' (GHSOM), a novel extension of the SOM, we can further detect subject hierarchies in a document collection, with the neural network adapting its size and structure automatically during its unsupervised training process to reflect the topical hierarchy. Individual SOMs can further be integrated to form a network of referencing maps.

By mining the weight vector structure of the trained maps using the 'LabelSOM' technique, the system automatically extracts keywords describing the various topical clusters. This is based on the analysis of the feature distributions within each cluster. It helps users in identifying which topics are present in a given document collection and where they are located on the map.

Finally, the 'libViewer' provides an intuitive representation of the documents in a repository by using real-world metaphors such as different document types, spine widths, dust etc to convey metadata in an intuitively graspable way.

Using the 'SOMLib' system, users can browse a document collection in the form of bookshelves and find clusters of documents on similar topics located in neighbouring boxes, with the topic of each box being described by a set of automatically extracted keywords, and metadata being depicted in the form of different document representations. In combination with conventional approaches for searching and dynamically sorting text archives we thus have a powerful tool at our disposal, which allows us to obtain and maintain an overview of the amount and type of information available, and to detect relationships between different documents. This means we can better handle, interact with, and use the available information.

The 'SOMLib' system has been applied in numerous different domains in a variety of languages, such as the organisation of legal databases, newspaper archives, scientific document collections and Web search results. Recently, the principles of this work have been expanded for use in digital music archives as part of the 'SOM-enhanced Jukebox' (SOMeJB) system. By analysing frequency spectra of audio files and transforming them to time-invariant representations while incorporating psycho-acoustic models, organisation and exploration following musical genres is facilitated.

Andreas Rauber is currently an ERCIM Research Fellow at INRIA.

Please contact:
Andreas Rauber, Vienna University of Technology
E-mail: rauber@ifs.tuwien.ac.at
http://www.ifs.tuwien.ac.at/~andi