DELOS KICK-OFF WORKSHOP

INRIA - Sophia Antipolis, 4-6 March 1996

These abstracts are also available in rtf format

Pre-Workshop Abstracts

Invited Speakers

The Development of Operational Digital Libraries: Clifford Lynch, University of California
Toward Digital Libraries: An Overview of The University of Michigan Digital Library Project: Bill Birmingham, University of Michigan
The Stanford Integrated Digital Library Project: Andreas Paepcke, Stanford University
The Networked Computer Science Technical Reports Library (NCSTRL): A framework for exploring technical and policy issues in a distributed digital library: Carl Lagoze, Cornell University

DELOS Speakers

Retroconversion of Library Catalogues and Multilingual Information Retrieval Peter Schauble and Pariac Sheridan - ETH

On Modelling Multimedia Information Retrieval: Carlo Meghini - IEI-CNR
Digital Libraries Research at GMD-IPSI: Accessing Multimedia Documents by Knowledge Discovery Methods and Intelligent Retrieval: Reginald Ferber - GMD-IPSI
Designing Interactive IR combining Adaptation to User Tasks and Strategies with Non-topic Document Analysis: Preben Hansen - SICS/CRIT and Jussi Karlgren - SICS/HUMLE
Mathematical Problems from the Taxonomy of Mathematics: Michael Hazewinkel - CWI
A Framework for Modelling, Pricing and Charging in Digital Libraries: J. Sairamesh and C. Nikolaou - FORTH in cooperation with D. F. Ferguson and Y. Yemini
End-User Payment Systems in a Digital Library Environment: Albert Fischer and Karen Hunter - Elsevier Science
An SGML workbench for Digital Libraries: Jacques Ducloy - INRIA
Dienst architecture issues with respect to replication: Laszlo Kovacs, Andras Micsik - MTA SZTAKI
Transforming Conventional Library Systems into Digital Libraries: Ole Husby - BIBSYS/SINTEF
Digital Libraries and Related Research at RAL: Judy Lay - RAL-CLRC

Invited Speakers

The Development of Operational Digital Libraries

Clifford Lynch, University of California

My presentation will focus on efforts in the United States to move towards operational digital libraries. I will discuss the evolving view of the digital library and its relationship to traditional libraries on the one hand and networked information services on the other. I will survey some of the work in progress based in libraries, in research projects such as the ARPA/NASA/NSF funded efforts, in access to Federal government information, and in the commercial sector. As part of this survey I will also summarize some of the key research issues, including both technical questions and economic, social and legal problems such as intellectual property management and economic frameworks.

Toward Digital Libraries: An Overview of The University of Michigan Digital Library Project

Bill Birmingham, University of Michigan

The ubiquity and accessibility of large information networks, perhaps best exemplified by the World Wide Web (WWW), are radically changing the way in which we conceive libraries. In a "digital library", a library without walls, a library patron can both access a broader range of information than is typically expected today, and can draw upon a vast array of services to process this information. The University of Michigan Digital Library Project (UMDL) is a research project concerned with the complex array of technical and socioeconomic issues implied by digital libraries. In this talk, I will overview the UMDL project, with particular emphasis on our vision of digital libraries and the software systems needed to realize this vision. In particular, we see digital libraries as heterogeneous systems populated by a vast number of software agents that represent information goods and services. These agents collaborate opportunistically to perform tasks, forming an "engineered" economy.

The Stanford Integrated Digital Library Project

Andreas Paepcke, Stanford University

Our work is based on the premise that digital libraries will not just be online catalogs and collections, but that they will be made up of geographically wide-spread services that support users in their tasks. The Stanford project is comprised of five thrusts which each contribute to this vision: Digital library infrastructure, user interfaces, economic issues, software agents and information retrieval technologies. Technologies supporting interoperability among collections and services are being developed throughout all of this work. We sketch highlights from these activities and then describe our InfoBus architecture. It is based on CORBA distributed object technology. We build proxy objects to provide interfaces to online services with different interaction models and access protocols. We sketch our information access protocol that takes advantage of this distributed object environment.

The Networked Computer Science Technical Reports Library (NCSTRL): A framework for exploring technical and policy issues in a distributed digital library

Carl Lagoze, Cornell University

The Networked Computer Science Technical Reports Library (NCSTRL) is an international collaboration with three major goals. First, NCSTRL makes available the technical reports from over 35 computer science departments and laboratories in North America and Europe. This collection is expected to grow with the endorsement of the Computing Research Association (CRA) in the U.S. and the participation of ERCIM in Europe. Second, the NCSTRL collection provides a framework for experimentation and demonstration of developing digital library technology. This technology is rapidly evolving as a result of research efforts such as the NSF/ARPA/NASA Digital Library Initiative in the U.S. and ERCIM/DELOS project in Europe. Finally, NCSTRL is an opportunity for exploration of the complex policy issues in developing and managing a federated digital library. Protecting intellectual property rights and maintaining the quality of the collection and service are primary concerns. In this talk, we will review the current NCSTRL technology and its origins, describe the future technical direction of the project, and discuss some of the policy issues and mechanisms for exploring them.

DELOS Speakers

Retroconversion of Library Catalogues and Multilingual Information Retrieval

Peter Schauble and Pariac Sheridan - ETH

We will present an overview of two areas of ongoing research at ETH, directly related to digital libraries in Europe. We are presently co-operating with the Zentralbibliothek Zurich in the digitization of their 2.2 million card catalogue. We have developed new techniques to allow effective retrieval on the full texts of the scanned cards, even though OCR is only achieving a word recognition rate of 66%. These techniques have been integrated into our information retrieval system, SPIDER, and have been shown to increase retrieval performance by over 30% in experiments on a sample of digitized cards. We are also actively working in the area of multi-lingual information retrieval, allowing users to query a system in one language and retrieve documents in other languages. We have adapted proven information retrieval techniques (the use of corpus-based similarity thesauri for query expansion) to the multi-lingual problem, and we have recently demonstrated their effectiveness, again using the SPIDER retrieval system. In experiments over a collection of more than 90,000 Italian documents we have shown that the SPIDER system can retrieve Italian documents in response to German queries with *better* effectiveness than a baseline system retrieving Italian documents in response to Italian queries.

On Modelling Multimedia Information Retrieval

Carlo Meghini - IEI-CNR

Sophisticated usage of multimedia document repositories, such as content-based retrieval, requires sophisticated modelling and exploitation of documents and user information needs. While there is widespread awareness of the enormous potentiality of multimedia document collections, there seems to be less commitment on tackling the modelling issue in a radical way. We argue in favour of the latter, proposing logic as a natural candidate for the modelling role, and show various applications of the logical modelling paradigm to areas of the vast multimedia world.

Digital Libraries Research at GMD-IPSI: Accessing Multimedia Documents by Knowledge Discovery Methods and Intelligent Retrieval

Reginald Ferber - GMD-IPSI

The Integrated Publication and Information Systems Institute (IPSI) within the German National Research Center for Information Science (GMD) is dedicated to basic research and prototypical developments in the areas of electronic publishing and information supply. A main focus of this research is the content based approach to publishing environments and retrieval systems.

To retrieve multimedia documents a unified approach is chosen that integrates textual and nontextual parts of a document. Within this approach advanced techniques of textual information retrieval can be applied for the textual parts and possible textual annotations of nontextual parts. These include the use of the documents' structure and knowledge extracted from large (domain specific) corpora. For the nontextual part classification methods from knowledge discovery and statistics can be applied. To enhance query expansion on the conceptual level, we employ a retrieval engine based on abductive reasoning.

Designing Interactive IR combining Adaptation to User Tasks and Strategies with Non-topic Document Analysis

Preben Hansen - SICS/CRIT and Jussi Karlgren - SICS/HUMLE

Research in information retrieval and document analysis has traditionally concentrated on building general, task-independent representations about the content of documents. Throughout the history of information retrieval, however, the research community has been aware of the fact that the interaction of information seeking users and the tools to access information sources is important in itself. Information can be sought for various reasons and with various ideas of how to determine what documents are relevant. This research plan outlines a framework within which a) to find more knowledge from texts and their users than a shallow approximation of text topic - texts have, besides content, STYLE and ECOLOGY, both which can be automatically identified from a text base and its usage statistics and used for text categorization - and b) users have information seeking STRATEGIES that can be recognized through user studies and supported through interface design. Finding ways to describe and evaluate the problems of search behavior and browsing/navigation through a hypermedia/hypertext system are important. In conclusion, we find that beyond the technical design challenges of Digital Libraries and other information retrieval systems, there is a need to address other aspects e.g. non- topical text analysis; information seeking strategies; user interface design; and user tasks and navigation. We will build information retrieval tools which will support high level information seeking strategies in different ways. We will use techniques from previous SICS projects on adaptive hypermedia where an information system adapts to its perception of user task and background, and displays information of different type and quantity accordingly. In this case, the perceived background and task of the user will not change the information itself as in the case of adaptive hypermedia, but primarily the tool setup and default tool parameter settings offered to the user. Tools for the user will not only include standard tools for content search, but also tools for genre or text style identification and social filtering.

Mathematical Problems from the Taxonomy of Mathematics

Michael Hazewinkel - CWI

The AM department of CWI has a number of projects aimed at the creation of tools to find information in large bodies of scientific papers. In this talk I will try to present three of them and to say something on the mathematical (and linguistic) problems that arise from them.

As a starting point let's take the basic data as available in the ZMG database (STN/FIZ Karlsruhe). This is the database behind the Zentralblatt fuer Mathematik und Grenzgebiete, one of the two basic abstracting journals in mathematics. The basic data are: a collection of some 700000 articles in the form of abstracts, classification data and key words/phrases. Thus we have large bipartite graph between a collection of documents and a collection of terms (the 3.5M key phrases), which tells us which key phrases occur in which documents.

The project OTIS looks at this collection of key phrases and aims to develop tools to generate from this an adequate thesaurus for mathematics and other areas and to match these thesauri with each other. The project BUC'M1 concentrates on the bipartite graph just mentioned and is aimed at the problem of transferring optimally taxonomic information of the collection of documents to the collection f key phrases. The project BUC'M2 aims to use the bipartite graph to generate an additional classification scheme for mathematics. (The Acronym BUC'M stands for 'Bottom Up Classification in Mathematics).

A Framework for Modelling, Pricing and Charging in Digital Libraries

J. Sairamesh and C. Nikolaou - FORTH in cooperation with D. F. Ferguson and Y. Yemini

Digital Libraries will have a major influence on the design of future information systems. They will set the stage for future complex information technologies to evolve and provide "transparent" services to a variety of users. We consider commercial Digital Libraries as information economies consisting of several players: authors and publishers who create and sell their collections, suppliers (e.g. computer systems) who provide information storage, indexing and access services, information-agents who provide searching and presentation services, and users who request services.

In such an economic framework, one can envision suppliers and information-agents competing to provide services for information storage, searching, access and presentation. In providing such services, several issues arise, among them are pricing and Quality of Service (QoS) to access and view information objects. These issues play an important role in allocating resources - such as processing time, network bandwidth and buffers, memory, cache and network I/O. Using this framework, we present the interactions among the players, service models, pricing and charging/billing mechanisms (QoS based), and corresponding implementation issues in large digital libraries.

End-User Payment Systems in a Digital Library Environment

Albert Fischer and Karen Hunter - Elsevier Science

As distribution becomes easier in an electronic environment, end-users will need less effort to access information from publishers as compared to the traditional library system. As a result, end-users may subscribe on a personal basis to materials which are not part of the core electronic collection of the library. At present, a wide variety of pricing models are under development for a digital environment, mostly geared towards the library. Additionally, one begins to think of direct end-user payment according to some electronically mediated system, with possibly a neutral agent in the middle between buyers and sellers. It is conceivable that a proper direct end-user payment system integrated in the digital library will facilitate the transition of academic information distribution to the electronic environment. The aim of the presentation is to discuss some possible schemes, and to review their pros and cons.

An SGML workbench for Digital Libraries

by Jacques Ducloy - INRIA

We present DILIB (Document and Information LIBrary), a workbench for Scientific or Technical Information and Document Engineering, and some of its applications. This workbench has been designed in order to make investigations on heterogeneous sets of data or to build various Information Retrieval Applications. It contains information (bibliographic records, specimens of data coded with various formats), and tools (library of functions, interfaces with software products) and uses SGML for coding information or designing tool interfaces. Its kernel is a toolkit whose basic part consists of an SGML tree handling library. Another part contains a set of components for building Information Retrieval Systems or Applications. A main target of using DILIB is now prototyping or developing applications for Digital Libraries."

Dienst architecture issues with respect to replication

Laszlo Kovacs, Andras Micsik - MTA SZTAKI

Cooperative and interoperability aspects of distributed digital document libraries are discussed. The commitment of ERCIM institutions to the Dienst protocol raised the necessity of the development of general detailed architecture of Dienst. A new Dienst architecture augmented by replication service is suggested. Replication is described in detail.

Transforming Conventional Library Systems into Digital Libraries

Ole Husby - BIBSYS/SINTEF

BIBSYS is a shared library system for Norwegian university libraries, other academic institutions and the National Library. Traditional retrieval services are centered around the catalogue. Our web-gateway was launched in early 1994, as one of the very first of this type. Other access methods include Z39.50 operation.

The presentation will focus on ongoing activities in the following fields:

Interconnecting with other systems, both bibliographic services fulltext systems and multimedia document stores.
Delivering the primary document to the user
Improved access to the journal articles held in the libraries, with links from article databases on CD-ROM or accessible online.

BIBSYS is partner of the ONE, MECANO and UNIVERSE projects of the EU Library Programme.

Digital Libraries and Related Research at RAL

Judy Lay - RAL-CLRC

In mid 1980s we implemented an in-house Library system which involved integrating a free text retrieval system (STATUS, developed/marketed by STATUS IQ, England) for the library catalogue and a relational database system (IBM's SQL/DS) for circulation control, ie library loans, recalls, etc. In early 1990s we integrated IBM's Office Vision (formerly known as PROFS) and STATUS to produce a new product - PROFOUND. Documents in PROFS could be retrieved on words in title or key words. PROFOUND provided a means of indexing all words in a document and a query interface for users. This was more a document management / retrieval system than a digital library as such, although many of the features are common.

We started a project in late 1980s with an institute from each of the G7 countries to exchange information on funded research projects. A subset of information about each project was stored locally at each institute using the locally available information retrieval system. Protocols were developed for sending remote queries to the other institutes for more detailed information about selected projects, perform the query, and return the results of the query to the user. The protocols controlled and monitored the remote queries across the networks. This was the earliest demonstration of a heterogeneous distributed database system working and is clearly a technology of interest for heterogeneous library systems (i.e. including ones not using the same server technology).

We are involved in WWW activities, including the World Wide Web Consortium and the ERCIM WWW Working Group, and we use web technology to disseminate information (notices, information bulletins; and provide access to internal financial system, staff directories, etc) to CCLRC staff. Clearly WWW extensions we are working on for richer types, query, optimised performance, caching, security.... are all relevant.

RAL has acquired Dienst and we are setting up the server. We are new to Digital Libraries but our experience from other projects will enable us to participate in and contribute to the DELOS and SAMOS projects.