by Paolo Ciaccia, Paolo Tiberio and Pavel Zezula
The signature file access method is now recognized as a storage structure to handle heterogeneous data sources. A joint project between the "Centro di Studio per l'Interazione Operatore-Calcolatore" (CIOC-CNR), Bologna, and Brno University, Czech Republic, aims at high performance implementations by exploiting I/O parallelism. Several parallel signature file organizations have been studied.
Because of their simplicity, search power, low storage overhead, and ability to deal with many kinds of data, signature files have become a basic retrieval technique for a large variety of database and information retrieval applications. Signatures are binary abstractions of objects obtained by means of hash functions applied to terms characterizing the objects, such as attribute values, significant words in a text, features of an image, etc. For a given query, i.e. a conjunction of terms, the search is performed on the signature file in order to reduce the I/O cost of accessing the data objects in the file.
For large databases, a sequential search of the signature file can be expensive. We have thus designed a hash-based dynamic organization, called Quick Filter (QF), that uses fixed-size buckets (i.e. disk blocks), and stores together signatures sharing a common suffix, called the signature key. Employing buckets of fixed size means that the number of buckets and the size of the signature key will change if the signature file changes. QF does not use directories to address buckets, and the percentage of saved disk accesses grows with the file size.
Additional major improvements to the signature file performance are only possible with parallel I/O processing. When D disks are available, the response time could be reduced by a factor of D, provided disks can be accessed in parallel and an appropriate strategy for signature allocation is adopted.
Hamming Filter (HF) is a dynamic key-based organization for multiple disks. It uses linear error correcting codes for allocating signatures to disks and manages signatures stored in the same disk by means of the QF organization. Specifically, unsimilar signatures are placed on the same disk, resulting in the desirable effect of searching many, very often all, disks to process a query. On the other hand, the QF approach clusters signatures in disk buckets, thus the total number of disk accesses is minimized. For specific file sizes, the organization is able to guarantee optimum performance for a very high percentage of queries, provided the right code is used and the number of disks is a power of 2.
Performance of HF can deteriorate in the case of highly dynamic files, when a mismatch occurs between the chosen code parameters and the current file size. Since the ability to efficiently deal with dynamic files is critical for many application environments (e.g. office information systems, news servers, etc.), we have defined a new organization, called Hamming+ Filter (H+F), that uses a sequence of error correcting codes and organizes smooth migration of signatures between disks. When a bucket splits due to file growth, some of its signatures are stored in a new bucket, which is allocated to a different disk (see figure). This on-line management policy guarantees high performance levels, regardless of file size, and does not affect file availability, because no global reorganization is required. Theoretical analysis has identified conditions for optimal response time and characterized both expected and worst-case behavior. A prototype implementation confirmed the validity of the approach.
At present, our research, which has also led to the implementation of H+F on a cluster of Unix workstations, includes the study of new parallel signature file organizations that can fit specific application requirements. In addition, we are also applying the basic ideas underlying H+F to parallelize other, namely spatial (multi-dimensional), data structures.