PROTAL - Protein Structure Prediction and Accurate Alignment of Protein Sequences and Structures

ERCIM News No.32 - January 1998

PROTAL - Protein Structure Prediction and Accurate Alignment of Protein Sequences and Structures

by Heinz Mevissen, Joachim Selbig, Ralf Thiele, Ralf Zimmer and Thomas Lengauer

The PROTAL project focuses on the development of computer-aided, theoretical prediction methods for structure and function of given protein sequences. The three-dimensional structure of proteins determines their biochemical function in organisms in a complex process network of interactions and regulations. Theoretical prediction methods try to shortcut labour-intensive, time-consuming and expensive experimental structure determination in order to reduce the increasing gap between the large number of of sequenced proteins produced within genome and large scale sequencing projects and known protein structures. Especially, predictions and modeling of binding and active sites of particular proteins are of interest for biotechnology and pharmaceutical industry.

The PROTAL project aims at the development of scoring schemes and methods to support the construction of structural models for protein sequences based on the homology modeling paradigm, ie inferring partial structures from similarities with experimentally determined protein structures. The scoring schemes and algorithmic methods are integrated in a software package ToPLign (TOolbox for Protein aLIGNment).

The PROTAL project develops new hypotheses and models for factors determining protein structure, which are essential as scoring schemes for prediction methods. Especially new empirical potentials for amino acid interactions are derived from databases of experimentally determined structures. In order to appropriately model such interactions we developed a new contact definition based on Voronoi tesselations of protein structures. The potentials are used for the identification of appropriate targets as well as the computation of structurally compatible alignments.

Protein fold recognition allows the rapid identification of evolutionary, structurally, or functionally related proteins of known structure, so called targets. Accurate homology modeling of such proteins requires high quality alignments of the sequences in question to identified target structures. Refined alignments specify the coordinates of identically matched amino acids and determine amino acid side chains to be placed and backbone loop regions to be modeled.

A significant sequence similarity between sequences of unknown structure and known protein families is demonstrated via extended and new methods for the visualization and reliability of alignments, for tree and clustering multiple profile alignments, and for the computation of all compatible phylogenetic trees together with their associated multiple alignments. New threading approaches are developed in order to show sequence-structure compatibility: The 123D threading method computes optimal alignments with respect to so-called contact capacity potentials (CCP). These potentials represent a detailed measure for hydrophobicity and contact environments. The RDP (recursive dynamic programming) threading algorithm is used to refine computed alignments and structural models. The method concentrates on the most important regions of the protein and optimizes the mapping recursively. The procedure optimally extends subalignments by dynamic programming with respect to a full pair interaction potential of already mapped parts.

For all alignment and threading methods the inherent inaccuracy of biological data has to be taken into account in the methods and their statistical validation. In order to analyse the dependency of alignments on the parameters of the scoring scheme we use a new parametric optimization method, which is able to compute, for a given algorithm and type of scoring function, all different optimal alignments over the whole range of parameter settings. This allows for the systematic evaluation of competing methods and for insights into the reliabilty of computed alignments.

The Figure shows our successful ToPLign prediction of the structure for the active site of the thymidine kinase of the herpes virus, which is important for the development of inhibitors. The sequence was kindly provided by BASF AG. The developed methods have also been successfully applied in an international competition to predict the structure of proteins, which are close to being solved experimentally.

A prototype of the alignment tool ToPLign, including the fast threading method 123D and the RDP method, has been distributed since mid 1994 and been updated since then. The software is also available via the World Wide Web and mirrored at the National Cancer Institute (NCI/NIH) in Frederick, MD, USA. It is in use in various university institutes and some US biotechnology companies.

Please contact:

Heinz Mevissen - GMD
Tel: +49 2241 14 2784
E-mail: heinz-theodor.mevissen@gmd.de

Joachim Selbig - GMD
Tel: +49 2241 14 2792
E-mail: joachim.selbig@gmd.de

Ralf Thiele - GMD
Tel: +49 2241 14 2302
E-mail: ralf.thiele@gmd.de

Ralf Zimmer - GMD
Tel: +49 2241 14 2818
E-mail: ralf.zimmer@gmd.de

Thomas Lengauer - GMD
Tel: +49 2241 14 2777
E-mail: thomas.lengauer@gmd.de

return to the contents page