Bioinformatics in the Fast Lane
by Finn Drabløs and Ståle Fjeldstad
By using special-purpose search processors, a standard PC can be accelerated to perform complex pattern-matching at 10 teraoperations per second. This platform is used by the Norwegian University of Science and Technology (NTNU) and Interagon AS for biomedical research.
The flow of data generated by genome-related research is growing exponentially, and this trend is likely to continue. Full genome sequencing is becoming a standard approach even for large genomes. Several research groups are developing novel sequencing techniques, trying to bring the total sequencing cost for a human genome under US$1000. This will open up new opportunities for personalized medicine, with treatment being tailored to specific genetic profiles. Other techniques are also contributing to this data flow, particularly from areas such as the analysis of gene products by proteomics, microarray analysis of gene expression patterns and the mapping of genetic variations, as in Single Nucleotide Polymorphisms (SNPs). At the same time, increasingly complex approaches are being developed to extract essential information from these data. More computing power is therefore needed for such large-scale analysis of genome data.
The Interagon Pattern Matching Chip (PMC) is a special-purpose search processor, capable of searching for complex approximate patterns in arbitrary data. The architecture is massively parallel, thus making it possible to simultaneously search the same data stream with a large number of queries. Query features include regular expressions, with additional functionality such as proximity, adjacency and order conditions, as well as alphanumerical comparisons and approximative matching at both the character and expression levels.
Sixteen PMC chips, each with its own dedicated memory, are mounted on a PCI-compliant plug-in card. Up to six cards can be inserted in a standard workstation, turning ordinary PCs into high-performance search tools. This enables a single standard PC to perform close to 10 teraoperations (1013 operations) per second for pattern-matching purposes. Our in-house PMC-equipped Linux cluster is capable of 80 teraoperations per second.
Novel software tools have been developed in order to make use of this immense computational power for biomedical research. The main software component is based on evolutionary algorithms, where Darwinian principles are used to identify essential information in large and complex data sets. Important examples are SNP analysis and siRNA design.
An SNP is a genetic variation at a single position (nucleotide) in the genome. This variation may affect gene regulation or gene product properties. The total effect of a large number of variations makes up our genetic 'personality', that is, our genetic disposition to cancer, strokes, adverse drug effects or a long life. Genetic variation alone does not determine an individual's medical history, but it has an influence on their risk of being affected by diseases for which there is a genetic component. However, the correlation between genetic variation and disease risk is complex and difficult to identify. Large data sets with genetic data from both patients and non-affected controls are needed in order to identify significant correlations. We have used PMC technology and evolutionary algorithms on data sets containing several hundred SNP candidates, in order to identify SNP subsets that are associated with specific clinical outcomes. This is an important contribution towards personalized medicine and a better understanding of complex diseases.
Small interfering RNA (siRNA) is a novel technique for the selective blocking of gene expression. It employs a natural defence mechanism against foreign RNA, where short (21-23 nucleotides) double-stranded RNA is incorporated into a silencing complex. This then cleaves messenger RNA (mRNA) with complementarity to the short RNA fragments. Since the mRNA is the intermediate in the synthesis of protein from genomic information, siRNA represents a flexible mechanism for selective silencing of specific genes. This can be used in research, but also has potential as a gene-based therapy. However, it is a prerequisite that the siRNA is designed with high efficacy and specificity, enabling the targeted gene to be selectively knocked down without side effects from non-selective binding to mRNAs from other genes. PMC and genetic programming are used to predict the efficacy of existing siRNA designs, and have shown that many existing siRNA designs may knock down more than one gene. This approach has also been used to design improved siRNAs.
Interagon and the NTNU bioinformatics laboratory are a part of FUGE, a national initiative for functional genomics in Norway. FUGE is coordinated by the Norwegian Research Council. The initiative includes, in addition to bioinformatics, laboratories for proteomics, structural biology, microarray work, biobanks, SNP analysis and molecular imaging.
Finn Drabløs, NTNU
Ståle Fjeldstad, Interagon AS