Knowledge discovery in biology has profited enormously the past few years from technological improvements, especially in large-scale sequencing projects. This development, in turn, created new challenges for computational tools for discovery to exhaustively and efficiently analyze and interpret greater amounts of raw data. Two new major issues of the molecular biological community are:
These challenges prompted many new developments, for example, integrated data warehouses and distributed database federations to manage primary and derived genetic data; and new methods to analyze and interpret genomic sequences.
The ultimate challenge in molecular biology is to make sense of the billions of four-letter sequences and to understand their significance in the composition and behavior of biological beings. Mere statistical analysis often seems to capture the functional aspect hidden within genomic fragments inadequately. Hence, some researchers attempt to elucidate higher structure information from plain sequences by using knowledge discovery methods. Here, we discuss three such approaches.
Data mining for regulatory elements in yeast. Regulatory elements in the genome are stretches of DNA that exert control over other genes and thereby influence the phenotype and behavior of the organism. The Transcription Factor Combination Discoverer [1] is a data mining tool for finding and analyzing combinations of transcription-factor binding sites and potential promoter classes in the yeast genome.
Yeast genes and transcription-factor binding-site descriptions are taken from the MIPS and IMD databases. A combination of binding sites is characterized by the following parameters:
Coverage: The number of its occurrences in upstream regions.
Goodness: The ratio of the number of its occurrences in the upstream regions vs. the number of occurrences in random regions (of the same length and number).
Unexpectedness: The ratio of its occurrences vs. the expected number of occurrences based on the individual sites.
Combinations with high values for all these parameters can possibly define promoter classes. A high value of the coverage parameter ensures the combination is present in at least the given number of upstream regions. The goodness parameter ensures the frequency of the combination in upstream regions is not a mere consequence of a high frequency in the genome as the whole. The unexpectedness parameter ensures the frequency of the combination is not only a consequence of the high frequency of individual participating binding sites.
Automated clustering and assembly of large EST collections. Short DNA sequences experimentally confirmed to correspond to actual gene products are called "expressed sequence tags" (ESTs). It is an ongoing challenge to find the correct placement of ESTs and their corresponding genes on the chromosome.
The REX algorithm [4] is a simple yet effective algorithm that integrates the clustering and assembly steps necessary to create a consensus assembled sequence from EST data. It is fast and reflects an intuitive idea of extending an anchor sequence in direction with ESTs derived from an arbitrary number of sequence databases. A rough assembled sequence is created in the process of extending an anchor. This rough assembly is then improved by analyzing a multiple alignment of contigs to make consensus base calls and to remove potential insertion and deletion errors. This method also addresses the issues of splice variants and unspliced DNA data.
Genetic algorithms for protein threading. Since the prediction of protein tertiary structure from protein sequence turned out to be extremely difficult, researchers have attempted to succeed on the inverse route: Given a valid, native protein tertiary structure, which protein sequences fit that structure best?
This NP-hard problem is called "protein threading" and it is hoped that its reverse application will be beneficial to the original protein folding problem. Genetic Algorithms (GAs) have been used [3] in order to significantly improve the ability of threading algorithms to find the optimal alignment of a sequence to a structure, such as the alignment with the minimum free energy.
Major progress has been reported in the design of a representation of the threading alignment as a string of fixed length. The described algorithm is shown to perform well even without predefined core elements. The authors maintain that GA threading is capable of finding consistently good solutions of full alignments in search spaces of size up to 1070.
The three approaches presented here are meant to illustrate the status of awareness of computational biologists of methods from automatic knowledge discovery. It is there, but the latest algorithms are not always used and more attention to the domain of computational biology would certainly be welcome.
However, it must be emphasized the application of knowledge discovery methods to biological problems depends crucially on profound understanding of the application domain, in this case a domain where 90% experimental accuracy is considered highly significant, fuzzy rules are common, and the ability to abstract from irrelevant details is important but difficult. The sample studies presented also reflect the fact the issues in molecular biology today are many and subtle and most obvious approaches to the great problems have already been tried, and yet, there is much room for improvement.
1. Brazma, A., Vilo, J., Ukkonen, E., and Valtonen, K. Data mining for regulatory elements in yeast. In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology 5 (1997), 6574.
2. Schulze-Kremer, S. Ontologies for molecular biology. In Proceedings of the Third Pacific Symposium on Biocomputing 3 (1998), 693704.
3. Yadgari, J., Amir, A., and Unger, R. Genetic algorithms for protein threading. In Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology 6 (1998), 193202.
4. Yee, D. P. and Conklin, D. (1998). Automated clustering and assembly of large EST collections. In Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology 6 (1998), 203211.
©1999 ACM 0002-0782/99/1100 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 1999 ACM, Inc.
No entries found