
| THE GROUP | WHAT DO WE DO? | PREVIOUS WORK |
|---|
The main impetus of our work comes from the rapid maturing of the Human Genome Project and its many satellite genomic projects. This has produced, and continues to produce at an ever increasing rate, massive amounts of genomic data in data banks such as the GenBank and the Protein Data Bank on the World Wide Web.
Life is the most complex if all systems. With the advance of the Human Genome Project and other genome projects, the database of symbolic sequences of biological origin grows rapidly. As of February 2003 the Genebank contains more than 22 billion bases in more than 18 million sequences [0]. Analyzing and understanding these sequences is a challenge to the whole natural science. The complete sequencing in February of 2001 of the 3 billion pb genome of the Homo sapien [1] ( DNA reference sequence completed in April, 2003) has fundamentally altered the way research in life science will henceforth be conducted. The existence and open accessibility of GenBank and the Protein Data Bank, in conjunction with the vastly increased power of inexpensive computers have caused a sea change in the way people think about and do research on life sciences. Life sciences is no longer the exclusive domain of biologists. Biologists may be the best people to generate genomic data, but they may no longer be the best people to analyze and mine information from such data. Physicists and statisticians may well be better qualified to mine information from the vats data banks, study and analyze them and draw conclusion than traditional biologists.
Portraits of Whole GenomesIn visualizing very long DNA sequences, including the complete genomes of several bacteria, yeast and segments of human genes, we encounter fractal-like patterns underlying these biological objects of prominent importance. The method used here to visualize genomes of organisms may well be used as a convenient tool to trace, e.g., evolutionary relatedness of species. We describe the method and explain the origin of the observed fractal-like patterns [2]. |
|
| This is a National Research Program for Genomic Medicine Project. |
|
Protein is the link between genotype and phenotype. The life of an organism as scripted in its genome is executed by its proteins. The folding of a polypeptide into a specific three-dimensional structure that is the protein is a phenomenon of immense interest and challenge to physicists. Recently computer simulation of (a domain of) a obtained its first breakthrough when computation using the method of molecular dynamics executed on massively parallel supercomputers succeeded in tracing the folding path for a real time of one microsecond (but a computer time of 4 months) of a 34-residue polypeptide from its extended state to the almost native (folded) state [3]. For a typical protein with about 300 residues the magnitude of the problem is at least 100-fold greater.
We have previously acquired expertise and solid research result by studying protein design problem and inter-residual interaction in the lattice model using combinatorics and the Monte Carlo method [4,5]. We are currently undertaking a project to study protein folding, structure and function using the molecular dynamics method and executed through massively distributed computing. We hope to engage thousands, perhaps even tens or even hundreds of thousands of personal computer users as clients to help us do protein simulation calculations on their computers when it would otherwise be idle. A pioneering project of this type has been successfully carried out by the Pande group at Stanford University [6]. Our project, constructed with public freeware (GROMACS for molecular simulation and COSM for networking) and called Protein@CBL was released in November, 2003. We are using this facility to study the folding, structure, and function of several proteins, including the kinetics and thermodynamics in the folding of Trp-cage, small artificial peptide [7].
One of the pressing problems in the extraction of biological information from a genome is gene recognition and the identification of the many control signal sequences. Gene recognition is relatively simple in microbial genomes due to the lack of introns. Because about 80% of a microbial genome coding region gene finding in such a genome is more aptly the finding of much shorter non-coding regions. The situation is the opposite in eukaryotic genomes. There coding regions comprise of a small portion of the genome, and genes are broken into even smaller segments of exons. Our goal is to make high-performance gene finding and signal sequence finding tools. So far we have developed algorithms whose sensitivity and specificity are very high (better than 95% overall and better than 98% for genes longer than 1000 nucleotides) for genes in microbial genomes but are only moderate (about 70%) for eukaryotes [8].
Our goal is to push our knowledge of the evolution of life as close as possible to its very origin. Our recent textual analysis [7-11] of complete genomes reveals clear evidence suggesting that segmental duplication is likely the most important mechanism that has driven the large-scale growth and evolution of genomes. The study covers all complete prokaryotic and eukaryotic genomes or chromosomes available in the GenBank; about 280 sequences ranging in length from 0.3 to 230 million bases. Our study shows that: (i) The statistics of word frequencies in a genome are the same as that of a matching random sequence of a much shorter length, and the short effective length is (within a factor of two or three) is a universal for all genomes. For example, for two-letter words, the universal effective length is about 300 bases, compared with the lengths of genomes that range from 200 kb to 230 Mb. (ii) Genomes are maximally self-similar. (iii) Indivdual words are randomly distributed throughout the genome. These properties impose strict constraints on mechanisms for genome growth. We show that an extremely simple universal growth model [7,8,11] based on an early onset of maximally stochastic segmental duplications generates sequences having the genomic properties cited above. The growth has far reaching implications on evolution of genomes and for the first time allows us to think about the very early ancestor of the present-day genomes, when it is only about 300 bases long [11].
Evolution and PhylogenyWe are also working on understanding the evolution of DNA uptake signal sequences in human phathogens [12,15] and on phylogeny based on the analysis of whole genomes [16]. |
|
Analysis of Bioelectromagnetic SignalsThe neural system and the brain, especially those of higher mammals, is the most complex - and least understood - of biological systems. As part of the newly established Brain Research Center, a four-university project involving Tsing-Hua, Chiao-Tong, Yang-Ming and Central Universities, we are starting a research project in analysis of brain fucntion and brain modeling. Currently we are developing a capability to interpret MEG (magnetoencephalogram) data using the maximum entropy method [17]. This project is run in close collaboration with the Brain-Imaging Facility at the Taipei Veteran's Hospital. We are also collaborating with the Neuroscience Lab of B.C. Shyu at the Inst. of Biomedical Science, Academis Sinica to analysis and interpret electrophysiology data on rat [18]. |
|
[0] The GenBank. http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
|
[1] International Human Genome Sequencing Consortium,
"Initial sequencing and analysis of the human genome",
Nature 409 (2001) 860;
J. C. Venter, et al., "The sequence of the human genome",
Science, 291 (2001) 1304.
|
[2] B.L. Hao, H.C. Lee and S.Y. Zhang,
"Fractal related to long DNA sequences and complete genomes",
Chaos, Solitons and Fractals, 11 (2000) 825-836.
|
[3] Y. Duan and P.A. Kollman, Science, 282, 740 (1998).
|
[4] C.T. Shih, et al. and HCL, "The Mean-Field HP Model, Designability and Alpha-Helices in Protein Structures",
Phys. Rev. Lett. 84 (2000) 574-577.
|
[5] Z.H. Wang and HCL, "Origin of the Native Driving Force for Protein
Folding",
Phys. Rev. Lett. 84 (2000) 386-389.
|
[6] C.D. Snow et al. "Absolute comparison of simulated and experimental protein-folding dynamics". Nature 420 (2002) 102-106.
|
[7] JL Lo, CY Tseng, HC Lee,
"Kinetics and Thermodynamics in the Folding of Trp-Cage:
Simulation by Parallel-Tempering",
(preprint)
|
[8] Hong-Da Chen,
"Gene identification and gene search beased on phase differences
in DNA sequences", (in Chinese)
NCU MSc. Thesis (2003)
|
[9] L.S. Hsieh, L.F. Luo, F.M. Ji and H.C. Lee, "Minimal model for genome evolution and growth",
Phys. Rev. Letts. 90 (2003) 018101-104.
|
[10] LS Hsieh, TY Chen, CH Chang, WL Fan and HC Lee,
"Universality in large-scale structure of complete genomes",
Genome Biology, 5 (2004) 7
|
[11] HD Chen, CH Chang, LC Hsieh and HC Lee,
"Divergence and Shannon information in genomes",
Phys. Rev. Lett. 94, 178103 (2005)
|
[12] TY Chen, LC Hsieh and HC Lee,
"Shannon Information and Self-Similarity in Complete Genomes",
Computer Physics Communications 168 (May 2005)
|
[13] CH Chang, LC Hsieh, TY Chen, HD Chen, LF Luo and HC Lee,
"Shannon Information in Complete Genomes",
J. Bioinfo. & Comp. Biology 3:3 (June 2005)
|
[14] M. Bakkali, TY Chen, HC Lee and RJ Redfield,
"Evolutionary stability of uptake signal sequence in the Pasteurellaceae",
PNAS, 101 (2004) 4513-4518
|
[15] Dominique Chu, HC Lee and Tom Lenaerts,
"Emergence of Uptake Sequence in Bacterial DNA",
Artificial Life 11:3 (Summer 2005)
|
[16] LC Hsieh, CY Tseng, LF Luo, FM Ji and HC Lee,
"Oligo-distance: a sequence distance determined by word frequencies",
JBCB (preprint)
|
[17] CY Tseng, HY Bai and HC Lee,
"Investigation on Maximum Entropy Method for biomagnetic sources
reconstruction",
(preprint)
|
[18] Zi-Hao Wang, Ming-Hua Chang, Jenq-Wei Yang, Jyh-Jang Sun,
H.C. Lee and Bai-Chuang Shyu,
"Intra-cortical complexities revealed in the primary somatosensory
cortex of rats",
Cytology, Cellular and
System Neuroscience (preprint)
| |