H. C. (Paul) LEE 

 THE GROUP   WHAT DO WE DO?   PREVIOUS WORK 

Computational Biology Laboratory

COMPUTATIONAL BIOLOGY and SYSTEMS BIOLOGY

Introduction

Genomes are books of life [0]. The genome of each organism encodes almost all the instruction of how the physical body of the organism is to be constructed and much of much of how its life is to be maintained. At the same time, the genome is also a historical record. It is rich in clues to the history of its own growth and evolution, some obvious and many others obscure. We are interested in both aspects of the genome. We gain insight and make inferences on the growth and evolution of genomes by using mathematics and physical models to analyze and interpret the statistical properties of genomic sequences. We study the relation between the genome and human diseases by integrating DNA microarray and protein microarray data with informatics, computation, statistical analysis and modeling.

Portraits of Whole Genomes

In visualizing very long DNA sequences, including the complete genomes of several bacteria, yeast and segments of human genes, we encounter fractal-like patterns underlying these biological objects of prominent importance. The method used here to visualize genomes of organisms may well be used as a convenient tool to trace, e.g., evolutionary relatedness of species. We describe the method and explain the origin of the observed fractal-like patterns [1].

     pic9z.jpg

Order of Genomes

If genomes are information carriers, do they manifest special textual properties? We use an order index, φ, to quantify the randomness/order of genomes. The index maps genomes to a number from 0 (random and of infinite length) to 1(fully ordered). For a random sequence φ= ln L-1/2, where L is sequence length. This suggests that we can use length as a measure of randomness: the equivalent length of sequence whose index is φ is Le=e. That mean the sequence is as random as a random sequence of length is Le. Another way to measure the randomness of a sequence is to say how many mutations per site (nμ) is it from being ordered. This is given by the relation φ=e-2nμ, so far as nμ is not greater than nμc=(1/4) lnL, the critical mutation density at which the sequence becomes random (and cannot be more so). For sequence in the 1 too 100 Mb range, nμc is about 4.0(0.16). We studied all complete genomes in GenBank (about 800 in November, 2006) [2] and found that their φ are concentrate in a very narrow range, φg ~ 0.031. This implies and Le ~ 250 b to 10 kb, and an equivalent mutation density of nμe~ 1.8/site. That is, genomes are half way towards being random, or "at the edge of chaos" [3]. We argue that this narrow range represents the neighborhood of a fixed-point in the space of sequences, in which the sequences are in a state of maximum information capacity. Our in silico studies show that a minimal model of genome growth based on random segmental duplication does drive and genome-length sequence to such a fixed-point [4].
     The_Order_curve_v1.jpg

Symmetry in Genomic Sequences

The cause of symmetry is usually subtle, and its study often leads to a deeper understanding of the bearer of the symmetry. To gain insight the dynamics driving the growth and evolution of genomes, we conducted a comprehensive study of textual symmetries in about 800 complete chromosomes. We focused on symmetry based on our belief that, in spite of their extreme diversity, genomes must share common dynamical principles and mechanisms that drive their growth and evolution, and that the most robust footprints of such dynamics are symmetry related. We found that while complement and reverse symmetries are essentially absent in genomic sequences, inverse -- complement plus reverse -- symmetry is prevalent in complex patterns in most chromosomes, a vast majority of which have near maximum global inverse symmetry. We also discovered relations that can quantitatively account for the long observed but unexplained phenomenon of skews in genomes. Our results suggest segmental and whole-genome inverse duplications are important mechanisms in genome growth and evolution, probably because they are efficient means by which the genome can exploit its double-stranded structure to enrich its code-inventory. [5]
     InvSym_Fig4.jpg

Minimal Model for Genome Growth: Random Segmental Duplication

Our goal is to push our knowledge of the evolution of life as close as possible to its very origin. Our recent textual analysis of complete genomes reveals clear evidence suggesting that segmental duplication is likely the most important mechanism that has driven the large-scale growth and evolution of genomes. The study covers all complete prokaryotic and eukaryotic genomes or chromosomes available in the GenBank; about 280 sequences ranging in length from 0.3 to 230 million bases. Our study shows that: (i) The statistics of word frequencies in a genome are the same as that of a matching random sequence of a much shorter length, and the short effective length is (within a factor of two or three) is a universal for all genomes. For example, for two-letter words, the universal effective length is about 300 bases, compared with the lengths of genomes that range from 200 kb to 230 Mb. (ii) Genomes are maximally self-similar. (iii) Indivdual words are randomly distributed throughout the genome. These properties impose strict constraints on mechanisms for genome growth. We show that an extremely simple universal growth model based on an early onset of maximally stochastic segmental duplications generates sequences having the genomic properties cited above. The growth has far reaching implications on evolution of genomes and for the first time allows us to think about the very early ancestor of the present-day genomes, when it is only about 300 bases long [6-10].

       CSB_Fig4_2.jpg   CSB_Fig5.jpg   RSD_Le_Result2009_v1.jpg

Folding of Prion Protein

In template-assistance model, normal prion protein (PrPC), the pathogenic cause of prion diseases such as Creutzfeldt-Jakob (CJD) in human, Bovine Spongiform Encephalopathy (BSE) in cow, and scrapie in sheep, converts to infectious prion (PrPSc) through an autocatalytic process triggered by a transient interaction between PrPC and PrPSc. Conventional studies suggest the S1-H1-S2 region in PrPC to be the template of S1-S2 β-sheet in PrPSc, and the conformational conversion of PrPC into PrPSc may involve an unfolding of H1 in PrPC and its refolding into the β-sheet in PrPSc. Here we conduct a series of simulation experiments to test the idea of transient interaction of the template-assistance model. We find that the integrity of H1 in PrPC is vulnerable to a transient interaction that alters the native dihedral angles at residue Asn143, which connects the S1 flank to H1, but not to interactions that alter the internal structure of the S1 flank, nor to those that alter the relative orientation between H1 and the S2 flank [11].

     Prion_conformation_v1.jpg

Localizing Bioelectromagnetic Sources in the Brain

Magnetoencephalography (MEG) provides dynamic spatial-temporal insight for neural activities in the cortex. Because the possible number of sources is far greater than the number of MEG detectors, the proposition to localize sources directly from MEG data is ill-posed. Here we develop a novel approach based on a sequence of data processing procedures that includes a clustering process, an new filter analysis, and an application of the maximum entropy method. We examine the performance of our method and compare it with the minimum-norm least-square inverse method using an artificial noisy MEG data [12]. We are also collaborating with the Neuroscience Lab of B.C. Shyu at the Inst. of Biomedical Science, Academis Sinica to analysis and interpret electrophysiology data on rat [13].

     MEG_F2_2008_v1.jpg

Systems Biology: A Cancer Research Project Integrating Exon-array, Protein-array, Informatics, Computation, and Modeling

In this project we propose to develop a research and development protocol based on the integration of high-throughput protein chips technology, exon microarrays measurement, bioinformatics, and computational and modeling to construct protein-protein interaction maps and biological pathways and disease induced aberrations in these structures, and the changes in these aberrations when the research model is subjected to a variety of controlled stress conditions. A notable feature of the proposal is the use of exon arrays, from which we expect to obtain information on alternative splicing and, through protein chip technology, to gain insights on relations between isoform and pathway aberration. The proposal aims to generate two intellectual products. One concerns the design and manufacture of a protein chip (or chips) and the development of its high-throughput application, in the detection of PPIN as well as protein-DNA interaction. While it will involve procedural development, this part is mostly chemistry and hardware. The other is the development of the integrated research protocol itself, which will involve informatics, computation, modeling, statistics, and systems development [14].

     IntegratedFlowchart.jpg

REFERENCES

[0] International Human Genome Sequencing Consortium, "Initial sequencing and analysis of the human genome", Nature 409 (2001) 860; J. C. Venter, et al., "The sequence of the human genome", Science, 291 (2001) 1304.
[1] B.L. Hao, H.C. Lee and S.Y. Zhang, "Fractal related to long DNA sequences and complete genomes", Chaos, Solitons and Fractals, 11 (2000) 825-836.
[2] The GenBank. http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
[3] SG Kong, et al., "Genomes: at the edge of chaos with maximum information capacity", ArXiv:0708.1598v1 (2007).
[4] SG Kong, et al., "Quantitative measure of randomness and order for complete genomes", Phys. Rev. E 79 (2009) 061911
[5] SG Kong, et al., "Inverse symmetry in genomes and whole-genome inverse duplication", (To appear) PLoS ONE (2009)
[6] L.S. Hsieh, L.F. Luo, F.M. Ji and H.C. Lee, "Minimal model for genome evolution and growth", Phys. Rev. Letts. 90 (2003) 018101-104.
[7] LS Hsieh, TY Chen, CH Chang, WL Fan and HC Lee, "Universality in large-scale structure of complete genomes", Genome Biology, 5 (2004) 7
[8] HD Chen, CH Chang, LC Hsieh and HC Lee, "Divergence and Shannon information in genomes", Phys. Rev. Lett. 94, 178103 (2005)
[9] TY Chen, LC Hsieh and HC Lee, "Shannon Information and Self-Similarity in Complete Genomes", Computer Physics Communications 168 (May 2005)
[10] CH Chang, LC Hsieh, TY Chen, HD Chen, LF Luo and HC Lee, "Shannon Information in Complete Genomes", J. Bioinfo. & Comp. Biology 3 (2005) 587-608
[11] Chih-Yuan Tseng, Chun-Ping Yu and H.C. Lee, "Integrity of H1 helix in prion protein revealed by molecular dynamic simulations to be especially vulnerable to changes in the relative orientation of H1 and its S1 flank ", Eur. Biophy. J (2009) 38:601-611
[12] Hung-I Pai, Chih-Yuan Tseng and HC Lee, "Data processing approach for localizing bio-magnetic sources in the brain", ArXiv:0903.0859v1 [q-bio.QM] (2009)
[13] Zi-Hao Wang, Ming-Hua Chang, Jenq-Wei Yang, Jyh-Jang Sun, H.C. Lee and Bai-Chuang Shyu, "Intra-cortical complexities revealed in the primary somatosensory cortex of rats", Brain Research 1082 (2006) 102-114
[14] JS Chen, TS Tsou, QD Ling, SC Wang and HCL. National Science Council Grant No. 98-2627-M-008-02 (2009-2012).


H. C. (Paul) Lee