H. C. (Paul) LEE 

 THE GROUP   WHAT WE DO   PREVIOUS WORK 

Computational Biology Laboratory

COMPUTATIONAL BIOLOGY and SYSTEMS BIOLOGY

Systems Biology of Complex Diseases

The causes of many complex diseases and disorders, including cancer, diabetes, aging, and Alzheimer's disease, have a genomic basis. We study changes in the genomic profiles of patients of such diseases, to gain insight in the molecular basis of the disease, in particular insights that may suggest gene-based treatments, and to identify repurposed drugs whose genomic profiles suggest they may have the desired therapeutic properties. We use public available data on genomic profiles of patient cohorts, databases on genomic properties of drugs and bioactive molecules and many other databases as material, and use statistical and physical methods, including network theory to conduct our study.

A Gene Set-Based Approach for Characterizing Bioactive Compounds in Terms of Biological Functional Groups

We constructed a gene-set-analysis-based version of CMap, Gene-Set Connectivity Map (GSCMap), in which all the genomic profiles in CMap are converted, using gene-sets from the Molecular Signatures Database, to functional profiles. We showed that GSCMap essentially eliminated cell-type dependence, a weakness of CMap, and yielded significantly better performance on sample clustering and drug-target association. As a first application of GSCMap we constructed the platform Gene-Set Local Hierarchical Clustering (GSLHC) for discovering insights on coordinated actions of biological functions and facilitating classification of heterogeneous subtypes on drug-driven responses. GSLHC was shown to tightly clustered drugs of known similar properties. We used GSLHC to identify the therapeutic properties and putative targets of 18 compounds of previously unknown characteristics listed in CMap, eight of which suggest anti-cancer activities. The GSLHC website http://cloudr.ncu.edu.tw/gslhc/ contains 1,857 local hierarchical clusters accessible by querying 555 of the 1,309 drugs and small molecules listed in CMap. The figure on right suggests that the molecule 0175029-0000 listed in CMap, previously of unknown therapeutic property, has a genomic profile highly similar to several cyclin-dependent kinase inhibitors and DNA topoisomerase inhibitors [17].
     2015GSLHC-Fig7.PNG

Functional Module Connectivity Map (FMCM): A Framework for Searching Repurposed Drug Compounds for Systems Treatment of Cancer and an application to Colorectal Adenocarcinoma

We devised Functional Module Connectivity Map (FMCM) for the discovery of repurposed drug compounds for systems treatment of complex diseases, and applied it to colorectal adenocarcinoma. FMCM used multiple functional gene modules to query the Connectivity Map (CMap). The functional modules were built around hub genes identified, through a gene selection by trend-of-disease-progression (GSToP) procedure, from condition-specific gene-gene interaction networks constructed from sets of cohort gene expression microarrays. The candidate drug compounds were restricted to drugs exhibiting predicted minimal intracellular harmful side effects. Among the 46 drug candidates selected by FMCM for colorectal adenocarcinoma treatment, 65% had literature support for association with anti-cancer activities, and 60% of the drugs predicted to have harmful effects on cancer had been reported to be associated with carcinogens/immune suppressors. In cell viability tests, we identified four candidate drugs: GW-8510, etacrynic acid, ginkgolide A, and 6-azathymine, as having high inhibitory activities against cancer cells. Through microarray experiments we confirmed the novel functional links predicted for three candidate drugs: phenoxybenzamine (broad effects), GW-8510 (cell cycle), and imipenem (immune system). The figure on right is drug-function association map [16].
     2014Fig-drug-function-map.png

A Trend-of-Disease-Progression Procedure Works Well for Identifying Cancer Genes from Multi-State Cohort Gene Expression Data for Human Colorectal Cancer

In a novel approach, we used the network and disease progression properties of individual genes in state-specific gene-gene interaction networks (GGINs) to select cancer genes for human colorectal cancer (CRC) and obtain a much higher hit rate of known cancer genes when compared with methods not based on network theory. We constructed GGINs by integrating gene expression microarray data from multiple states ? healthy control (Nor), adenoma (Ade), inflammatory bowel disease (IBD) and CRC ? with protein-protein interaction database and Gene Ontology. We tracked changes in the network degrees and clustering coefficients of individual genes in the GGINs as the disease state changed from one to another. From these we inferred the state sequences Nor-Ade-CRC and Nor-IBD-CRC both exhibited a trend of (disease) progression (ToP) toward CRC, and devised a ToP procedure for selecting cancer genes for CRC. Of the 141 candidates selected using ToP, ?50% had literature support as cancer genes, compared to hit rates of 20% to 30% for standard methods using only gene expression data. Among the 16 candidate cancer genes that encoded transcription factors, 13 were known to be tumorigenic and three were novel: CDK1, SNRPF, and ILF2. The figure on right shows how the partial networks connected to the genes CDK1, SNRPF, ILF2, and MCM10 grow from Nor to CRC [15].

     2013ToPFig6partialgenenetwork.png

Copy number variation in genomes

Detection of copy number variation (CNV) in DNA is an important method for understanding the pathogenesis of cancer. The trend towards ever larger sample sizes and higher resolution microarrays has vastly increased the importance of a fast and statistically informed algorithm for extracting CNV from microarray data. We have developed a clustering algorithm, SAD, constructed with a strategy in which all operational decisions are based on simple and rigorous applications of statistical principles, measurement theory, and precise mathematical relations [14]. Compared with existing packages, SAD is simpler in formulation, more user friendly, much faster and less thirsty for memory, offers higher accuracy, and supplies quantitative statistics for its predictions. SAD's running time scales linearly with array size; on a typical modern notebook, it completes high-quality CNV analyses for a 250 thousand-probe array in ~1 second and a 1.8 million-probe array in ~8 seconds. The SAD program may be downloaded here. The speed of SAD makes practical large-scale and in depth applications, such as checking the systematic errors among microarrays from a given study.
     SAD_validation.jpg

Quantitative Genome Evolution

Genomes are books of life [0]. The genome of each organism encodes almost all the instruction of how the physical body of the organism is to be constructed and much of much of how its life is to be maintained. At the same time, the genome is also a historical record. It is rich in clues to the history of its own growth and evolution, some obvious and many others obscure. We are interested in both aspects of the genome. We gain insight and make inferences on the growth and evolution of genomes by using mathematics and physical models to analyze and interpret the statistical properties of genomic sequences. We study the relation between the genome and human diseases by integrating DNA microarray and protein microarray data with informatics, computation, statistical analysis and modeling.

Order of Genomes

If genomes are information carriers, do they manifest special textual properties? We use an order index, φ, to quantify the randomness/order of genomes. The index maps genomes to a number from 0 (random and of infinite length) to 1(fully ordered). For a random sequence φ= ln L-1/2, where L is sequence length. This suggests that we can use length as a measure of randomness: the equivalent length of sequence whose index is φ is Le=e. That mean the sequence is as random as a random sequence of length is Le. Another way to measure the randomness of a sequence is to say how many mutations per site (nμ) is it from being ordered. This is given by the relation φ=e-2nμ, so far as nμ is not greater than nμc=(1/4) lnL, the critical mutation density at which the sequence becomes random (and cannot be more so). For sequence in the 1 too 100 Mb range, nμc is about 4.0(0.16). We studied all complete genomes in GenBank (about 800 in November, 2006) [2] and found that their φ are concentrate in a very narrow range, φg ~ 0.031. This implies and Le ~ 250 b to 10 kb, and an equivalent mutation density of nμe~ 1.8/site. That is, genomes are half way towards being random, or "at the edge of chaos" [3]. We argue that this narrow range represents the neighborhood of a fixed-point in the space of sequences, in which the sequences are in a state of maximum information capacity. Our in silico studies show that a minimal model of genome growth based on random segmental duplication does drive and genome-length sequence to such a fixed-point [4].
     The_Order_curve_v1.jpg

Symmetry in Genomic Sequences

The cause of symmetry is usually subtle, and its study often leads to a deeper understanding of the bearer of the symmetry. To gain insight the dynamics driving the growth and evolution of genomes, we conducted a comprehensive study of textual symmetries in about 800 complete chromosomes. We focused on symmetry based on our belief that, in spite of their extreme diversity, genomes must share common dynamical principles and mechanisms that drive their growth and evolution, and that the most robust footprints of such dynamics are symmetry related. We found that while complement and reverse symmetries are essentially absent in genomic sequences, inverse -- complement plus reverse -- symmetry is prevalent in complex patterns in most chromosomes, a vast majority of which have near maximum global inverse symmetry. We also discovered relations that can quantitatively account for the long observed but unexplained phenomenon of skews in genomes. Our results suggest segmental and whole-genome inverse duplications are important mechanisms in genome growth and evolution, probably because they are efficient means by which the genome can exploit its double-stranded structure to enrich its code-inventory. [5]
     InvSym_Fig4.jpg

Minimal Model for Genome Growth: Random Segmental Duplication

Our goal is to push our knowledge of the evolution of life as close as possible to its very origin. Our recent textual analysis of complete genomes reveals clear evidence suggesting that segmental duplication is likely the most important mechanism that has driven the large-scale growth and evolution of genomes. The study covers all complete prokaryotic and eukaryotic genomes or chromosomes available in the GenBank; about 280 sequences ranging in length from 0.3 to 230 million bases. Our study shows that: (i) The statistics of word frequencies in a genome are the same as that of a matching random sequence of a much shorter length, and the short effective length is (within a factor of two or three) is a universal for all genomes. For example, for two-letter words, the universal effective length is about 300 bases, compared with the lengths of genomes that range from 200 kb to 230 Mb. (ii) Genomes are maximally self-similar. (iii) Indivdual words are randomly distributed throughout the genome. These properties impose strict constraints on mechanisms for genome growth. We show that an extremely simple universal growth model based on an early onset of maximally stochastic segmental duplications generates sequences having the genomic properties cited above. The growth has far reaching implications on evolution of genomes and for the first time allows us to think about the very early ancestor of the present-day genomes, when it is only about 300 bases long [6-10].

       CSB_Fig4_2.jpg   CSB_Fig5.jpg   RSD_Le_Result2009_v1.jpg

Other Topics

Portraits of Genomes

In visualizing very long DNA sequences, including the complete genomes of several bacteria, yeast and segments of human genes, we encounter fractal-like patterns underlying these biological objects of prominent importance. The method used here to visualize genomes of organisms may well be used as a convenient tool to trace, e.g., evolutionary relatedness of species. We describe the method and explain the origin of the observed fractal-like patterns [1].

     pic9z.jpg

Folding of Prion Protein

In template-assistance model, normal prion protein (PrPC), the pathogenic cause of prion diseases such as Creutzfeldt-Jakob (CJD) in human, Bovine Spongiform Encephalopathy (BSE) in cow, and scrapie in sheep, converts to infectious prion (PrPSc) through an autocatalytic process triggered by a transient interaction between PrPC and PrPSc. Conventional studies suggest the S1-H1-S2 region in PrPC to be the template of S1-S2 β-sheet in PrPSc, and the conformational conversion of PrPC into PrPSc may involve an unfolding of H1 in PrPC and its refolding into the β-sheet in PrPSc. Here we conduct a series of simulation experiments to test the idea of transient interaction of the template-assistance model. We find that the integrity of H1 in PrPC is vulnerable to a transient interaction that alters the native dihedral angles at residue Asn143, which connects the S1 flank to H1, but not to interactions that alter the internal structure of the S1 flank, nor to those that alter the relative orientation between H1 and the S2 flank [11].

     Prion_conformation_v1.jpg

Localizing Bioelectromagnetic Sources in the Brain

Magnetoencephalography (MEG) provides dynamic spatial-temporal insight for neural activities in the cortex. Because the possible number of sources is far greater than the number of MEG detectors, the proposition to localize sources directly from MEG data is ill-posed. Here we develop a novel approach based on a sequence of data processing procedures that includes a clustering process, an new filter analysis, and an application of the maximum entropy method. We examine the performance of our method and compare it with the minimum-norm least-square inverse method using an artificial noisy MEG data [12]. We are also collaborating with the Neuroscience Lab of B.C. Shyu at the Inst. of Biomedical Science, Academis Sinica to analysis and interpret electrophysiology data on rat [13].
     MEG_F2_2008_v1.jpg

REFERENCES

[0] International Human Genome Sequencing Consortium, "Initial sequencing and analysis of the human genome", Nature 409 (2001) 860; J. C. Venter, et al., "The sequence of the human genome", Science, 291 (2001) 1304.
[1] B.L. Hao, H.C. Lee and S.Y. Zhang, "Fractal related to long DNA sequences and complete genomes", Chaos, Solitons and Fractals, 11 (2000) 825-836.
[2] The GenBank. (Link)
[3] SG Kong, et al., "Genomes: at the edge of chaos with maximum information capacity", ArXiv:0708.1598v1 (2007).
[4] SG Kong, et al., "Quantitative measure of randomness and order for complete genomes", Phys. Rev. E 79, 061911 (2009) .
[5] SG Kong, et al., "Inverse symmetry in genomes and whole-genome inverse duplication", PLoS ONE 4(11): e7553 (2009)
[6] L.S. Hsieh, L.F. Luo, F.M. Ji and H.C. Lee, "Minimal model for genome evolution and growth", Phys. Rev. Letts. 90 (2003) 018101-104.
[7] LS Hsieh, TY Chen, CH Chang, WL Fan and HC Lee, "Universality in large-scale structure of complete genomes", Genome Biology, 5 (2004) 7
[8] HD Chen, CH Chang, LC Hsieh and HC Lee, "Divergence and Shannon information in genomes", Phys. Rev. Lett. 94, 178103 (2005)
[9] TY Chen, LC Hsieh and HC Lee, "Shannon Information and Self-Similarity in Complete Genomes", Computer Physics Communications 168 (May 2005)
[10] HD Chen, WL Fan, SG Kong and HC Lee, "Universal Global Imprints of Genome Growth and Evolution - Equivalent Length and Cumulative Mutation Density", PLoS ONE 5(4): e9844 (2010)
[11] Chih-Yuan Tseng, Chun-Ping Yu and H.C. Lee, "Integrity of H1 helix in prion protein revealed by molecular dynamic simulations to be especially vulnerable to changes in the relative orientation of H1 and its S1 flank ", Eur. Biophy. J (2009) 38:601-611; "From laws of inference to protein folding dynamics", Phys. Rev. E 82, 021914 (2010)
[12] Hung-I Pai, Chih-Yuan Tseng and HC Lee, "Data processing approach for localizing bio-magnetic sources in the brain", ArXiv:0903.0859v1 [q-bio.QM] (2009)
[13] Zi-Hao Wang, Ming-Hua Chang, Jenq-Wei Yang, Jyh-Jang Sun, H.C. Lee and Bai-Chuang Shyu, "Intra-cortical complexities revealed in the primary somatosensory cortex of rats", Brain Research 1082 (2006) 102-114
[14] CH Chen, et al. An all-statistics, high-speed algorithm for the analysis of copy number variation in genomes. To appear in Nucl. Acid. Review Nucleic Acids Research 39, e89 (2011)
[15] Feng-Hsiang Chung, Henry Hsin-Chung Lee, HC Lee A Trend-of-Disease-Progression Procedure Works Well for Identifying Cancer Genes from Multi-State Cohort Gene Expression Data for Human Colorectal Cancer PLoS ONE 8(6): e65683 (2013)
[16] Feng-Hsiang Chung, et al. Functional Module Connectivity Map (FMCM): A Framework for Searching Repurposed Drug Compounds for Systems Treatment of Cancer and an application to Colorectal Adenocarcinoma PLoS ONE 9(1): e86299 (2014)
[17] FH Chung, et al. Gene-Set Local Hierarchical Clustering (GSLHC) - A Gene Set-Based Approach for Characterizing Bioactive Compounds in terms of Biological Functional Groups PLoS ONE 10(10):e0139889 (2015)


H. C. (Paul) Lee