HARVARD UNIVERSITYGraduate School of Arts and SciencesTHESIS ACCEPTANCE CERTIFICATE The undersigned, appointed by the Department of Chemistry and Chemical Biology have examined a thesis
Trang 1NOTE TO USERS
This reproduction is the best copy available.
®
UMI
Trang 2HARVARD UNIVERSITYGraduate School of Arts and Sciences
THESIS ACCEPTANCE CERTIFICATE
The undersigned, appointed by the
Department of Chemistry and Chemical Biology
have examined a thesis entitled
Small Molecule-Based Approach to Chemistry and Biology:Synthesis, Measurement, and Analysis
presented by Young-kwon Kim
candidate for the degree of Doctor of Philosophy and hereby
Signature
Typed name: P
Signature
Typed name: Prof David Liu
Signature 172A lo D TST ee
Typed name: Prof Daniel Kahne
Date: December 7, 2005
Trang 3Small Molecule-Based Approach to Chemistry and Biology:
Synthesis, Measurement, and Analysis
A thesis presented
by
Young-kwon Kim
to
The Department of Chemistry and Chemical Biology
in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
in the subject of
Chemistry and Chemical Biology
Harvard University Cambridge, Massachusetts
December 2005
Trang 4UMI Number: 3205917
Copyright 2005 by Kim, Young-kwon
All rights reserved.
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy submitted Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.
®
UMI
UMI Microform 3205917 Copyright 2006 by ProQuest Information and Learning Company All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O Box 1346 Ann Arbor, MI 48106-1346
Trang 5© 2005 —- Young-kwon Kim All rights reserved
Trang 6Small Molecule-Based Approach to Chemistry and Biology:
Synthesis, Measurement, and Analysis
Young-kwon Kim Professor Stuart L Schreiber
7 December 2005 Research Adviser
Abstract
Small molecules have long played important roles in the advancement of biology; however, little meta-insight has been gained during this period This thesis presents two studies that aim
to uncover the relationships between chemical space and biological measurement space.
The first chapter comprises literature surveys of chemical descriptor space, biological measurement space (outputs), and analysis methods to link them An emphasis on the role of
diversity-oriented synthesis populating accessible chemical space (inputs) is offered.
The second chapter describes the methodology that uses well-defined inputs provided by
diversity-oriented synthesis and robust readouts from a series of chemical genetic modifier screenings Subsequent multidimensional data analysis confirms the intuition of the scientists yet adds methodical rigor, while simultaneously discovers novel patterns of biological activity
that correlate with stereochemistry in a subtle and unexpected way Significant variations in biological outcomes were found to result from the stereochemical and skeletal elements in small molecules Such insights facilitate efficient searching and probing of chemical space.
The third chapter reports the development of analytical implements and illustrates that the relevance network is robust and flexible The resulting analysis environment enables the visualization of significant associations between small molecules A larger number of
Trang 7structurally and functionally heterogeneous inputs (small molecules) are efficiently examined based on a small-molecule annotation dataset and subsequently validated Furthermore, novel hypotheses on the biological mechanisms of small molecules are proposed using already annotated small molecules.
Trang 8
-Ìv-Table of Contents
L4 11v 1drdađađadaiaiiiiiiadaaaiaẳẳiiaiiii iii
Abbr€ViatÏOTS cu ng ng nee ene E nh TK ee cet nee ete eH tk km nà tà nh rà vi
Dedication 0.0 -Ỡ EE EEE ERLE EE EEE eRe E EERE EEE EEE xi
Chapter 1 Introducfi0I eee ĐH HE BE ĐK Ki Ko EEE Đi EEE 1 1.1, Chemical descriptor sDACG ng TT nà nà kh nh TH bà ST 2
1.2 Biological measurement SDAC€ ch nh mm ene eed eee hà by 24
1.3 Multidimensional data analySIS -.- cọ nee eect BH TK nh vệ, 49 1.4 Sampling chemical space by diversity-oriented syntheSis che 67
Chapter 2 Case Study Ì, cence ee eee nent Ko ĐK net Ee eee eee eee EEE EEE Bà ea 105
2.1 Relationship of skeletal and stereochemical diversity to cellular measurement space 107 2.2 Supporting inÍOrmatiOn ‹ ác cóc ch ni KH ch TK Ki ĐK ki KÊU 119
Chapter 3 Case Study ÏÏ cm ĐK kh nh 197 3.1 Construction and analysis of relevance network from small-molecule annotation 199 3.2 Supporting InfOrImatiOT no HH ener eee eee EERE Ee een ti nền nà EEE EE EERE 215
Trang 9acetic acid activation domain
acute myelogenous leukemia
angiotensin adenosine 5’-triphosphate binding domain
building block 5-bromo-2’deoxyuridine adenosine 3’,5’-cyclic monophosphate calcein-acetoxymethylesters
ceric ammonium nitrate cholecystokinin receptor methylene chloride acetonitrile
chloroform chemical global positioning system chemical ionization-mass spectrometry comprehensive medicinal chemistry central nervous system
comparative molecular field analysis dichloromethane
Trang 10
DNA deoxyribonucleic acid
DOS diversity-oriented synthesis
ECs effective concentration of half-maximal effect
EDC 1- ethyl-3-(3’-dimethylaminopropyl)carbodiimide hydrochloride
EI-MS electron impact-mass spectrometry
ELISA enzyme-linked immunosorbent assay
EM expectation-maximization
EtO diethyl ether
EtOAc ethyl acetate
Et ethyl
ES-MS electrospray-mass spectrometry
FAB-MS fast atom bombardment-mass spectrometry
FTIR Fourier transform infrared spectrometry
GA genetic algorithm
GE-HTS gene expression-based high-throughput screening
GPCR G protein coupled receptor
GRIND grid-independent descriptors
h hours
HCS high-content screening
HDAC histone deacetylase
Trang 11hydroxysteroid dehydrogenase
hydroxytryptamine
high-throughput screening
Hertz high-pressure liquid chromatography
highest scoring common substructure
iso-propylalcohol
knowledge discovery in database
Kyoto encyclopedia of genes and genomes tandem liquid chromatography-mass spectrometry magic angle spinning nuclear magnetic resonance spectroscopy multi-component reaction
multidimensional scaling methyl
2,4,6-trimethylphenyl
methanol megahertz minutes magnesium sulfate
mass spectrometry
MACCS-II drug data report (3-(4,5-dimethylthiazole-2-yl)-2,5-diphenyltetrazoliumbromide) sodium sulfate
nuclear magnetic resonance spectroscopy
VI
Trang 12-NR nuclear receptor
PCA principal component analysis
PCR polymerase chain reaction
PEG polyethylene glycol
PSA polar surface area
p-TsOH para-toluenesulfonic acid
PyBOP bezotriazol-1-yloxytripyrrolidinophosphonium hexafluorophosphate pybox pyridine-bis(oxazoline)
PyBroP bromotripyrrolidinophosphonium hexafluorophophate
pyr pyridine
QSAR quantitative structure activity relationship
QUINAP [1-(2-diphenylphosphino-1-naphthy])isoquinoline]
RNA ribonucleic acid
RNAi RNA interference
ROF rule-of-five
SMILES simplified molecular input line entry specification
SMM small-molecule microarray
SOM self-organizing map
SOSA ‘selective optimization of side activities
TBS tert-butyldimethylsily!
TES triethylsilyl
Trang 13TfOH trifluoromethanesulfonic acid
Y2H yeast two-hybrid
Y3H yeast three-hybrid
[M] Macrobeads
Silyl functionalized, 500-600 um PS, 1% cross-linked by divinylbenzeney y y
Trang 14To my parents
Trang 15Chemical descriptor space
Biological measurement space
Multidimensional data analysis
Sampling chemical space by diversity-oriented synthesis
24 49 67
Trang 161.1 Chemical descriptor space
1.1.1 Chemical (descriptor) space
Frequently, the term “chemical space” is used as a colloquialism referring to a conceptual framework for formulating relations between molecular structures and/or properties Chemicalspace, which encompasses all possible small organic molecules, has no theoretical limit, but can
be reduced according to practical concerns: synthetic feasibility, user accessibility, drug-like
properties, and the ability to modulate biological processes.'
chemical space in silico
data mining and analysis computational scientist
¬ feasible chemical space in cerebro
strategy and methodology synthetic chemist
/_ * accessible chemical space in vivo, in vitro
collection of commercially available reagents.
Based on synthetic feasibility, chemical space is reduced to “feasible chemical space”
Feasible chemical space can also be defined in various ways, even without real synthetic
considerations For example, # silico combinatorial enumerations of common appendages and
core skeletons in chemical databases delineate the boundary of a chemical space.”
Further reduction to “accessible chemical space” can be primarily based on scientific
demands For example, accessible chemical space for the chemical biologist is populated by
! (a) Dobson, C M Nature, 2004, 432, 824-828 (b) Lipinski, C.; Hopkins, A Nature, 2004, 432, 855—
861 (c) Hann, M M.; Oprea, T I Curr Opin Chem Biol 2004, 8, 255-263 (d) Bohacek, R.S.; McMartin C.; Guida, W C Med Res Rev., 1996, 16, 3-50 (e) Czarnik, A C Chemtracts, 1995, 8, 13-18.
? Ertl, P J Chem Inf Comput Sci 2003, 43, 374-380
Trang 17natural products, commercially available compounds, and libraries derived from oriented synthesis, each ready for interrogating biological systems of interest These
diversity-compounds should be of sufficient quantity, purity, and with adequate explicit/implicit
structural information For synthetic chemists, the development of novel synthetic strategies
and methodologies might expand feasible chemical space significantly; indeed, the execution
of diversity-oriented synthesis can populate extensively the accessible chemical space
Chemical descriptor space: mathematical definition
The definition of chemical descriptor space is a vector (metric) space defined by a number of
chemical descriptors for each small molecule In general, each of ø selected chemical
descriptors adds a dimension to an n-dimensional vector space, and each small molecule isassigned to coordinates in this vector space according to the scaled values of its chemicaldescriptors (Figure 1.2a) For visualization, an n-dimensional chemical-descriptor space can beprojected onto fewer dimensions by a variety of dimensionality reduction methods As shown
in Figure 1.2b, each axis is replaced by a latent variable from the original descriptor set
Sometimes chemical space is partitioned by a number of binned descriptors, represented by a
number of cells shown in Figure 1.2c.’
n-dimensional chemical deacriptor space Reduced space by latent variables 18 cells divided by 5 partitioning
Figure 1.2 Chemical descriptor space (a) n-dimensional chemical descriptor space (b) For
visualization, n-dimensional chemical descriptor space can be reduced into two or three-dimensionalspace using proper dimensionality reduction methods Each axis is represented by a latent variable from
the original descriptor set (c) Cell-based representation of chemical descriptor space
3 (a) Bajorath, J J Chem Inf Comput Sci 2001, 41, 233-245 (b) Gasteiger, J.; Engel, T
Chemoinformatics: a textbook (Wiley-VCH, Weinheim, 2003), pp 15-268.
Trang 18
-3-Role of chemical descriptor space
The role of chemical descriptor space is divided into two elements: storage and retrieval ofchemical information related to large compound collections in databases, and rigorous analysis
of the properties (i.e., measurement space) of small molecules associated with their structuralfeatures encoded by chemical descriptors The process of assigning each small molecule in
feasible (F) or accessible chemical space (A) to chemical descriptor space based on its chemical
descriptors can be referred to as “representation” (Figure 1.3)? On the other hand, analysis of
chemical descriptor space and measurement space can generate a number of hypothetical
models to be tested These models are testing-grounds for the practical significance of chemical
descriptor space as a valid method for linking chemical space and measurement space.”
Moreover, the construction of chemical descriptor space is much cheaper, more consistent than
both empirical synthesis and biological testing Therefore, chemical descriptor space mightmake possible valid predictions of routes between accessible to feasible chemical spaces Forexample, thoughtful extension of validated models from the analysis of accessible chemical
space and measurement space might provide guidelines for a second-phase synthesis directed at
molecules with improved measured outcomes.
In short, dynamic integration of synthetic chemistry, assay measurements, and data analysismight enable us to constantly evaluate overall processes in order to provide probabilistic,statistically significant predictions.
4 Strausberg, R L.; Schreiber, S L Science 2003, 300, 294-295
Trang 191.1.2 Chemical descriptors
Representation: search and retrieval
Molecular structures are usually represented, manipulated, and stored as molecular graphs Graph theory is a well-established branch of mathematics that has found applications in
chemistry as well as in many different areas A graph is an abstract formalism that contains nodes connected by edges In a graph representation (Figure 1.4a) the nodes correspond toatoms and the edges to bonds Note that hydrogen atoms are often omitted These atom andbond attributes are important when performing operations on the molecular graph A graph represents only the topology of a molecule; that is, the way the nodes are connected.
Therefore, a given graph may be drawn in many different ways and may not obviously
correspond to a “standard” chemical diagram.”
(a) Ie (bị [> 4a] + 11
v Z 7 -1,9493| 0.750 0 aw GYR,“ ea 1.9491 +0, 750 0 eh coordinates
number of atoms aa eal Son ak” ra \
0 0 number of bandÝ i 0.650] -0.750 BỊ : `ƠESU| wrist U iret a
(†-carvone (2)-carvone -0.650| 1.500 0 2] carbon
(R)-2-methyl-5-4prop-1-en-2-yl) (S)-2-methyl-5-(prop-1-en-2-y1) -3.248] -1,500 ia) fej
cyclohex-2-enone cyclohex-2-enone -0.650! -3,000 0 fa}
information in red (d) Examples of ambiguous graphical representations.
The most common method to parse molecular graphs in a numeric format is using a
connection table The simplest type of connection table consists of two sections: a list of theatom numbers, and positions of the atoms in the molecule; and a list of the bonds, specified as
pairs of bonded atoms As shown in Figure 1.4, a connection table can encode a variety of
information A simpler way to represent a molecular graph is through the use of linear notation;
Trang 20SMILES (Simplified Molecular Input Line Entry Specification) has been used extensively in
this way, and can further be used to encode stereochemical information (Figure 1.4c).*
Graphical representation has a relative deficiency, since it allows only one valence bond model
in each structure representation For example, benzene can be represented in two ways, costing more time to search and retrieve exact structures based on graphical representation.
Approximately 0.5% of compounds from commercial collections (550K) contain tautomers, or
ambiguous functional groups (Figure 1.4d).°
Since graphs can be constructed in many different ways, it is necessary to have methods to determine whether two graphs are the same In graph-theoretic terms this problem is known as graph isomorphism Nevertheless, the graph isomorphism problem is NP-hard, and the
computational time involved increases exponentially for larger compound sets There are two distinct approaches for efficient retrieval of compounds from databases: generation of newchemical descriptors, such as binary (“bit-string”) fingerprints, and development of efficient
algorithms (e.g., heuristic models).°°?
Chemical descriptor is a term describing a molecular structure in quantitative terms Theuse of chemical descriptors makes possible further understanding and even prediction ofchemical and biological properties in structural terms From this perspective, we can discussthe representation of molecular structure (chemical descriptors), methods of comparing smallmolecules based on these representations, and how these methods can relate to measuredproperties Such considerations lead to the concept of “molecular similarity”, its variousdefinitions and uses, and how these definitions have evolved in recent years Molecular
similarity, as a paradigm, contains many implicit and explicit assumptions with respect to the
> (a) Weininger, D J Chem Inf Comput Sci 1988, 28, 31-36 (b) Trepalin, S V.; Skorenko, A V.;Balakin, K V.; Nasonov, A F.; Lang, S A.; Ivashchenko, A A.; Savchuk, N P J Chem Inf’ Comput Sci 2003, 43, 852-860 (c) Wang, X.; Wang, J.T L J Chem Inf: Comput Sci 2000, 40, 442-451 (d) Rhodes, N.; Willett, P.; Calvet, A.; Dunbar, J B.; Humblet, C J Chem Inf Comput Sci 2003, 43, 443—
448 (e) Raymond, J W.; Gardiner, E J.; Willett, P Chem Inf Comput Sci 2002, 42, 305-316.
Trang 21prediction of biological activity."
Traditionally, chemists have described molecular structure, topologic, geometric, and
electronic features encoded across three levels: constitution, configuration, and conformation
Likewise, classification of chemical descriptors is often based on their dimensionality.’
One-dimensional (1D) descriptors include bulk properties such as volume, molecularweight, log P, molar refractivity, and simple counts of atom or bond-types (i.e., heavy atomcounts, rotatable bond counts).
Two-dimensional (2D) descriptors include topological indices and other graph-baseddescriptors, derived from graph-theoretic decomposition of the connectivity matrix with real
numbers (molecular connectivity indices) or integers (Wiener indices)®° based solely on the
constitution of compounds Kier and Hall extended topological indices to include electronicand valence state information, deriving “electro-topological” descriptors which were further
refined to “E-state fields”.Š°*
Fingerprint descriptors incorporate diverse sets of chemical descriptors in a binary bit-string
6 Willett, P.; Barnard, J M.; Downs, G M J Chem Inf Comput Sci 1998, 38, 983-996
7 (a) Bajorath, J J Chem Inf, Comput Sci 2001, 41, 233-245 (b) Livingstone, D.J J Chem Inf
Comput Sci 2000, 40, 195-209, and references therein (c) Wehrens, R et al Anal Chim Acta 1999,
400, 413-424.
8 (a) Randic, M J Mol Graph Model 2001, 20, 19-35 (b) Randic, M J Chem Inf Comput Sci 2004,
44, 373-371, and references therein (c) Hall, L H.; Monhey, B.; Kier, L B J Chem Inf Comput Sci
1991, 3/, 76-82 (d) Kellogg, G E.; Kier, L B.; Gaillard, P.; Hall, L H J Comput.-Aided Mol Des
1996, 70, 513-520 (e) Torrens, F Comb Chem High Throughput Screen 2003, 6, 801-809 (f) Xue,L.; Godden, J.; Bajorath, J J Chem Inf Comput Sci 1999, 39, 881-886 (g) McGregor, M J.; Pallai, P.V.J Chem Inf, Comput Sci 1997, 37, 443-448 (h) James, C A.; Weininger, D Daylight theorymanual (Daylight Chemical Information Systems, Inc., Irvine, CA, 1995) (i) Ghuloum, A M; Sage, C
R.; Jain, A J J Med Chem 1999, 42, 1739-1748.
Trang 22format with various sizes and complexities, and are designed to be “barcodes” for a molecule Such formats can capture structural or topological features and/or properties of molecules at the same time One of the principal differences between various fingerprints designs is whether or not specific bit positions within the string can reliably mapped to specific chemical features (absence or presence of pre-defined structural fragment) or descriptor values This is the case in
keyed designs, such as MACCS keys, or MFPs.*** By contrast, in hashed or folded
representations where features are mapped to corresponding or overlapping bit segments to
enhance uniqueness, single bit positions lose apparent physical meaning.""!
Originally, the derivation of three-dimensional (3D) descriptors needed to use geometrical information from points in 3D space These descriptors are calculated in molecular interaction
fields, which require that individual compounds be aligned for property calculations.”* The
group of field-based descriptors differs from other groups because they use three-dimensional information within a molecule for their derivation Because the methods of generation require a sufficient number of data points (“grid points’) for a sensible resolution, they are
computationally much more demanding than two-dimensional descriptors.” For example,
typical 2D descriptors range from 0.5-5 Kilobits per molecule, while 3D descriptors require more than 3 megabits per molecule In some studies, a subset of field-based descriptors found
to be invariant upon rotation or derived from back-projection algorithms was used.”°
Furthermore, Gaussian representations based on quantum similarity methods have replaced the
grid methods for describing the general shape of molecules.°
Most shape-based descriptors encode the shape of a molecule not in one fragment, but
Ọ (a) Pastor, M.; Cruciani, G.; McLay, I.; Pickett, S.; Clementi, S J Med Chem 2000, 43, 3233-3243.(b) Stiefl, N.; Baumann K J Chem Inf Comput Sci 2003, 46, 1390-1407, and references therein (c) Carbo, R.; Leyda, L.; Arnau, M Int J Quantum Chem 1980, 17, 1185-1189 (d) Carbo, R.; Calabuig,
B J Chem Inf Comput Sci 1992, 32, 600-606 (e) Grant, J A.; Pickup, B T J Phys Chem 1995, 99, 3503-3510 (f) Sheridan, R P.; Ramaswamy, N.; Rusinko III., A.; Bauman, N.; Haraki, K S.;
Venkataraghavan, R J Chem Inf Comput Sci 1989, 29, 255-260 (g) Sheridan, R P.; Miller, M D.; Underwood, D J.; Kearsley, S K J Chem Inf Comput Sci 1996, 36, 128-136 (h) Good, A C.; Ewing,
T J.; Gschwend, D A.; Kuntz, I D J Comput.-Aided Mol Des 1995, 9, 1-12.
Trang 23instead use several small important features to discover feature relationships by statistical
associations These methods are free from alignment problems and are usually performed by a
bit-string representation of features, saving computation time They are often referred to as
multiple-point pharmacophores: two-point pharmacophores (2PP), which are known as atompairs and represent all possible pairs of atoms in the molecule, three-point pharmacophores
(3PP), and four-point pharmacophores (4PP).°
Surface-based descriptors rely on the intuitive notion that macromolecule-ligand interactions
are mostly mediated by the molecular surfaces (e.g., Van-der-Waals surface).'"* In one case,
refinements over geometric organizations of polar and nonpolar surface areas showed
significant improvements in the prediction of physicochemical descriptors.'°*
Spread and variability of chemical descriptors
Since most descriptors have been developed for different purposes, the ranges and
distribution patterns of descriptor values are heterogeneous Prior to using collections of
descriptors from different sources, it is wise to scale them Descriptors can be scaled based on
the observed range of values in the dataset to sense any peculiarities in the dataset.'!* For
example, fingerprint scaling is a method to increase the performance of similarity search
calculations It is based on the detection of local patterns with higher information content,representing specific compound classes; application of scaling factors has been shown to
improve search results for different sets of fingerprints.''® Standard, chemically meaningful
scaling based on mean value and absolute deviation of a variety of descriptors in different
chemical databases (ACD, CMC, MDDR) has been completed to filter outliers.'!°
0 (a) Stanton, D.; Jurs, P Anal Chem 1990, 62, 2323-2329 (b) Gaillard, P; Carrupt, P A.; Testa, B.;
Boudon, A J Comput.-Aided Mol Des 1994, 8, 83-96 (c) Polanski, J.; Walczak, B Comput Chem
Trang 24-9-Nevertheless, consideration of outliers is really a question of finding a balance between
extremes For example, outliers would have a high probability of being identified by
dissimilarity-based searching if left in the dataset, or they might tend to artificially compress major populations into smaller spaces during cell-based partitioning In most studies, however,
it is better for outliers to remain in the dataset.'!**
Correlation between descriptors
Despite their heterogeneous origins, there is significant overlap in the information content of
chemical descriptors.'** Implicit information encoded in various 2D descriptors, for example,
can be extracted, making it possible to use a subset of cost-effective descriptors Ina series ofalkyl-phenyl compounds, there are significant correlations between 2D topological indices andparameters related to conformations, demonstrating that 3D properties can be extracted without
resorting to geometric optimizations.!?° For the calculation of polar surface area, essentially
identical results are obtained using either 3D calculation or 2D topological indices.'7°4
Performance of chemical descriptors: retrospective analysis
Retrospective analysis of descriptor performance is carried out either by using simulatedproperty prediction experiments or by examining the coverage of different bioactivity types inthe diverse subsets The ability to distinguish biologically active and inactive compound sets byvarious clustering methods was evaluated over a range of structural descriptors; the mosteffective descriptor was the 2D keyed fingerprints.'** In the following study by the same group,
j 13
the prediction of known physical properties as a metric showed the same trend as wel
Independent studies also demonstrated that 2D fingerprint-based descriptors were most effective
'2 (a) Oprea, T I J Braz Chem Soc., 2002, 13, 811-815 (b) Quigley, J M Naughton, S M J Chem.Inf Comput Sci 2002, 42, 976-982 (c) Estrada, E.; Molina, E.; Peromodo-Lopez, I J Chem Inf.
Comput Sci 2001, 47, 1015-1021 (d) Ertl, P et al J Med Chem 2000, 43, 3714-3717.
l3 (a) Brown, R D.; Martin, Y C J Chem Inf Comput Sci 1996, 36, 572-584 (b) Brown, R D.;Martin, Y C J Chem Inf Comput Sci 1997, 37, 1-9 (c) Matter, H J Med Chem 1997, 40, 1219-
1229 (d) Potter, T.; Matter, H J Med Chem 1998, 41, 478-488 (e) Patterson, D E eral J Med Chem 1996, 39, 3049-3059 (f) Cruciani, G et al J Med Chem 2002, 45, 2685-2694 (g) Ooms, F et
al Biochim Biophy Acta 2002, 1587, 118-125.
Trang 25both in selecting active compounds and in sampling representative subsets of active
compounds °%4
Often, application of the similarity principle is validated by the mathematical concept of
neighborhood behavior.'** A number of descriptors were assessed by their distributions for a
subset of related compounds and biological activities; 3D grid-based descriptor (CoMEA) and 2D fingerprints were found to out-perform other chemical descriptors These conclusions, however, should be treated with some caution due to the limited sizes of datasets.
Solubility data and blood-brain barrier penetration serve as test cases for pharmacokinetic
aspects of descriptor analysis 8 Comparison of the descriptors applied to these data sets
revealed that surface-based 3D descriptors (VolSurf) demonstrated the most consistent and reliable performance, Grid-independent 3D descriptors (GRIND) showed intermediate
performance, while 2D fingerprint-type descriptors (UNITY fingerprints, ISIS keys)
underperformed for pharmacokinetic profiling.
1.1.3 Navigating chemical descriptor space
Historically the notion of similarity is used mainly in early stages of the development of aparticular science, and later it may be quantified and explained with accuracy as the theory ofthis science develops For example, the periodic table was originally founded on similarity between elements and these “similarities” were later explained based on electrons and the
'4 (a) Hansch, C Acc Chem Res 1969, 2, 232-239 (b) Issacs, N Physical Organic Chemistry (Second
edition, Prentice Hall, 1996), pp 146-192.
Trang 26
-11-the indoleacetic acid-derivatives started from using -11-the electronic coefficients of -11-the Hammett
equation.’ A general shortcoming common to these methods is that the scope is limited to
structurally closely related series of compounds, so they are inappropriate for correlation of data where compounds fall into many different structural classes Prediction of activity outside the structural classes of established biological interest is thus problematic most of the time A second shortcoming is their weakness in accommodating data represented by inactive
compounds Essentially, existing structure-activity correlation methodologies are only usefulfor optimizing a previously known “lead” structure, and not in generating new “leads”
Substructure search
Substructural analysis is often dubbed Free- Wilson-Analysis, as Free and Wilson published
one of the early works in this area.'** Substructure search involves the retrieval of all molecules
in a database that contain a user-defined query substructure, irrespective of the environments in
which the query substructure occurs.’ It is equivalent to determining whether one graph is
entirely contained within another, a problem known as sub-graph isomorphism Nevertheless,efficient search of a vast database requires two steps The first step involves the use of screens
to rapidly eliminate molecules that cannot possibly match the substructure query The
remaining structures are subjected to the more time-consuming subgraph isomorphism
procedure to determine which of them truly match based on the presence or absence of
structural features represented There are a number of different graph-theoretic algorithms such
as the maximum common edge subgraph (MCES), maximum weight clique, and the k-cutmethods, that show similar or superior performances to subgraph isomorphism '*°
Global substructure analysis of frequent substructures using drug databases might well
lead to the identification of minimal motifs relevant to biological activity.''*“ This approach
'S (a) Free, S M.; Wilson, J W J Med Chem 1964, 7, 395-398 (b) Merlot, C et al Curr Opin Drug.Dise Dev 2002, 5, 391-399 (c) Hagadone, T R J Chem Inf’ Comput Sci., 1992, 32, 515-521.
!5 (a) Sheridan, R P J Chem Inf, Comput Sci 1998, 38, 915-924 (b) Horton, D A; Bourne, G T.;Smythe, M L Chem Rev 2003, 703, 893-930 (c) Tounge, B A.; Reynolds, C H J Chem Inf.
Trang 27resonates well with the concept of “privileged (sub)structures”, referring to substructural
elements found in compounds enriched for biological activity.'* Application of
retro-synthetic analysis and comparative filtering of database ranked a number of substructures aspreferred building blocks for library design.'** Furthermore, “highest scoring common
substructure” (HSCS) was analyzed using a smaller dataset to provide skeletal information for
postulated lead compounds related to a certain biological activity.’
stimulated interest in the use of multiple reference structures to identify further molecules for
biological screening.'”* Similarity searching in large chemical databases requires
representations of the molecules that are both effective, i.e., can differentiate between molecules that are different, and efficient, i.e., quick to calculate, in operation In general, there is a conflict between these two traits in that the most effective methods of representation tend to be the least efficient to calculate, and vice versa, so a suitable compromise needs to be made Based on the set of descriptors or features chosen, comparison of molecules is usually
17b
performed using similarity measures (e.g., similarity coefficlents) ˆ Many similarity
coefficients have been developed, and these can broadly be divided into association, correlation,
Comput Sci 2004, 44, 1810-1815 (d) Lewell, X Q.; Judd, D B.; Watson, S P.; Hann, M M J Chem Inf Comput Sci 1998, 38, 511-522 (e) Sheridan, R P J Chem Inf Comput Sci 2003, 43, 1037-1050.
W (a) Hert, J.; Willett, P.; Wilton, D J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A J Chem
Inf, Comput Sci 2004, 44, 1177-1185 (b) Holliday, J D.; Salim, N.; Whittle, M.; Willett, P J Chem Inf Comput Sci 2003, 43, 819-828 (C) Chen, X.; Reynolds, C H J Chem Inf Comput Sci 2002, 42, 1407-1414.
Trang 28
-13-distance, and probabilistic coefficients In the case of comparison of two bit-string type
descriptors, association coefficients (e.g., Tanimoto coefficient) try to capture fragments
common to the two molecules to be compared and give a result in the range [0,1], where 1
represents identical molecules Correlation coefficients (e.g., Pearson coefficient) give values
in the same range and represent the correlation between two vectors representing two
molecules Distance coefficients (e.g., Euclidian coefficient) focus on differences between two
molecules and are a measure of dissimilarity, giving results in the range [0, + inf] There are anumber of factors influencing the performance of similarity coefficients in mining larger
datasets Sometimes a judicious combination of similarity coefficients improved the global
outcomes of the analysis.’ ”°
(a) (b) x (c)
P4.8)= Š 6 =b)Ÿ {oa de TA, B) = apo
Figure 1.6 Similarity coefficients (a) Euclidian coefficient (b) Pearson correlation coefficient (c)Tanimoto coefficient (a, B, y: common fragments in both œ and )
Similarity paradox
There are numerous examples, illustrating so called “similarity paradox”, in which a smallchange in the chemical structure leads to a drastic change in the biological response Thisfailure of the similarity principle can be viewed on two levels One level is the aspect ofrepresentation, whether or not the assessment of similarity correctly quantifies the intuitivesimilarity between two compounds The other level comes from the complexity of the
biological systems and the responses of interest
Structural similarity is more evident in actions when the hypothetical “lock and key” mode
of macromolecular interactions is prevalent Nevertheless, this mode may not always be the
case; similar compounds frequently bind in very different orientations in the protein active site,bind to a different conformation of a protein, or bind to a different protein altogether In fact,
Trang 29such observations are strengthened by the notion that medicinal chemists need to make a large number of compounds to represent any structural class, even as they are designing the
compounds to interact with a biological target of known structure Moreover, numerous
meaningful biological responses result from the complex interactions of genetic and
R:H AT; antagonist enantiomer binds D; receptor agonist of calcium channel CCK1 agonist
R:/-Pr = AT, agonist with 1/ 1,250 fold affinity enantiomer: antagonist enantiomer: antagonist
Figure 1.7 Examples of similarity paradox (a) Substitution at one position alters functional action of
small molecule (b) Butaclamol is an example in which affinities of (+)- and (-)- enantiomers differ from receptor to receptor for the same compound The (S)-(-) form of the calcium channel ligand Bay K 8644
is an agonist (stabilizing the open calcium channel), whereas the (R)-(+) form is a weak antagonist, a
calcium channel blocker (stabilizing the closed channel) Corresponding differences are observed for a
CCK1 ligand, where one diastereomer is an agonist, whereas its enantiomer is an antagonist (c) Because
of the chiral nature of our sensoric receptors, the enantiomers of limonene and carvone differ in their
typical odor (d) For two diastereomers of the wine lactone, the odor threshold values (i.e., the lowest
concentration in air that can be smelled by a person) differ by about 8 orders of magnitude.'**!
As illustrated in Figure 1.7, it is not evident that there is any structural relationship betweencompounds and their odors Interestingly, a model based on the vibrational spectra of odorants
showed promising results in correlating with odor For example, degeneracy, the ability of
structurally different components to carry out a similar function or produce an equivalent outputwithin a system, takes part in many levels of biological systems to yield robust behavior over
the course of evolutionary pressure.'TM
'8 (a) Edelman, G M.; Gally, J A Proc Natl Acad Sci, U.S.A, 2001, 98, 13763-13768 (b) Beely, N
R A Drug Dise Today 2000, 5, 354-363 (c) Ariéns, E J.; Wuis, E W.; Veringa, E J Biochem.
Trang 30
-15-Diversity analysis (assessmenf)
Diversity analysis generally deals with two different questions: “which compound set spansthe largest chemical space?” or “which compound set is most similar to (a) reference
compound(s)?” The first question involves maximizing dissimilarity of compounds to explorenew areas of chemical space, while the second entails maximizing similarity between
compounds in focused region.'”* A related notion in diversity design is that of selecting
compounds to “fill holes” in some diversity descriptor-defined space, a strategy most commonlyput forward in connection with pharmacophore-based molecular descriptors In principle, this is
a perfectly sensible procedure, but if and only if the descriptor(s) defining the space havealready been shown to be valid, e.g., showing a neighborhood behavior Such analyses can also
be applied to different problems such as non-redundant subset selection from chemical
databases, or global analysis and comparison of chemical databases.'””
Distance-based methods In distance-based approaches, diversity is generally expressed assome measure of pairwise dissimilarities One drawback is that the cost of calculation is scaled
to the square of the number of compounds O(N’), becoming prohibitive for large collections ofcompounds Aggressive applications of cost-effective searching methods (e.g., decision trees,simulated annealing, and nearest-neighbors) and metrics have reduced the scale to O(NlogN) or
even O(N).'”° Besides computational costs, this approach has a tendency to spread out
compounds too much in descriptor spaces, making it difficult to locate diversity “voids”
Optimization-based methods are effective ways of sampling large spaces evenly Forexample, a variance-based approach starts by finding a subset of compounds with descriptors of
Pharmacol 1988, 37,9 (d) Schramm, M.; Thomas, G.; Towart, R.; Franckowiak, G Nature 1983, 303,
535 (e) Franckowiak, G.; Bechem, M.; Schramm, M.; Thomas, G Eur J Pharmacol 1985, 7/4, 223 (f) de Tullio, P.; Delarge, J.; Pirotte, B Curr Med Chem 1999, 6, 433 (g) Hughes, J.; Dockray, G J.; Hill, D.; Garcia, L.; Pritchard, M C.; Forster, E.; Toescu, E.; Woodruff, G.; Horwell, D C Regul Pept.
1996, 65, 15 (h) Beinborn, M.; Quinn, S M.; Kopin, A S J Biol Chem 1998, 273, 14146 (i)
Friedman, L.; Miller, J G Science 1971, 172, 1044 (j) Guth, H.; Helv Chim Acta 1996, 79, 1559.
'? (a) Young, S S.; Ge, N Curr Opin Drug Disc Dev 2004, 7, 318-324 (b) Bayada, D M et al J
Chem Inf Comput Sci 1999, 39, 1-10 (c) Agrafiotis, D K.; Lobanov, V S J Chem Inf Comput Sci., 1999, 39 51-58 (d) Martin, E J.; Critchlow, R E J Comb Chem 1999, 1, 32-45.
Trang 31the least possible correlation, and tests the significance of each descriptor in predicting relevantdependent variables (e.g., biological activity) The most widely used method is D-optimaldesign, but the results are model-dependent, and tend to favor the extremes of chemical
space, |Ӣ
Cell-based designs operate within a pre-defined low-dimensional space (Figure 1.2c) One
of the important advantages of cell-based partitioning (binning) methods is a common frame ofreference for comparing different datasets, allowing focused design of a library based on theproperties of interest.”° Another advantage is that they are very fast and cost-effective, scaling
as O(N) The main drawback of these methods is that they are restricted to a low-dimensionalspace because the number of cells required to dissect a space rapidly becomes prohibitive as the
number of dimensions grows.””” It is also believed that diversity space is a relative space (/.e.,
there is no absolute origin), which has to be oriented to a reference point (e.g., drug-like space
for a drug-discovery program) Hence, global modeling, especially through the use of
insufficient number of descriptors, has come under fire for whether or not it properly represents
chemical space.ˆ°°*
ChemGPS was built to provide a consistent metric by always calculating its properties inrelation to a set of compounds having extreme properties (“satellites”) The role of satellites is
to serve as boundary conditions imposed by property space Since it is a global model based on
interpolation, the ChemGPS prediction is expected to be robust.”"
1.1.4 Mapping Chemical Descriptor Space
Here, I have reviewed some literature covering retrospective analyses of chemical space
? (a) Yi, B.; Hughes-Oliver, J M.; Zhu, L.; Young, S S J Chem Inf, Comput Sci 2002, 42,
1221-1229 (b) Godden, J W.; Furr, J R.; Bajorath, J J Chem Inf: Comput Sci 2003, 43, 182-188 (c) Menard, P R.; Mason, J S.; Morize, I.; Bauerschmidt S J Chem Inf; Comput Sci 1998, 38, 1204—
1213 (d) Schnur, D J Chem Inf Comput Sci 1999, 39, 36-45.
?! (a) Oprea, T IL; Gottfries, J J Comb Chem 2001, 3, 157-166 (b) Bergstrom, C A etal J Chem
Inf Comput Sci 2004, 44, 1477-1488.
Trang 32
-17-based on activity, drug-similarity, and sources in chemical descriptor space Most studies used
an available database as a surrogate to define compound For example, ACD” and SPRESIdatabases” are commonly regarded as representing inactive, non-drug compounds A number
of databases of drugs and pharmacologically interesting agents (e.g, WDI,” MDDR,”” CMC”*)
are treated as surrogates of drug space More sophisticated studies pre-treated databases for the purposes of the experimental comparison An interesting problem associated with the use of these databases is that they grow over time; therefore, are not always as if sampled from a static probability distribution, ¿.e., the probability distribution can vary dramatically over time It has been noted that there has been a shift to higher molecular weight for compounds in clinical trial over the past few years This suggests that the concept of drug, hence the characteristics of the
drug molecules, is not static but evolves over time.
Activity-based mapping
Selection of compound sets in a screening campaign can be based on the coverage of a whole set in some diversity space and/or distances between compounds On the other hand, if there are known chemotypes with the same biological activity, these become the seeds for
*2 Available Chemical Directory (2002.1 version) contains grade and bulk chemicals The database is
available from MDL Information Systems Inc., 14600 Catalina Street, San Leandro, CA, 94577.
Website: http://www.mdli.com/products/acd html
?3 The SPRESI database is produced by the All-Union Institute of Scientific and Technical Information of
the Academy of Science of the USSR (VINITD in Moscow and the Central Information Processing for Chemistry (ZIC) in Berlin This database consists of data extracted from 1,000 journals and patents, books, and other sources from 1975 to 1990 SPRESI is distributed by Daylight Chemical Information Systems, Inc., Mission Viejo, CA.
** World Drug Index (Derwent Information London, UK)
*° MACCS-II Drug Data Report (2002.1 version) contains biologically active compounds in the different
stages of drug development as presented in the patent literature, journals, meetings and congresses The database is available from MDL Information Systems Inc., 14600 Catalina Street, San Leandro, CA,
94577 Website: www.mdli.com/products/mddr.htm|
? Comprehensive Medicinal Chemistry (2002.1 version) contains compounds used or studied as
medicinal agents in humans and pharmaceutical compounds It is derived from the Drug Compendium in the Pergamon’s Comprehensive Medicinal Chemistry The database is available from MDL Information Systems Inc., San Leandro, CA, 94577 Website: http:/Avww.mdli.com/products/eme.html.
Trang 33compound selection From the structural information of such seeds, a representative set of compounds with similarity (i.e., “activity-enriched clusters” or “activity- prioritized screening
lists”) can be chosen for subsequent screening.””* For example, application of k-means
clustering with topological substructure analysis is able to distinguish selected sets of
antibacterial agents from others.ˆ”” CNS activities mined from the WDI were shown to beclassified by substructure analysis.””° Retrospective modeling using seven decision trees of
various complexity over 15,000 compounds isolated a very small number (37) of “highly active” or “moderately active” compounds identified in an HTS experiment The combination
of multiple decision trees with diverse classifiers is surprisingly successful, resulting in higher
fold-enrichments.’”4
The increase in the number of compounds with reliable biological annotation enables the prediction of activity modeling with higher resolution Hierarchical consensus modeling of a series of neural networks trained by compounds from the MDDR database and random
compounds generated sets of compounds that act on biological targets belonging to specific
gene families Similar studies discriminating modulators such as GPCR and non-GPCR targets,
or GPCRs and kinases, made efficient predictive models for whether or not a compound could
become a GPCR ligand.”** Applying this information to ligand-based library design might wellfocus the exploration of chemical space relevant to GPCR-targeting.”°
Mapping drug-space
Drug-like properties collectively come from a wide range of in vivo mechanisms, many of which are not well characterized; therefore, it is currently impossible to predict such properties
27 (a) Young, S S et al Stat Sci 2001, 76, 154-168 (b) Molina, E.; Diaz, H G.; Gonzalez, M P.;
Rodriguez, M.; Uriarte, E J Chem Inf: Comput Sci 2004, 44, 515-521 (c) Engkvist, O.; Wrede, P.; Rester, U J Chem Inf, Comput Sci 2003, 43, 155-160 (c) Van Rhee, A M J Chem Inf: Comput Sci 2003, 43, 941-948.
28 (a) Manallack, D T.; Pitt, W R.; Gancia, E.; Montana, J G.; Livingstone, D J.; Ford, M G.; Whitley,
D.C J Chem Inf Comput Sci 2002, 42, 1256-1262 (b) Balakin, K V.; Tkachenko, S E.; Lang, S A.; Okun, I.; Ivashchenko, A A.; Savchuk, N P J Chem Inf Comput Sci 2002, 42, 1332-1342.
Trang 34
-19-from first principles Instead, heuristic methods, rules based on relatively simple compoundproperties and derived from available experimental data, are used The most widely used drug-
like filter was developed by Lipinski by analyzing tendencies in simple molecular descriptors
for a reference set of 2,300 compounds all passing phase I clinical trials.”° This filter
encompasses the following criteria: molecular weight < 500, clogP < 5, number of H-bond
donors < 5, and number of H-bond acceptors < 10.
Though the use of these simple physicochemical descriptors can characterize drug-space
very well, it does not have enough power to distinguish drug-like space from others; therefore,
it is more appropriate to think of these criteria as necessary but not sufficient conditions todescribe drug-like space For example, it was shown that there are more compounds in theACD (surrogate of “non-drugs”) that are compliant with Lipinski’s rules, compared with
compounds from the MDDR database (surrogate of “drugs”).? From the carefully curated,
expanded dataset, 80% of the compounds in the non-drug space are compliant with Lipinski’s
first pass through the liver and lungs.
A similar drug-like filter proposed by Veber came from the analysis of a proprietary
database of 1,100 drug candidates and systematic oral bioavailability data from a single animal
?? (a) Lipinski, C A.; Lombardo, F.; Dominy, B W.; Feeney, P J Adv Drug Deliv Rev 1997, 23, 3-25.(b) Xu, J.; Stevenson, J J Chem Inf Comput Sci 2000, 40, 1177-1187 (c) Oprea, T I J Comp-
Aided Mol Des 2000, 74, 251-264.
Trang 35species (rat) Only two molecular properties, a polar surface area less than 140 A’ and fewer
than twelve rotatable bonds, were enough for the prediction of sufficient oral bioavailability °”*
Recently, hierarchical classification schemes were reported Here, the dominant charge at biological pH determines different properties (e.g., rotatable bonds, PSA, Lipinski’s rules) that
govern the bioavailability of compounds.”
Data mining studies addressing drug-likeness cover substructures (e.g., frameworks, side chains, and functional groups), descriptors (e.g., atom environments, drug-like indices, and pharmacophore filters), and the use of supervised-learning algorithm (e.g., neural networks and
recursive-partitioning methods).°**
Using the Comprehensive Medicinal Chemistry (CMC) and the MDDR as representatives ofdrug-like molecule databases and the ACD as a surrogate for nondrug-like molecules, neuralnetwork models to classify drug-like and nondrug-like molecules were reported These
analyses are based on both one-dimensional descriptors (molecular weight, topological indices,
atom types) and two-dimensional descriptors.*'* A genetic algorithm was used to distinguish
between drug-like and nondrug-like compounds using relatively simple descriptors (molecular
weights, the numbers of H-bond donors and acceptors, rotatable bonds, and aromatic rings) tb
Here, compounds from the WDI were assumed to comprise a drug-like dataset, and compoundsfrom the SPRESI database were presumed to be a nondrug-like dataset Decision-tree analysis
was also applied to the same dataset.*'* Although these approaches showed reasonable accuracy
in their performances, each is highly dependent upon training datasets, and lacking in
30 (a) Veber, D F.; Johnson, S R.; Cheng, H.-Y.; Smith, B R.; Ward, K W.; Kopple, K D.; J Med
Chem 2002, 45, 2615—2623 (b) Martin, Y J Med Chem 2005, 48, 3164-3170 (c) Ghose, A K.; Viswanadhan, V N.; Wendoloski, J J Comb Chem 1999, 7, 55-68 (d) Bemis, G W.; Murcko, M A.J Med Chem 1996, 39, 2887-2893 (e) Muegge, I.; Heald, S L.; Brittelli, D J Med Chem 2001,
44, 1841-1846.
31 (a) Frimurer, T M.; Bywater, R.; Narum, L.; Lauritsen, L N.; Brunak, S J Chem Inf’ Comput Sci
2000, 40, 1315-1324, and earlier works include (i) Ajay; Walters, W P.; Murcko, M A J Med Chem.
1998, 41, 3314-3324 (ii) Sadowski, J.; Kubinyi, H A J Med Chem 1998, 41, 3325-3329 (b) Gillet, V.J.; Willett, P.; Bradshaw, J Chem Inf Comput Sci 1998, 38, 165-179 (c) Wagener, M.; van
Geerestein, V J J Chem Inf Comput Sci 2000, 40, 280-292.
Trang 36
Mapping based on the origin of compounds
The significant roles of small molecules in modulating biological processes might become more apparent by just enumerating compounds found naturally in biological systems A
collection of these compounds, with chemical structures, is located in the COMPOUND section
of the KEGG/LIGAND database, the total number of which is 13,000 (as of August 2005).These are roughly classified, according to the source, into 10% drug-related compounds, 30%phytochemical compounds (secondary metabolites in plants), and 60% metabolites and othercompounds originating mostly from the KEGG metabolic pathways (Figure 1.8).”* The use ofsynthetic compounds without any biological origin might come from the pioneering work of
Paul Ehlrich He discovered arsphenamine (Salvarsan), which greatly improves the treatment of
syphilis, by screening systematically over 600 synthetic compounds available
Ỏ On Me Ô
D-Glucose Cinnamate DNA Menthol Ergosterol
Figure 1.8 Common substructures of the top clusters from KEGG/LIGAND database For each cluster, a representative compound is shown with its name, and the maximum common subgraph is in red.Comprehensive analysis of the structural and property differences between drugs, naturalproducts, and combinatorial libraries showed that natural products and combinatorial librarieslie at two extremes, with drugs being intermediate in character.** The results of this analysissuggest that the design of a synthetic library should veer towards natural-product-like features.Comparative analysis based on the database of 10,495 natural products, a collection of
32 (a) Hattori, M.; Okuno, Y.; Goto, S.; Kanehisa, M J Am Chem Soc 2003, 725, 11853-11865 (b)Goto, S.; Okuno, Y.; Hattori, M.; Nishioka, T.; Kanehisa, M Nucleic Acids Res 2002, 30, 401-404.
33 (a) Feher, M.; Schmidt, J M J Chem Inf, Comput Sci 2003, 43, 218-227 (b) Lee, M-L.; Schneider,
G J Comb Chem 2001, 3, 284-289 (c) Zuccotto, F J Chem Inf Comput Sci 2003, 43, 1542-1552.
Trang 37combinatorial libraries, and the WDI revealed a number of structural features prevalent in natural products: they have higher in molecular weights, have more stereogenic centers and fewer rotatable bonds, have larger, more complex and diverse ring systems, are lower in
nitrogen, sulfur, and halogen content, are higher in oxygen content, and have more hydrogenbond donors and acceptors Notably, due to both acyclic and cyclic conformational constraints,natural products tend to be comparatively rigid, a property that may be associated with reduced
entropic cost of binding and improved oral bioavailability Combinatorial libraries were also
found to be significantly more hydrophobic than either drugs or natural products.” ;
Shannon entropy was applied to quantify the information content of major descriptors within
a compound library, and the distribution of these measures was sufficiently distinct among
combinatorial libraries.** Subsets of property and substructure descriptors for differentiating
between natural compounds (Chapman and Hall compendium of natural products) and synthetic
compounds (ACD) were utilized to build a simple regression model.**>*
* (a) Bajorath, J J Comput-Aided Mol Design 2002, 16, 431-439 (b) Stahura, F L.; Godden, J W.;Xue, L.; Bajorath, J J Chem Inf’ Comput Sci 2000, 40, 1245-1252 (c) Buckingham, J Dictionary of
Natural Products, (Chapman & Hall/CRC, 2002).
Trang 38
-23-1.2 Biological measurement space
1.2.1 Complexity and diversity in biological space
Although living systems follow the basic laws of physics and chemistry, biological problems
are not answered purely based on these laws, simply because of the diversity of and complexity
residing in biological systems The complete sequence of the human genome provides the
means to identify all of the heritable elements in biological systems; however, it has becomeclear that the detailed inventory of cellular components (i.e., genes, macromolecules, and
metabolites) is not sufficient to understand the systems’ behavior.’ Biological responses cannot
be rationally predicted without a comprehensive understanding of the intracellular biochemicaland genetic interactions, resulting in general principles of regulation, diverse nature of
responses on different levels.
Universal Robustness «<q,» Fragility
{metabolites, hormones etc.}
Universal design: information flow
Natural Variations Biodiversity
Figure 2.1 Complexity and diversity in biological systems Phenotypic bifurcations (robustness andfragility) of biological system may result from the complex interactions Nevertheless, complex
biological system is organized in modular, hierarchical manners, which are universal The components of
each system are very different from each other (at the bottom of complexity axis) On the other hand,evolutionary constraints of biological systems along the diversity axis might afford importance clues of
phenotypic variations.
Ị (a) Lander, E S et al Nature, 2001, 409, 860-921 (b) Austin, C P Curr Opin Chem Biol 2003, 7,
511-515.
? Kitano, H Science 2002, 295, 1662-1664
Trang 39According to the basic dogma of molecular biology, DNA is the ultimate repository of
biological complexity.’ In general, it is accepted that information storage, information
processing, and the implementation of diverse cellular programs would be located in distinct
domains of organization: genome, transcriptome, proteome, and metabolome.*? Nonetheless,
the functional distinctness of these organizational levels has recently been scrutinized Forexample, although long-term information is stored almost exclusively in the genome, the
proteome is essential for short-term information storage, and transcription factor-controlledinformation retrieval is strongly influenced by the state of the metabolome
Each eukaryotic cell is very complex since it is composed of an exceedingly large number of
macromolecules that interact with each other and with low molecular-weight components (e.g.,
metabolites and hormones) to yield nonlinear behavior that has been fine-tuned by natural
selection to achieve specific functional properties Furthermore, cellular processes may bedisassembled into basic “operating units” or “modules”, subsystems of interacting
macromolecules and low-molecular weight components that perform a given function (e.g.,signal transduction, protein synthesis, and cell-cycle regulation) in a largely context-
independent manner.’ Consequently, biological systems are complex, but also modular and
hierarchical, awareness of which opens new avenues to understanding
Such complexity may formulate the basis for phenotypic variations along the complexityaxis (Figure 2.1) Furthermore, along with the complexity axis, one can observe the shift fromthe specific (at the bottom level) to the universal (at the top level) to certain biological systems.Undoubtedly, the exact catalog of components (/.e., genes, metabolites, and proteins) is unique
to each species For example, only 4% of metabolites are shared between 43 organisms
3 (a) Schreiber, S L Nat Chem Biol 2005, 1, 64-67 (b) Mangelsdorf, D J.; Evans, R M Cell 1995,
&3, 841-850.
4 (a) Hartwell L H.; Hopfield, 1 J.; Leibler, S.; Murray, A W Nature 1999, 402, C47-C52 (b) Petty, H
R ChemBioChem 2004, 5, 1359-1364 (c) Tyson, J J.; Chen, K C.; Novak, B Curr Opin Cell Biol
2003, 75, 221-231.
Trang 40
-25-examined; however, main metabolic pathways and modules are frequently shared."”
These modules, with groups of heterogeneous and unique components, are assumed to
interact to form larger networks.’ There is unambiguous proof for the existence of such cellular
networks; indeed, the proteome organizes itself into a protein-interaction network and
metabolites are interconverted through intricate reaction networks Theoretical conclusions that global organization of such networks is governed by the same principles may come as a
surprise, but offer a new perspective on cellular organization It remains, however, to be seenwhether or not an even higher degree of universality is present on the module level The
hierarchical relationship among modules, in turn, is apparently quite universal, shared by allmetabolic and protein interaction networks studied.
On the other hand, principles governing biological systems may come from the elucidation
of evolutionary constraints selected over the changes in environment and over internal failures(i.e., DNA damages, genetic malfunctions) Without the modulation of the time domain, thecomparative surveys of biodiversity (Figure 2.1) can impart some evidences The survival of
living systems implies that the critical parameters of essential modules should be robust; that is,
they are insensitive to many environmental and genetic perturbations Evolvability, on the otherhand, requires that other parameters of modules should be sensitive to genetic changes.° It isimportant to understand how such robustness and sensitivity can be reconciled within eachfunctional module Emergent properties frequently found over various complex systems may
give clues to these contradictory observations.’
> (a) Lee, T I et al, Science 2002, 298, 799-804 (b) Milo, R et al., Science 2002, 298, 824-829 (c)
Jeong, H.; Tombor, B.; Albert, R.; Oltvai, Z N.; Barabasi, A L Nature 2000, 407, 651-654.
6 Kirschner, M.; Gerhart, J Proc Natl Acad Sci U.S.A 1998, 95, 8420-8427
7 (a) Carlson, J M.; Doyle, J Proc Natl Acad Sci U.S.A 2002, 99, 2538-2545 (b) Zhou, T.; Carlson,
J M.; Doyle, J Proc Natl Acad Sci U.S.A 2002, 99, 2049-2054.