Luận án tiến sĩ: Small molecule-based approach to chemistry and biology: Synthesis, measurement, and analysis

HARVARD UNIVERSITYGraduate School of Arts and SciencesTHESIS ACCEPTANCE CERTIFICATE The undersigned, appointed by the Department of Chemistry and Chemical Biology have examined a thesis

Trang 1

NOTE TO USERS

This reproduction is the best copy available.

®

UMI

Trang 2

HARVARD UNIVERSITYGraduate School of Arts and Sciences

THESIS ACCEPTANCE CERTIFICATE

The undersigned, appointed by the

Department of Chemistry and Chemical Biology

have examined a thesis entitled

Small Molecule-Based Approach to Chemistry and Biology:Synthesis, Measurement, and Analysis

presented by Young-kwon Kim

candidate for the degree of Doctor of Philosophy and hereby

Signature

Typed name: P

Signature

Typed name: Prof David Liu

Signature 172A lo D TST ee

Typed name: Prof Daniel Kahne

Date: December 7, 2005

Trang 3

Small Molecule-Based Approach to Chemistry and Biology:

Synthesis, Measurement, and Analysis

A thesis presented

by

Young-kwon Kim

to

The Department of Chemistry and Chemical Biology

in partial fulfillment of the requirements

for the degree of Doctor of Philosophy

in the subject of

Chemistry and Chemical Biology

Harvard University Cambridge, Massachusetts

December 2005

Trang 4

UMI Number: 3205917

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy submitted Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper

alignment can adversely affect reproduction.

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

®

UMI

UMI Microform 3205917 Copyright 2006 by ProQuest Information and Learning Company All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code.

ProQuest Information and Learning Company

300 North Zeeb Road

P.O Box 1346 Ann Arbor, MI 48106-1346

Trang 5

Trang 6

Small Molecule-Based Approach to Chemistry and Biology:

Synthesis, Measurement, and Analysis

Young-kwon Kim Professor Stuart L Schreiber

7 December 2005 Research Adviser

Abstract

Small molecules have long played important roles in the advancement of biology; however, little meta-insight has been gained during this period This thesis presents two studies that aim

to uncover the relationships between chemical space and biological measurement space.

The first chapter comprises literature surveys of chemical descriptor space, biological measurement space (outputs), and analysis methods to link them An emphasis on the role of

diversity-oriented synthesis populating accessible chemical space (inputs) is offered.

The second chapter describes the methodology that uses well-defined inputs provided by

diversity-oriented synthesis and robust readouts from a series of chemical genetic modifier screenings Subsequent multidimensional data analysis confirms the intuition of the scientists yet adds methodical rigor, while simultaneously discovers novel patterns of biological activity

that correlate with stereochemistry in a subtle and unexpected way Significant variations in biological outcomes were found to result from the stereochemical and skeletal elements in small molecules Such insights facilitate efficient searching and probing of chemical space.

The third chapter reports the development of analytical implements and illustrates that the relevance network is robust and flexible The resulting analysis environment enables the visualization of significant associations between small molecules A larger number of

Trang 7

structurally and functionally heterogeneous inputs (small molecules) are efficiently examined based on a small-molecule annotation dataset and subsequently validated Furthermore, novel hypotheses on the biological mechanisms of small molecules are proposed using already annotated small molecules.

Trang 8

-Ìv-Table of Contents

L4 11v 1drdađađadaiaiiiiiiadaaaiaẳẳiiaiiii iii

Abbr€ViatÏOTS cu ng ng nee ene E nh TK ee cet nee ete eH tk km nà tà nh rà vi

Dedication 0.0 -Ỡ EE EEE ERLE EE EEE eRe E EERE EEE EEE xi

Chapter 1 Introducfi0I eee ĐH HE BE ĐK Ki Ko EEE Đi EEE 1 1.1, Chemical descriptor sDACG ng TT nà nà kh nh TH bà ST 2

1.2 Biological measurement SDAC€ ch nh mm ene eed eee hà by 24

1.3 Multidimensional data analySIS -.- cọ nee eect BH TK nh vệ, 49 1.4 Sampling chemical space by diversity-oriented syntheSis che 67

Chapter 2 Case Study Ì, cence ee eee nent Ko ĐK net Ee eee eee eee EEE EEE Bà ea 105

2.1 Relationship of skeletal and stereochemical diversity to cellular measurement space 107 2.2 Supporting inÍOrmatiOn ‹ ác cóc ch ni KH ch TK Ki ĐK ki KÊU 119

Chapter 3 Case Study ÏÏ cm ĐK kh nh 197 3.1 Construction and analysis of relevance network from small-molecule annotation 199 3.2 Supporting InfOrImatiOT no HH ener eee eee EERE Ee een ti nền nà EEE EE EERE 215

Trang 9

acetic acid activation domain

acute myelogenous leukemia

angiotensin adenosine 5’-triphosphate binding domain

building block 5-bromo-2’deoxyuridine adenosine 3’,5’-cyclic monophosphate calcein-acetoxymethylesters

ceric ammonium nitrate cholecystokinin receptor methylene chloride acetonitrile

chloroform chemical global positioning system chemical ionization-mass spectrometry comprehensive medicinal chemistry central nervous system

comparative molecular field analysis dichloromethane

Trang 10

DNA deoxyribonucleic acid

DOS diversity-oriented synthesis

ECs effective concentration of half-maximal effect

EDC 1- ethyl-3-(3’-dimethylaminopropyl)carbodiimide hydrochloride

EI-MS electron impact-mass spectrometry

ELISA enzyme-linked immunosorbent assay

EM expectation-maximization

EtO diethyl ether

EtOAc ethyl acetate

Et ethyl

ES-MS electrospray-mass spectrometry

FAB-MS fast atom bombardment-mass spectrometry

FTIR Fourier transform infrared spectrometry

GA genetic algorithm

GE-HTS gene expression-based high-throughput screening

GPCR G protein coupled receptor

GRIND grid-independent descriptors

h hours

HCS high-content screening

HDAC histone deacetylase

Trang 11

hydroxysteroid dehydrogenase

hydroxytryptamine

high-throughput screening

Hertz high-pressure liquid chromatography

highest scoring common substructure

iso-propylalcohol

knowledge discovery in database

Kyoto encyclopedia of genes and genomes tandem liquid chromatography-mass spectrometry magic angle spinning nuclear magnetic resonance spectroscopy multi-component reaction

multidimensional scaling methyl

2,4,6-trimethylphenyl

methanol megahertz minutes magnesium sulfate

mass spectrometry

MACCS-II drug data report (3-(4,5-dimethylthiazole-2-yl)-2,5-diphenyltetrazoliumbromide) sodium sulfate

nuclear magnetic resonance spectroscopy

VI

Trang 12

-NR nuclear receptor

PCA principal component analysis

PCR polymerase chain reaction

PEG polyethylene glycol

PSA polar surface area

p-TsOH para-toluenesulfonic acid

PyBOP bezotriazol-1-yloxytripyrrolidinophosphonium hexafluorophosphate pybox pyridine-bis(oxazoline)

PyBroP bromotripyrrolidinophosphonium hexafluorophophate

pyr pyridine

QSAR quantitative structure activity relationship

QUINAP [1-(2-diphenylphosphino-1-naphthy])isoquinoline]

RNA ribonucleic acid

RNAi RNA interference

ROF rule-of-five

SMILES simplified molecular input line entry specification

SMM small-molecule microarray

SOM self-organizing map

SOSA ‘selective optimization of side activities

TBS tert-butyldimethylsily!

TES triethylsilyl

Trang 13

TfOH trifluoromethanesulfonic acid

Y2H yeast two-hybrid

Y3H yeast three-hybrid

[M] Macrobeads

Silyl functionalized, 500-600 um PS, 1% cross-linked by divinylbenzeney y y

Trang 14

To my parents

Trang 15

Chemical descriptor space

Biological measurement space

Multidimensional data analysis

Sampling chemical space by diversity-oriented synthesis

24 49 67

Trang 16

1.1 Chemical descriptor space

1.1.1 Chemical (descriptor) space

Frequently, the term “chemical space” is used as a colloquialism referring to a conceptual framework for formulating relations between molecular structures and/or properties Chemicalspace, which encompasses all possible small organic molecules, has no theoretical limit, but can

be reduced according to practical concerns: synthetic feasibility, user accessibility, drug-like

properties, and the ability to modulate biological processes.'

chemical space in silico

data mining and analysis computational scientist

¬ feasible chemical space in cerebro

strategy and methodology synthetic chemist

/_ * accessible chemical space in vivo, in vitro

collection of commercially available reagents.

Based on synthetic feasibility, chemical space is reduced to “feasible chemical space”

Feasible chemical space can also be defined in various ways, even without real synthetic

considerations For example, # silico combinatorial enumerations of common appendages and

core skeletons in chemical databases delineate the boundary of a chemical space.”

Further reduction to “accessible chemical space” can be primarily based on scientific

demands For example, accessible chemical space for the chemical biologist is populated by

! (a) Dobson, C M Nature, 2004, 432, 824-828 (b) Lipinski, C.; Hopkins, A Nature, 2004, 432, 855—

861 (c) Hann, M M.; Oprea, T I Curr Opin Chem Biol 2004, 8, 255-263 (d) Bohacek, R.S.; McMartin C.; Guida, W C Med Res Rev., 1996, 16, 3-50 (e) Czarnik, A C Chemtracts, 1995, 8, 13-18.

? Ertl, P J Chem Inf Comput Sci 2003, 43, 374-380

Trang 17

natural products, commercially available compounds, and libraries derived from oriented synthesis, each ready for interrogating biological systems of interest These

diversity-compounds should be of sufficient quantity, purity, and with adequate explicit/implicit

structural information For synthetic chemists, the development of novel synthetic strategies

and methodologies might expand feasible chemical space significantly; indeed, the execution

of diversity-oriented synthesis can populate extensively the accessible chemical space

Chemical descriptor space: mathematical definition

The definition of chemical descriptor space is a vector (metric) space defined by a number of

chemical descriptors for each small molecule In general, each of ø selected chemical

descriptors adds a dimension to an n-dimensional vector space, and each small molecule isassigned to coordinates in this vector space according to the scaled values of its chemicaldescriptors (Figure 1.2a) For visualization, an n-dimensional chemical-descriptor space can beprojected onto fewer dimensions by a variety of dimensionality reduction methods As shown

in Figure 1.2b, each axis is replaced by a latent variable from the original descriptor set

Sometimes chemical space is partitioned by a number of binned descriptors, represented by a

number of cells shown in Figure 1.2c.’

n-dimensional chemical deacriptor space Reduced space by latent variables 18 cells divided by 5 partitioning

Figure 1.2 Chemical descriptor space (a) n-dimensional chemical descriptor space (b) For

visualization, n-dimensional chemical descriptor space can be reduced into two or three-dimensionalspace using proper dimensionality reduction methods Each axis is represented by a latent variable from

the original descriptor set (c) Cell-based representation of chemical descriptor space

3 (a) Bajorath, J J Chem Inf Comput Sci 2001, 41, 233-245 (b) Gasteiger, J.; Engel, T

Chemoinformatics: a textbook (Wiley-VCH, Weinheim, 2003), pp 15-268.

Trang 18

-3-Role of chemical descriptor space

The role of chemical descriptor space is divided into two elements: storage and retrieval ofchemical information related to large compound collections in databases, and rigorous analysis

of the properties (i.e., measurement space) of small molecules associated with their structuralfeatures encoded by chemical descriptors The process of assigning each small molecule in

feasible (F) or accessible chemical space (A) to chemical descriptor space based on its chemical

descriptors can be referred to as “representation” (Figure 1.3)? On the other hand, analysis of

chemical descriptor space and measurement space can generate a number of hypothetical

models to be tested These models are testing-grounds for the practical significance of chemical

descriptor space as a valid method for linking chemical space and measurement space.”

Moreover, the construction of chemical descriptor space is much cheaper, more consistent than

both empirical synthesis and biological testing Therefore, chemical descriptor space mightmake possible valid predictions of routes between accessible to feasible chemical spaces Forexample, thoughtful extension of validated models from the analysis of accessible chemical

space and measurement space might provide guidelines for a second-phase synthesis directed at

molecules with improved measured outcomes.

In short, dynamic integration of synthetic chemistry, assay measurements, and data analysismight enable us to constantly evaluate overall processes in order to provide probabilistic,statistically significant predictions.

4 Strausberg, R L.; Schreiber, S L Science 2003, 300, 294-295

Trang 19

1.1.2 Chemical descriptors

Representation: search and retrieval

Molecular structures are usually represented, manipulated, and stored as molecular graphs Graph theory is a well-established branch of mathematics that has found applications in

chemistry as well as in many different areas A graph is an abstract formalism that contains nodes connected by edges In a graph representation (Figure 1.4a) the nodes correspond toatoms and the edges to bonds Note that hydrogen atoms are often omitted These atom andbond attributes are important when performing operations on the molecular graph A graph represents only the topology of a molecule; that is, the way the nodes are connected.

Therefore, a given graph may be drawn in many different ways and may not obviously

correspond to a “standard” chemical diagram.”

(a) Ie (bị [> 4a] + 11

v Z 7 -1,9493| 0.750 0 aw GYR,“ ea 1.9491 +0, 750 0 eh coordinates

number of atoms aa eal Son ak” ra \

0 0 number of bandÝ i 0.650] -0.750 BỊ : `ƠESU| wrist U iret a

(†-carvone (2)-carvone -0.650| 1.500 0 2] carbon

(R)-2-methyl-5-4prop-1-en-2-yl) (S)-2-methyl-5-(prop-1-en-2-y1) -3.248] -1,500 ia) fej

cyclohex-2-enone cyclohex-2-enone -0.650! -3,000 0 fa}

information in red (d) Examples of ambiguous graphical representations.

The most common method to parse molecular graphs in a numeric format is using a

connection table The simplest type of connection table consists of two sections: a list of theatom numbers, and positions of the atoms in the molecule; and a list of the bonds, specified as

pairs of bonded atoms As shown in Figure 1.4, a connection table can encode a variety of

information A simpler way to represent a molecular graph is through the use of linear notation;

Trang 20

SMILES (Simplified Molecular Input Line Entry Specification) has been used extensively in

this way, and can further be used to encode stereochemical information (Figure 1.4c).*

Graphical representation has a relative deficiency, since it allows only one valence bond model

in each structure representation For example, benzene can be represented in two ways, costing more time to search and retrieve exact structures based on graphical representation.

Approximately 0.5% of compounds from commercial collections (550K) contain tautomers, or

ambiguous functional groups (Figure 1.4d).°

Since graphs can be constructed in many different ways, it is necessary to have methods to determine whether two graphs are the same In graph-theoretic terms this problem is known as graph isomorphism Nevertheless, the graph isomorphism problem is NP-hard, and the

computational time involved increases exponentially for larger compound sets There are two distinct approaches for efficient retrieval of compounds from databases: generation of newchemical descriptors, such as binary (“bit-string”) fingerprints, and development of efficient

algorithms (e.g., heuristic models).°°?

Chemical descriptor is a term describing a molecular structure in quantitative terms Theuse of chemical descriptors makes possible further understanding and even prediction ofchemical and biological properties in structural terms From this perspective, we can discussthe representation of molecular structure (chemical descriptors), methods of comparing smallmolecules based on these representations, and how these methods can relate to measuredproperties Such considerations lead to the concept of “molecular similarity”, its variousdefinitions and uses, and how these definitions have evolved in recent years Molecular

similarity, as a paradigm, contains many implicit and explicit assumptions with respect to the

> (a) Weininger, D J Chem Inf Comput Sci 1988, 28, 31-36 (b) Trepalin, S V.; Skorenko, A V.;Balakin, K V.; Nasonov, A F.; Lang, S A.; Ivashchenko, A A.; Savchuk, N P J Chem Inf’ Comput Sci 2003, 43, 852-860 (c) Wang, X.; Wang, J.T L J Chem Inf: Comput Sci 2000, 40, 442-451 (d) Rhodes, N.; Willett, P.; Calvet, A.; Dunbar, J B.; Humblet, C J Chem Inf Comput Sci 2003, 43, 443—

448 (e) Raymond, J W.; Gardiner, E J.; Willett, P Chem Inf Comput Sci 2002, 42, 305-316.

Trang 21

prediction of biological activity."

Traditionally, chemists have described molecular structure, topologic, geometric, and

electronic features encoded across three levels: constitution, configuration, and conformation

Likewise, classification of chemical descriptors is often based on their dimensionality.’

One-dimensional (1D) descriptors include bulk properties such as volume, molecularweight, log P, molar refractivity, and simple counts of atom or bond-types (i.e., heavy atomcounts, rotatable bond counts).

Two-dimensional (2D) descriptors include topological indices and other graph-baseddescriptors, derived from graph-theoretic decomposition of the connectivity matrix with real

numbers (molecular connectivity indices) or integers (Wiener indices)®° based solely on the

constitution of compounds Kier and Hall extended topological indices to include electronicand valence state information, deriving “electro-topological” descriptors which were further

refined to “E-state fields”.Š°*

Fingerprint descriptors incorporate diverse sets of chemical descriptors in a binary bit-string

6 Willett, P.; Barnard, J M.; Downs, G M J Chem Inf Comput Sci 1998, 38, 983-996

7 (a) Bajorath, J J Chem Inf, Comput Sci 2001, 41, 233-245 (b) Livingstone, D.J J Chem Inf

Comput Sci 2000, 40, 195-209, and references therein (c) Wehrens, R et al Anal Chim Acta 1999,

400, 413-424.

8 (a) Randic, M J Mol Graph Model 2001, 20, 19-35 (b) Randic, M J Chem Inf Comput Sci 2004,

44, 373-371, and references therein (c) Hall, L H.; Monhey, B.; Kier, L B J Chem Inf Comput Sci

1991, 3/, 76-82 (d) Kellogg, G E.; Kier, L B.; Gaillard, P.; Hall, L H J Comput.-Aided Mol Des

1996, 70, 513-520 (e) Torrens, F Comb Chem High Throughput Screen 2003, 6, 801-809 (f) Xue,L.; Godden, J.; Bajorath, J J Chem Inf Comput Sci 1999, 39, 881-886 (g) McGregor, M J.; Pallai, P.V.J Chem Inf, Comput Sci 1997, 37, 443-448 (h) James, C A.; Weininger, D Daylight theorymanual (Daylight Chemical Information Systems, Inc., Irvine, CA, 1995) (i) Ghuloum, A M; Sage, C

R.; Jain, A J J Med Chem 1999, 42, 1739-1748.

Trang 22

format with various sizes and complexities, and are designed to be “barcodes” for a molecule Such formats can capture structural or topological features and/or properties of molecules at the same time One of the principal differences between various fingerprints designs is whether or not specific bit positions within the string can reliably mapped to specific chemical features (absence or presence of pre-defined structural fragment) or descriptor values This is the case in

keyed designs, such as MACCS keys, or MFPs.*** By contrast, in hashed or folded

representations where features are mapped to corresponding or overlapping bit segments to

enhance uniqueness, single bit positions lose apparent physical meaning.""!

Originally, the derivation of three-dimensional (3D) descriptors needed to use geometrical information from points in 3D space These descriptors are calculated in molecular interaction

fields, which require that individual compounds be aligned for property calculations.”* The

group of field-based descriptors differs from other groups because they use three-dimensional information within a molecule for their derivation Because the methods of generation require a sufficient number of data points (“grid points’) for a sensible resolution, they are

computationally much more demanding than two-dimensional descriptors.” For example,

typical 2D descriptors range from 0.5-5 Kilobits per molecule, while 3D descriptors require more than 3 megabits per molecule In some studies, a subset of field-based descriptors found

to be invariant upon rotation or derived from back-projection algorithms was used.”°

Furthermore, Gaussian representations based on quantum similarity methods have replaced the

grid methods for describing the general shape of molecules.°

Most shape-based descriptors encode the shape of a molecule not in one fragment, but

Ọ (a) Pastor, M.; Cruciani, G.; McLay, I.; Pickett, S.; Clementi, S J Med Chem 2000, 43, 3233-3243.(b) Stiefl, N.; Baumann K J Chem Inf Comput Sci 2003, 46, 1390-1407, and references therein (c) Carbo, R.; Leyda, L.; Arnau, M Int J Quantum Chem 1980, 17, 1185-1189 (d) Carbo, R.; Calabuig,

B J Chem Inf Comput Sci 1992, 32, 600-606 (e) Grant, J A.; Pickup, B T J Phys Chem 1995, 99, 3503-3510 (f) Sheridan, R P.; Ramaswamy, N.; Rusinko III., A.; Bauman, N.; Haraki, K S.;

Venkataraghavan, R J Chem Inf Comput Sci 1989, 29, 255-260 (g) Sheridan, R P.; Miller, M D.; Underwood, D J.; Kearsley, S K J Chem Inf Comput Sci 1996, 36, 128-136 (h) Good, A C.; Ewing,

T J.; Gschwend, D A.; Kuntz, I D J Comput.-Aided Mol Des 1995, 9, 1-12.

Trang 23

instead use several small important features to discover feature relationships by statistical

associations These methods are free from alignment problems and are usually performed by a

bit-string representation of features, saving computation time They are often referred to as

multiple-point pharmacophores: two-point pharmacophores (2PP), which are known as atompairs and represent all possible pairs of atoms in the molecule, three-point pharmacophores

(3PP), and four-point pharmacophores (4PP).°

Surface-based descriptors rely on the intuitive notion that macromolecule-ligand interactions

are mostly mediated by the molecular surfaces (e.g., Van-der-Waals surface).'"* In one case,

refinements over geometric organizations of polar and nonpolar surface areas showed

significant improvements in the prediction of physicochemical descriptors.'°*

Spread and variability of chemical descriptors

Since most descriptors have been developed for different purposes, the ranges and

distribution patterns of descriptor values are heterogeneous Prior to using collections of

descriptors from different sources, it is wise to scale them Descriptors can be scaled based on

the observed range of values in the dataset to sense any peculiarities in the dataset.'!* For

example, fingerprint scaling is a method to increase the performance of similarity search

calculations It is based on the detection of local patterns with higher information content,representing specific compound classes; application of scaling factors has been shown to

improve search results for different sets of fingerprints.''® Standard, chemically meaningful

scaling based on mean value and absolute deviation of a variety of descriptors in different

chemical databases (ACD, CMC, MDDR) has been completed to filter outliers.'!°

0 (a) Stanton, D.; Jurs, P Anal Chem 1990, 62, 2323-2329 (b) Gaillard, P; Carrupt, P A.; Testa, B.;

Boudon, A J Comput.-Aided Mol Des 1994, 8, 83-96 (c) Polanski, J.; Walczak, B Comput Chem

Trang 24

-9-Nevertheless, consideration of outliers is really a question of finding a balance between

extremes For example, outliers would have a high probability of being identified by

dissimilarity-based searching if left in the dataset, or they might tend to artificially compress major populations into smaller spaces during cell-based partitioning In most studies, however,

it is better for outliers to remain in the dataset.'!**

Correlation between descriptors

Despite their heterogeneous origins, there is significant overlap in the information content of

chemical descriptors.'** Implicit information encoded in various 2D descriptors, for example,

can be extracted, making it possible to use a subset of cost-effective descriptors Ina series ofalkyl-phenyl compounds, there are significant correlations between 2D topological indices andparameters related to conformations, demonstrating that 3D properties can be extracted without

resorting to geometric optimizations.!?° For the calculation of polar surface area, essentially

identical results are obtained using either 3D calculation or 2D topological indices.'7°4

Performance of chemical descriptors: retrospective analysis

Retrospective analysis of descriptor performance is carried out either by using simulatedproperty prediction experiments or by examining the coverage of different bioactivity types inthe diverse subsets The ability to distinguish biologically active and inactive compound sets byvarious clustering methods was evaluated over a range of structural descriptors; the mosteffective descriptor was the 2D keyed fingerprints.'** In the following study by the same group,

j 13

the prediction of known physical properties as a metric showed the same trend as wel

Independent studies also demonstrated that 2D fingerprint-based descriptors were most effective

'2 (a) Oprea, T I J Braz Chem Soc., 2002, 13, 811-815 (b) Quigley, J M Naughton, S M J Chem.Inf Comput Sci 2002, 42, 976-982 (c) Estrada, E.; Molina, E.; Peromodo-Lopez, I J Chem Inf.

Comput Sci 2001, 47, 1015-1021 (d) Ertl, P et al J Med Chem 2000, 43, 3714-3717.

l3 (a) Brown, R D.; Martin, Y C J Chem Inf Comput Sci 1996, 36, 572-584 (b) Brown, R D.;Martin, Y C J Chem Inf Comput Sci 1997, 37, 1-9 (c) Matter, H J Med Chem 1997, 40, 1219-

1229 (d) Potter, T.; Matter, H J Med Chem 1998, 41, 478-488 (e) Patterson, D E eral J Med Chem 1996, 39, 3049-3059 (f) Cruciani, G et al J Med Chem 2002, 45, 2685-2694 (g) Ooms, F et

al Biochim Biophy Acta 2002, 1587, 118-125.

Trang 25

both in selecting active compounds and in sampling representative subsets of active

compounds °%4

Often, application of the similarity principle is validated by the mathematical concept of

neighborhood behavior.'** A number of descriptors were assessed by their distributions for a

subset of related compounds and biological activities; 3D grid-based descriptor (CoMEA) and 2D fingerprints were found to out-perform other chemical descriptors These conclusions, however, should be treated with some caution due to the limited sizes of datasets.

Solubility data and blood-brain barrier penetration serve as test cases for pharmacokinetic

aspects of descriptor analysis 8 Comparison of the descriptors applied to these data sets

revealed that surface-based 3D descriptors (VolSurf) demonstrated the most consistent and reliable performance, Grid-independent 3D descriptors (GRIND) showed intermediate

performance, while 2D fingerprint-type descriptors (UNITY fingerprints, ISIS keys)

underperformed for pharmacokinetic profiling.

1.1.3 Navigating chemical descriptor space

Historically the notion of similarity is used mainly in early stages of the development of aparticular science, and later it may be quantified and explained with accuracy as the theory ofthis science develops For example, the periodic table was originally founded on similarity between elements and these “similarities” were later explained based on electrons and the

'4 (a) Hansch, C Acc Chem Res 1969, 2, 232-239 (b) Issacs, N Physical Organic Chemistry (Second

edition, Prentice Hall, 1996), pp 146-192.

Trang 26

-11-the indoleacetic acid-derivatives started from using -11-the electronic coefficients of -11-the Hammett

equation.’ A general shortcoming common to these methods is that the scope is limited to

structurally closely related series of compounds, so they are inappropriate for correlation of data where compounds fall into many different structural classes Prediction of activity outside the structural classes of established biological interest is thus problematic most of the time A second shortcoming is their weakness in accommodating data represented by inactive

compounds Essentially, existing structure-activity correlation methodologies are only usefulfor optimizing a previously known “lead” structure, and not in generating new “leads”

Substructure search

Substructural analysis is often dubbed Free- Wilson-Analysis, as Free and Wilson published

one of the early works in this area.'** Substructure search involves the retrieval of all molecules

in a database that contain a user-defined query substructure, irrespective of the environments in

which the query substructure occurs.’ It is equivalent to determining whether one graph is

entirely contained within another, a problem known as sub-graph isomorphism Nevertheless,efficient search of a vast database requires two steps The first step involves the use of screens

to rapidly eliminate molecules that cannot possibly match the substructure query The

remaining structures are subjected to the more time-consuming subgraph isomorphism

procedure to determine which of them truly match based on the presence or absence of

structural features represented There are a number of different graph-theoretic algorithms such

as the maximum common edge subgraph (MCES), maximum weight clique, and the k-cutmethods, that show similar or superior performances to subgraph isomorphism '*°

Global substructure analysis of frequent substructures using drug databases might well

lead to the identification of minimal motifs relevant to biological activity.''*“ This approach

'S (a) Free, S M.; Wilson, J W J Med Chem 1964, 7, 395-398 (b) Merlot, C et al Curr Opin Drug.Dise Dev 2002, 5, 391-399 (c) Hagadone, T R J Chem Inf’ Comput Sci., 1992, 32, 515-521.

!5 (a) Sheridan, R P J Chem Inf, Comput Sci 1998, 38, 915-924 (b) Horton, D A; Bourne, G T.;Smythe, M L Chem Rev 2003, 703, 893-930 (c) Tounge, B A.; Reynolds, C H J Chem Inf.

Trang 27

resonates well with the concept of “privileged (sub)structures”, referring to substructural

elements found in compounds enriched for biological activity.'* Application of

retro-synthetic analysis and comparative filtering of database ranked a number of substructures aspreferred building blocks for library design.'** Furthermore, “highest scoring common

substructure” (HSCS) was analyzed using a smaller dataset to provide skeletal information for

postulated lead compounds related to a certain biological activity.’

stimulated interest in the use of multiple reference structures to identify further molecules for

biological screening.'”* Similarity searching in large chemical databases requires

representations of the molecules that are both effective, i.e., can differentiate between molecules that are different, and efficient, i.e., quick to calculate, in operation In general, there is a conflict between these two traits in that the most effective methods of representation tend to be the least efficient to calculate, and vice versa, so a suitable compromise needs to be made Based on the set of descriptors or features chosen, comparison of molecules is usually

17b

performed using similarity measures (e.g., similarity coefficlents) ˆ Many similarity

coefficients have been developed, and these can broadly be divided into association, correlation,

Comput Sci 2004, 44, 1810-1815 (d) Lewell, X Q.; Judd, D B.; Watson, S P.; Hann, M M J Chem Inf Comput Sci 1998, 38, 511-522 (e) Sheridan, R P J Chem Inf Comput Sci 2003, 43, 1037-1050.

W (a) Hert, J.; Willett, P.; Wilton, D J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A J Chem

Inf, Comput Sci 2004, 44, 1177-1185 (b) Holliday, J D.; Salim, N.; Whittle, M.; Willett, P J Chem Inf Comput Sci 2003, 43, 819-828 (C) Chen, X.; Reynolds, C H J Chem Inf Comput Sci 2002, 42, 1407-1414.

Trang 28

-13-distance, and probabilistic coefficients In the case of comparison of two bit-string type

descriptors, association coefficients (e.g., Tanimoto coefficient) try to capture fragments

common to the two molecules to be compared and give a result in the range [0,1], where 1

represents identical molecules Correlation coefficients (e.g., Pearson coefficient) give values

in the same range and represent the correlation between two vectors representing two

molecules Distance coefficients (e.g., Euclidian coefficient) focus on differences between two

molecules and are a measure of dissimilarity, giving results in the range [0, + inf] There are anumber of factors influencing the performance of similarity coefficients in mining larger

datasets Sometimes a judicious combination of similarity coefficients improved the global

outcomes of the analysis.’ ”°

(a) (b) x (c)

P4.8)= Š 6 =b)Ÿ {oa de TA, B) = apo

Figure 1.6 Similarity coefficients (a) Euclidian coefficient (b) Pearson correlation coefficient (c)Tanimoto coefficient (a, B, y: common fragments in both œ and )

Similarity paradox

There are numerous examples, illustrating so called “similarity paradox”, in which a smallchange in the chemical structure leads to a drastic change in the biological response Thisfailure of the similarity principle can be viewed on two levels One level is the aspect ofrepresentation, whether or not the assessment of similarity correctly quantifies the intuitivesimilarity between two compounds The other level comes from the complexity of the

biological systems and the responses of interest

Structural similarity is more evident in actions when the hypothetical “lock and key” mode

of macromolecular interactions is prevalent Nevertheless, this mode may not always be the

case; similar compounds frequently bind in very different orientations in the protein active site,bind to a different conformation of a protein, or bind to a different protein altogether In fact,

Trang 29

such observations are strengthened by the notion that medicinal chemists need to make a large number of compounds to represent any structural class, even as they are designing the

compounds to interact with a biological target of known structure Moreover, numerous

meaningful biological responses result from the complex interactions of genetic and

R:H AT; antagonist enantiomer binds D; receptor agonist of calcium channel CCK1 agonist

R:/-Pr = AT, agonist with 1/ 1,250 fold affinity enantiomer: antagonist enantiomer: antagonist

Figure 1.7 Examples of similarity paradox (a) Substitution at one position alters functional action of

small molecule (b) Butaclamol is an example in which affinities of (+)- and (-)- enantiomers differ from receptor to receptor for the same compound The (S)-(-) form of the calcium channel ligand Bay K 8644

is an agonist (stabilizing the open calcium channel), whereas the (R)-(+) form is a weak antagonist, a

calcium channel blocker (stabilizing the closed channel) Corresponding differences are observed for a

CCK1 ligand, where one diastereomer is an agonist, whereas its enantiomer is an antagonist (c) Because

of the chiral nature of our sensoric receptors, the enantiomers of limonene and carvone differ in their

typical odor (d) For two diastereomers of the wine lactone, the odor threshold values (i.e., the lowest

concentration in air that can be smelled by a person) differ by about 8 orders of magnitude.'**!

As illustrated in Figure 1.7, it is not evident that there is any structural relationship betweencompounds and their odors Interestingly, a model based on the vibrational spectra of odorants

showed promising results in correlating with odor For example, degeneracy, the ability of

structurally different components to carry out a similar function or produce an equivalent outputwithin a system, takes part in many levels of biological systems to yield robust behavior over

the course of evolutionary pressure.'TM

'8 (a) Edelman, G M.; Gally, J A Proc Natl Acad Sci, U.S.A, 2001, 98, 13763-13768 (b) Beely, N

R A Drug Dise Today 2000, 5, 354-363 (c) Ariéns, E J.; Wuis, E W.; Veringa, E J Biochem.

Trang 30

-15-Diversity analysis (assessmenf)

Diversity analysis generally deals with two different questions: “which compound set spansthe largest chemical space?” or “which compound set is most similar to (a) reference

compound(s)?” The first question involves maximizing dissimilarity of compounds to explorenew areas of chemical space, while the second entails maximizing similarity between

compounds in focused region.'”* A related notion in diversity design is that of selecting

compounds to “fill holes” in some diversity descriptor-defined space, a strategy most commonlyput forward in connection with pharmacophore-based molecular descriptors In principle, this is

a perfectly sensible procedure, but if and only if the descriptor(s) defining the space havealready been shown to be valid, e.g., showing a neighborhood behavior Such analyses can also

be applied to different problems such as non-redundant subset selection from chemical

databases, or global analysis and comparison of chemical databases.'””

Distance-based methods In distance-based approaches, diversity is generally expressed assome measure of pairwise dissimilarities One drawback is that the cost of calculation is scaled

to the square of the number of compounds O(N’), becoming prohibitive for large collections ofcompounds Aggressive applications of cost-effective searching methods (e.g., decision trees,simulated annealing, and nearest-neighbors) and metrics have reduced the scale to O(NlogN) or

even O(N).'”° Besides computational costs, this approach has a tendency to spread out

compounds too much in descriptor spaces, making it difficult to locate diversity “voids”

Optimization-based methods are effective ways of sampling large spaces evenly Forexample, a variance-based approach starts by finding a subset of compounds with descriptors of

Pharmacol 1988, 37,9 (d) Schramm, M.; Thomas, G.; Towart, R.; Franckowiak, G Nature 1983, 303,

535 (e) Franckowiak, G.; Bechem, M.; Schramm, M.; Thomas, G Eur J Pharmacol 1985, 7/4, 223 (f) de Tullio, P.; Delarge, J.; Pirotte, B Curr Med Chem 1999, 6, 433 (g) Hughes, J.; Dockray, G J.; Hill, D.; Garcia, L.; Pritchard, M C.; Forster, E.; Toescu, E.; Woodruff, G.; Horwell, D C Regul Pept.

1996, 65, 15 (h) Beinborn, M.; Quinn, S M.; Kopin, A S J Biol Chem 1998, 273, 14146 (i)

Friedman, L.; Miller, J G Science 1971, 172, 1044 (j) Guth, H.; Helv Chim Acta 1996, 79, 1559.

'? (a) Young, S S.; Ge, N Curr Opin Drug Disc Dev 2004, 7, 318-324 (b) Bayada, D M et al J

Chem Inf Comput Sci 1999, 39, 1-10 (c) Agrafiotis, D K.; Lobanov, V S J Chem Inf Comput Sci., 1999, 39 51-58 (d) Martin, E J.; Critchlow, R E J Comb Chem 1999, 1, 32-45.

Trang 31

the least possible correlation, and tests the significance of each descriptor in predicting relevantdependent variables (e.g., biological activity) The most widely used method is D-optimaldesign, but the results are model-dependent, and tend to favor the extremes of chemical

space, |”¢

Cell-based designs operate within a pre-defined low-dimensional space (Figure 1.2c) One

of the important advantages of cell-based partitioning (binning) methods is a common frame ofreference for comparing different datasets, allowing focused design of a library based on theproperties of interest.”° Another advantage is that they are very fast and cost-effective, scaling

as O(N) The main drawback of these methods is that they are restricted to a low-dimensionalspace because the number of cells required to dissect a space rapidly becomes prohibitive as the

number of dimensions grows.””” It is also believed that diversity space is a relative space (/.e.,

there is no absolute origin), which has to be oriented to a reference point (e.g., drug-like space

for a drug-discovery program) Hence, global modeling, especially through the use of

insufficient number of descriptors, has come under fire for whether or not it properly represents

chemical space.ˆ°°*

ChemGPS was built to provide a consistent metric by always calculating its properties inrelation to a set of compounds having extreme properties (“satellites”) The role of satellites is

to serve as boundary conditions imposed by property space Since it is a global model based on

interpolation, the ChemGPS prediction is expected to be robust.”"

1.1.4 Mapping Chemical Descriptor Space

Here, I have reviewed some literature covering retrospective analyses of chemical space

? (a) Yi, B.; Hughes-Oliver, J M.; Zhu, L.; Young, S S J Chem Inf, Comput Sci 2002, 42,

1221-1229 (b) Godden, J W.; Furr, J R.; Bajorath, J J Chem Inf: Comput Sci 2003, 43, 182-188 (c) Menard, P R.; Mason, J S.; Morize, I.; Bauerschmidt S J Chem Inf; Comput Sci 1998, 38, 1204—

1213 (d) Schnur, D J Chem Inf Comput Sci 1999, 39, 36-45.

?! (a) Oprea, T IL; Gottfries, J J Comb Chem 2001, 3, 157-166 (b) Bergstrom, C A etal J Chem

Inf Comput Sci 2004, 44, 1477-1488.

Trang 32

-17-based on activity, drug-similarity, and sources in chemical descriptor space Most studies used

an available database as a surrogate to define compound For example, ACD” and SPRESIdatabases” are commonly regarded as representing inactive, non-drug compounds A number

of databases of drugs and pharmacologically interesting agents (e.g, WDI,” MDDR,”” CMC”*)

are treated as surrogates of drug space More sophisticated studies pre-treated databases for the purposes of the experimental comparison An interesting problem associated with the use of these databases is that they grow over time; therefore, are not always as if sampled from a static probability distribution, ¿.e., the probability distribution can vary dramatically over time It has been noted that there has been a shift to higher molecular weight for compounds in clinical trial over the past few years This suggests that the concept of drug, hence the characteristics of the

drug molecules, is not static but evolves over time.

Activity-based mapping

Selection of compound sets in a screening campaign can be based on the coverage of a whole set in some diversity space and/or distances between compounds On the other hand, if there are known chemotypes with the same biological activity, these become the seeds for

*2 Available Chemical Directory (2002.1 version) contains grade and bulk chemicals The database is

available from MDL Information Systems Inc., 14600 Catalina Street, San Leandro, CA, 94577.

Website: http://www.mdli.com/products/acd html

?3 The SPRESI database is produced by the All-Union Institute of Scientific and Technical Information of

the Academy of Science of the USSR (VINITD in Moscow and the Central Information Processing for Chemistry (ZIC) in Berlin This database consists of data extracted from 1,000 journals and patents, books, and other sources from 1975 to 1990 SPRESI is distributed by Daylight Chemical Information Systems, Inc., Mission Viejo, CA.

** World Drug Index (Derwent Information London, UK)

*° MACCS-II Drug Data Report (2002.1 version) contains biologically active compounds in the different

stages of drug development as presented in the patent literature, journals, meetings and congresses The database is available from MDL Information Systems Inc., 14600 Catalina Street, San Leandro, CA,

94577 Website: www.mdli.com/products/mddr.htm|

? Comprehensive Medicinal Chemistry (2002.1 version) contains compounds used or studied as

medicinal agents in humans and pharmaceutical compounds It is derived from the Drug Compendium in the Pergamon’s Comprehensive Medicinal Chemistry The database is available from MDL Information Systems Inc., San Leandro, CA, 94577 Website: http:/Avww.mdli.com/products/eme.html.

Trang 33

compound selection From the structural information of such seeds, a representative set of compounds with similarity (i.e., “activity-enriched clusters” or “activity- prioritized screening

lists”) can be chosen for subsequent screening.””* For example, application of k-means

clustering with topological substructure analysis is able to distinguish selected sets of

antibacterial agents from others.ˆ”” CNS activities mined from the WDI were shown to beclassified by substructure analysis.””° Retrospective modeling using seven decision trees of

various complexity over 15,000 compounds isolated a very small number (37) of “highly active” or “moderately active” compounds identified in an HTS experiment The combination

of multiple decision trees with diverse classifiers is surprisingly successful, resulting in higher

fold-enrichments.’”4

The increase in the number of compounds with reliable biological annotation enables the prediction of activity modeling with higher resolution Hierarchical consensus modeling of a series of neural networks trained by compounds from the MDDR database and random

compounds generated sets of compounds that act on biological targets belonging to specific

gene families Similar studies discriminating modulators such as GPCR and non-GPCR targets,

or GPCRs and kinases, made efficient predictive models for whether or not a compound could

become a GPCR ligand.”** Applying this information to ligand-based library design might wellfocus the exploration of chemical space relevant to GPCR-targeting.”°

Mapping drug-space

Drug-like properties collectively come from a wide range of in vivo mechanisms, many of which are not well characterized; therefore, it is currently impossible to predict such properties

27 (a) Young, S S et al Stat Sci 2001, 76, 154-168 (b) Molina, E.; Diaz, H G.; Gonzalez, M P.;

Rodriguez, M.; Uriarte, E J Chem Inf: Comput Sci 2004, 44, 515-521 (c) Engkvist, O.; Wrede, P.; Rester, U J Chem Inf, Comput Sci 2003, 43, 155-160 (c) Van Rhee, A M J Chem Inf: Comput Sci 2003, 43, 941-948.

28 (a) Manallack, D T.; Pitt, W R.; Gancia, E.; Montana, J G.; Livingstone, D J.; Ford, M G.; Whitley,

D.C J Chem Inf Comput Sci 2002, 42, 1256-1262 (b) Balakin, K V.; Tkachenko, S E.; Lang, S A.; Okun, I.; Ivashchenko, A A.; Savchuk, N P J Chem Inf Comput Sci 2002, 42, 1332-1342.

Trang 34

-19-from first principles Instead, heuristic methods, rules based on relatively simple compoundproperties and derived from available experimental data, are used The most widely used drug-

like filter was developed by Lipinski by analyzing tendencies in simple molecular descriptors

for a reference set of 2,300 compounds all passing phase I clinical trials.”° This filter

encompasses the following criteria: molecular weight < 500, clogP < 5, number of H-bond

donors < 5, and number of H-bond acceptors < 10.

Though the use of these simple physicochemical descriptors can characterize drug-space

very well, it does not have enough power to distinguish drug-like space from others; therefore,

it is more appropriate to think of these criteria as necessary but not sufficient conditions todescribe drug-like space For example, it was shown that there are more compounds in theACD (surrogate of “non-drugs”) that are compliant with Lipinski’s rules, compared with

compounds from the MDDR database (surrogate of “drugs”).? From the carefully curated,

expanded dataset, 80% of the compounds in the non-drug space are compliant with Lipinski’s

first pass through the liver and lungs.

A similar drug-like filter proposed by Veber came from the analysis of a proprietary

database of 1,100 drug candidates and systematic oral bioavailability data from a single animal

?? (a) Lipinski, C A.; Lombardo, F.; Dominy, B W.; Feeney, P J Adv Drug Deliv Rev 1997, 23, 3-25.(b) Xu, J.; Stevenson, J J Chem Inf Comput Sci 2000, 40, 1177-1187 (c) Oprea, T I J Comp-

Aided Mol Des 2000, 74, 251-264.

Trang 35

species (rat) Only two molecular properties, a polar surface area less than 140 A’ and fewer

than twelve rotatable bonds, were enough for the prediction of sufficient oral bioavailability °”*

Recently, hierarchical classification schemes were reported Here, the dominant charge at biological pH determines different properties (e.g., rotatable bonds, PSA, Lipinski’s rules) that

govern the bioavailability of compounds.”

Data mining studies addressing drug-likeness cover substructures (e.g., frameworks, side chains, and functional groups), descriptors (e.g., atom environments, drug-like indices, and pharmacophore filters), and the use of supervised-learning algorithm (e.g., neural networks and

recursive-partitioning methods).°**

Using the Comprehensive Medicinal Chemistry (CMC) and the MDDR as representatives ofdrug-like molecule databases and the ACD as a surrogate for nondrug-like molecules, neuralnetwork models to classify drug-like and nondrug-like molecules were reported These

analyses are based on both one-dimensional descriptors (molecular weight, topological indices,

atom types) and two-dimensional descriptors.*'* A genetic algorithm was used to distinguish

between drug-like and nondrug-like compounds using relatively simple descriptors (molecular

weights, the numbers of H-bond donors and acceptors, rotatable bonds, and aromatic rings) tb

Here, compounds from the WDI were assumed to comprise a drug-like dataset, and compoundsfrom the SPRESI database were presumed to be a nondrug-like dataset Decision-tree analysis

was also applied to the same dataset.*'* Although these approaches showed reasonable accuracy

in their performances, each is highly dependent upon training datasets, and lacking in

30 (a) Veber, D F.; Johnson, S R.; Cheng, H.-Y.; Smith, B R.; Ward, K W.; Kopple, K D.; J Med

Chem 2002, 45, 2615—2623 (b) Martin, Y J Med Chem 2005, 48, 3164-3170 (c) Ghose, A K.; Viswanadhan, V N.; Wendoloski, J J Comb Chem 1999, 7, 55-68 (d) Bemis, G W.; Murcko, M A.J Med Chem 1996, 39, 2887-2893 (e) Muegge, I.; Heald, S L.; Brittelli, D J Med Chem 2001,

44, 1841-1846.

31 (a) Frimurer, T M.; Bywater, R.; Narum, L.; Lauritsen, L N.; Brunak, S J Chem Inf’ Comput Sci

2000, 40, 1315-1324, and earlier works include (i) Ajay; Walters, W P.; Murcko, M A J Med Chem.

1998, 41, 3314-3324 (ii) Sadowski, J.; Kubinyi, H A J Med Chem 1998, 41, 3325-3329 (b) Gillet, V.J.; Willett, P.; Bradshaw, J Chem Inf Comput Sci 1998, 38, 165-179 (c) Wagener, M.; van

Geerestein, V J J Chem Inf Comput Sci 2000, 40, 280-292.

Trang 36

Mapping based on the origin of compounds

The significant roles of small molecules in modulating biological processes might become more apparent by just enumerating compounds found naturally in biological systems A

collection of these compounds, with chemical structures, is located in the COMPOUND section

of the KEGG/LIGAND database, the total number of which is 13,000 (as of August 2005).These are roughly classified, according to the source, into 10% drug-related compounds, 30%phytochemical compounds (secondary metabolites in plants), and 60% metabolites and othercompounds originating mostly from the KEGG metabolic pathways (Figure 1.8).”* The use ofsynthetic compounds without any biological origin might come from the pioneering work of

Paul Ehlrich He discovered arsphenamine (Salvarsan), which greatly improves the treatment of

syphilis, by screening systematically over 600 synthetic compounds available

Ỏ On Me Ô

D-Glucose Cinnamate DNA Menthol Ergosterol

Figure 1.8 Common substructures of the top clusters from KEGG/LIGAND database For each cluster, a representative compound is shown with its name, and the maximum common subgraph is in red.Comprehensive analysis of the structural and property differences between drugs, naturalproducts, and combinatorial libraries showed that natural products and combinatorial librarieslie at two extremes, with drugs being intermediate in character.** The results of this analysissuggest that the design of a synthetic library should veer towards natural-product-like features.Comparative analysis based on the database of 10,495 natural products, a collection of

32 (a) Hattori, M.; Okuno, Y.; Goto, S.; Kanehisa, M J Am Chem Soc 2003, 725, 11853-11865 (b)Goto, S.; Okuno, Y.; Hattori, M.; Nishioka, T.; Kanehisa, M Nucleic Acids Res 2002, 30, 401-404.

33 (a) Feher, M.; Schmidt, J M J Chem Inf, Comput Sci 2003, 43, 218-227 (b) Lee, M-L.; Schneider,

G J Comb Chem 2001, 3, 284-289 (c) Zuccotto, F J Chem Inf Comput Sci 2003, 43, 1542-1552.

Trang 37

combinatorial libraries, and the WDI revealed a number of structural features prevalent in natural products: they have higher in molecular weights, have more stereogenic centers and fewer rotatable bonds, have larger, more complex and diverse ring systems, are lower in

nitrogen, sulfur, and halogen content, are higher in oxygen content, and have more hydrogenbond donors and acceptors Notably, due to both acyclic and cyclic conformational constraints,natural products tend to be comparatively rigid, a property that may be associated with reduced

entropic cost of binding and improved oral bioavailability Combinatorial libraries were also

found to be significantly more hydrophobic than either drugs or natural products.” ;

Shannon entropy was applied to quantify the information content of major descriptors within

a compound library, and the distribution of these measures was sufficiently distinct among

combinatorial libraries.** Subsets of property and substructure descriptors for differentiating

between natural compounds (Chapman and Hall compendium of natural products) and synthetic

compounds (ACD) were utilized to build a simple regression model.**>*

* (a) Bajorath, J J Comput-Aided Mol Design 2002, 16, 431-439 (b) Stahura, F L.; Godden, J W.;Xue, L.; Bajorath, J J Chem Inf’ Comput Sci 2000, 40, 1245-1252 (c) Buckingham, J Dictionary of

Natural Products, (Chapman & Hall/CRC, 2002).

Trang 38

-23-1.2 Biological measurement space

1.2.1 Complexity and diversity in biological space

Although living systems follow the basic laws of physics and chemistry, biological problems

are not answered purely based on these laws, simply because of the diversity of and complexity

residing in biological systems The complete sequence of the human genome provides the

means to identify all of the heritable elements in biological systems; however, it has becomeclear that the detailed inventory of cellular components (i.e., genes, macromolecules, and

metabolites) is not sufficient to understand the systems’ behavior.’ Biological responses cannot

be rationally predicted without a comprehensive understanding of the intracellular biochemicaland genetic interactions, resulting in general principles of regulation, diverse nature of

responses on different levels.

Universal Robustness «<q,» Fragility

{metabolites, hormones etc.}

Universal design: information flow

Natural Variations Biodiversity

Figure 2.1 Complexity and diversity in biological systems Phenotypic bifurcations (robustness andfragility) of biological system may result from the complex interactions Nevertheless, complex

biological system is organized in modular, hierarchical manners, which are universal The components of

each system are very different from each other (at the bottom of complexity axis) On the other hand,evolutionary constraints of biological systems along the diversity axis might afford importance clues of

phenotypic variations.

Ị (a) Lander, E S et al Nature, 2001, 409, 860-921 (b) Austin, C P Curr Opin Chem Biol 2003, 7,

511-515.

? Kitano, H Science 2002, 295, 1662-1664

Trang 39

According to the basic dogma of molecular biology, DNA is the ultimate repository of

biological complexity.’ In general, it is accepted that information storage, information

processing, and the implementation of diverse cellular programs would be located in distinct

domains of organization: genome, transcriptome, proteome, and metabolome.*? Nonetheless,

the functional distinctness of these organizational levels has recently been scrutinized Forexample, although long-term information is stored almost exclusively in the genome, the

proteome is essential for short-term information storage, and transcription factor-controlledinformation retrieval is strongly influenced by the state of the metabolome

Each eukaryotic cell is very complex since it is composed of an exceedingly large number of

macromolecules that interact with each other and with low molecular-weight components (e.g.,

metabolites and hormones) to yield nonlinear behavior that has been fine-tuned by natural

selection to achieve specific functional properties Furthermore, cellular processes may bedisassembled into basic “operating units” or “modules”, subsystems of interacting

macromolecules and low-molecular weight components that perform a given function (e.g.,signal transduction, protein synthesis, and cell-cycle regulation) in a largely context-

independent manner.’ Consequently, biological systems are complex, but also modular and

hierarchical, awareness of which opens new avenues to understanding

Such complexity may formulate the basis for phenotypic variations along the complexityaxis (Figure 2.1) Furthermore, along with the complexity axis, one can observe the shift fromthe specific (at the bottom level) to the universal (at the top level) to certain biological systems.Undoubtedly, the exact catalog of components (/.e., genes, metabolites, and proteins) is unique

to each species For example, only 4% of metabolites are shared between 43 organisms

3 (a) Schreiber, S L Nat Chem Biol 2005, 1, 64-67 (b) Mangelsdorf, D J.; Evans, R M Cell 1995,

&3, 841-850.

4 (a) Hartwell L H.; Hopfield, 1 J.; Leibler, S.; Murray, A W Nature 1999, 402, C47-C52 (b) Petty, H

R ChemBioChem 2004, 5, 1359-1364 (c) Tyson, J J.; Chen, K C.; Novak, B Curr Opin Cell Biol

2003, 75, 221-231.

Trang 40

-25-examined; however, main metabolic pathways and modules are frequently shared."”

These modules, with groups of heterogeneous and unique components, are assumed to

interact to form larger networks.’ There is unambiguous proof for the existence of such cellular

networks; indeed, the proteome organizes itself into a protein-interaction network and

metabolites are interconverted through intricate reaction networks Theoretical conclusions that global organization of such networks is governed by the same principles may come as a

surprise, but offer a new perspective on cellular organization It remains, however, to be seenwhether or not an even higher degree of universality is present on the module level The

hierarchical relationship among modules, in turn, is apparently quite universal, shared by allmetabolic and protein interaction networks studied.

On the other hand, principles governing biological systems may come from the elucidation

of evolutionary constraints selected over the changes in environment and over internal failures(i.e., DNA damages, genetic malfunctions) Without the modulation of the time domain, thecomparative surveys of biodiversity (Figure 2.1) can impart some evidences The survival of

living systems implies that the critical parameters of essential modules should be robust; that is,

they are insensitive to many environmental and genetic perturbations Evolvability, on the otherhand, requires that other parameters of modules should be sensitive to genetic changes.° It isimportant to understand how such robustness and sensitivity can be reconciled within eachfunctional module Emergent properties frequently found over various complex systems may

give clues to these contradictory observations.’

> (a) Lee, T I et al, Science 2002, 298, 799-804 (b) Milo, R et al., Science 2002, 298, 824-829 (c)

Jeong, H.; Tombor, B.; Albert, R.; Oltvai, Z N.; Barabasi, A L Nature 2000, 407, 651-654.

6 Kirschner, M.; Gerhart, J Proc Natl Acad Sci U.S.A 1998, 95, 8420-8427

7 (a) Carlson, J M.; Doyle, J Proc Natl Acad Sci U.S.A 2002, 99, 2538-2545 (b) Zhou, T.; Carlson,

J M.; Doyle, J Proc Natl Acad Sci U.S.A 2002, 99, 2049-2054.

Tiêu đề	Small Molecule-Based Approach to Chemistry and Biology: Synthesis, Measurement, and Analysis
Tác giả	Young-kwon Kim
Người hướng dẫn	Prof. David Liu, Prof. Daniel Kahne
Trường học	Harvard University
Chuyên ngành	Chemistry and Chemical Biology
Thể loại	Thesis
Năm xuất bản	2005
Thành phố	Cambridge

Định dạng
Số trang	237
Dung lượng	23,84 MB