Orthogonal partial least squares discriminant analysis in metabolomic for disease characterization

ORTHOGONAL PARTIAL LEAST SQUARES DISCRIMINANT ANALYSIS IN METABOLOMICS FOR KIDNEY AND CATARACT DISEASE CHARACTERIZATION CHEW AI PING (B.Sc. (Hons.), NUS) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF CHEMISTRY NATIONAL UNIVERSITY OF SINGAPORE 2012 Declaration I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has also not been submitted for any degree in any university previously. __________________________ Chew Ai Ping 25 June 2012 i Acknowledgements It is my honour to thank the following who have made this thesis possible. Firstly, I thank Professor Sam Li, my main supervisor, for the support and patient guidance these few years, from the start of the project to the end of the write-up for this thesis. I also thank Dr. Ong Eng Shi for being the co-supervisor for this project, and for starting me on this project with the kind and thoughtful help in obtaining and running the samples. I also thank Professor Ong Choon Nam, NUS, for kindly agreeing to release the samples, and for his prompt replies to my questions and offering assistance in any way possible. I thank my lab mates for their support in my studies, research, and also for giving valuable advice where needed. They are Drs. Lau Hiu Fung, Law Wai Siang, Tok Junie, Zuo Xinbing, Wu Huanan, Liu Feng, Grace Birungi, Jiang Zhangjian, Ms Elaine Tay, Ms Fang Guihua, Ms Gan Peipei, Ms Lü Min, Ms Huang Yan, Mr Jon Ashley, Mr Chen Baisheng, and Mr Lin Junyu. I also thank Mr Ting Aik Leong, whose help in running the samples has also made this thesis possible. ii I thank the National University of Singapore for giving me the financial support and the chance to take up this degree under the Research Scholarship programme. I sincerely thank the pastors, full-time staff, elders, leaders, and brothers and sisters of the Tabernacle Church and Missions, Singapore, for loving, teaching, guiding, and spurring me towards completing my thesis. I sincerely thank my family for their unfailing support and love given these few years while I undertook studies for my Master’s degree. Finally, all thanks and glory be to God, who has made all things possible through Him and in Him. iii Table of Contents Declaration ...................................................................................................... i Acknowledgements....................................................................................... ii Table of Contents ......................................................................................... iv Summary ...................................................................................................... vii List of Tables .............................................................................................. viii List of Figures............................................................................................... ix List of Abbreviations ................................................................................... xii List of Symbols........................................................................................... xiii Chapter 1 Introduction ................................................................................. 1 1.1 Metabolomics ....................................................................................... 1 1.1.1 Overview...................................................................................... 1 1.1.2 Metabolomics in Disease Diagnosis ............................................ 2 1.1.3 Non-targeted and Targeted Approaches in Metabolomics........... 4 1.1.4 Using Urine for Metabolomic Analysis ......................................... 6 1.2 Analytical and Separation Techniques in Metabolomics ...................... 8 1.2.1 Nuclear Magnetic Resonance...................................................... 8 1.2.2 Mass Spectrometric Techniques in Metabolomics ..................... 10 1.2.3 Separation Techniques in Metabolomics ................................... 12 1.2.3.1 Overview............................................................................ 12 1.2.3.2 Gas Chromatography ........................................................ 12 1.2.3.3 High Performance Liquid Chromatography........................ 14 1.3 Chemometrics in Metabolomics ......................................................... 16 1.3.1 Overview.................................................................................... 16 1.3.2 Principal Component Analysis ................................................... 18 iv 1.3.3 Partial Least Squares/ Projection to Latent Structures............... 19 1.3.4 Orthogonal Partial Least Squares Discriminant Analysis ........... 20 1.3.5 Pre-treatment of Data for Chemometric Analysis....................... 21 1.4 Chronic Kidney Disease ..................................................................... 24 1.4.1 Overview of Chronic Kidney Disease......................................... 24 1.4.2 Diagnosis of Chronic Kidney Disease ........................................ 27 1.4.3 Metabolomics and Chemometrics for Chronic Kidney Disease . 30 1.5 Cataract Disease................................................................................ 31 1.5.1 Overview of Cataract Disease ................................................... 31 1.5.2 Diagnosis of Cataract Disease................................................... 32 1.5.3 Metabolomics and Chemometrics for Cataract Disease ............ 33 1.6 Approach and Scope of Study............................................................ 34 Chapter 2 Materials and Methods ............................................................. 36 2.1 Materials............................................................................................. 36 2.2 Urine Sample Collection..................................................................... 36 2.3 Equipment and Procedure for HPLC-MS/MS ..................................... 36 2.4 Extraction and Normalization of Chromatogram Peak Areas ............. 37 2.5 Chemometric Analysis........................................................................ 38 2.6 Statistical Analysis.............................................................................. 39 Chapter 3 Results and Discussion for Chronic Kidney Disease ............ 40 3.1 Results for Chronic Kidney Disease ................................................... 40 3.1.1 Results for Control vs. Chronic Kidney Disease ESI+ Dataset .. 40 3.1.2 Results for Control vs. Chronic Kidney Disease ESI- Dataset ... 51 3.1.3 Results for Combined ESI+ and ESI- Dataset ........................... 60 3.2 Discussion for Chronic Kidney Disease.............................................. 66 v 3.3 Summary............................................................................................ 74 Chapter 4 Results and Discussion for Cataract Disease ........................ 76 4.1 Results for Cataract Disease.............................................................. 76 4.1.1 Results for Control vs. Cataract Disease ESI+ Dataset ............. 76 4.1.2 Results for Control vs. Cataract Disease ESI- Dataset .............. 83 4.1.3 Results for Combined ESI+ and ESI- Dataset ........................... 90 4.2 Discussion for Cataract Disease ........................................................ 95 4.3 Summary............................................................................................ 98 Chapter 5 Conclusion and Future Work ................................................... 99 References ................................................................................................. 103 vi Summary This thesis shows how metabolomics and multivariate statistical methods such as Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) can be used to study and enhance understanding of two diseases. The study utilizes univariate and multivariate statistical techniques to determine the differences in a targeted set of metabolites for healthy controls and two groups of diseased persons. Urine samples were collected from healthy controls and patients suffering from chronic kidney disease (CKD). High performance liquid chromatography-tandem mass spectrometry analysis was performed on each sample, and chromatographic and mass spectrometric data were obtained. After pre-treatment of the data through normalization and scaling, principal component analysis and OPLS-DA were used to visualize the differences in these two classes. Further statistical analysis was employed to determine fluctuations in target metabolites to understand disease pathology, and also to identify potential biomarker candidates for CKD. This same method was also employed for a separate group of patients suffering from cataract disease for further validation. The thesis is then concluded with a summary of the main findings, a discussion on the challenges faced, and suggestions for future work in metabolomic studies of CKD and cataract disease. vii List of Tables Table 1 Description of stages of chronic kidney disease (adapted from [50, 117, 128], originally from [131]) .................................................................... 28 Table 2 Metabolites identified in human urine samples for Controls and CKD patients in ESI+ mode ................................................................................... 48 Table 3 Metabolites identified in human urine samples for Controls and CKD patients in ESI- mode .................................................................................... 58 Table 4 Metabolites identified in human urine samples for Controls and cataract disease patients in ESI+ mode ........................................................ 81 Table 5 Metabolites identified in human urine samples for Controls and cataract disease patients in ESI- mode ......................................................... 88 viii List of Figures Figure 1 Representative TICs (ESI+) of (A) Control, and (B) Patient with CKD. ...................................................................................................................... 40 Figure 2 (A) PCA scores plot for Control ESI+ data; (B) DModX scores plot for Control ESI+ data; (C) PCA scores plot for CKD ESI+ data. .................... 42 Figure 3 PCA scores plot for Control and CKD ESI+ dataset ....................... 43 Figure 4 OPLS-DA scores plot for Control against CKD ESI+ dataset ......... 44 Figure 5 Cross-validation scores plot for Control and CKD ESI+ dataset ..... 45 Figure 6 Random permutation test scores plot for Control and CKD ESI+ dataset........................................................................................................... 46 Figure 7 (A) VIP and (B) Loadings plot for Control-CKD ESI+ dataset. Interval bars denote the jack-knife confidence intervals for each metabolite.............. 50 Figure 8 Representative TICs (ESI-) of (A) Control, and (B) Patient with Chronic Kidney Disease. ............................................................................... 51 Figure 9 (A) PCA scores plot for Control ESI- data; (B) DModX scores plot for Control ESI- data; (C) PCA scores plot for CKD ESI- data............................ 53 Figure 10 PCA scores plot for Control and CKD ESI- dataset ...................... 54 Figure 11 OPLS-DA scores plot for Control against CKD ESI- dataset ........ 55 Figure 12 Cross-validation scores plot for Control and CKD ESI- dataset.... 55 Figure 13 Random permutation test scores plot for Control and CKD ESIdataset........................................................................................................... 56 Figure 14 (A) VIP and (B) Loadings plot for Control-CKD ESI- dataset. Interval bars denote the jack-knife confidence intervals for each metabolite. 59 Figure 15 (A) PCA scores plot for Control combined ESI+/ESI- data; (B) DModX scores plot for Control combined ESI+/ESI- data; (C) PCA scores plot for CKD combined ESI+/ESI- data ................................................................ 61 Figure 16 PCA scores plot for Control and CKD combined dataset.............. 61 Figure 17 OPLS-DA scores plot for Control against CKD combined dataset 62 Figure 18 Cross-validation scores plot for Control and CKD combined dataset ...................................................................................................................... 63 Figure 19 Random permutation test scores plot for Control and Cataract Disease combined dataset ............................................................................ 64 ix Figure 20 (A) VIP and (B) Loadings plot for Control and CKD combined dataset. Interval bars denote the jack-knife confidence intervals for each metabolite. ..................................................................................................... 65 Figure 21 Representative TICs (ESI+) of (A) Healthy Control, and (B) Patient with Cataract Disease.................................................................................... 76 Figure 22 (A) PCA scores plot and (B) DModX scores plot for Cataract Disease ESI+ data......................................................................................... 77 Figure 23 PCA scores plot for Control and Cataract Disease ESI+ dataset . 78 Figure 24 OPLS-DA scores plot for Control against Cataract Disease ESI+ dataset........................................................................................................... 79 Figure 25 Cross-validation scores plot for Control and Cataract Disease ESI+ dataset........................................................................................................... 79 Figure 26 Random permutation test scores plot for Control and Cataract Disease ESI+ dataset .................................................................................... 80 Figure 27 (A) VIP and (B) Loadings plot for Control-Cataract Disease ESI+ dataset. Interval bars denote the jack-knife confidence intervals for each metabolite. ..................................................................................................... 82 Figure 28 Representative TICs (ESI-) of (A) Healthy control, and (B) Patient with Cataract Disease.................................................................................... 83 Figure 29 (A) PCA scores plot for Cataract Disease ESI- data; (B) DModX scores plot for Cataract Disease ESI- dataset ............................................... 84 Figure 30 PCA scores plot for Control and Cataract Disease ESI- dataset .. 85 Figure 31 OPLS-DA scores plot for Control against Cataract Disease ESIdataset........................................................................................................... 86 Figure 32 Cross-validation scores plot for Control and Cataract Disease ESIdataset........................................................................................................... 86 Figure 33 Random permutation test scores plot for Control and Cataract Disease ESI- dataset..................................................................................... 87 Figure 34 (A) VIP and (B) Loadings plot for Control-Cataract Disease ESIdataset. Interval bars denote the jack-knife confidence intervals for each metabolite. ..................................................................................................... 89 Figure 35 (A) PCA scores plot and (B) DModX scores plot for Cataract Disease combined ESI+/ESI- data ................................................................ 90 Figure 36 PCA scores plot for Control and Cataract Disease combined ESI+/ESI- dataset.......................................................................................... 91 x Figure 37 OPLS-DA scores plot for Control and Cataract Disease combined dataset........................................................................................................... 92 Figure 38 Cross-validation scores plot for Control and Cataract Disease combined dataset .......................................................................................... 92 Figure 39 Cross-validation scores plot for Control and Cataract Disease combined dataset .......................................................................................... 93 Figure 40 (A) VIP and (B) Loadings plot for Control and Cataract disease combined datasets. Interval bars denote the jack-knife confidence intervals for each metabolite. ............................................................................................ 94 xi List of Abbreviations ANN Artificial neural network CKD Chronic kidney disease CV Cross validation DA Discriminant analysis EIC Extracted ion chromatogram ESI+/- Electrospray ionization (positive/ negative mode) GC Gas chromatography GFR Glomerular filtration rate HPLC High performance liquid chromatography LC Liquid chromatography MS Mass spectrometry MS/MS Tandem mass spectrometry NMR Nuclear magnetic resonance OPLS Orthogonal projections to latent structures/ orthogonal partial least squares PCA Principal component analysis PLS Partial least squares/ Projection to latent structures RT Retention time SIMCA Soft independent modelling of class analogy SPE Solid-phase extraction TIC Total ion chromatogram UPLC Ultra Performance Liquid Chromatography VIP Variable importance plot/ Variable influence on projection xii List of Symbols D-crit Critical distance DModX Distance to the model in X-space m/z Mass-to-charge ratio Q2X(cum) Cross-validation parameter representing the predictability of the model Q2Y(cum) Cross-validation parameter showing the cumulative predicted variation in the Y matrix, representing the predictive ability of the model R2X(cum) Cumulative modelled variation in the X matrix, representing the total explained variance in the model R2Y(cum) Coefficient of determination of OPLS-DA model, showing the cumulative modelled variation in the Y matrix, and representing the goodness of fit of the model in explaining the variation by the components in the model t[a] X-score of component a in the model tcv Cross-validated X-score of component a in the model to Orthogonal X-scores of (uncorrelated) component in the OPLS-DA model, also representing within class variation tp Predictive component in the OPLS-DA model, also representing between-class variation w[a] Loading vector of component a X Matrix of predictor variables Y Matrix of response variables xiii Chapter 1 Introduction 1.1 Metabolomics 1.1.1 Overview Metabolomics is the area of study that is concerned with the metabolome, which comprises small molecule components (of size less than 1 kDa [1]) associated with the biochemical processes of a given organism [2]. Examples of such small molecules include simple sugars, fatty acid amides, and amino acids. The presence of and quantity of these metabolites are a reflection of what goes on within and outside the cell. The goal of metabolomics is not only to determine the disease pathology, the role of metabolites in the biochemistry of the organism, and potential biomarkers, but also ultimately to determine the molecular structure of these biomarkers [3]. Overall, metabolomic studies greatly aid our understanding of the biology of an organism at a systems level. Nicholson et al. in their landmark paper have defined metabolomics as the study of the “dynamic multiparametric metabolic response of living systems to pathophysiological stimuli or genetic modification” [4], helping us to understand how living systems actually work. Specifically, when organisms are under a state of stress as a result of disease (“pathophysiological stimuli”) or perturbations to the genetic content of the cells (“genetic modification”), metabolomics as a discipline becomes useful [5]. Knowledge of the cellular response under such conditions helps researchers identify potential therapeutic targets. In this manner, therapy does not end with just symptomatic treatment to just address the metabolic flux but indeed has the end-goal of a total cure in mind. 1 Metabolomics, or metabonomics, has been gaining new ground in the field of systems-level “omics” research [6], i.e. genomics, transcriptomics, and proteomics. It is a relatively new area of study compared to its sister disciplines [7] and complements these fields [8]. The advantage of metabolomics over its counterparts is that the metabolome is much more closely correlated to the actual cellular response than the genome or proteome [3, 7, 8]. Also, the amount of data generated is less due to the lower number of metabolites compared to the number of genes or proteins [2, 7]. In addition, each metabolite may be involved in one or several pathways, contributing to the complex expressed phenotype of the organism. It is this set of downstream biochemical pathways, and not just a single pathway, that metabolomics aims to map out [2, 8]. This allows us to obtain a more timely and accurate understanding of cellular and systemic processes. 1.1.2 Metabolomics in Disease Diagnosis Metabolomics is increasingly becoming a valuable tool to study disease pathology, and to screen, diagnose and determine the effect of treatment on diseases as well. A wide variety of diseases is being studied, with the analytical methods used varying with the disease and the aim of the study. For example, Jung et al. have successfully used proton NMR (1H-NMR) and targeted metabolic profiling with multivariate analysis to distinguish patients with cerebral infarctions from healthy controls by analysis of urine and plasma [9]. Also, Kim et al. have combined toxicology with metabolomics to determine urinary biomarkers for human gastric cancer using a mouse model [10]. In a 2 recent study, Bao et al. have also devised a novel method of measuring the systemic effects of various drug treatments on type 2 diabetes mellitus (T2DM) instead of just obtaining the conventional glucose measurement for T2DM [11]. Further, in an attempt to obtain a comprehensive understanding of and to diagnose renal cell carcinoma, Kind et al. have successfully utilized various separation techniques coupled with mass spectrometry and subsequent multivariate analysis to analyze and discriminate patient urine from healthy controls in a small pilot study [3]. Given the increasing number of parameters in metabolomic analysis, there is an even greater need for reliable and informative multivariate techniques to analyse this data. The combination of multivariate statistical tools with metabolomics has been shown to be powerful for disease screening involving non-targeted determinations. One such study of interest is that by Michell et al. In their metabolomic analysis of Parkinson’s disease patient serum and urine samples, they were able to separate female Parkinson’s patients from their age-matched controls using partial least squares discriminant analysis (PLSDA) based on the urine data, despite not finding strong individual biomarkers responsible for this separation. They surmise that there is a unique metabolic pattern of Parkinson’s disease contributed by certain metabolites [12]. Also, in a separate study by Kemperman et al., they observe that while multivariate statistical analysis was able to show discriminatory peptide peaks, univariate analysis failed to show these as discriminatory due to “a very large biological variation among the proteinuric patient group” [13]. These studies show the 3 necessity of multivariate techniques in view of the nature of samples and data obtained. 1.1.3 Non-targeted and Targeted Approaches in Metabolomics There are two general approaches towards metabolomic studies – nontargeted and targeted. Non-targeted or global profiling approaches in metabolomics aim to capture as many features of an organism’s metabolic profile as possible. This approach allows researchers to obtain a holistic picture of the types and concentrations (relative or absolute) of the metabolites, so that comparisons can be made between study groups in order to determine patterns of changes which are useful for diagnosis [14]. Non-targeted approaches as that in metabolic fingerprinting may not identify the specific metabolites involved in disease pathology, but consider the total combination of analytes and their concentrations in totality [15]. This approach allows for the “simultaneous analysis of multiple end products”, allowing for a “more powerful and robust means by which to stratify disease severity, progression and to assess drug efficacy than the analysis of any single marker over a patient population” [16]. For example, Vallejo et al. have used capillary electrophoresis coupled with ultraviolet detection and subsequent metabolic fingerprinting to distinguish between normal rats and diabetic rats on antioxidant treatment [17]. Issaq et al. have also successfully utilized metabolomic profiling with high performance liquid chromatography-mass spectrometry (HPLC-MS) to detect bladder cancer using urine samples in their proof-of-concept study. Their study does not use the traditional 4 techniques which are less sensitive towards low-grade tumours (i.e. through urine cytology) or more invasive in terms of methodology (i.e. cystoscopy) [5]. Novel biomarkers may also be identified in non-targeted approaches, e.g. by structural studies through NMR or tandem mass spectrometry. Given the knowledge of metabolites and their interactions in specific biochemical pathways, one can also capitalise on targeted approaches to study specific metabolites or groups of metabolites [18] using reference spectra for analysis [19]. The duration for post-acquisition data processing and identification of metabolites are shorter as well [19]. Metabolite and pathway databases and search engines such as the Human Metabolome Database [20], Kyoto Encyclopaedia of Genes and Genomes database [21, 22] and the METLIN Metabolite Database [23] are useful resources in this area of pathway analysis. Researchers can also make use of targeted analysis to determine how the concentrations of particular metabolites in a system vary with concentration changes of other metabolites. For example, Grison et al. have successfully used targeted profiling to determine a metabolic signature for chronic caesium exposure [24]. Also, Wu et al. have compared the metabolite profiles of salt-tolerant and salt-intolerant soybean plants, and through multivariate analysis, have found that secondary metabolites such as isoflavones and saponins distinguished these two varieties [25]. One limitation of this targeted approach is that since only known metabolites can be identified and quantified, it is not possible to discover novel compounds as biomarkers through this approach [18]. Yet, the 5 numerous successes using this approach show that there is a need and use for such targeted studies. 1.1.4 Using Urine for Metabolomic Analysis While many types of body fluids (biofluids) have been used for metabolomic studies, the choice of biofluid is highly dependent on the disease being studied. The choices of biofluid include blood serum [12, 26-29], plasma [27, 30-32], cerebrospinal fluid [33], urine [3, 5, 7, 12, 29, 34-43], saliva [44, 45], tears [46], and even vitreous humour [15]. Urine has an advantage of being easily obtained in large enough volumes for multiple analyses [1, 18, 47]. It is also one of the least invasive body fluids to collect from patients [10], allowing for multiple collections at different times [18], and at the minimal level of discomfort to study subjects [1]. Furthermore, urine is the biofluid through which the majority of metabolic waste products are excreted from most organ systems in the body and therefore can provide much information about the body’s biochemical processes as a whole system [3], since it is not subject to strict homeostatic regulation as is serum [48]. In addition, obtaining or preparing urine samples is usually more straightforward than for other biofluids such as blood [49], serum [18], plasma [1], or tears. In addition, the concentrations of metabolites are often higher in urine [47], which makes it easier for determination and detection. It has also been found that in the study of renal diseases, measurements of kidney function are generally more accurate when using urine measurements than plasma, provided sufficient and accurate volumes of urine samples can be obtained 6 [50]. Metabolic changes that take place at the cellular level are easily reflected in the urine, as, other than blood, it is the biofluid which most of the kidney is exposed to [2, 3]. Urine as a biofluid for analysis, however, also has its disadvantages. There may be large variations in terms of volume and therefore the degree of dilution of metabolites [17], resulting in a very wide dynamic range [1] and concentration differences of five-thousand fold or more [17, 51]. These differences represent natural variation, and may be exacerbated under conditions of disease [1]. In addition, as with other body fluids, the concentrations of metabolites may not correspond to their importance in disease pathology [17]. Also, xenobiotics may be present [1], and these may or may not be directly related to the organism’s core metabolism; if so, they may provide valuable information on the varied interactions of the organism with its environment. The analysis method chosen must therefore be able to deal with these problems associated with metabolic studies involving urine, in addition to being reliable and reproducible [14]. Despite these limitations in terms of variation of urine volume, metabolite concentration differences, and the presence of xenobiotics, urine has been one of the choice candidates for metabolomics studies. This is because the advantages of using urine for this current study far outweigh the disadvantages, as will be discussed further in the foregoing sections. and is therefore the choice of body fluid for this study of chronic kidney disease (CKD). 7 1.2 Analytical and Separation Techniques in Metabolomics 1.2.1 Nuclear Magnetic Resonance The expanding area of research in metabolomics can also be attributed to the improvement in technologies that allow for sensitive, specific, and reproducible studies to be carried out. Traditionally, NMR was, and still is, a major analysis technique employed for the purposes of mapping the metabolome [18, 52]. The main advantage that NMR affords is its reproducibility [53] over different runs and across different instruments [2] and its ability to detect a wide range of metabolites [19], allowing for the building of compound libraries [46]. Furthermore, sample preparation is usually minimal [2, 19] and non-destructive [8], and analysis times are short as well [8]. NMR is also able to analyse intact tissue through high-resolution magic-angle spinning [54]. In addition, NMR allows for the molecular structures of biomarkers to be discerned in two-dimensional structural studies [55]. It allows researchers to determine metabolite profile patterns through metabolic fingerprinting to classify groups of subjects without actual identification of the molecules involved [53]. For example, Brindle et al. have used 1H-NMR to successfully profile human serum for the accurate diagnosis of coronary heart disease [56], while Keun et al. have successfully used 13 C-NMR to investigate urine in metabolomic studies [57]. Further, Kang et al. have also successfully used NMR with orthogonal partial least squares discriminant analysis (OPLS-DA) – a multivariate statistical tool – to discriminate between Korean and Chinese herbal medicines [58].As the number of variables being analysed increases, it 8 is apparent that multivariate tools become necessary in order to obtain a more complete understanding of the systems being studied. These multivariate tools must also allow for a logical and systematic way of handling the information obtained. It is with this thought in mind that multivariate statistical techniques feature in our study, and which will be further reviewed in this chapter. However, a main drawback of NMR is its inherent lack of analytical sensitivity [2, 8, 46], which results in the inability to detect metabolites which have a concentration lower than 5 µM [18]. Spin-spin coupling also causes complications in data interpretation [19]. Several recent advances in NMR technology include microprobes and miniature probe coils for smaller volumes of sample [53] and cryoprobes for better sensitivity and shorter acquisition times [53, 59]. However, high-throughput profiling does not seem possible if the issues of complicated spectra and difficult compound identification are to be resolved [19]. In addition, the high cost and space requirements of equipment [48] may also mean that not all laboratories will be appropriately equipped. Therefore, while NMR is a very useful and powerful technology for metabolite profiling, it does not allow for high-throughput studies on metabolites of very low concentrations. In view of these considerations, other analytical methods such as mass spectrometry and chromatographic separation techniques need to be considered. 9 1.2.2 Mass Spectrometric Techniques in Metabolomics MS is a useful and often necessary tool for the identification and quantification of metabolites in metabolomics investigations [2] through their molecular fragmentation patterns [18]. It also has a higher sample salt tolerance than NMR and has improved to reach picomole detection levels [2]. In addition, MS with its inherent higher sensitivity than NMR also means that it is usually used for targeted studies [8]. It must, however, be noted that MS and NMR techniques are complementary as each has their own limitations and advantages. For MS, the choice of ionisation techniques is very important as it needs to be suitable for the type of metabolites under study. Commonly used ionization techniques include electron ionization (EI), chemical ionization (CI), and atmospheric ionization. For EI, it is the most commonly used ionization technique with gas chromatography (GC), but suffers from the lack of molecular ions for some compounds [60]. A common alternative would therefore be CI, which helps to produce the molecular ions for compounds that do not do so with EI [60]. The most commonly used atmospheric ionization technique, electrospray ionization (ESI) [61], is suitable for the metabolomic analysis of urine as the metabolites are usually polar or ionic. The lack of extensive fragmentation also means that molecular ions can be detected with high sensitivity [1]. ESI also has both positive and negative-ion modes, allowing for wider coverage of metabolites [1]. ESI is therefore the choice of ionisation technique for this study. 10 The choice of mass analysers is also dependent on the type of metabolic analysis being carried out. If “global, untargeted metabolic profiling studies” are to be carried out, high resolution mass analysers such as time-of-flight (TOF) and quadrupole-TOF (Q-TOF) instruments are suitable for resolution of “co-eluting metabolites having the same nominal mass” [1]. On the other hand, if targeted studies on a select group of metabolites are to be carried out, lowresolution mass analysers such as single quadrupole, triple quadrupole, and ion trap instruments are sufficient for detection and quantification of the metabolites being investigated [1]. Triple-quadrupole instruments are usually chosen over single-quadrupole mass analyzers due to the former’s higher sensitivity and ability for selective reaction monitoring [62]. As this study utilises a targeted approach, a triple quadrupole mass analyser would be sufficient for detection and quantification of the metabolites under study. As the composition of biofluids is highly complex, it is advantageous for metabolomic studies to utilise separation techniques prior to MS analysis. Direct infusion techniques have been used in metabolomics to determine metabolic profiles in both plants [63] and animals with high sensitivity and selectivity [53], but the quality of chromatograms is usually adversely affected by matrix effects [18, 46], incomplete ionization [18], or ion suppression or enhancement [64]. Although direct injection techniques such as desorption ESI and extractive ESI could be used to counter matrix effects encountered in urine analysis [18], coupling MS with an orthogonal separation method is allows for a more complete and accurate measurement of the metabolites present [1]. In light of these considerations and the complexity of the samples 11 obtained, it was decided that a separation method prior to ionization was necessary to improve the quality of data obtained. 1.2.3 Separation Techniques in Metabolomics 1.2.3.1 Overview The advancement of metabolomics has also been largely supported by developments in separation and mass spectrometric technologies. As the number of variables under investigation increases, there is a need to distinguish and analyse each metabolite separately so that a more accurate understanding of the condition being studied can be obtained [2]. Separation technologies include chromatographic methods such as HPLC, gas chromatography (GC), and capillary electrophoresis. Other extraction methods also have earned favour with metabolomics, and these include solid phase extraction [65]. The use of GC and HPLC in metabolomics will be discussed in more detail in this sub-section. 1.2.3.2 Gas Chromatography GC, specifically capillary GC, has been widely used in the field of metabolomics in conjunction with mass spectrometry due to the reproducible high quality spectra obtained using this method [66]. As Kind et al. note, GCMS has been extensively used since the 1970s [3]. Compound libraries for reference can be compiled in-house, obtained from commercial sources, or imported from other external sources such as the National Institutes of Science and Technology [61]. These extensive libraries available make the combination of GC-MS a tool of choice for identification of metabolites [46]. 12 GC-MS has been widely used in metabolomic studies involving multivariate determination of the diseased state through analysis of urine. Halket et al. have explored a method of determining urinary organic acids using GC-MS with pattern recognition techniques to identify metabolic disorders [67]. Zhang et al. have also successfully used multivariate OPLS-DA modelling to determine 40 differentiating metabolites for osteosarcoma in GC-MS analysis of serum and urine, as well as discern the energy metabolism disruptions through their targeted analysis [29]. More recently, Pasikanti et al. have devised and validated a method where GC-MS was coupled with principal component analysis (PCA) and OPLS-DA to differentiate between genders based on a global metabolomic analysis on urine samples [49]. These studies show the utility of GC-MS coupled with multivariate statistical tools in metabolomic studies. However, for GC, sample derivatization is necessary to obtain volatile forms of the non-volatile analytes [18]. As in the study by Pasikanti et al., sample derivatization using BSTFA and the presence of co-eluting compounds made it difficult to identify especially the low-abundance metabolites [18]. Sample pre-treatment to remove interfering molecules and to enhance the concentration of desired metabolites is therefore one main limitation associated with techniques such as GC-MS [18] as well as HPLC-MS. Many metabolites in urine are non-volatile and polar or ionic, and tedious sample derivatization is required prior to GC analysis [1] in order to decrease their polarity and increase volatility [66]. Thermal degradation of metabolites may 13 also occur due to the high temperatures utilised in GC [46], further warranting the need for sample derivatization [66]. In the case of urine, urease treatment is necessary to protect the column and enhance the quality of spectra obtained [18, 66]. These preparation steps complicate and lengthen the total amount of time needed for analysis, and unwanted artefacts may also be introduced. 1.2.3.3 High Performance Liquid Chromatography Compared with GC, high performance liquid chromatography (HPLC) has also been extensively used in research, as reviewed by Kind et al. [3]. As with GC, HPLC is frequently coupled with MS in analysis. The combination of orthogonal separation techniques improves separation and identification of metabolites [3]. Once thought of as only a potentially powerful tool for metabolomics [68], HPLC-MS has proven to be very useful in this field. For example, Jia et al. have used HPLC-MS to determine the plasma phospholipid profiles of mice with Immunoglobin A nephropathy – which is “the most common form of glomerulonephritis” [69]. They have found that the combination of HPLC-MS with multivariate modelling by PCA and partial least-squares discriminant analysis (PLS-DA) is successful in differentiating healthy mice from their diseased counterparts and identifying relevant biomarkers [69]. The inherent sensitivity, specificity and efficiency of MS [69] coupled with the high peak capacity of HPLC have made it possible to accurately determine large numbers of metabolites in a short length of time [46]. In other works, Plumb et 14 al. have successfully used HPLC-MS to screen rat urine in drug development and detect drug metabolites in biological fluids [31, 70]. Idborg-Björkman et al. have also used HPLC-MS and two-way data analysis to screen for biomarkers in rat urine [71]. Similarly, Yang et al. have similarly used HPLC-based metabonomics in the diagnosis of liver cancer to decrease the false-positive rate [72], and further built on this work by exploring strategies for HPLC-based metabonomics research [73]. Furthermore, Chen et al. in their targeted analysis of urine metabolites have utilised Rapid Resolution Liquid Chromatography (RRLC) and multivariate analysis, and leveraged on metabolic correlation networks to determine potential biomarkers of breast cancer and gain a greater understanding on the interactions between the putative biomarkers [74]. Yin et al. have also used a similar method to study liver cirrhosis and hepatocellular carcinoma [75]. Therefore, HPLC has proven to be a necessary and powerful tool for studies involving disease screening and diagnosis. In addition, urine, which is the choice of biofluid for this study, is particularly suitable for analysis by reversed-phase HPLC-MS [1]. As mentioned previously, urine contains many dissolved non-volatile analytes with various degrees of polarity. Urine can be injected directly into the column either in a diluted or neat form [14]. Apart from removing particulates and appropriate dilution, there is minimal sample preparation for HPLC-MS analysis of the low molecular weight metabolites [1, 14]. Although compound identification is more difficult than in GC-MS due to the lack of standard reference spectra 15 libraries [61], the use of reference standards coupled with the use of metabolite databases containing reference spectra alleviates this difficulty associated with HPLC-MS. Therefore, comparing GC-MS and HPLC-MS, it is felt that the use of HPLCMS would be more advantageous due to the nature of the target metabolites for this study. There is no need for derivatization of analytes in HPLC-MS, unlike GC-MS [76], reducing the likelihood of mistakes introduced in the sample preparation step, and increasing the likelihood of accurately identifying novel compounds [48]. Also, the overall duration for sample analysis and post-processing would be lower as time is not needed for sample derivatization. 1.3 Chemometrics in Metabolomics 1.3.1 Overview The trend towards having many variables (such as chromatographic and spectroscopic information) describing one observation (each sample) has been fuelled by the advances in the above-mentioned analysis technologies, as well as a need to determine disease pathology not in terms of only one metabolite but as a combination of metabolite responses in various physiological states. Moreover, there is a large dynamic range of metabolite concentrations, and the most important metabolites may not be the most abundant ones [77]. Appropriate multivariate chemometrics tools therefore have to be employed in order to summarise, interpret, and visualise this wealth of data generated from metabolomic experiments [77]. 16 Chemometrics techniques are useful tools to help the researcher understand the acquired data. The underlying principle of chemometrics technologies is this – mathematical operations and transformations are used to determine if there are underlying patterns or trends in multi- or megavariate data, known as latent variables [78]. These latent variables are able to summarise the variation shown in the data as the absolute data obtained are usually highly collinear [78]. In addition, these latent structures may or may not be the variables being measured themselves. Alternatively, a priori information about the sample can be used in the analysis, and the data summarised to show whether the measured variables are correlated with the known information. Multivariate analysis methods are therefore necessary as they can help researchers to determine underlying patterns across large sets of data, and also because disease causation is seldom due to a single metabolite, but are usually multifactorial in origin [2]. Many chemometrics techniques have been used in the field of metabolomics. These multi- and megavariate techniques can be broadly divided into a few categories, namely supervised and unsupervised methods. These include hierarchical classification analysis [79], PCA [80], linear discriminant analysis, PLS methods [81], OPLS [82], soft independent modelling of class analogy [83], support vector machines [84] and artificial neural networks (ANNs) [85]. The tool of choice is largely dependent on two factors: the nature of data obtained, and the nature of information that is needed about the data obtained. Given these tools, the researcher must still make wise decisions based on 17 careful experimental and study designs, and the researcher’s assumptions will also affect the kind of data-processing performed using these chemometrics tools. 1.3.2 Principal Component Analysis The tool of choice for preliminary visualisation of underlying trends in data is PCA [86]. PCA has been described as the “workhorse of chemometrics” [87]. Indeed, PCA renders itself useful as it allows researchers to not only visualise data in a reduced dimensionality, it also shows up any groupings or clustering in the samples observed, representing similarities in samples [2, 34]. This also explains its use in what is known as exploratory data analysis [86], and is useful in giving the researcher an overview of the data [17]. In addition, PCA is also able to show differences in data by calculation of orthogonal components to maximise the variance [5], represented as separation between different groups or clusters along the orthogonal components [2, 5]. Also, outliers are easily recognised in the score plots, and these inform researchers if there is a need to remove these samples [12, 86]. If classification information about the samples is unavailable, PCA can be used to show the trends and patterns that allow us to classify the samples accordingly, and carry out further analysis. Examination of the loadings plots will usually reveal the variable or variables most responsible for the groupings observed [48]. It can be seen that PCA is an unsupervised technique, as class information is not used in data analysis, as is done in supervised models. 18 PCA can also show whether such between-class variations are significant enough to outweigh the within-class variation. However, as stated, PCA is but a preliminary visualisation method. Being an unsupervised technique, it tends to not be able to separate samples due to large chemical noise and other sources of variation which may not be relevant and also distracting, such as instrumental drift and artefacts [3]. There may also be cases where the intra- or inter-subject variation is too large for clustering to take place. There is therefore a need for supervised chemometric tools which can help researchers focus on the relevant sources of variation that are being studied [77]. 1.3.3 Partial Least Squares/ Projection to Latent Structures PLS, on the other hand, is an example of a supervised classification tool which utilises known class information in data analysis. PLS is a powerful form of multivariate analysis as it is able to handle data which are “strongly collinear, noisy, and [contain] numerous X-variables” [88]. For metabolomics studies, there are often two or more groups of samples being studied – these could be control and diseased, different phases of disease, or different types of treatment. These dependent variables are collectively termed as the Y matrix, and may be discrete or continuous [77]. Discriminant analysis may also be applied where the Y variables consist of variables denoting group belonging. PLS when combined with discriminant analysis (PLS-DA) maximises class separation and builds prediction models based on this information given [2, 17]. 19 1.3.4 Orthogonal Partial Least Squares Discriminant Analysis OPLS is also a supervised classification tool that is a relatively new extension of PLS [82]. Where class information is known about the samples, OPLS becomes a very powerful dimension-reduction and visualisation tool. It affords better interpretability and transparency compared to PLS, as the PLS model is rotated [89] so that the variation in the data is separated into two components – those that are related to the Y matrix (and therefore class separation [77]), and those that are unrelated (orthogonal) to the Y matrix [77, 90-92]. Such separation of components is important, as it allows researchers to understand the main causes of variation that separate these two classes of samples by relating it with known variables [34]. Like PLS, OPLS can also be used in conjunction with discriminant analysis in metabonomics studies as well [93]. In this case, the values in the descriptive Y matrix are dummy variables used solely to assign class belonging [94]. OPLS-DA therefore increases class separation, interpretation, and identification of the metabolite information [34, 91]. For example, Whelehan et al. have used it to detect ovarian cancer through analysis of the proteomic profiles of 191 subjects [91], while Qiu et al. have used it to diagnose human colorectal cancer [26]. These improvements therefore allow OPLS-DA to be used as a chemometric tool for disease diagnosis. For our investigation, we have chosen to use PCA and OPLS-DA as they are linear methods, and produce models which are more easily interpreted than 20 those of non-linear methods such as ANNs [95, 96]. Also, as stated by Wiklund et al., “multivariate models such as PLS and OPLS include both statistical significance based on cross-validation and confidence intervals based on jack-knifing estimations as well as magnitude and reliability of the data provided by good visualization” [77]. These multivariate models more powerful than univariate tests [77] as the latter does not show how groups of biomarkers are more powerful than the individual biomarkers themselves [97]. 1.3.5 Pre-treatment of Data for Chemometric Analysis The data that are used in such pattern recognition techniques must also necessarily be properly processed prior to the use of these techniques, otherwise spurious correlations and patterns might be mistakenly identified. For example, in the case of HPLC-MS data, peak retention times between different runs must be aligned in the pre-processing analysis as retention times and baselines may deviate from run to run [46]. Also, data reduction in the form of peak-picking is necessary such that only the true analytical peaks remain, reducing the noise in the spectra and correlations to structures unrelated to class information [62]. Furthermore, it is imperative for the researcher to understand clearly the nature of the data obtained so that the appropriate peak alignment tools and scaling methods can be employed. Projection-based multivariate methods are sensitive to the scaling of the data [77, 92], which is in turn dependent on the data acquired for all samples. Scaling methods that could be used include auto-, mean-centred, pareto, and level scaling [17]. Weiss and Kim in their 21 review of metabolomics in kidney disease also lend support that metabolomic data need to be suitably scaled and transformed to attain symmetry and normality [2]. In metabolomics, normalization is carried out in order to reduce systematic errors in the data so that biologically significant changes in metabolite concentrations may be discovered [98, 99]. Normalization also helps to make the data obtained across samples comparable in size, such as correcting for urine dilution effects [100]. There are generally six different methods of normalization for mass spectra: (1) with reference to mean, (2) with reference to median, (3) linear rescaling according to the largest and smallest values, (4) with reference to total ion count, (5) with reference to the peak of maximum intensity, and (6) with reference to an internal standard ([101, 102], cited in [103]). For studies using urine as the biofluid, there are also the options of normalization to total urine volume, urine osmolality, range of peak intensities, or to an endogenous compound such as creatinine [17, 18, 99]. Depending on the nature of the dataset, the choice of normalization method is important as it also affects the identity and ranking of biomarkers found [103]. It is known that urine samples have a wide dynamic range in terms of the metabolite concentrations. Therefore, even though it has been found that there is usually a high level of consistency among methods in terms of the important compounds identified [103], there is still a need to choose the best method of normalization. 22 Normalization to total urine volume is usually not ideal as there are several limitations in using this as the reference. This method tends to introduce errors such as those due to inaccurate sample collection by controls and patients, whether for 24-hour collections (especially for children [104]) or for spot collections. It is also known that urine volume can be affected by hydration level ([105, 106], cited in [34]). Also, as the concentrations of metabolites changes with the total urine volume, normalization to total urine volume may result in spurious correlations between metabolite concentrations and disease state. Therefore, normalization to urine volume is not usually the choice of reference as it varies too widely for meaningful comparisons to be made, whether in an intra- or inter-subject manner [99]. In addition, it is not recommended to use a single compound such as creatinine as the reference [13], since it may vary widely across individuals, as is so especially in cases of kidney disease [18, 47]. Although urinary creatinine excretion for each person is relatively stable, it is not a good reference as many factors can affect its concentration in urine. Significant changes in urine creatinine within a so-called healthy population have been found in the study conducted by Saude et al [47]. It has also been shown that creatinine fold change among the wider population can be highly varied as well [104]. Further, it has been found that urinary excretion of creatinine is affected especially in kidney disease due to its degradation in the body [18, 50]. These also lend support to the choice of not using urinary creatinine as 23 the choice of internal reference for the normalization of metabolite concentrations. As the correct choice of normalization can improve differentiation between study groups, it is felt that normalization to a form of total ion count is appropriate. This was observed in a study by Warrack et al [99], which showed improved discrimination between dose groups in their study. They recommend that urine samples be normalized to both the ‘mass spectrometry total useful signal’ (MSTUS) as well as osmolality. In their study, it has been found that normalizing to total urine volume or to creatinine levels actually caused the group separation to become unclear, hence the recommendation to use osmolality and the MSTUS instead [99]. Therefore, the current work in normalizing to the 40 targeted metabolites and four m/z regions, instead of normalizing to creatinine or urine volume, finds its support here as well. 1.4 Chronic Kidney Disease 1.4.1 Overview of Chronic Kidney Disease CKD is a “life-threatening condition characterized by progressive and irreversible loss of renal function” [107]. The term itself is non-generic, and represents the declining kidney function which arises from various diseases [108]. In terms of pathophysiology, kidney disease manifests itself in changes in the “glomerular filtration rate (GFR), glomerular permeability, tubular function, tubular damage, urinary reflux, obstruction to urinary flow, and deposition of collagen” [109]. According to Eknoyan et al., it is a disease which affects 5-10 % of the world population [110]. Sabanayagam has also 24 found that Singapore has a relatively high prevalence of CKD in her population compared to the rest of the world, with a prevalence of 10.8% ([111], cited in [112]). Despite its high occurrence, there is a general lack of awareness of the prevalence of CKD [113]. This underscores the need for an investigation into possible methods of early detection based on local subjects. In addition, the disease itself is also asymptomatic in its early stages, possibly until when only 25% of GFR remains [109], making it difficult to treat as patients’ conditions worsen [114]. Symptoms that may appear are also generic and may not be a cause for alarm until quite late into the progression of the disease [115]. Also, the rate of progression of disease can be unpredictable as there is no one ‘standard’ rate at which the disease progresses [50, 108, 109], and the rate of progression may also be dependent on the individual, the underlying disease and other risk factors [50]. There is therefore a need to understand the mechanisms of CKD so that earlier diagnosis and effective treatments can be carried out. Several risk factors that cause the incidence of CKD to be increased include existing medical conditions such as hypertension and diabetes, poor choice of lifestyle such as the use of tobacco, familial history of CKD, increased age, and pre-natal conditions such as low birth weight [116]. In the United States, approximately a quarter of non-diabetic chronic failure cases are caused by hypertension alone [117]. It has also been found that Asians with a family history of kidney disease are at a high risk of contracting non-diabetic kidney disease as well [117]. In Singapore, there is an increasing rate of new patients 25 with end stage renal disease (ESRD), which is when renal transplants are required for survival [118]. While there were 194 cases per million population in 1999, the number vastly increased to 254 per million population in 2009 [118]. There has also been an increasing trend of diabetes-related ESRD, with 58.8% of new ESRD cases being diabetics [118]. There is therefore a need for methods to detect the early onset of CKD before ESRD sets in. Besides diabetes, CKD has been found to be positively associated with other debilitating or life-threatening conditions such as cardiovascular disease, hypertension and other vascular diseases [116]. It has also been found that inflammation, oxidative stress, insulin resistance and endothelial dysfunction increase as the disease progresses, even from the early stages [119]. In addition, when kidney function decreases, persons may suffer from other physiological disturbances as metabolites, toxins, water, electrolytes, acidbase balance and endocrine function deviate from the norm [120]. Excess retention of a number of solutes results in the uremic syndrome, which is a state of increased oxidative stress [119] and a poor prognosis indicator. Also, as the disease progresses, many bodily systems have to compensate for the accumulation of metabolites within the blood, and the patient has to make adjustments to his lifestyle, especially his diet, as his kidney function decreases [119]. Clearly, many metabolic pathways are disrupted in kidney disease patients [119], and there is a need for successful early screening before the onset of more serious conditions. 26 It has also been found that most types of CKD lead to a common phenotype [50, 107], and if untreated lead to chronic renal failure (CRF) [120-123]. CRF has already reached epidemic levels [120, 124-127]. Current treatments are ineffective as the mechanism of progression is not totally clear [108, 120], and these treatments only address symptomatic issues to delay the onset of CRF. Furthermore, when CKD has progressed to a substantial degree, the underlying causes may be obscured such that it is difficult to determine the root cause of CKD [115]. These make it even more important to diagnose CKD in its early stages. In the study of CKD, urine is arguably one of the best choices of body fluids to be used [2, 7] as the kidneys are part of the urinary tract [7]. Hence, the pathophysiological disturbances to the kidney in CKD would be closely and clearly reflected in the metabolite composition of the urine. Also, other system-wide biological effects will also be reflected in the urine as urine is not a homeostatically controlled biofluid, and will therefore reflect these changes as well [1]. Urine is therefore the choice of biofluid for this current CKD study. 1.4.2 Diagnosis of Chronic Kidney Disease Diagnosis of CKD is accomplished by detection of abnormally high levels of urinary protein, abnormal urinary sediments, abnormal results from imaging tests or biopsies, and a measurement of the GFR [117]. Measurements of GFR can be made through determining the levels of endogenous markers that have a high correlation with the progression of the disease, such as serum and urinary creatinine, blood urea nitrogen and urate [109, 128]. The serum 27 level of a low molecular weight endogenous protein, cystatin C, has been found to be a more accurate early indicator of kidney dysfunction than that of serum creatinine [115, 129]. Alternatively, exogenous markers such as 51 Cr ethylenediamine tetra-acetic acid, iohexol, inulin, and iothalamate can be used as well [109, 117, 128, 130]. Stages of the disease are assigned based on the GFR, regardless of the aetiology. It has been advised that early diagnosis and management of CKD (even before symptoms appear) are important factors towards a better clinical outcome [113, 114, 131, 132]. Indeed, an accurate determination of the GFR is crucial as it best represents the remaining amount of kidney function present in a person [50]. A diagnosis of CKD is made when the GFR falls below 60 mL/min/1.73 m2 body surface area for three or more months, with or without kidney damage [131]. A brief description of the five stages of progression towards CKD are summarised in Table 1. As the GFR decreases, the degree of severity of CKD correspondingly increases [50]. When a patient reaches stage 5, renal replacement therapy is necessary to sustain life, and preparation for therapy commences usually in stage 4 [131]. Table 1 Description of stages of chronic kidney disease (adapted from [50, 117, 128], originally from [131]) Stage Description 1 2 3 4 5 Kidney damage with normal or increased GFR Kidney damage with mild decrease in GFR Moderate decrease in GFR Severe decrease in GFR Kidney failure GFR (mL/min/1.73 m2) ≥ 90 60-89 30-59 15-29 < 15 or on dialysis 28 Albuminuria or proteinuria is often an indicator of both diabetic and nondiabetic kidney disease [114, 115, 117, 131, 133, 134]. Screening for CKD can therefore be done in the form of a simple urine dipstick test, which detects albuminuria of 300 mg/L [117]. It can also be done by measuring the urinary albumin-to-creatinine (ACR) ratio [117]. If, over the course of three months, the urine dipstick is positive, or if the urine ACR is increased, it means that the person is suffering from CKD ([135] ,cited in [117]). However, one common limitation associated with these tests (and other screening tests in general) is that patients do not usually undergo routine screening unless they are known to be at risk, or symptoms have started appearing. In the case of CKD, the appearance of symptoms usually means that the CKD has progressed to a rather severe stage, which usually necessitates immediate treatment. As mentioned above, early diagnosis and determination of the rate of progression of CKD promotes better clinical outcomes. It is with this thought in mind that many researchers endeavour to come up with models which are able to predict and diagnose CKD occurrence as well as determine the rate of progression. The Cockcroft-Gault equation [136], Modification of Diet in Renal Disease formula [50] and Brochner-Mortensen equation for children [130] have all been used to help determine the GFR in CKD patients so that an accurate diagnosis can be made, and subsequently the most effective treatment plan can be implemented. Other predictive models of CKD progression include those by Soares et al. for children [108] and that by Chonchol et al. and Madero et al. which use uric acid levels [137, 138] . In particular, Soares et al. found that two variables – a GFR lower than 30 29 mL/min and severe proteinuria – were indicators of poor clinical outcomes for children with chronic renal insufficiency [108]. However, it must be noted that all equations are derived from select populations, and need to be validated prior to use in a new population; alternatively, new equations must be derived if the current ones cannot be modified to suit the population under study [50]. Several biomarkers associated with CKD have been found by various research groups. These include oxidative stress, insulin resistance, hyperlipidemia, hyperuricemia, proteinuria, anemia, nitric oxide synthase (NOS)/ asymmetric dimethylarginine (ADMA), aldosterone, tumour growth factor β (TGFβ), and sympathetic nervous system activation (referenced by [50]). More recently, determination of the levels of C-Reactive Protein (CRP) in plasma has been found to be useful as well [139]. Some groups have also studied combinations of biomarkers, as Peralta et al. have done, and found that measuring creatinine, cystatin C, and the urinary ACR more accurately determines the presence of CKD than with one of the markers alone [140]. Lederer and Ouseph note that CKD in its onset may not be diagnosed, especially for older patients or patients who are chronically ill [116]. Metabonomics with the use of multivariate modelling therefore aims to screen for potential patients based on the patterns in their urinary metabolite profiles, and potentially avoiding biopsies which are highly invasive and non-routine. 1.4.3 Metabolomics and Chemometrics for Chronic Kidney Disease As noted by Weiss and Kim, “the real, most tangible and immediate future goal of the use of metabolomics in kidney disease, as with other renal 30 biomarker research, rests with its ability to predict disease occurrence before either phenotypic changes or evidence of disease detected using standard laboratory assays” [2]. Indeed, the ultimate goal of using more sensitive instruments and sophisticated statistical techniques is to improve the current state of detecting and diagnosing diseases in the early stages. Although it has not been common practice to use metabolomics and chemometrics in the study of kidney diseases [2], there are existing studies using these methods. Jia et al. have successfully utilised UPLC-QTOF-MS with multivariate pattern recognition techniques in a non-targeted serum metabonomics study to classify controls and chronic renal failure patients [120]. It is to our knowledge that while there have been many studies for the analysis of urine from CKD patients, there has thus far been no comprehensive study on the targeted metabolomic analysis of urine using HPLC-MS/MS and multivariate modelling in screening for CKD. 1.5 Cataract Disease 1.5.1 Overview of Cataract Disease Cataract disease is a condition whereby visual acuity is reduced due to the lens of the eye turning opaque [141] as a result of malfunctioning of lens metabolism [142]. In the 2010 report by the World Health Organisation Prevention of Blindness and Deafness Programme, Pascolini and Mariotti estimate that cataract disease is the second major cause of visual impairment (33%), and is the main cause of blindness worldwide (51%) [143]. It is also “a major contributing cause of low vision, blindness, and low visual function scores” [144], after glaucoma. Cataract is known to be highly prevalent in 31 Singapore [144], and its incidence is likely to continue increasing due to the aging population [145]. In addition, the incidence of cataract among the younger population is increasing as well [146]. There is therefore a need for more rapid ways of early detection for cataract disease. In addition, cataracts tend to develop slowly, painlessly, and may not be detectable until vision is noticeably affected [141, 147], partly due to the fact that its signs and symptoms are common with other conditions [147]. The causes of cataract disease are varied, including congenital conditions due to infections and systemic malfunctions, adult cataracts due to aging, systemic diseases such as diabetes mellitus or local eye disease such as uveitis, trauma, or even unknown causes [148]. Also, the type and route of formation of the cataracts are different as well, including nuclear, anterior or posterior subcapsular, or cortical cataracts [148]. This underscores a need for better understanding, early screening and treatment for cataract disease. 1.5.2 Diagnosis of Cataract Disease Currently, diagnosis of cataract disease has to be done by an optometrist or ophthalmologist through a visual acuity test and slit-lamp eye examination. For the latter, the use of pupil-dilating eye drops is necessary [147]. Alternatively, tonometry, which involves testing the intra-ocular pressure of the eyeball, may be done for a complete eye examination as well. The whole procedure is painless, not very invasive, and takes about 45 minutes to an hour. There has been some debate on the best method of the staging of cataract progression, though common staging methods include the Lens 32 Opacities Classification System III and the Oxford Clinical Cataract Classification and Grading System, which are based on slit-lamp examinations and comparison against reference photographs [149]. However, in the vast array of methods available, there is still a certain degree of subjectivity, and it has been argued that it would be most useful clinically to test the visual function of patients along with these objective measurements in deciding whether surgery is necessary [149]. It would therefore be useful to have a more objective way of screening for and staging the progression of cataract disease before visual function is greatly reduced. 1.5.3 Metabolomics and Chemometrics for Cataract Disease It has been only a very recent phenomenon that metabolomics or metabonomics have been used in the study of the eye and its diseases [8]. One advantage of using metabolomics on body fluids is that the need for extracting eye tissue samples is reduced, which is significant as it is difficult to obtain such samples from the eye [8]. Chen et al. have recently performed global metabolite profiling of human tears through HPLC-MS/MS, and have also identified 60 metabolites in normal human tear fluid [46]. Tear fluid is known to be challenging to analyse, not least because of the small volume [150], but also because of the wide dynamic range of molecules in this complex mixture [151]. The unprecedented work by Chen et al. signifies that metabolomics is indeed versatile for helping researchers and clinicians understand the expressed phenotype of an organism through its various organs and systems. 33 As far as eye diseases are concerned, it is possible to use body fluids other than those directly associated with the eye itself to diagnose and screen for patients suffering from a particular disease. The body does not consist of isolated systems which do not interact; it is indeed an integrated whole where all the different bodily systems work together for the proper functioning of the organism. Hammond et al. have recently reported their findings on the nontargeted metabolomic analysis of age-related nuclear cataract through GC and HPLC-MS/MS analysis of plasma [32]. They report that cortical cataract is strongly correlated with 3-methoxytyrosine, while nuclear cataract is strongly correlated with laurate, 4-ethylphenyl sulphate and malate, reflecting the change in metabolic pathways in the pathophysiology of cataract [32]. Their work shows that it is definitely possible to study diseases of the eye using body fluids other than those directly in contact with the eye. 1.6 Approach and Scope of Study Based on the above, we have decided to use ESI with rapid polarity switching coupled with a triple quadrupole for analysis of our forty targeted metabolites. The speed, sensitivity, accuracy, range, ease of use, and medium throughput make this combination suitable for targeted analysis of metabolites in urine [14]. The multivariate HPLC-MS metabolomic data will then be analysed and visualised by the chemometric tools PCA and OPLS-DA. The approach of this study is as follows: urine samples from an existing local cohort study are collected from healthy controls and patients suffering from CKD. HPLC-MS/MS analysis is performed on each sample, and 34 chromatographic and mass spectrometric data are obtained. After pretreatment of the data, PCA and OPLS-DA will be used to visualise the differences in these two classes. Further statistical analysis is also employed to determine fluctuations in target metabolites to understand disease pathology, and also identify potential biomarker candidates for CKD. In order to show the applicability of this method for discrimination between control and diseased groups utilising urine as the biofluid for analysis, it will also be employed for a separate local cohort of patients suffering from cataract disease. The urine samples were similarly obtained from an existing local cohort study. 35 Chapter 2 Materials and Methods 2.1 Materials HPLC grade methanol and acetonitrile were obtained from APS (Blacktown, NSW, Australia). Pure water was obtained through an Millipore MilliQ water system (Bedford, MA, USA). Formic acid was obtained from Merck (Darmstadt, Germany). 2.2 Urine Sample Collection Anonymised urine samples were obtained from the National University Hospital, Singapore. Thirty samples each of healthy controls, patients with chronic kidney disease (CKD), and patients with cataract disease were acquired for this study. For this study, metadata such as physiological and demographic information on the healthy and diseased subjects were not released. All urine samples were stored at -20 °C until analysis. 2.3 Equipment and Procedure for HPLC-MS/MS The liquid chromatography tandem mass spectrometry (HPLC-MS/MS) method used follows that which was published by Law et al. [34, 38]. For the analysis, 30 µL of each urine sample was diluted to 70 µL with deionised water. The LC system used was an Agilent 1200 RRLC system (Waldbronn, Germany) with a binary gradient pump, autosampler, column oven and diodearray detector. An Agilent 6410 triple quadrupole mass spectrometer was coupled to this system. Gradient elution was performed using mobile phase (A) 0.1% formic acid in water and (B) 0.1% formic acid in acetonitrile. The gradient profile used was 5% (B) at 0 min, 100% (B) in 10 min, and reverting 36 to initial conditions in 5 min, using a flow rate of 200 µL/min and an oven temperature of 50 °C. For all analyses, 5 µL of sample were injected. A reversed-phase Zorbax SB-C18, 50 × 2.0 mm, 1.8 µm (Agilent Technologies, USA) was used for LC separation. Mass spectra were collected in both positive and negative electrospray ionisation (ESI+ and ESI-) modes, with a product ion m/z range of 100 to 800. Capillary temperature was set at 350 °C. Drying gas flow rate was 10 L/min, and nebulizer nitrogen gas flow rate was 50 psi. Targets were set at 2 × 107 ions using automatic gain control. ESI voltage was set at 4.5 kV, capillary voltage at 10 V, and lens tube offset at 0 V. 2.4 Extraction and Normalization of Chromatogram Peak Areas To facilitate the chemometric analysis, retention time (RT), mass (m/z) and peak area had to be extracted from the total ion chromatograms (TICs). This was performed using the Agilent MassHunter Workstation Software Qualitative Analysis programme (Version B01.02, Build 1.2.122.1, Agilent Technologies, Inc., USA). Extracted ion chromatograms (EICs) of m/z range 100-110, 200-210, 300-310, and 400-410 were obtained from the TICs. The resultant EICs were subject to manual peak detection and baseline-to-baseline integration. Chromatograms were data-reduced by separating true analytical peaks from noise peaks and baseline-to-baseline integration was performed on true peaks. Retention times used were of chromatogram peak tops. Retention time-m/z pairs (RTm/z) of peak areas were tabulated in a Microsoft Excel® spreadsheet and manually aligned with reference to the original chromatograms. The resultant 37 pre-processed data was a three-dimensional (3D) data table with retention time, m/z ratios, and peak intensities. In addition, the relative concentrations of 40 metabolites were determined for targeted analysis. The retention times and m/z values were determined in previous works by Law et al. according to their current HPLC-MS method [34, 38]. This 3D data were then tabulated into Microsoft Excel for every sample analysed. Prior to further analysis, peak areas were normalised within each sample to remove spurious correlations as a result of changes to urine volume [99]. 2.5 Chemometric Analysis Normalised peak areas were then subject to chemometric analysis using the SIMCA P+ software (Version 12.0, Umetrics, Umeå, Sweden) as described in the foregoing paragraphs. The normalised peak intensities were square-rooted, mean-centred, and univariate scaled. All datasets were preliminarily screened using PCA to ensure that there were no major outliers in the samples prior to further chemometric analysis. Using the SIMCA P+ software, PCA models containing two to four orthogonal components were obtained using the ‘Autofit’ setting. For multivariate analysis by OPLS-DA, normalized data were also squarerooted, mean-centred and univariate scaled. Data were reduced to one predictive component t and two or three non-predictive orthogonal 38 components to using the software’s ‘Autofit’ setting. Scores plots were obtained for to against t. The R2X(cum) value obtained in the modelling process is a measure of the amount of X variation explained by the model. In addition, measures of fit to the Y data (R2Y(cum)) and predictive power (Q2Y(cum)) were obtained for each model as well. If the R2Y(cum) and Q2Y(cum) values are close to 1.0, it means that the model has good predictive ability based on the X data [41]. The default seven-fold full cross validation setting was used in construction of the OPLS-DA models using the SIMCA P+ software to minimise overfitting of the data to the constructed models [2, 7]. To further test the validity of the models, one third of all samples were used as the test set, while the remaining samples were used to build the OPLS-DA models. 2.6 Statistical Analysis The normalised data were subjected to a two-tailed Mann-Whitney test using the PASW-SPSS 18.0 statistical analysis software (SPSS, Chicago, IL, USA). We took a P value of less than 0.05 as significant for consideration as a potential biomarker, while a P value of less than 0.01 was considered very significant for consideration as a potential biomarker. This non-parametric test was chosen, as recommended by Gibbons and Chakraborti, since normality of the data cannot be assumed and the sample size is rather small (30 for each group) [152]. As this test uses rank values, extreme values will not affect the data much [153]. 39 Chapter 3 Results and Discussion for Chronic Kidney Disease 3.1 Results for Chronic Kidney Disease 3.1.1 Results for Control vs. Chronic Kidney Disease ESI+ Dataset A comparison of representative TICs obtained in the ESI+ mode for the control and CKD datasets is as shown in Figure 1. Visual inspection of the chromatograms shows the differences in the urine profile of healthy controls and patients diagnosed with CKD. Based on the TICs alone, the time regions of 2.3-3.0 min, 5.0-7.0 min, and 10.0-12.0 min show visible perturbations in the diseased state. A B Figure 1 Representative TICs (ESI+) of (A) Control, and (B) Patient with CKD. 40 To obtain an overview of underlying trends and potential outliers prior to further multivariate statistical analysis, PCA was carried out for the control ESI+ dataset. As shown in the scores plot for the control dataset (Figure 2A), observation C9 lies outside the Hotelling’s T2 Range (significance level = 0.05) and could be a potential outlier. In order to verify whether these two observations should be excluded from subsequent analysis, a graph of the distance of the observation to the X data plane (DModX) was plotted for this observation (Figure 2B). Since the DModX value for observation C9 is below the critical value (D-crit), it is not considered to be an outlier, and was kept for subsequent analysis. A similar PCA plot was constructed for the CKD ESI+ dataset, and no significant outliers were found (Figure 2C). Hence, all the observations for this dataset were retained for further analysis as well. A 41 B C Figure 2 (A) PCA scores plot for Control ESI+ data; (B) DModX scores plot for Control ESI+ data; (C) PCA scores plot for CKD ESI+ data. Multivariate statistical analysis using PCA resulted in a model which could not satisfactorily separate the control and CKD classes based on the ESI+ data alone (Figure 3). The constructed model had two components, designated as t[1] and t[2]. R2X (cum) = 0.199, which indicated that 19.9% of the variation in X (i.e. the profile of peak intensities of each variable for each sample) could be explained by the model. A two-dimensional scatter plot of this 42 unsupervised model shows that the inter-class variation is concentrated in the first principal component t[1]. However, the significant overlap between classes signifies that there are other sources of variation in the samples, and their contribution causes the inter-class variation to be insufficient for class separation. The degree of overlap of the two classes shows that the variation in metabolic profile is largely unrelated to the class differences [91]. This is also reflected in the Q2(cum) value of -0.0952, showing that this model is inadequate for class prediction. Figure 3 PCA scores plot for Control and CKD ESI+ dataset Due to the poor transparency and interpretability of the PCA model, an OPLSDA model was constructed to better visualize the separation between the control and CKD groups. The OPLS-DA plot based on these data show that there is a clear separation between the healthy controls and CKD patients along the t[1] predictive component axis (Figure 4). The total variation in X explains about 17% of the variation in Y (R2X(cum) = 0.174), the Y matrix being the two different sample classes. It was also found that 8.12% of the 43 variation in the sample data directly correlated to class separation (R2X = 0.0812). Furthermore, a cumulative R2Y value of 0.973 indicates that the model is able to account for the variation in Y well. In addition, as Q2Y(cum) = 0.830, the OPLS-DA model is able to predict class membership better than chance. Figure 4 OPLS-DA scores plot for Control against CKD ESI+ dataset The first test of validation for the OPLS-DA model is to examine the crossvalidation score plots based on the internal seven-fold full cross-validation employed in construction of the model. This is done by plotting both the score of the predictive component, t and its cross-validated counterpart tcv for each observation used to construct the model. As shown in Figure 5, most of the samples were predicted to their own class in the cross-validation. Only samples C36, C38, and KY16 were predicted to belong to the opposite classes. However, these samples did not have absolute tcv values that were too high, and the model was therefore taken as valid overall. 44 Figure 5 Cross-validation scores plot for Control and CKD ESI+ dataset The validity of the OPLS-DA model and possible biomarkers was further verified by using a random permutation test in SIMCA P+ [74]. A threecomponent PLS-DA model was constructed and the test was performed using 100 permutations with the option to recalculate the permutations. The intercepts obtained of R2 and Q2 were 0.801 and -0.220 respectively (Figure 6). Also, all of the calculated Q2 values were lower than the Q2 value of the PLS-DA model, which is another indication of the validity of the model. Furthermore, using the criteria Q2Y > 0.5, 0 < R2Y – Q2Y < 0.3, intercepts of R2 < 0.4 and intercepts of Q2 < 0.05 [74], this model was found to be satisfactory in goodness of fit and validity, even though it did not meet the requirement for the R2-intercept. 45 Figure 6 Random permutation test scores plot for Control and CKD ESI+ dataset A further test of the validity of the model was carried out by randomly holding out one-third of the samples from each of the control and CKD classes to form a test set. The remaining observations constituted a working set, which were used in the construction of an OPLS-DA model. This working set model was then used to classify the samples in the test set. In this second test, a higher percentage of test set samples that are correctly classified would also be an indicator of the robustness of the model formed based on the control-CKD ESI+ dataset. This procedure was carried out a total of three times. It was found that the model achieved an average of 96.7% accuracy with Fisher’s probability p[...]... the increasing number of parameters in metabolomic analysis, there is an even greater need for reliable and informative multivariate techniques to analyse this data The combination of multivariate statistical tools with metabolomics has been shown to be powerful for disease screening involving non-targeted determinations One such study of interest is that by Michell et al In their metabolomic analysis. .. Parkinson’s disease patient serum and urine samples, they were able to separate female Parkinson’s patients from their age-matched controls using partial least squares discriminant analysis (PLSDA) based on the urine data, despite not finding strong individual biomarkers responsible for this separation They surmise that there is a unique metabolic pattern of Parkinson’s disease contributed by certain... matrix, and may be discrete or continuous [77] Discriminant analysis may also be applied where the Y variables consist of variables denoting group belonging PLS when combined with discriminant analysis (PLS-DA) maximises class separation and builds prediction models based on this information given [2, 17] 19 1.3.4 Orthogonal Partial Least Squares Discriminant Analysis OPLS is also a supervised classification... profile human serum for the accurate diagnosis of coronary heart disease [56], while Keun et al have successfully used 13 C-NMR to investigate urine in metabolomic studies [57] Further, Kang et al have also successfully used NMR with orthogonal partial least squares discriminant analysis (OPLS-DA) – a multivariate statistical tool – to discriminate between Korean and Chinese herbal medicines [58].As the... groups in order to determine patterns of changes which are useful for diagnosis [14] Non-targeted approaches as that in metabolic fingerprinting may not identify the specific metabolites involved in disease pathology, but consider the total combination of analytes and their concentrations in totality [15] This approach allows for the “simultaneous analysis of multiple end products”, allowing for a “more... [74] Yin et al have also used a similar method to study liver cirrhosis and hepatocellular carcinoma [75] Therefore, HPLC has proven to be a necessary and powerful tool for studies involving disease screening and diagnosis In addition, urine, which is the choice of biofluid for this study, is particularly suitable for analysis by reversed-phase HPLC-MS [1] As mentioned previously, urine contains many... multivariate determination of the diseased state through analysis of urine Halket et al have explored a method of determining urinary organic acids using GC-MS with pattern recognition techniques to identify metabolic disorders [67] Zhang et al have also successfully used multivariate OPLS-DA modelling to determine 40 differentiating metabolites for osteosarcoma in GC-MS analysis of serum and urine, as well... further reviewed in this chapter However, a main drawback of NMR is its inherent lack of analytical sensitivity [2, 8, 46], which results in the inability to detect metabolites which have a concentration lower than 5 µM [18] Spin-spin coupling also causes complications in data interpretation [19] Several recent advances in NMR technology include microprobes and miniature probe coils for smaller volumes... successes using this approach show that there is a need and use for such targeted studies 1.1.4 Using Urine for Metabolomic Analysis While many types of body fluids (biofluids) have been used for metabolomic studies, the choice of biofluid is highly dependent on the disease being studied The choices of biofluid include blood serum [12, 26-29], plasma [27, 30-32], cerebrospinal fluid [33], urine [3, 5,... that the combination of HPLC-MS with multivariate modelling by PCA and partial least- squares discriminant analysis (PLS-DA) is successful in differentiating healthy mice from their diseased counterparts and identifying relevant biomarkers [69] The inherent sensitivity, specificity and efficiency of MS [69] coupled with the high peak capacity of HPLC have made it possible to accurately determine large ... 16 1.3.2 Principal Component Analysis 18 iv 1.3.3 Partial Least Squares/ Projection to Latent Structures 19 1.3.4 Orthogonal Partial Least Squares Discriminant Analysis 20 1.3.5... separate female Parkinson’s patients from their age-matched controls using partial least squares discriminant analysis (PLSDA) based on the urine data, despite not finding strong individual biomarkers... co-supervisor for this project, and for starting me on this project with the kind and thoughtful help in obtaining and running the samples I also thank Professor Ong Choon Nam, NUS, for kindly agreeing

Định dạng
Số trang	129
Dung lượng	2,27 MB