IT training machine learning and systems engineering ao, amouzegar rieger 2010 10 13

Lecture Notes in Electrical Engineering Volume 68 Sio-Iong Ao Burghard Rieger Mahyar A Amouzegar l l Machine Learning and Systems Engineering Editors Dr Sio-Iong Ao International Association of Engineers Hung To Road 37-39 Hong Kong Unit 1, 1/F Hong Kong SAR publication@iaeng.org Prof Dr Burghard Rieger Universität Trier FB II Linguistische Datenverarbeitung Computerlinguistik Universitätsring 15 54286 Trier Germany Prof Mahyar A Amouzegar College of Engineering California State University Long Beach CA 90840 USA ISSN 1876-1100 e-ISSN 1876-1119 ISBN 978-90-481-9418-6 e-ISBN 978-90-481-9419-3 DOI 10.1007/978-90-481-9419-3 Springer Dordrecht Heidelberg London New York Library of Congress Control Number: 2010936819 # Springer Science+Business Media B.V 2010 No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microﬁlming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied speciﬁcally for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Cover design: SPi Publisher Services Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Preface A large international conference on Advances in Machine Learning and Systems Engineering was held in UC Berkeley, California, USA, October 20–22, 2009, under the auspices of the World Congress on Engineering and Computer Science (WCECS 2009) The WCECS is organized by the International Association of Engineers (IAENG) IAENG is a non-profit international association for the engineers and the computer scientists, which was founded in 1968 and has been undergoing rapid expansions in recent years The WCECS conferences have served as excellent venues for the engineering community to meet with each other and to exchange ideas Moreover, WCECS continues to strike a balance between theoretical and application development The conference committees have been formed with over two hundred members who are mainly research center heads, deans, department heads (chairs), professors, and research scientists from over thirty countries with the full committee list available at our congress web site (http:// www.iaeng.org/WCECS2009/committee.html) The conference participants are truly international representing high level research and development from many countries The responses for the congress have been excellent In 2009, we received more than six hundred manuscripts, and after a thorough peer review process 54.69% of the papers were accepted This volume contains 46 revised and extended research articles written by prominent researchers participating in the conference Topics covered include Expert system, Intelligent decision making, Knowledge-based systems, Knowledge extraction, Data analysis tools, Computational biology, Optimization algorithms, Experiment designs, Complex system identification, Computational modeling, and industrial applications The book offers the state of the art of tremendous advances in machine learning and systems engineering and also serves as an excellent reference text for researchers and graduate students, working on machine learning and systems engineering Sio-Iong Ao Burghard B Rieger Mahyar A Amouzegar v Contents Multimodal Human Spacecraft Interaction in Remote Environments 1 Introduction The MIT SPHERES Program 2.1 General Information 2.2 Human-SPHERES Interaction 2.3 SPHERES Goggles Multimodal Telepresence 3.1 Areas of Application 3.2 The Development of a Test Environment Experimental Setup 4.1 Control via ARTEMIS 4.2 The Servicing Scenarios Results of the Experiments 11 5.1 Round Trip Delays due to the Relay Satellite 11 5.2 Operator Force Feedback 12 Summary 14 References 14 A Framework for Collaborative Aspects of Intelligent Service Robot Introduction Related Works 2.1 Context-Awareness Systems 2.2 Robot Grouping and Collaboration Design of the System 3.1 Context-Awareness Layer 3.2 Grouping Layer 3.3 Collaboration Layer 17 17 18 18 19 20 21 22 24 vii viii Contents Simulated Experimentation 4.1 Robot Grouping 4.2 Robot Collaboration Conclusion References 25 25 27 28 28 Piecewise Bezier Curves Path Planning with Continuous Curvature Constraint for Autonomous Driving Introduction Bezier Curve 2.1 The de Casteljau Algorithm 2.2 Derivatives, Continuity and Curvature Problem Statement Path Planning Algorithm 4.1 Path Planning Placing Bezier Curves within Segments (BS) 4.2 Path Planning Placing Bezier Curves on Corners (BC) Simulation Results Conclusions References 31 31 32 33 34 34 36 37 38 43 45 45 Combined Heuristic Approach to Resource-Constrained Project Scheduling Problem Introduction Basic Notions Algorithm Generalisation for Multiproject Schedule KNapsack-Based Heuristic Stochastic Heuristic Methods Experimentation Conclusion References 47 47 48 49 51 51 53 55 56 56 A Development of Data-Logger for Indoor Environment Introduction Sensors Module 2.1 Temperature Sensor 2.2 Humidity Sensor 2.3 CO and CO2 Sensor LCD Interface to the Microcontroller Real Time Clock Interface to the Microcontroller EEPROM Interface to the Microcontroller PC Interface Using RS-232 Serial Communication Graphical User Interface Schematic of the Data Logger 59 59 60 60 61 62 62 62 63 63 63 64 Contents ix Software Design of Data Logger 9.1 Programming Steps for I2C Interface 9.2 Programming Steps for LCD Interface 9.3 Programming Steps for Sensor Data Collection 10 Results and Discussion 11 Conclusions References 64 65 67 67 68 68 69 Multiobjective Evolutionary Optimization and Machine Learning: Application to Renewable Energy Predictions Introduction Material and Methods 2.1 Support Vector Machines 2.2 Multiobjective Evolutionary Optimization 2.3 SVM-MOPSO Trainings Application Results and Discussion Conclusions References 71 71 72 72 74 76 78 78 80 81 Hybriding Intelligent Host-Based and Network-Based Stepping Stone Detections Introduction Research Terms Related Works Proposed Approach: Hybrid Intelligence Stepping Stone Detection (HI-SSD) Experiment Result and Analysis 6.1 Intelligence Network Stepping Stone Detection (I-NSSD) 6.2 Intelligence Host-Based Stepping Stone Detection (I-HSSD) 6.3 Hybrid Intelligence Stepping Stone Detection (HI-SSD) Conclusion and Future Work References Open Source Software Use in City Government Introduction Related Research Research Goals Methodology Survey Execution Survey Results Analysis: Interesting Findings 7.1 Few Cities Have All Characteristics 83 83 84 85 85 86 88 88 89 92 93 94 97 97 98 100 100 101 102 104 104 576 M Lee et al Fig Representative data from the daily routine for each of the three axes of the tri-axial device While the movements and postures contained within the routine are by no means a complete set of all possible activities that a given person might perform, they form a basic set of simple activities which form an underlying structure to a person’s daily life, and are likely to provide a great deal of information in terms of the person’s balance, gait and activity levels if they can be accurately identified 2.3 Feature Extraction Features were computed on 512 sample windows of acceleration data with 256 samples overlapping between consecutive windows At a sampling frequency of 100 Hz, each window represents 5.2 s Maximum acceleration, mean and standard deviation of acceleration channels were computed over sliding windows with 50% overlap has demonstrated success in past works The 512 sample window size enabled fast computation of FFTs used for some of the features The DC feature for normalization is the mean acceleration value of the signal over the window Use of mean of maximum acceleration features has been shown to result in accurate recognition of certain postures and activities Results This section describes the experiments and experimental results of the human posture recognition system In the pilot test, a subject continuous posture change including standing, sitting, lying, walking and running In the experiment, each posture was recognized third 44 Review of Daily Physical Activity Monitoring System Table Clustering results of different posture in a continuous motion Parameters and real posture Standing Sitting Lying Walking Running Average Jaccard score 0.99 1 0.99 0.98 577 Purity Efficiency 0.99 1 1 0.99 1 0.99 0.99 Fig Activity recognition result using Fuzzy c means classification algorithm Mean and standard deviation of acceleration and correlation features were extracted from acceleration data Activity recognition on these features was performed using Fuzzy c means classification algorithm recognized standing, sitting, lying, walking and running with 99.5% accuracy as shown in Table and Fig Discussion and Conclusion This paper proposes an ambulatory movement’s recognition system in daily life A portable acceleration sensor module has been designed and implemented to measure human body motion A small portable device utilizing single tri-axis accelerometer 578 M Lee et al was developed, which detects features of ambulatory movements including vertical position shifts The classification method based on Fuzzy c means classification algorithm, recognition accuracy of over 99% on a five activities (standing, sitting, lying, walking and running) These results are competitive with prior activity recognition results that only used laboratory data However, several limitations are also observed for the system Firstly, collected data was from younger (age 24–33) subjects Secondly, single accelerometer of placed on body waist typically not measure ascending and descending stairs walking Accelerometers are preferable to detect frequency and intensity of vibrational human motion [28] Many studies have demonstrated the usefulness of accelerometer for the evaluation of physical activity, mostly focusing on the detection of level walking or active/rest discrimination [29–31] This study was pilot test for our developed system’s feasibilities Further application of the present technique may be helpful in the health promotion of both young and elderly, and in the management of obese, diabetic, hyperlipidemic and cardiac patients Efforts are being directed to make the device smaller and allow data collection for longer time periods Implementation of real-time processing firmware and encapsulation of the hardware are our future studies Acknowledgments This study was supported by a grant of the Seoul R&BD Program, Republic of Korea (10526) and the Ministry of Knowledge Economy (MKE) and Korea Industrial Technology Foundation (KOTEF) through the Human Resource Training Project for Strategic Technology References P Zimmet, K.G Alberti, J Shaw, Global and societal implications of the diabetes epidemic Nature 414, 782–787 (2001) S.M Grundy, B Hansen, S.C Smith Jr, J.I Cleeman, R.A Kahn, Clinical management of metabolic syndrome: report of the American Heart Association/National Heart, Lung, and Blood Institute/American Diabetes Association conference on scientific issues related to management Circulation 109, 551–556 (2004) R.H Eckel, S.M Grundy, P.Z Zimmet, The metabolic syndrome Lancet 365, 1415–28 P.D Thompson, D Buchner, I.L Pina, et al., Exercise and physical activity in the prevention and treatment of atherosclerotic cardiovascular disease: a statement from the Council on Clinical Cardiology (Subcommittee on Exercise, Rehabilitation, and Prevention) and the Council on Nutrition, Physical Activity, and Metabolism (Subcommittee on Physical Activity) Circulation 107, 3109–3116 (2003) J Tuomilehto, J Lindstrom, J.G Eriksson, et al., Prevention of type diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance N Engl J Med 344, 1343–1350 (2001) Y Ohtaki, M Susumago, A Suzuki, et al., Automatic classification of ambulatory movements and evaluation of energy consumptions utilizing accelerometers and a barometer Microsyst Technol 11, 1034–1040 (2005) D.R Bassett Jr Validity and reliability issues in objective monitoring of physical activity Res Q Exerc Sport 71, S30–S36 (2000) 44 Review of Daily Physical Activity Monitoring System 579 C.L Craig, A.L Marshall, M Sjostrom, A.E Bauman, et al., International physical activity questionnaire: 12-country reliability and validity Med Sci Sports Exerc 35, 1381–1395 (2003) K.M Allor, J.M Pivarnik, Stability and convergent validity of three physical activity assessments Med Sci Sports Exerc 33, 671–676 (2001) 10 M.J LaMonte, B.E Ainsworth, C Tudor-Locke, Assessment of physical activity and energy expenditure, in Obesity: Etiology, Assessment, Treatment and Prevention, ed by R.E Andersen (Human Kinetics, Champaign, IL, 2003), pp 111–117 11 C.V.C Bouten, W.P.H.G Verboeket-Van Venne, et al., Daily physical activity assessment: comparison between movement registration and doubly labeled water J Appl Physiol 81, 1019–1026 (1996) 12 U.S Department of Health & Human Services, Physical Activity and Health: A Report of the Surgeon General (U.S Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, The President’s Council on Physical Fitness and Sports, Atlanta, GA, 1996) 13 P.S Freedson, K Miller Objective monitoring of physical activity using motion sensors and heart rate Res Quart Exerc Sport 71, 2129 (2000) 14 H.J Montoye, H.C.G Kemper, W.H.M Saris, R.A Washburn, Measuring Physical Activity and Energy Expenditure (Human Kinetics, Campaign, IL, 1996) 15 R.K Dishman, R.A Washburn, D.A Schoeller, Measurement of physical activity Quest 53, 295–309 (2001) 16 A Bhattacharya, E.P McCutcheon, E Shvartz, J.E Greenleaf, Body acceleration distribution and O2 uptake in humans during running and jumping J Appl Physiol 49, 881–887 (1980) 17 G.J Welk, Physical Activity Assessments for Health-Related Research (Human Kinetics, Champaign, IL, 2002) 18 C.V.C Bouten, A.A.H.J Sauren, M Verduin, J.D Janssen, Effects of placement and orientation of body-fixed accelerometers on the assessment of energy expenditure during walking Med Biol Eng Comput 35, 50–56 (1997) 19 K.Y Chen, D.R Bassett, The technology of accelerometry-based activity monitors: current and future Med Sci Sport Exerc 37(11), S490–S500 (2005) 20 E.L Melanson, P.S Freedson Physical activity assessment: a review of methods Crit Rev Food Sci Nutr 36, 385–396 (1996) 21 T.G Ayen, H.J Montoye Estimation of energy expenditure with a simulated threedimensional accelerometer J Ambul Monit 1(4), 293–301 (1988) 22 K.R Westerterp, Physical activity assessment with accelerometers Int J Obes 23(Suppl 3), S45–S49 (1999) 23 B.G Steele, B Belza, K Cain, C Warms, J Coopersmith, J Howard, Bodies in motion: monitoring daily activity and exercise with motion sensors in people with chronic pulmonary disease J Rehabil Res Dev 40(Suppl 2) 45–58 (2003) 24 M.J Lamonte, B.E Ainsworth, Quantifying energy expenditure and physical activity in the context of dose response Med Sci Sports Exerc 33, S370–S378 (2001) 25 R.K Dishman, R.A Washburn, D.A Schoeller, Measurement of physical activity Quest 53, 295–309 (2001) 26 K.R Westerterp, Physical activity assessment with accelerometers Int J Obes 23(Suppl 3), S45–S49 (1999) 27 M.J Mathie, A.C.F Coster, N.H Lovell, B.G Celler, Accelerometry: providing an integrated, practical method for long-term, ambulatory monitoring of human movement Physiol Meas 25, R1–R20 (2004) 28 C.V.C Bouten, K.T.M Koekkoek, M Verduin, R Kodde, J.D Janssen, A triaxial accelerometer and portable data processing unit for the assessment of daily physical activity IEEE Trans Biomed Eng 44(3):136–147 (1997) 580 M Lee et al 29 A.K Nakahara, E.E Sabelman, D.L Jaffe, Development of a second generation wearable accelerometric motion analysis system Proceedings of the first joint EMBS/BMES conference, 1999, p 630 30 K Aminian, P Robert, E.E Buchser, B Rutschmann, D Hayoz, M Depairon, Physical activity monitoring based on accelerometry Med Biol Eng Comput 37, 304–308 (1999) 31 M.J Mathie, N.H Lovell, C.F Coster, B.G Celler, Determining activity using a triaxial accelerometer, in Proceedings of the Second Joint EMBS/BMES Conference, 2002, pp 2481–2482 Chapter 45 A Study of the Protein Folding Problem by a Simulation Model Omar Gaci Abstract In this paper, we propose a simulation model to study the protein folding problem We describe the main properties of proteins and describe the protein folding problem according to the existing approaches Then, we propose to simulate the folding process when a protein is represented by an amino acid interaction network This is a graph whose vertices are the proteins amino acids and whose edges are the interactions between them We propose a genetic algorithm of reconstructing the graph of interactions between secondary structure elements which describe the structural motifs The performance of our algorithms is validated experimentally Introduction Proteins are biological macromolecules participating in the large majority of processes which govern organisms The roles played by proteins are varied and complex Certain proteins, called enzymes, act as catalysts and increase several orders of magnitude, with a remarkable specificity, the speed of multiple chemical reactions essential to the organism survival Proteins are also used for storage and transport of small molecules or ions, control the passage of molecules through the cell membranes, etc Hormones, which transmit information and allow the regulation of complex cellular processes, are also proteins Genome sequencing projects generate an ever increasing number of protein sequences For example, the Human Genome Project has identified over 30,000 genes which may encode about 100,000 proteins One of the first tasks when annotating a new genome is to assign functions to the proteins produced by the genes To fully understand the biological functions of proteins, the knowledge of their structure is essential O Gaci Le Havre University, 25 rue Phillipe Lebon, 76600 Le Havre, France e-mail: omar.gaci@gmail.com S.-I Ao et al (eds.), Machine Learning and Systems Engineering, Lecture Notes in Electrical Engineering 68, DOI 10.1007/978-90-481-9419-3_45, # Springer ScienceỵBusiness Media B.V 2010 581 582 O Gaci In their natural environment, proteins adopt a native compact three dimensional form This process is called folding and is not fully understood The process is a result of interactions between the protein’s amino acids which form chemical bonds In this study, we propose to study the protein folding problem We describe this biological process through the historical approaches to solve this problem Then, we treat proteins as networks of interacting amino acid pairs [1] In particular, we consider the subgraph induced by the set of amino acids participating in the secondary structure also called Secondary Structure Elements (SSE).We call this graph SSE interaction network (SSE-IN) We begin by recapitulating relative works about this kind of study model Then, we present a genetic algorithm able to reconstruct the graph whose vertices represent the SSE and edges represent spatial interactions between them In other words, this graph is another way to describe the motifs involved in the protein secondary structures The Protein Folding Problem Several tens of thousands of protein sequences are encoded in the human genome A protein is comparable to an amino acid chain which folds to adopt its tertiary structure Thus, this 3D structure enables a protein to achieve its biological function In vivo, each protein must quickly find its native structure, functional, among innumerable alternative conformations The protein 3D structure prediction is one of the most important problems of bioinformatics and remains however still irresolute in the majority of cases The problem is summarized by the following question: being given a protein defined by its sequence of amino acids, which is its native structure? In other words, we want to determine the structure whose amino acids are correctly organized in three dimensions in order to this protein can achieve correctly its biological function Unfortunately, the exact answer is not always possible that is why the researchers have developed study models to provide a feasible solution for any unknown sequences However, models to fold proteins bring back to NP-Hard optimization problems [2] Those kinds of models consider a conformational space where the modeled protein tries to reach its minimum energy level which corresponds to its native structure Therefore, any algorithm of resolution seems improbable and ineffective; the fact is that in the absolute no study model is yet able to entirely define the general principles of the protein folding 2.1 The Levinthal Paradox The first observation of spontaneous and reversible folding in vitro was carried out by Anfinsen [3] He deduced that the native structure of a protein corresponds 45 A Study of the Protein Folding Problem by a Simulation Model 583 to a conformation with a minimal free energy, at least under suitable environmental conditions But if the protein folding is indeed under thermodynamic control, a judicious question is to know how a protein can find, in a reasonable time, its structure of lower energy among an astronomical number of possible conformations As example, a protein of 100 residues can adopt 2100 (%1030) distinct conformations when we suppose that only two possibilities are accessible to each residue If the passage from a conformation to another is carried out in 10À13 s (which corresponds to time necessary for a rotation around a connection), this protein would need at least 1017 s, i.e approximately three billion years, “to test” all possible conformations The proteins however manage to find their native structures in a lapse of time which is about the millisecond at the second The apparent incompatibility between these facts, raised initially by Levinthal [4], was quickly set up in paradox and made run enormously ink since Levinthal gives the solution of its paradox: proteins not explore the integrality of their conformational space, and their folding needs to be “guided”, for example, via the fast formation of certain interactions which would be determining for the continuation of the process 2.2 Motivations To be able to understand how a protein accomplishes its biological function, and to be able to act on the cellular processes in which the protein intervenes, it is essential to know its structure Many protein native structures were determined experimentally – primarily by crystallography with X-rays or by Nuclear Magnetic Resonance (NMR) – and indexed in a database accessible to all, Protein Data Bank (PDB) [5] However, the application of these experimental techniques consumes a considerable time [6, 7] Indeed, the number of protein sequences known [8] is much more important than the number of solved structures [5], this gap continues to grow quickly The design of methods making it possible to predict the protein structure from its sequence is a problem whose stakes are major, and which fascine many of scientists for several decades Various tracks were followed with an aim of solving this problem, elementary in theory but extremely complex in practice Approaches to Study the Protein Folding Problem The existing models for the protein folding problem study depend, amongst other things, on the way that the native structure is supposed be reached Either, a protein folds following a preferential folding path [9], or a protein folds by searching the native state among an energetic landscape organized as a funnel 584 O Gaci The first hypothesis implies the existence of preferential paths for the folding process In the simplest case, the folding mechanism is comparable to a linear reaction Thus, when the steps are enough specifics, only a local region of the conformational space will be explored This concept is nowadays obsolete since we know the existence of parallel folding paths The second hypothesis defines the folding by the following way (Dill, 1997): a parallel flow process of an ensemble of chain molecules; folding is seen as more like trickle of water down mountainsides of complex shapes, and less like flow through a single gallery In other words, the folding can be described as a set of transitions between structures whose energies become weaker It allows guiding the protein by a funnel effect toward the conformational state whose energy level is the minimum that is the native conformation Then, the polypeptide chain explores only a fraction of the accessible states This last hypothesis is the one accepted in this chapter, the protein folding is a process by which a large number of conformations are accessible and which leads a sequence into its native structure with the lowest energy level (see Fig 1) Fig Evolution in the description of folding paths in an energy landscape Top, the protein folding according to the Anfinsen theory: a protein can adopt a large number of conformations Bottom-left, a protein folds by following only one specific path in the conformational space Bottom-right, from a denatured conformation, a protein searches its native structure whose energy level is minimum in a minimum time 45 3.1 A Study of the Protein Folding Problem by a Simulation Model 585 Latest Approach Many systems, both natural and artificial, can be represented by networks, that is, by sites or vertices bound by links The study of these networks is interdisciplinary because they appear in scientific fields like physics, biology, computer science or information technology These studies are lead with the aim to explain how elements interact with each other inside the network and what the general laws which govern the observed network properties are From physics and computer science to biology and social sciences, researchers have found that a broad variety of systems can be represented as networks, and that there is much to be learned by studying these networks Indeed, the studies of the Web [10], of social networks [11] or of metabolic networks [12] contribute to put in light common non-trivial properties of these networks which have a priori nothing in common The ambition is to understand how the large networks are structured, how they evolve and what are the phenomena acting on their constitution and formation In [13], the authors propose to consider a protein as an interaction network whose vertices represent the amino acids and an edge describes a specific type of interaction (which is not the same according to the object of study) Thus, a protein, molecule composed of atoms becomes a set constituted by individuals (the amino acids), by interactions (to be defined according to the study) which evolves in a particular environment (describing the experimental conditions) The vocabulary evolves but the aim remains the same, we want to better understand the protein folding process by the way of the modeling The interaction network of a protein is initially the one built from the primary structure The goal is to predict the graph of the tertiary structure through a discrete simulation process 3.2 The Amino Acid Interaction Network The 3D structure of a protein is represented by the coordinates of its atoms This information is available in Protein Data Bank (PDB), which regroups all experimentally solved protein structures Using the coordinates of two atoms, one can compute the distance between them We define the distance between two amino acids as the distance between their Ca atoms Considering the Ca atom as a “center” of the amino acid is an approximation, but it works well enough for our purposes Let us denote by N the number of amino acids in the protein A contact map matrix is a N x N 0-1 matrix, whose element (i, j) is one if there is a contact between amino acids i and j and zero otherwise It provides useful information about the protein For example, the secondary structure elements can be identified using this matrix Indeed, a-helices spread along the main diagonal, while b-sheets appear as bands parallel or perpendicular to the main diagonal [14] There are different ways to 586 O Gaci define the contact between two amino acids Our notion is based on spatial proximity, so that the contact map can consider non-covalent interactions We say that two amino acids are in contact if and only if the distance between them ˚´ [1] and this is the is below a given threshold A commonly used threshold is A value we use Consider a graph with N vertices (each vertex corresponds to an amino acid) and the contact map matrix as incidence matrix It is called contact map graph The contact map graph is an abstract description of the protein structure taking into account only the interactions between the amino acids First, we consider the graph induced by the entire set of amino acids participating in folded proteins We call this graph the three dimensional structure elements interaction network (3DSE-IN), see Fig As well, we consider the subgraph induced by the set of amino acids participating in SSE We call this graph SSE interaction network (SSE-IN) (see Fig 2) In [15] the authors rely on amino acid interaction networks (more precisely they use SSE-IN) to study some of their properties, in particular concerning the role played by certain nodes or comparing the graph to general interaction networks models Thus, thanks to this point of view the Protein Folding Problem can be tackle by the graph theory To manipulate a SSE-IN or a 3DSE-IN, we need a PDB file which is transformed by a parser we have developed This parser generates a new file which is read by the GraphStream library [16] to display the SSE-IN in two or three dimensions Folding a Protein in a Topological Space by Bio-Inspired Methods In this section, we treat proteins as amino acid interaction networks (see Fig 2) We describe a bio-inspired method we use to fold amino acid interaction networks In particular, we want to fold a SSE-IN to predict the motifs which describe the secondary structure 4.1 Genetic Algorithms The concept of genetic algorithms has been proposed by John Holland [17] to describe adaptive systems according to biological process The genetic algorithms are inspired from the concept of natural selection proposed by Charles Darwin The vocabulary employed here is the one relative to the evolution theory and the genetic We speak about individuals (potential solutions), populations, genes (which are the variables), chromosomes, parents, descendants, reproductions, etc 45 A Study of the Protein Folding Problem by a Simulation Model 587 Fig Protein 1DTP SSE-IN (top) and the 1DTP 3DSE-IN (bottom) From a pdb file a parser we have developed produces a new file which corresponds to the SSE-IN graph displayed by the GraphStream library [16] At the beginning, we conserve a population among which it exists a solution which is not yet optimal Then, the genetic algorithm make evolves this population by an iterative process Certain individuals reproduce themselves, others mute or disappear and only the well adapted individuals are supposed to survive The 588 O Gaci genetic heritage between generations must help to produce individuals which are better and better adapted to correspond to the optimal solution 4.2 Motif Prediction In previous works [18], we have studied the protein SSE-IN We have identified notably some of their properties like the degree distribution or also the way in which the amino acids interact These works have allowed us to determine criteria discriminating the different structural families We have established a parallel between structural families and topological metrics describing the protein SSE-IN Using these results, we have proposed a method to deduce the family of an unclassified protein based on the topological properties of its SSE-IN, see [19] Thus, we consider a protein defined by its sequence in which the amino acids participating in the secondary structure are known Then, we apply a method able to associate a family from which we rely to predict the fold shape of the protein This work consists in associating the family which is the most compatible to the unknown sequence The following step is to fold the unknown sequence SSE-IN relying on the family topological properties To fold a SSE-IN, we rely on the Levinthal hypothesis also called the kinetic hypothesis Thus, the folding process is oriented and the proteins don’t explore their entire conformational space In this paper, we use the same approach: to fold a SSEIN we limit the topological space by associating a structural family to a sequence [19] Since the structural motifs which describe a structural family are limited, we propose a genetic algorithm (GA) to enumerate all possibilities In this section, we present a method based on a GA to predict the graph whose vertices represent the SSE and edges represent spatial interactions between two amino acids involved in two different SSE, further this graph is called Secondary Structure Interaction Network (SS-IN) (see Fig 3) Fig 2OUF SS-IN (left) and its associated incidence matrix (right) The vertices represent the different a-helices and an edge exists when two amino acids interact 45 4.3 A Study of the Protein Folding Problem by a Simulation Model 589 Dataset Thereafter, we use a dataset composed by proteins which have not fold families in the SCOP v1.73 classification and for which we have associated a family in [19] 4.4 Overall Description The GA has to predict the adjacency matrix of an unknown sequence when it is represented by a chromosome Then, the initial population is composed of proteins of the associated family with the same number of SSEs During the genetic process, genetic operators are applied to create new individuals with new adjacency matrices We want to predict the studied protein adjacency matrix when only its chromosome is known Here, we represent a protein by an array of alleles Each allele represents a SSE notably considering its size that is the number of amino acids which compose it The size is normalized contributing to produce genomes whose alleles describe a value between and 100 Obviously, the position of an allele corresponds to the SSE position it represents in the sequence In the same time, for each genome we associate its SS-IN incidence matrix The fitness function we use to evaluate the performance of a chromosome is the L1 distance between this chromosome and the target sequence 4.5 Genetic Operators Our GA uses the common genetic operators and also a specific topological operator The crossover operator uses two parents to produce two children It produces two new chromosomes and matrices After generating two random cut positions, (one applied on chromosomes and another on matrices), we swap respectively the both chromosome parts and the both matrices parts This operator can produce incidence matrices which are not compatible with the structural family, the topological operator solve this problem The mutation operator is used for a small fraction (about 1%) of the generated children It modifies the chromosome and the associated matrix For the chromosomes, we define two operators: the two position swapping and the one position mutation Concerning the associated matrix, we define four operators: the row translation, the column translation, the two position swapping and the one position mutation These common operators may produce matrices which describe incoherent SS-IN compared to the associated sequence fold family To eliminate the wrong cases we develop a topological operator 590 O Gaci The topological operator is used to exclude the incompatible children generated by our GA The principle is the following; we have deduced a fold family for the sequence from which we extract an initial population of chromosomes Thus, we compute the diameter, the characteristic path length and the mean degree to evaluate the average topological properties of the family for the particular SSE number Then, after the GA generates a new individual by crossover or mutation, we compare the associated SS-IN matrix with the properties of the initial population by admitting an error rate up to 20% If the new individual is not compatible, it is rejected 4.6 Algorithm Starting from an initial population of chromosomes from the associated family, the population evolves according to the genetic operators When the global population fitness cannot increase between two generations, the process is stopped, see Algorithm The genetic process is the following: after the initial population is built, we extract a fraction of parents according to their fitness and we reproduce them to produce children Then, we select the new generation by including the chromosomes which are not among the parents plus a fraction of parents plus a fraction of children It remains to compute the new generation fitness When the algorithm stops, the final population is composed of individuals close to the target protein in terms of SSE length distribution because of the choice of our fitness function As a side effect, their associated matrices are supposed to be close to the adjacency matrix of the studied protein that we want to predict In order to test the performance of our GA, we pick randomly three chromosomes from the final population and we compare their associated matrices to the ... advances in machine learning and systems engineering and also serves as an excellent reference text for researchers and graduate students, working on machine learning and systems engineering. .. Avenue, Cambridge, MA 0 2139 -4307, USA e-mail: estoll@MIT.edu S.-I Ao et al (eds.), Machine Learning and Systems Engineering, Lecture Notes in Electrical Engineering 68, DOI 10. 1007/978-90-481-9419-3_1,... satellites Torques and forces are calculated and directly commanded to the thrusters, which will cause a motion of the SPHERES satellite The satellites measure their position and attitude and transmit

Định dạng
Số trang	635
Dung lượng	11,28 MB