Computational Biology Dariusz Mrozek Scalable Big Data Analytics for Protein Bioinformatics Efficient Computational Solutions for Protein Structures Computational Biology Volume 28 Editors-in-Chief Andreas Dress, CAS-MPG Partner Institute for Computational Biology, Shanghai, China Michal Linial, Hebrew University of Jerusalem, Jerusalem, Israel Olga Troyanskaya, Princeton University, Princeton, NJ, USA Martin Vingron, Max Planck Institute for Molecular Genetics, Berlin, Germany Editorial Board Robert Giegerich, University of Bielefeld, Bielefeld, Germany Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany Gene Myers, Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany Pavel A Pevzner, University of California, San Diego, CA, USA Advisory Board Gordon Crippen, University of Michigan, Ann Arbor, MI, USA Joe Felsenstein, University of Washington, Seattle, WA, USA Dan Gusfield, University of California, Davis, CA, USA Sorin Istrail, Brown University, Providence, RI, USA Thomas Lengauer, Max Planck Institute for Computer Science, Saarbrücken, Germany Marcella McClure, Montana State University, Bozeman, MO, USA Martin Nowak, Harvard University, Cambridge, MA, USA David Sankoff, University of Ottawa, Ottawa, ON, Canada Ron Shamir, Tel Aviv University, Tel Aviv, Israel Mike Steel, University of Canterbury, Christchurch, New Zealand Gary Stormo, Washington University in St Louis, St Louis, MO, USA Simon Tavaré, University of Cambridge, Cambridge, UK Tandy Warnow, University of Illinois at Urbana-Champaign, Champaign, IL, USA Lonnie Welch, Ohio University, Athens, OH, USA The Computational Biology series publishes the very latest, high-quality research devoted to specific issues in computer-assisted analysis of biological data The main emphasis is on current scientific developments and innovative techniques in computational biology (bioinformatics), bringing to light methods from mathematics, statistics and computer science that directly address biological problems currently under investigation The series offers publications that present the state-of-the-art regarding the problems in question; show computational biology/bioinformatics methods at work; and finally discuss anticipated demands regarding developments in future methodology Titles can range from focused monographs, to undergraduate and graduate textbooks, and professional text/reference works More information about this series at http://www.springer.com/series/5769 Dariusz Mrozek Scalable Big Data Analytics for Protein Bioinformatics Efficient Computational Solutions for Protein Structures 123 Dariusz Mrozek Silesian University of Technology Gliwice, Poland ISSN 1568-2684 Computational Biology ISBN 978-3-319-98838-2 ISBN 978-3-319-98839-9 https://doi.org/10.1007/978-3-319-98839-9 (eBook) Library of Congress Control Number: 2018950968 © Springer Nature Switzerland AG 2018 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland For my always smiling and beloved wife Bożena, and my lively and infinitely active sons Paweł and Henryk, with all my love To my parents, thank you for your support, concern and faith in me Foreword High-performance computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business Big Data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization This timely book by Dariusz Mrozek gives you a quick introduction to the area of proteins and their structures, protein structure similarity searching carried out at main representation levels, and various techniques that can be used to accelerate similarity searches using high-performance Cloud computing and Big Data concepts It presents introductory concepts of formal model of 3D protein structures for functional genomics, comparative bioinformatics, and molecular modeling and the use of multi-threading for the efficient approximate searching on protein secondary structures In addition, there is a material on finding 3D protein structure similarities accelerated with high-performance computing techniques The book is required reading to help in understanding for anyone working with area of data analytics for structural bioinformatics and the use of high-performance computing It explores area of proteins and their structures in depth and provides practical approaches to many problems that may be encountered It is especially useful to applications developers, scientists, students, and teachers I have enjoyed and learned from this book and feel confident that you will as well Knoxville, USA June 2018 Jack Dongarra University of Tennessee vii Preface International efforts focused on understanding living organisms at various levels of molecular organization, including genomic, proteomic, metabolomic, and cell signaling levels, lead to huge proliferation of biological data collected in dedicated, and frequently, public repositories The amount of data deposited in these repositories increases every year, and cumulated volume has grown to sizes that are difficult to handle with traditional analysis tools This growth of biological data is stimulated by various international projects, such as 1000 Genomes The project aims at sequencing genomes of at least one thousand anonymous participants from a number of different ethnic groups in order to establish a detailed catalog of human genetic variations As a result, it generates terabytes of genetic data Apart from international initiatives and projects, like the 1000 Genomes, the proliferation of biological data is further accelerated by newly developed technologies for DNA sequencing, like next-generation sequencing (NGS) methods These methods are getting faster and less expensive every year They produce huge amounts of genetic data that require fast analysis in various phases of molecular profiling, medical diagnostics, and treatment of patients that suffer from serious diseases Indeed, for the last three decades we have been witnesses of the continuous exponential growth of biological data in repositories, such as GenBank, Sequence Read Archive (SRA), RefSeq, Protein Data Bank, UniProt/SwissProt The specificity of the data has inspired the scientific community to develop many algorithms that can be used to analyze the data and draw useful conclusions A huge volume of the biological data caused that many of the existing algorithms became inefficient due to their computational complexity Fortunately, the rapid development of computer science in the last decade has brought many technological innovations that can be also used in the field of bioinformatics and life sciences The algorithms demonstrating a significant utility value, which have recently been perceived as too time-consuming, can now be efficiently used by applying the latest technological achievements, like Hadoop and Spark for analyzing Big Data sets, multi-threading, graphics processing units (GPUs), or cloud computing ix x Preface Scope of the Book The book focuses on proteins and their structures It presents various scalable solutions for protein structure similarity searching carried out at main representation levels and for prediction of 3D structures of proteins It specifically focuses on various techniques that can be used to accelerate similarity searches and protein structure modeling processes But, why proteins? somebody can ask I could answer the question by following Arthur M Lesk in his book entitled Introduction to Protein Science Architecture, Function, and Genomics Because proteins are where the action is Understanding proteins, their structures, functions, mutual interactions, activity in cellular reactions, interactions with drugs, and expression in body cells is a key to efficient medical diagnosis, drug production, and treatment of patients I have been fascinated with proteins and their structures for fifteen years I have fallen in love with the beauty of protein structures at first sight inspired by the research conducted by R.I.P Lech Znamirowski from the Silesian University of Technology, Gliwice, Poland I decided to continue his research on proteins and development of new efficient tools for their analysis and exploration I believe this book will be interesting for scientists, researchers, and software developers working in the field of structural bioinformatics and biomedical databases I hope that readers of the book will find it interesting and helpful in their everyday work Chapter Overview The content of the book is divided into four parts The first part provides background information on proteins and their representation levels, including a formal model of a 3D protein structure used in computational processes, and a brief overview of technologies used in the solutions presented in this book • Chapter 1: Formal Model of 3D Protein Structures for Functional Genomics, Comparative Bioinformatics, and Molecular Modeling This chapter shows how proteins can be represented in computational processes performed in scientific fields, such as functional genomics, comparative bioinformatics, and molecular modeling The chapter provides a general definition of protein spatial structure that is then referenced to four representation levels of protein structure: primary, secondary, tertiary, and quaternary structures • Chapter 2: Technological Roadmap This chapter provides a technological roadmap for solutions presented in this book It covers a brief introduction to the concept of Cloud computing, cloud service, and deployment models It also defines the Big Data challenge and Preface xi presents the benefits of using multi-threading in scientific computations It then explains graphics processing units (GPUs) and CUDA architecture Finally, it focuses on relational databases and the SQL language used for declarative querying The second part of the book is focused on Cloud services that are utilized in the development of scalable and reliable cloud applications for 3D protein structure similarity searching and protein structure prediction • Chapter 3: Azure Cloud Services Microsoft Azure Cloud Services support development of scalable and reliable cloud applications that can be used to scientific computing This chapter provides a brief introduction to Microsoft Azure cloud platform and its services It focuses on Azure Cloud Services that allow building a cloud-based application with the use of Web roles and Worker roles Finally, it shows a sample application that can be quickly developed on the basis of these two types of roles and the role of queues in passing messages between components of the built system • Chapter 4: Scaling 3D Protein Structure Similarity Searching with Cloud Services In this chapter, you will see how the Cloud computing architecture and Azure Cloud Services can be utilized to scale out and scale up protein similarity searches by utilizing the system, called Cloud4PSi, that was developed for the Microsoft Azure public cloud The chapter presents the architecture of the system, its components, communication flow, and advantages of using a queue-based model over the direct communication between computing units It also shows results of various experiments confirming that the similarity searching can be successfully scaled on cloud platforms by using computation units of different sizes and by adding more computation units • Chapter 5: Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures In this chapter, you will see how Cloud Services may help to solve problems of protein structure prediction by scaling the computations in a role-based and queue-based Cloud4PSP system, deployed in the Microsoft Azure cloud The chapter shows the system architecture, the Cloud4PSP processing model, and results of various scalability tests that speak in favor of the presented architecture The third part of the book shows the utilization of scalable Big Data computational frameworks, like Hadoop and Spark, in massive 3D protein structure alignments and identification of intrinsically disordered regions in protein structures • Chapter 6: Foundations of the Hadoop Ecosystem At the moment, Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable data analytics This chapter briefly describes the Hadoop ecosystem and focuses on two elements of the ecosystem—the Apache Hadoop and the Apache Spark 300 11 Exploration of Protein Secondary Structures … Assumption: ≤ The following terms are compliant with the defined grammar G pss : • h(1;10)—representing an α-helix of the length 1–10 elements; • e(2;5),h(10;*),c(1;20)—representing a β-strand of the length 2–5 elements, followed by an α-helix of the length at least 10 elements, and a loop of the length 1–20 elements; • e(10;15),?(5;20),h(35)—representing a β-strand of the length 10–15 elements, followed by any element of the length 5–20, and an α-helix of the exact length 35 elements With such a representation of the query pattern, we can start the search process using one of the functions disclosed by the PSS-SQL extension 11.3.2 Sample Queries in PSS-SQL The PSS-SQL extension provides a set of functions and procedures for processing protein secondary structures Three of the functions can be effectively invoked from the SQL commands, usually the SELECT statement The containSequence function verifies if a particular protein or a set of database proteins contain the structural pattern specified as a query pattern This function returns the Boolean value (true), if the database protein contains specified pattern, or (false), if the protein does not include the pattern Sample invocation of the function is shown in Listing 11.1 SELECT protID , protAC FROM P r o t e i n T b l WHERE name LIKE ’%E s c h e r i c h i a c o l i%’ AND dbo c o n t a i n S e q u e n c e ( id , ’ secondary ’ , ’ h ( ; ) , c ( ) , ? ( ) , c ( ; ) ’ ) =1 Listing 11.1 Sample query invoking containSequence function and returning identifiers of proteins from Escherichia coli containing the given secondary structure pattern The sample query returns identifiers and Accession Numbers of proteins from Escherichia coli having the structural region containing an α-helix of the length 11.3 SQL as the Interface Between User and the Database 301 Table 11.2 Input arguments of PSS-SQL functions Argument Description @ProteinIda Unique identifier of a protein in the database table that contains sequences of SSEs (e.g., id field in case of the ProteinTbl) Database field containing sequences of SSEs of proteins (e.g., secondary) Query pattern represented by a set of segments, e.g., h(2;10), c(1;5),?(2;*) An optional, simple or complex filtering criteria that allow to limit the list of proteins that will be processed during the search, e.g., length 150 ORDER BY AC, s s t a r t P o s 10 11 12 13 14 15 16 −− invoking s e q u e n c e P o s i t i o n and s t a n d a r d JOIN SELECT p protAC AS AC, p name , s s t a r t P o s , s endPos , p [ primary ] , s matchingSeq , p secondary FROM P r o t e i n T b l AS p JOIN dbo s e q u e n c e P o s i t i o n ( ’ secondary ’ , ’e (1;10) ,c (0;5) ,h (5;6) ,c (0;5) ,e (1;10) ,c (5) ’ , ’p name LIKE ’ ’%Staphylococcus a ureus%’ ’ AND p l e n g t h > 150 ’ ) AS s ON p i d =s p r o t e i n I d ORDER BY AC, s s t a r t P o s Listing 11.2 Sample query invoking sequenceMatch and sequencePosition table functions and returning information on proteins from Staphylococcus aureus having the length greater than 150 residues and containing the given secondary structure pattern These sample queries return Accession Numbers (ACs) and names of proteins from Staphylococcus aureus having the length greater than 150 residues and structural region containing β-strand of the length from to 10 elements, optional loop up to elements, an α-helix of the length to elements, optional loop up to elements, a β-strand of the length 1–10 elements and a element loop—pattern e(1;10),c(0;5),h(5;6),c(0;5),e(1;10),c(5) Partial results of the query from Listing 11.2 are shown in Fig 11.12 Detailed description of the output fields of the sequenceMatch and sequencePosition functions is given in Table 11.3 Results of the PSS-SQL queries are originally returned in a tabular form However, by adding an extra FOR XML clause at the end of the SELECT statement, like in the example in Listing 11.3, produces results in the XML format that can be easily transformed to the HTML Web page by using appropriate XSLT transformation file, and finally, published in the Internet Partial results of the query from Listing 11.3 are presented in Fig 11.13 An additional function—superimpose—that was used in 11.3 SQL as the Interface Between User and the Database 303 Table 11.3 Output table of sequenceMatch and sequencePosition functions Field Description proteinId startPos endPos length matchingSeq Unique identifier of the protein that contains the specified pattern Position, where the pattern starts in the target protein from a database Position, where the pattern ends in the target protein from a database Length of the segment that matches to the given pattern Exact sequence of SSEs, which matches to the pattern defined in the query Fig 11.13 Partial results of the query from Listing 11.3 the presented query (Listing 11.3) visualizes the alignment of the matched sequence and the database sequence of SSEs SELECT p protAC AS AC, p name, s startPos , s endPos , s matchingSeq , p [ primary ] , dbo superimpose ( s matchingSeq , p secondary ) AS alignment FROM ProteinTbl AS p CROSS APPLY dbo sequenceMatch (p id , ’secondary ’ , ’e (1;10) , c (0;5) ,h(5;6) , c (0;5) , e (1;10) , c (5) ’ ) AS s WHERE p name LIKE ’%Staphylococcus aureus%’ AND p length > 150 ORDER BY AC, s startPos FOR XML RAW ( ’ protein ’ ) , ROOT( ’ proteins ’ ) , ELEMENTS Listing 11.3 Sample query invoking sequenceMatch table function and returning results as an XML document by using the FOR XML clause 304 11 Exploration of Protein Secondary Structures … 11.4 Efficiency of the PSS-SQL The efficiency of the PSS-SQL query language was examined in various experiments Tests were performed on the Microsoft SQL Server 2012 Enterprise Edition working on nodes of the virtualized cluster controlled by the HyperV hypervisor hosted on Microsoft Windows 2008 R2 Datacenter Edition 64-bit The host server had the following parameters: 2x Intel Xeon CPU E5620 2.40 GHz, RAM 92 GB, 3x HDD 1TB 7200 RPM Cluster nodes were configured to use virtual CPU cores and 4GB RAM per node and worked under the Microsoft Windows 2008 R2 Enterprise Edition 64-bit operating system Most of the tests were performed on the database storing 6,360 protein structures However, in order to compare our language to one of the competitive solutions, some tests were performed on the database storing 248,375 protein structures During the experiments we measured execution times for various query patterns The query patterns were passed as a parameter of the sequencePosition function Tests were performed for queries containing the following sample patterns: • • • • • SSE1: e(4;20),c(3;10),e(4;20),c(3;10),e(15),c(3;10),e(1;10) SSE2: h(30;40),c(1;5),?(50;60),c(5;10),h(29),c(1;5),h(20;25) SSE3: h(10;20),c(1;10),h(243),c(1;10),h(5;10),c(1;10),h(10;15) SSE4: e(1;10),c(1;5),e(27),h(1;10),e(1;10),c(1;10),e(5;20) SSE5: e(5;20),h(2;5),c(2;40),?(1;30),e(5;*) Pattern SSE1 represents protein structure built only with β-strands connected by loops Pattern SSE2 consists of several α-helices connected by loops and one undefined segment of SSEs (‘?’ wildcard symbol) Patterns SSE3 and SSE4 have regions that are unique in the database, i.e., h(243) in pattern SSE3 and e(27) in pattern SSE4 Pattern SSE5 has a wildcard symbol ‘*’ for undetermined length, which slows down the search process In order to verify the influence of particular acceleration techniques on the execution times, tests were carried out for the PSS-SQL in three variants: • without multi-threading (–MT), • with multi-threading, but without multiple scanning of the segment index (+MT– MSSI), • with multi-threading and with multiple scanning of the segment index (+MT+MSSI) Results of the tests presented in Fig 11.14 prove that the performance of +MT– MSSI variant is higher, and in case of SSE1 and SSE2 even much higher, than –MT variant (implemented in original PSS-SQL) For +MT+MSSI we can see additional improvement of the performance It is difficult to estimate the overall acceleration, because it tightly depends on the uniqueness of the pattern The more unique the pattern is, the more proteins are filtered out based on the segment index, the fewer proteins are aligned and the less time we need to obtain results We can see it clearly in Fig 11.14 for patterns SSE3 and SSE4 that have precisely defined, unique regions 11.4 Efficiency of the PSS-SQL Fig 11.15 Execution time for query pattern SSE5 for three variants of the PSS-SQL language: without multi-threading (–MT), with multi-threading, but without multiple scanning of the segment index (+MT–MSSI), with multi-threading and with multiple scanning of the segment index (+MT+MSSI) 120 -MT +MT-MSSI +MT+MSSI 100 time (s) 80 60 40 20 SSE1 SSE2 Query SSE3 SSE4 1,200 -MT +MT-MSSI +MT+MSSI 1,000 800 time (s) Fig 11.14 Execution time for various query patterns SSE1-SSE4 and for three variants of the PSS-SQL language: without multi-threading (–MT), with multi-threading, but without multiple scanning of the segment index (+MT–MSSI), with multi-threading and with multiple scanning of the segment index (+MT+MSSI) 305 600 400 200 SSE5 h(243) and e(27) For universal patterns, like SSE1 and SSE2, for which we can find many fitting proteins or multiple alignments, we can observe longer execution times In such cases, the parallelization and multiple scanning of the segment index start playing a more significant role In these cases, the length of the pattern influences the alignment time—for longer patterns we experienced longer response times We have not observed any dependency between the type of the SSE and the response time However, specifying wildcards in the query pattern increases the waiting period, which is visible for the pattern SSE5 (Fig 11.15) In Fig 11.15 for the pattern SSE5, we can also see how beneficial the use of the MSSI technique can be In this particular case, the execution time was reduced from 920 seconds in –MT (original PSS-SQL), and 550 seconds in +MT–MSSI, to 15 seconds in +MT+MSSI, which gives 61.33fold speedup over the –MT variant and 36.67-fold speedup over the +MT–MSSI variant 306 11 Exploration of Protein Secondary Structures … 11.5 Discussion PSS-SQL language complements existing relational database management systems, which are not designed to process biological data, such as protein secondary structures stored as sequences of secondary structure elements By extending the standard SELECT, UPDATE, and DELETE statements of the SQL language, it provides a declarative method for retrieving, modifying and deleting records Records that satisfy the criteria given by a user can be returned in a table-like form or as an XML document, which is easy to display as a Web page In such a way, the PSS-SQL extension to RDBMS provides a kind of domain specific language for processing protein secondary structures This is especially important for relational database designers, wide group of biological data analysts and bioinformaticians The PSS-SQL language can be used for the fast classification of proteins based on their secondary structures For example, systems such as SCOP [20] and CATH [21] make use of the secondary structure description of protein structures in order to classify proteins into classes and families PSS-SQL can be also supportive in protein 3D structure prediction by homology modeling, where appropriate structure profile can be found based on primary and secondary structure and the secondary structure can be superimposed on the protein of the unknown 3D structure before performing a free energy minimization Comparing the PSS-SQL to other languages presented in Sect 11.1, we can notice that all variants of the PSS-SQL extend the syntax of the SQL This makes the PSS-SQL similar to PiQL [26], rather than to ProteinQL [27] ProteinQL was developed for the object-oriented database and relies on its own domain-specific database and dedicated ProteinQL interpreter and translator As opposed to ProteinQL, both PiQL and PSS-SQL extend capabilities of relational database management system (RDBMS) They extend the syntax of the SQL language by providing additional functions that can be nested in particular clauses of the SQL commands However, the form of queries provided by users is different PiQL accepts query patterns in a full form, like in BLAST [1] – a tool used for fast local matching of biomolecular sequences of DNA and proteins Query patterns provided in PSS-SQL are similar to those presented by Hammel and Patel in [10] The pattern defined in a query does not have to be specified strictly Segments in the pattern can be specified as intervals and they can have undefined lengths Both languages allow specifying query patterns with undefined types of the SSE or patterns, where some SSE segments may occur optionally Therefore, the search process has an approximate character, regarding various possible options for segment matching The possibility of defining patterns that include optional segments allows users to specify gaps in a particular place The described version of the PSS-SQL also uses the method of scanning the segment index in order to accelerate the search process The method was adopted from the work of Hammel and Patel [10] However, after multiple scans of the segment index Hammel and Patel used sort-merge join operations in order to join segments from the same candidate proteins and decide, whether they meet specified query conditions or not The novelty of PSS-SQL is that it relies on the alignment of 11.5 Discussion 307 the found segments Alignment implemented in PSS-SQL gives the unique possibility of finding many matches for the same database protein and returning k-best matches, matches that in some particular cases can be separated by gaps These are not the gaps defined by a user and specified by an optional segment, but the gaps providing better alignment of particular regions This type of matching is typical for similarity searching between biomolecular sequences, such as DNA/RNA sequences or amino acid sequences Presented approach extends the spectrum of searching and guarantees the optimality of the results according to assumed scoring system Despite the fact that PSS-SQL uses the alignment procedure, which is computationally complex, it gained quite a good performance We have compared the efficiency of the PSS-SQL (+MT+MSSI variant) and language presented by Hammel and Patel for single-predicate exact match queries with various selectivity (between 0.3 and 6%) using the database storing 248,375 proteins (515 MB for ProteinTbl, 254 MB for segment table storing 11,986,962 segments) The PSS-SQL was on average 5.14 faster than Comm-Seg implementation, 3.28 faster than Comm-CSP implementation, both implemented on a commercial ORDBMS, and 1.84 faster than ISS-MISS(1) implementation on Periscope/SQ This proves that PSS-SQL compensates the efficiency loss caused by alignment procedure by using the segment index In such a way, the PSS-SQL joins wide capabilities of the alignment process (possible gaps, mismatches, and many solutions), provides optimality and quality of results, and guarantees efficiency of scanning databases of secondary structures 11.6 Summary Integrating methods of protein secondary structure similarity searching with database management systems provides an easy way for manipulation of biological data without the necessity of using external data mining applications The PSS-SQL extension presented in this chapter is a successful example of such integration PSS-SQL is certainly a good option for biological and biomedical data analysts who want to process their data on the server side This has many advantages that are typical for such a processing in the client-server architecture Entire logic of data processing is performed on the database server, which reduces the load on the user’s computer Therefore, data exploration is performed while retrieving data from a database Moreover, the number of data returned to the user, and the network traffic between the server and the user application, are much reduced The use of multi-threading allows to utilize the whole capable computing power more efficiently The PSS-SQL adapts to the number of processing units possessed by the server hosting the database management system and to the number of cores used by the database system This results in better performance of the language while scanning huge databases of protein secondary structures Parallelization of calculations in bioinformatics brings tangible benefits and reduces the execution time of many algorithms In this chapter, we could see one of many examples of such parallelization For the latest information on the PSS-SQL, 308 11 Exploration of Protein Secondary Structures … please visit the project home page: http://zti.polsl.pl/dmrozek/science/pss-sql.htm For readers that are interested in other examples, I recommend the book Parallel Computing for Bioinformatics and Computational Biology by Albert Y Zomaya [31] for further reading References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool J Mol Biol 215(3), 403–410 (1990) http://www.sciencedirect.com/science/article/pii/ S0022283605803602 Anvik, J., MacDonald, S., Szafron, D., Schaeffer, J., Bromling, S., Tan, K.: Generating parallel programs from the wavefront design pattern In: Proceedings 16th International Parallel and Distributed Processing Symposium, p (2002) Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., ODonovan, C., Redaschi, N., Yeh, L.L.: UniProt: the universal protein knowledgebase Nucleic Acids Res 32(suppl-1), D115–D119 (2004) https://doi.org/10.1093/nar/gkh131 Berman, H., et al.: The Protein Data Bank Nucleic Acids Res 28, 235–242 (2000) Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bansal, P., Bridge, A.J., Poux, S., Bougueleret, L., Xenarios, I.: UniProtKB/Swiss-Prot, the manually annotated section of the UniProt knowledgebase: how to use the entry view, 23–54 (2016) Can, T., Wang, Y.F.: CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features In: Computational Systems Bioinformatics CSB2003 Proceedings of the 2003 IEEE Bioinformatics Conference (CSB2003), pp 169–179 (2003) Date, C.: An Introduction to Database Systems, 8th edn Addison-Wesley, USA (2003) Frishman, D., Argos, P.: Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence Protein Eng 9(2), 133–142 (1996) Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison Curr Opin Struct Biol 6(3), 377–385 (1996) 10 Hammel, L., Patel, J.M.: Searching on the secondary structure of protein sequences In: Bernstein, P.A., Ioannidis, Y.E., Ramakrishnan, R., Papadias, D (eds.) VLDB ’02: Proceedings of the 28th International Conference on Very Large Databases, pp 634–645 Morgan Kaufmann, San Francisco (2002) 11 Joosten, R.P., te Beek, T.A., Krieger, E., Hekkelman, M.L., Hooft, R.W., Schneider, R., Sander, C., Vriend, G.: A series of PDB related databases for everyday needs Nucleic Acids Res 39(suppl-1), D411–D419 (2011) https://doi.org/10.1093/nar/gkq1105 12 Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features Biopolymers 22(12), 2577–2637 (1987) 13 Källberg, M., Wang, H., Wang, S., Peng, J., Wang, Z., Lu, H., Xu, J.: Template-based protein structure modeling using the RaptorX web server Nat Protoc 7, 1511–1522 (2012) 14 Liu, W., Schmidt, B.: Parallel design pattern for computational biology and scientific computing applications In: 2003 Proceedings of IEEE International Conference on Cluster Computing, pp 456–459 (2003) 15 Małysiak-Mrozek, B., Kozielski, S., Mrozek, D.: Server-side query language for protein structure similarity searching, pp 395–415 Springer, Berlin (2012) https://doi.org/10.1007/9783-642-23172-8_26 16 Mrozek, D., Małysiak-Mrozek, B.: CASSERT: a two-phase alignment algorithm for matching 3D structures of proteins In: Kwiecie´n, A., Gaj, P., Stera, P (eds.) Computer Networks Communications in Computer and Information Science, vol 370, pp 334–343 Springer International Publishing, Berlin (2013) References 309 17 Mrozek, D., Wieczorek, D., Małysiak-Mrozek, B., Kozielski, S.: PSS-SQL: protein secondary structure - structured query language In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp 1073–1076 (2010) 18 Mrozek, D., Małysiak-Mrozek, B., Socha, B., Kozielski, S.: Selection of a consensus area size for multithreaded wavefront-based alignment procedure for compressed sequences of protein secondary structures In: Kryszkiewicz, M., Bandyopadhyay, S., Rybinski, H., Pal, S.K (eds.) Pattern Recognition and Machine Intelligence Lecture Notes Computer Science, vol 9124, pp 472–481 Springer International Publishing, Cham (2015) 19 Mrozek, D., Socha, B., Kozielski, S., Małysiak-Mrozek, B.: An efficient and flexible scanning of databases of protein secondary structures J Intell Inf Syst 46(1), 213–233 (2016) https:// doi.org/10.1007/s10844-014-0353-0 20 Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures J Mol Biol 247(4), 536– 540 (1995) http://www.sciencedirect.com/science/article/pii/S0022283605801342 21 Orengo, C., Michie, A., Jones, S., Jones, D., Swindells, M., Thornton, J.: CATH a hierarchic classification of protein domain structures Structure 5(8), 1093–1109 (1997) http://www sciencedirect.com/science/article/pii/S0969212697002608 22 Shapiro, J., Brutlag, D.: FoldMiner and LOCK2: protein structure comparison and motif discovery on the Web Nucleic Acids Res 32, 536–41 (2004) 23 Smith, T., Waterman, M.: Identification of common molecular subsequences J Mol Biol 147(1), 195–197 (1981) http://www.sciencedirect.com/science/article/pii/ 0022283681900875 24 Socha, B.: Multithreaded execution of the Smith-Waterman algorithm in the query language for protein secondary structures Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2013) 25 Stephens, S.M., Chen, J.Y., Davidson, M.G., Thomas, S., Trute, B.M.: Oracle database 10g: a platform for BLAST search and regular expression pattern matching in life sciences Nucleic Acids Res 33(suppl-1), D675–D679 (2005) https://doi.org/10.1093/nar/gki114 26 Tata, S., Friedman, J.S., Swaroop, A.: Declarative querying for biological sequences In: 22nd International Conference on Data Engineering (ICDE’06), pp 87–98 (2006) 27 Wang, Y., Sunderraman, R., Tian, H.: A domain specific data management architecture for protein structure data In: 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, pp 5751–5754 (2006) 28 Wieczorek, D., Małysiak-Mrozek, B., Kozielski, S., Mrozek, D.: A declarative query language for protein secondary structures J Med Inform Technol 16, 139–148 (2010) 29 Wieczorek, D., Małysiak-Mrozek, B., Kozielski, S., Mrozek, D.: A method for matching sequences of protein secondary structures J Med Inform Technol 16, 133–137 (2010) 30 Yang, Y., Faraggi, E., Zhao, H., Zhou, Y.: Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted onedimensional structural properties of query and corresponding native properties of templates Bioinformatics 27(15), 2076–2082 (2011) http://dx.doi.org/10.1093/bioinformatics/btr350 31 Zomaya, A.Y.: Parallel Computing for Bioinformatics and Computational Biology: Models, Enabling Technologies, and Case Studies, 1st edn Wiley-Interscience, New York (2006) Index A Accuracy, 231 Aligned Fragment Pairs (AFP), 72, 163, 253 Alignment, 74 Alpha helix, 8, 9, 287 Amino acid chain, Amino acids, Amino acid sequence, Apache Spark, 226 Area Under the Curve (AUC), 232 Asynchronous task execution, 114 Atoms, Azure BLOB, 111 Azure SQL Database, 111 B Backbone, 12 Backtracking, 261, 291 Beta sheet, 287 Beta strand, 8, 9, 287 Big Data, 33, 137 value, 35 variety, 34 velocity, 34 veracity, 34 5V model of big data, 34 volume, 34 Block index, 263 BLOSUM62, 270 Bond angles, 17 Bonded interactions, 22 Bond lengths, 5, 16 Bonds, C Cartesian coordinates, Central Processing Unit (CPU), 55, 180 Cloud computing, 30, 71 characteristics, 31 cloud service models, 31 community cloud, 33 deployment models, 33 hybrid cloud, 33 Infrastructure as a Service (IaaS), 31 Platform as a Service (PaaS), 32 private cloud, 33 public cloud, 33 Software as a Service (SaaS), 32 Clustered index, 288 Coalesced access, 263, 278 Coil, 287 Combinatorial Extension (CE), 71 Compute Unified Device Architecture (CUDA), 39, 40, 262 blocks, 41 grid, 41 kernel, 40 kernel execution, 42 threads, 40, 41 Conformational energy, 20 Consensus modes, 219, 222 Contact patterns, 252 Coulomb potential, 22 CPU cores, 36, 86 Critical section, 293 CUDA streams, 267 CUDA transactions, 262 © Springer Nature Switzerland AG 2018 D Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9 311 312 D Database Management System (DBMS), 284, 285, 298 Declarative query language, 284 Dihedral angles, 17 Disulfide bridges, 12, 14 3D protein structure, Dynamic programming, 258, 278 E Electrostatic potential, 22 Energy minimization, 15 Enzymes, Extensible Markup Language (XML), 302 F False negatives, 230 False positives, 230 Fixed Number of Proteins Package-based (FNPP-based) scheduling scheme, 84 Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists (FATCAT), 71 F-measure, 232 Fold recognition, 10 Force fields, 20 FOR XML clause, 302 Functional groups, 12 Fuzzy set, 224 Fuzzy smoothing filter, 224 G Gap penalty, 258, 259, 290 Gaps, 73 General-Purpose Graphics Processing Units (GPGPU), 39 Genetic code, Global memory, 262, 264, 278 GPU applications, 152 GPU-CASSERT, 261 GPU devices, 39 Graphics Processing Unit (GPU), 39, 262 constant cache, 40 constant memory, 40, 41, 273 global memory, 40, 41, 273, 278 registers, 273 scalar processor cores, 41 scalar processors, 40 shared memory, 40, 41, 273 streaming multiprocessor, 41, 273 texture cache, 40 Index texture memory, 40, 41, 278 warp, 41 Grid computing, 105 H Hadoop, 35, 53, 138, 187, 188 application master, 142, 189 cluster managers, 143 head node, 194 JobTracker, 141, 159 resource management, 142 resource manager, 142, 194 scheduling, 142 sequential files, 156, 189, 200 splits, 140, 156 TaskTracker, 141, 160 Yet Another Resource Negotiator (YARN), 142, 144 Hadoop Distributed File System (HDFS), 35, 138, 156, 162, 187, 226 blocks, 139 block size, 139, 156, 189 datanode, 139, 196 namenode, 139, 194 replication, 139 Hadoop ecosystem, 146 HBase, 148, 187, 188 HDInsight, 188, 226 HDInsight4PSi, 187 High resolution alignment, 257 Hive, 53 Homology modeling, 10 Horizontal scaling, 76 Hyper-threading, 36 I IDPP, 218 IDPP meta-predictor, 218 IDP predictors, 217 InfiniBand, 58 Infrastructure as a Service (IaaS), 51, 54 Interatomic distances, 16 Inter-residue distances, 252 Intrinsically Disordered Proteins (IDPs), 216 Intrinsically disordered proteins metapredictor, 218 Intrinsically Disordered Regions (IDRs), 216 Intrinsically Unstructured Proteins (IUPs), 216 Index K Kafka, 53 Kernel, 267 L LLAP, 53 Local alignment, 285 Loop, 8, 9, 287 Low resolution alignment, 257, 278 M Macromolecules, Manager role, 81 MapReduce, 35, 140, 187 jobs, 140 key–value, 163, 192 map function, 192 map-only pattern, 156, 159 mapper class, 162 Map phase, 140, 189 MapReduce 1.0, 141 MapReduce 2.0, 142 Map task, 140, 157, 189 MRv1, 141 MRv2, 142 Reduce phase, 140, 189 Reduce task, 140, 157 shuffle, 140 Matthews correlation coefficient, 232 Membership function, 224 Messages, 80, 85 Microsoft Azure, 51, 52 application, 52 app services, 54 BLOB, 53, 87, 187 cloud explorer, 63 cloud services, 52, 59 cloud storage account, 60 compute, 52 compute emulator, 64 configuration file, 61 data services, 53 definition file, 61 fabric, 54 HDInsight, 53, 187 messages, 63 mobile services, 53 networking, 54 queue client, 60 queues, 59, 87, 99 roles, 99 313 SQL database, 53, 187 tables, 53, 87, 187 virtual machines, 52 web sites, 53 Microsoft Azure virtual machine size, 55–58 Microsoft SQL Server, 285, 299 Minimum Path End (MPE), 291 mmCIF files, 162, 184 Molecular dynamics, 20 Molecular mechanics, 20 Molecular residue descriptors, 255, 264, 270 Monte Carlo method, 111 Multi-core CPUs, 38 Multi-core processor, 36 Multiprocessor, 41, 273 Multi-threaded application, 37, 98, 291 Multithreading, 36 Mutual-exclusion lock, 293 N Non-bonded interactions, 22 O Object-Oriented Database (OODB), 285 Oracle, 285 P Page-locked memory, 267 Parallel alignments, 270 PDB files, 162, 184 PDBML files, 184 Peptide bond, 6, 18 Platform as a Service (PaaS), 51, 54 Polypeptide sequence, Positive Predictive Value (PPV), 232 Potential energy, 20 PowerShell, 189 Precision, 232 Prediction job, 113 PredictionManager role, 111 PredictionWorker role, 113 Primary structure, 6, Processing model, 114 Protein conformation, Protein Data Bank (PDB), 183 Protein folding, 10 Proteins, Protein Secondary Structure - Structured Query Language (PSS-SQL), 298, 300–302 Protein sequence, 314 Protein similarity searching, 251 Protein spatial structure, 4, Protein structure matching, 257 Protein structure prediction, 103, 104, 108 ab initio methods, 105 comparative methods, 105 Critical Assessment of protein Structure Prediction (CASP), 105 fold recognition, 105 force fields, 104 homology modeling, 105 methods, 104 physical methods, 104 potential energy function, 108 secondary structure prediction, 105 Warecki–Znamirowski (WZ) method, 109 Protein synthesis, Q Qualification threshold, 259, 273 Quaternary structure, 13 Quaternions, 261 Query pattern, 299, 300 Query profile, 266, 267, 278 Queues, 54, 79, 113, 130 R R, 53 Ramachandran plot, 19 Recall, 231 Receiver Operating Characteristic (ROC) curves, 232 Reduced chains of secondary structures, 254, 255, 264, 265 Relational database, 42, 284 columns, 42 records, 42 rows, 42 table, 42 table attributes, 43 tuples, 42 Relational Database Management System (RDBMS), 43 Relative coordinates, 16 Remote Direct Memory Access (RDMA), 58 Resilient Distributed Data set (RDD), 226 Root Mean Square Deviation (RMSD), 73, 261 Index S Scalability, 45 horizontal scaling, 46 scaling out/in, 46 scaling up/down, 46 vertical scaling, 46 Scaling out, 76 Scaling up, 76 Scheduling computations, 114 Searcher role, 86 Secondary structure, 8, 283 Secondary Structure Elements (SSEs), 252, 264, 287 Secondary structure types, 287 Segment index, 288 Segment table, 287 SELECT statement, 284, 300, 302 Semaphore, 293 Sensitivity, 231 Service Bus, 54 Shape signatures, 252 Similarity matrix, 257, 264, 292 Similarity measure, 265 Similarity searching, 70 Single Instruction, Multiple Data (SIMD), 41 Single Instruction, Multiple Thread (SIMT), 41 Singular Value Decomposition (SVD), 261 Software as a Service (SaaS), 51, 54 Spark, 53, 143, 187 actions, 145 data partitions, 144, 226 driver program, 143, 227 executors, 144, 226 FIFO queue, 226 pipe transformation, 145, 226 RDD caching, 145 RDD fault tolerance, 145 Resilient Distributed Data set (RRDs), 144 saveAsTextFile action, 145, 227 spark context, 143 transformations, 144 Spark-IDPP, 218 Spark-IDPP meta-predictor, 226 Specificity, 231 Sterical collisions, 19 Storm, 53 Structural alignment, 70 Structural alignment speed, 197 Structural descriptors, 263 Structural similarities, 69 Index Structured Query Language (SQL), 44, 284, 298, 300, 301 FROM clause, 44 query, 44 SELECT statement, 44 WHERE clause, 44 Superposition, 70, 74, 260, 265 Synchronization, 38, 295 T Task scheduling, 81 Tertiary structure, 10, 11 Thread, 37, 262, 292, 293 Thread index, 263 Torsion angles, 17, 19 Transact-SQL, 285 True negatives, 230 True Positive Rate (TPR), 231 True positives, 230 Turn, 287 Twists, 73 Two-phase alignment algorithm, 257 V Valence angles, 17 315 van der Waals potential, 21 Vertical scaling, 76 Virtual Hard Drive (VHD), 78 Virtual Machines (VMs), 54, 55, 187 series, 55 sizes, 55 Virtualization, 30 Visual Studio.NET, 63 W Warp, 262 Wavefront, 292 Web role, 52, 59, 78, 111, 130 Worker role, 52, 59, 81, 86, 130 Y Yet Another Resource Negotiator (YARN), 189 containers, 194 node managers, 142, 189 Z Zookeeper, 148 ... works More information about this series at http://www.springer.com/series/5769 Dariusz Mrozek Scalable Big Data Analytics for Protein Bioinformatics Efficient Computational Solutions for Protein Structures... 2018 D Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_1 Formal Model of 3D Protein Structures for Functional... reading to help in understanding for anyone working with area of data analytics for structural bioinformatics and the use of high-performance computing It explores area of proteins and their structures