Open source software in life science research Published by Woodhead Publishing Limited, 2012 10 20 30 40 41R Woodhead Publishing Series in Biomedicine Practical leadership for biopharmaceutical executives J Y Chin Outsourcing biopharma R&D to India P R Chowdhury Matlab® in bioscience and biotechnology L Burstein Allergens and respiratory pollutants Edited by M A Williams Concepts and techniques in genomics and proteomics N Saraswathy and P Ramalingam An introduction to pharmaceutical sciences J Roy Patently innovative: How pharmaceutical firms use emerging patent law to extend monopolies on blockbuster drugs R A Bouchard Therapeutic protein drug products: Practical approaches to formulation in the laboratory, manufacturing and the clinic Edited by B K Meyer A biotech manager’s handbook: A practical guide Edited by M O’Neill and M H Hopkins 10 Clinical research in Asia: Opportunities and challenges U Sahoo 11 Therapeutic antibody engineering: Current and future advances driving the strongest growth area in the pharma industry W R Strohl and L M Strohl 12 Commercialising the stem cell sciences O Harvey 13 14 Human papillomavirus infections: From the laboratory to clinical practice F Cobo 15 Annotating new genes: From in silico to validations by experiments S Uchida 16 Open-source software in life science research: Practical solutions in the pharmaceutical industry and beyond Edited by L Harland and M Forster 17 Nanoparticulate drug delivery: A perspective on the transition from laboratory to market V Patravale, P Dandekar and R Jain 18 Bacterial cellular metabolic systems: Metabolic regulation of a cell system with 13C-metabolic flux analysis K Shimizu 19 Contract research and manufacturing services (CRAMS) in India: The business, legal, regulatory and tax environment M Antani and G Gokhale Published by Woodhead Publishing Limited, 2012 20 Bioinformatics for biomedical science and clinical applications K-H Liang 21 Deterministic versus stochastic modelling in biochemistry and systems biology P Lecca, I Laurenzi and F Jordan 22 Protein folding in silico: Protein folding versus protein structure prediction I Roterman 23 Computer-aided vaccine design T J Chuan and S Ranganathan 24 An introduction to biotechnology W T Godbey 25 RNA interference: Therapeutic developments T Novobrantseva, P Ge and G Hinkle 26 Patent litigation in the pharmaceutical and biotechnology industries G Morgan 27 Clinical research in paediatric psychopharmacology: A practical guide P Auby 28 The application of SPC in the pharmaceutical and biotechnology industries T Cochrane 29 Ultrafiltration for bioprocessing H Lutz 30 Therapeutic risk management of medicines A K Banerjee and S Mayall 31 21st century quality management and good management practices: Value added compliance for the pharmaceutical and biotechnology industry S Williams 32 33 CAPA in the pharmaceutical and biotech industries J Rodriguez 34 Process validation for the production of biopharmaceuticals: Principles and best practice A R Newcombe and P Thillaivinayagalingam 35 Clinical trial management: An overview U Sahoo and D Sawant 36 Impact of regulation on drug development H Guenter Hennings 37 Lean biomanufacturing N J Smart 38 Marine enzymes for biocatalysis Edited by A Trincone 39 Ocular transporters and receptors in the eye: Their role in drug delivery A K Mitra 40 Stem cell bioprocessing: For cellular therapy, diagnostics and drug development T G Fernandes, M M Diogo and J M S Cabral 41 42 Fed-batch fermentation: A practical guide to scalable recombinant protein production in Escherichia coli G G Moulton and T Vedvick 43 The funding of biopharmaceutical research and development D R Williams 44 Formulation tools for pharmaceutical development Edited by J E A Diaz Published by Woodhead Publishing Limited, 2012 45 Drug-biomembrane interaction studies: The application of calorimetric techniques R Pignatello 46 Orphan drugs: Understanding the rare drugs market E Hernberg-Ståhl 47 Nanoparticle-based approaches to targeting drugs for severe diseases J L A Mediano 48 Successful biopharmaceutical operations C Driscoll 49 Electroporation-based therapies for cancer Edited by R Sundarajan 50 Transporters in drug discovery and development Y Lai 51 The life-cycle of pharmaceuticals in the environment R Braund and B Peake 52 Computer-aided applications in pharmaceutical technology Edited by J Petrovi 53 From plant genomics to plant biotechnology Edited by P Poltronieri, N Burbulis and C Fogher 54 Bioprocess engineering: An introductory engineering and life science approach K G Clarke 55 Quality assurance problem solving and training strategies for success in the pharmaceutical and life science industries G Welty 56 Nanomedicine: Prognostic and curative approaches to cancer K Scarberry 57 Gene therapy: Potential applications of nanotechnology S Nimesh 58 Controlled drug delivery: The role of self-assembling multi-task excipients M Mateescu 59 In silico protein design C M Frenz 60 Bioinformatics for computer science: Foundations in modern biology K Revett 61 Gene expression analysis in the RNA world J Q Clement 62 Computational methods for finding inferential bases in molecular genetics Q-N Tran 63 NMR metabolomics in cancer research ˇ M Cuperlovic-Culf ´ 64 Virtual worlds for medical education, training and care delivery K Kahol Published by Woodhead Publishing Limited, 2012 Woodhead Publishing Series in Biomedicine: Number 16 Open source software in life science research Practical solutions in the pharmaceutical industry and beyond Edited by Lee Harland and Mark Forster Published by Woodhead Publishing Limited, 2012 Woodhead Publishing Limited, 80 High Street, Sawston, Cambridge, CB22 3HJ, UK www.woodheadpublishing.com www.woodheadpublishingonline.com Woodhead Publishing, 1518 Walnut Street, Suite 1100, Philadelphia, PA 19102–3406, USA Woodhead Publishing India Private Limited, G-2, Vardaan House, 7/28 Ansari Road, Daryaganj, New Delhi – 110002, India www.woodheadpublishingindia.com First published in 2012 by Woodhead Publishing Limited ISBN: 978-1-907568-97-8 (print); ISBN: 978-1-908818-24-9 (online) Woodhead Publishing Series in Biomedicine ISSN: 2050-0289 (print); ISSN: 2050-0297 (online) © The editor, contributors and the Publishers, 2012 The right of Lee Harland and Mark Forster to be identified as authors of the editorial material in this Work has been asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988 British Library Cataloguing-in-Publication Data: A catalogue record for this book is available from the British Library Library of Congress Control Number: 2012944355 All rights reserved No part of this publication may be reproduced, stored in or introduced into a retrieval system, or transmitted, in any form, or by any means (electronic, mechanical, photocopying, recording or otherwise) without the prior written permission of the Publishers This publication may not be lent, resold, hired out or otherwise disposed of by way of trade in any form of binding or cover other than that in which it is published without the prior consent of the Publishers Any person who does any unauthorised act in relation to this publication may be liable to criminal prosecution and civil claims for damages Permissions may be sought from the Publishers at the above address The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights The Publishers are not associated with any product or vendor mentioned in this publication The Publishers, editors and contributors have attempted to trace the copyright holders of all material reproduced in this publication and apologise to any copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged, please write and let us know so we may rectify in any future reprint Any screenshots in this publication are the copyright of the website owner(s), unless indicated otherwise Limit of Liability/Disclaimer of Warranty The Publishers, editors and contributors make no representations or warranties with respect to the accuracy or completeness of the contents of this publication and specifically disclaim all warranties, including without limitation warranties of fitness of a particular purpose No warranty may be created or extended by sales of promotional materials The advice and strategies contained herein may not be suitable for every situation This publication is sold with the understanding that the Publishers are not rendering legal, accounting or other professional services If professional assistance is required, the services of a competent professional person should be sought No responsibility is assumed by the Publishers, editor(s) or contributors for any loss of profit or any other commercial damages, injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein The fact that an organisation or website is referred to in this publication as a citation and/or potential source of further information does not mean that the Publishers nor the editor(s) and contributors endorse the information the organisation or website may provide or recommendations it may make Further, readers should be aware that internet websites listed in this work may have changed or disappeared between when this publication was written and when it is read Because of rapid advances in medical sciences, in particular, independent verification of diagnoses and drug dosages should be made Typeset by RefineCatch Limited, Bungay, Suffolk Printed in the UK and USA Published by Woodhead Publishing Limited, 2012 For Anna, for making everything possible Lee Harland Thanks to my wife, children and other family members, for their support and understanding during this project Mark Forster Published by Woodhead Publishing Limited, 2012 10 20 30 40 41R Contents List of figures and tables Foreword About the editors About the contributors xvii xxvii xxxi xxxiii Introduction 1 Building research data handling systems with open source tools Claus Stie Kallesøe 1.1 Introduction 1.2 Legacy 11 1.3 Ambition 12 1.4 Path chosen 14 1.5 The ’ilities 15 1.6 Overall vision 21 1.7 Lessons learned 21 1.8 Implementation 23 1.9 Who uses LSP today? 24 1.10 Organisation 27 1.11 Future aspirations 29 1.12 References 32 35 Interactive predictive toxicology with Bioclipse and OpenTox Egon Willighagen, Roman Affentranger, Roland C Grafström, Barry Hardy, Nina Jeliazkova and Ola Spjuth 2.1 Introduction 35 2.2 Basic Bioclipse–OpenTox interaction examples 39 2.3 Use Case 1: Removing toxicity without interfering with pharmacology 45 2.4 Use Case 2: Toxicity prediction on compound collections 52 Published by Woodhead Publishing Limited, 2012 ix Open source software in life science research This intranet solution (shown in Figure 22.1) does, of course, have a number of drawbacks First, there is the cost involved of hosting the software, testing the software and managing updates, as well as providing training to the IT helpdesk, and the users These costs are not insignificant, and in many cases make open source software as expensive to run and support as commercial alternatives, especially when training material of a high standard is not already available This cost is then duplicated many times at other corporations that are doing similar research On top of that, as many companies begin to work in more collaborative ways, they open B2B (business to business) connections through their firewalls This has led to more concern about the security vulnerabilities of the software hosted internally The vision of the Sequence Services Project is to try to reduce the total costs of running and maintaining such a system, while ensuring that Figure 22.1 516 Deploying open source software and data inside the data centres of corporations Published by Woodhead Publishing Limited, 2012 Economics of free/open source software Figure 22.2 Vision for a new cloud-based shared architecture functionality, performance and, above all, security are not compromised This new vision is shown in Figure 22.2, where the services are hosted and maintained by a third-party vendor Pistoia chose an open source suite of software that is well known in the bioinformatics world as its test software for the proof of concept work Before starting there were discussions with the institute that wrote the software, building an agreement to communicate the results of the project and, in particular, share any recommendations or security vulnerabilities that were discovered In order to test and maintain security a number of approaches were taken The first was to run an extensive ethical hack of the chosen open source software Pistoia did this by employing the professional services of a global IT company with a specialist security division, producing an extensive report containing a breakdown of vulnerabilities Although I cannot list specific security issues found in the Pistoia project, I would like to highlight the most common vulnerabilities found in such Published by Woodhead Publishing Limited, 2012 517 Open source software in life science research software For those that are particularly interested in this topic, a valuable resource in this area plus a list of the current top 10 vulnerabilities can be found at the Open Web Application Security Project (OWASP) website [12] ■ Application allows uploading of malware Whenever a file is being uploaded, at minimum the software should check if this is the type of file it is expecting, and stop all others A further step could be taken by running a virus check on the file ■ Insufficient account lockout When a user has authenticated and gained access to the system there should be a method to log out if the user has been inactive for a set period of time ■ Unauthorised read access A common example of this is the ability once you have been given access to a system to be able to find access areas that you shouldn’t by manually entering a URL ■ Cross site framing/scripting This is when an attacker can use a bug to re-direct a user to a website that the attacker controls This will usually be made to look like the original site, and can be used to harvest information like login IDs and passwords ■ Autocomplete not disabled The system should not offer to remember passwords or similar security tokens ■ Web server directory indexing enabled This is often left on by default, and can give an attacker some indication of where weaknesses exist or where sensitive data may reside ■ Sensitive information disclosed in URL Some web-hosted software will place certain pieces of information, such as search terms, in a URL ( search=web+site+security+ ) This could be viewed by an attacker if the connection is not secured ■ Verbose error messages An error message should ideally be written so it is clear to the user an error has occurred; however, it should not contain any information that an attacker may find useful (such as the user’s login ID, software 518 Published by Woodhead Publishing Limited, 2012 Economics of free/open source software versions, etc.) This can all be put in an error log and viewed securely by a support team ■ SQL injection This is a technique whereby an attacker can run a database search that the software would not normally allow, by manipulating the input to the application This is not difficult to prevent if developers are thinking about security, so that the system only ever runs validated input that it is expecting ■ Concurrent sessions Users should not be able to log into the system more than once at any given time ■ Web server advertises version information in headers Revealing the version of the server software currently running makes it easier for hackers to search for vulnerabilities for that version Even if a system has been found to have no known vulnerabilities during testing, there are further steps that can to be taken to minimise any risks ■ Regular security re-testing, at least at each version release – this, of course, costs money to maintain ■ Use of virtual private clouds, where private data are clearly segregated ■ Use of IP filtering at the firewall to reject any IP address other than the approved ones Although IP addresses can be spoofed, a hacker would need to know the correct ones to spoof ■ Use of appropriate validated authentication standards, obviating the need for services to provide their own ad hoc systems ■ Use of remote encryption key servers, which would prevent even the vendor hosting the software from being able to view the private data they are hosting ■ Use of two factor authentication Many users will use the same short password for every service, meaning that once an attacker has found a user name and password at one site, they can quickly try it at many others Two factor authentication usually takes the form of a device that can give a seemingly random number that changes on a regular basis which would need to be entered along with the traditional user name and password ■ The use of appropriate insurance to provide some financial compensation if a service is down or breached Published by Woodhead Publishing Limited, 2012 519 Open source software in life science research 22.7 Conclusion As I hope I have highlighted, just as under-secretary Kennedy explained in his answer at the beginning of this chapter, there are many costs associated with the use of open source software in government departments or in industry As soon as there is a need to store or process sensitive data such as medical records, then the costs and complexity can increase considerably The cost-benefits of open source software being free to download and use can quickly be lost once training, support, security, legal reviews, update management, insurance and so on are taken into account However, in many areas of scientific research, open source software projects are truly world leading For industry to get maximum value from open source software it needs to actively participate in the development process Organisations like Pistoia show one way that this participation can be done, where many companies come together in areas that are important (yet pre-competitive) This approach has many benefits ■ Shared requirements – allowing industry to speak with one voice, rather than many different ones This can obviously help with understanding how software might be used, in prioritising future work and in improving and adopting open standards ■ Shared costs and risks – with many companies all sharing the cost of things like expert security reviews of software, the share of the cost to each company can quickly become very reasonable ■ Feedback and collaboration – even if industry is not participating directly with the writing of a given open source software package, it can help to build relationships with the authors This way the authors can better understand industry requirements Importantly, authors will not see a list of security vulnerabilities merely as a criticism, but rather as a contribution to making the software better for everyone Open source software is very likely to play an important role in the life sciences industry in the future, with companies not only using open source software, but also actively contributing back to the community Indeed, earlier in this book (Chapter 1) we saw a great example of exactly this from Claus Stie Kallesøe’s description of the LSP4All software Another exciting example of industry using open source software to help drive innovation is demonstrated by the Pistoia Alliance’s recent ‘Sequence Squeeze’ competition The Pistoia Alliance advertised a challenge to identify improved algorithms to compress the huge volumes of data 520 Published by Woodhead Publishing Limited, 2012 Economics of free/open source software produced by NGS In order to find an answer they reached out to the world by offering a $15 000 prize to the best new algorithm All algorithms had to be submitted via the sourceforge website, and under the BSD2 licence The BSD2 licence was deliberately chosen by Pistoia as it has minimal requirements about how the software can be used and re-distributed – there are no requirements on users to share back any modifications (although, obviously, it would be nice if they did) This should lower the legal barriers within companies to adopting the software, and is a great example of industry embracing open source standards to help drive innovation The challenge was open to anyone and entries were ranked on various statistics concerning data compression and performance A detailed breakdown of the results of the challenge is available from [13] In conclusion, it is clear that free and open source software is not really ‘cost-free’ in an industrial setting It is also true that there can be issues that make deployment difficult, particularly where sensitive data are concerned However, by developing new models for hosting and supporting such software we could be witnessing a new direction for the use of free and open source software in commercial environments 22.8 References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] http://weblogs.mozillazine.org/asa/archives/2009/05/firefox_at_270.html http://www.state.gov/secretary/rm/2009a/july/125949.htm http://www.dnalc.org/view/15073-Completion-of-a-draft-of-the-humangenome-Bill-Clinton.html http://en.wikipedia.org/wiki/Human_Genome_Project http://en.wikipedia.org/wiki/DNA_sequencing The 1000 Genomes Project Consortium, A map of human genome variation from population-scale sequencing Nature 2010;467:1061–73 Barnes, MR et al Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery Nature Reviews Drug Discovery 2009;8:701–8 http://www.pistoiaalliance.org http://en.wikipedia.org/wiki/SpamAssassin http://www.OSS-watch.ac.uk Wilbanks, J Intellectual Property Aspects Of Collaboration in Collaborative Computational Technologies for Biomedical Research (Ekins S, Hupcey MAZ, Williams AJ eds.) Wiley (London) ISBN0470638036 https://www.owasp.org/index.php/Main_Page http://www.sequencesqueeze.org/ Published by Woodhead Publishing Limited, 2012 521 10 20 30 40 41R Index 1000 genomes project 252, 508, 521 Active Directory 242, 268 Adobe Acrobat 354 Adverse Events Reporting System database (AERS) 52, 61 agile development 271 agrochemicals 2, alerting 45, 346, 404, 443, 450, 492 Altmetric 363, 365 AlzSWAN 410, 420 Amazon 327, 468–9, 472, 480 AMES test 44 clinical Analytics 7, 453–5, 457, 459, 461, 463, 465, 467, 469, 471, 473–5, 477–9 clinical data 365, 386, 453, 465, 475 AnnoJ 273–5, 283 Apache 17, 33, 156, 253–4, 260, 272, 278–9, 283, 285, 289–91, 295–7, 303, 322, 327–8, 330, 349–50, 419, 452, 468, 471, 477, 479–80, 505, 515 Apache Tika 328, 330, 349 ArrayExpress 60, 187, 224, 229, 233 ASK query 401, 406, 413–14 AstraZeneca 39, 285–7, 290–1, 325–8, 332–3, 335–6, 338, 341–2, 344, 347, 349 Asynchronous JavaScript and XML (AJAX) 19, 253, 289 backup 121, 129, 241–2, 245, 247, 249–50, 260, 295, 500 Bacula 245–6, 260 Basic Local Alignment Search Tool (BLAST) 250–1, 254, 270, 283 BeanShell 138 Beowulf 242 BGZip 252, 260 big data 6, 22, 217, 263, 265, 267, 271, 273, 275, 277, 279, 281, 283, 435–7, 439, 441–2, 451, 453, 468–9, 476 bigData database 445 Bingo 69–70, 86, 281 Bioclipse 5, 35–61, 490, 504 BioInvestigation Index 180, 183 BioJava 169, 255, 258, 261, 425, 430, 490, 504 BioLDA 431 BioLinux 180, 187 Biological Process Maps 341–2 biomarker 188, 223, 237, 240, 392, 408, 412 BioPerl 255, 260, 490, 504 BioPortal 180, 187, 226, 238, 426 BioSharing 173, 175–6, 186–7 BLAST-Like Alignment Tool (BLAT) 243, 254 blog 272, 293, 299–300, 304–9, 322–3, 352, 354, 363, 371, 377, 380, 505, 521 microblogging 299, 304–9 Broad Institute 252, 283 business intelligence 7, 475 business model 4, 6–7, 125, 157, 391, 505, 513 cancer 44, 190, 200, 219, 227, 233, 235, 253, 326–7, 344, 406, 474, 506 Cassandra 471–2, 480 Published by Woodhead Publishing Limited, 2012 523 Index Chem2Bio2RDF 6, 59, 421–7, 429–33 ChemBioGrid 422, 425 ChEMBL 43–4, 422, 426 ChemDoodle 71, 82–3, 86 ChemDraw Digester 72–4, 76–7 Chemical entities of biological interest (ChEBI) 66, 183, 188, 429 chemical structure 9, 21, 31–2, 35–6, 45–6, 66, 69–70, 73, 76, 86, 91, 432 cheminformatics/chemogenomics 34, 60–1, 63–4, 70, 83, 151, 158, 297, 422, 424–6, 429–30, 432–3, 452 Chemistry Development Kit 39–40, 43, 61, 66–7, 85–6, 160, 425, 430 chemometrics 60, 90, 115–16, 125, 128 ChemSpider 63–5, 67–75, 77–85, 107, 357, 365, 422 ChIP-Seq 168, 189–90, 193, 196–7, 207–8, 210–11, 213, 215, 219, 229 Clojure 138 cloud computing 15, 125, 246, 505, 512 commercial vendors 64, 124–5, 157, 170, 245, 258–9, 481, 487, 514 competitor intelligence 401, 407 Compound Design Database 287 compute cluster 108, 112, 115–16, 120, 156–7, 169, 242–3, 276–8, 285, 289, 295–6, 298, 328, 385, 431–2, 441, 444, 469, 471–2, 474, 476 copyleft 2, 33 copyright 2, 361–3 corporate database 9–11, 16, 158–9, 231 CouchDB 461, 470, 479–80 crystallography 64, 85 curation 68, 79–80, 84, 173–4, 179–81, 183, 185–7, 221, 225–6, 232–5, 379, 412, 441 524 cytogenetics 239 Cytoscape 121, 128, 430 CytoSure 255, 260 D2R Server 427 data integrity 491, 496 data mining 10, 113, 129, 158, 237, 353, 409, 422, 424, 433, 474 data warehouse 10, 32, 386, 397, 454 Debian 17, 33, 266, 276, 283 decision tree 152, 168 Design Tracker 6, 285–7, 289, 291–7 diagnostic 116, 239–40, 479 disambiguation 183, 315, 329–32, 378–80 disaster contingency (disaster planning, point of failure) disease resistance 266 disk array 241–3, 246, 249 Django 285, 291–2, 296, 298 Documentum 320, 443 DokuWiki 245, 260 drug repositioning 325, 339, 341 drug target 235, 391–2, 394, 397, 407, 419 Dryad 363, 365 Eclipse platform 112, 128, 151, 155, 253, 260 electronic lab notebook 511 Ensembl 216, 220, 250, 252, 260, 281, 283, 392, 396, 419 Enterprise Research Data Management 11 enterprise search 367, 369, 378–9, 384, 388, 450, 452 European Bioinformatics Institute (EBI) 183, 221–9, 231–5, 237, 250 European Federation of Pharmaceutical Industries and Associations (EFPIA) 181, 188 European Molecular Biology Open Software Suite (EMBOSS) 251, 260, 490, 504 Experimental Factor Ontology 226, 233, 238 Published by Woodhead Publishing Limited, 2012 Index Extensible Stylesheet Language Transformations (XSLT) 67, 71, 461 Extjs 19, 22–3, 33 faceted search 327, 336, 348 failover 285, 291, 294–5, 347 FAST search 450 FASTA 171, 270–1, 275, 282, 445, 499 FASTQ 168–9, 216, 251, 260, 270–1, 283 Fiji 134, 138 Firefox 23, 72, 275, 505–6 Flare 334, 348 Flex 334, 341, 369 freedom to operate 29 Galaxy 187, 190, 219, 276–9, 281–3 GBrowse 169, 171, 273, 281 Gene Expression Atlas 5, 221–2, 224, 231–2, 238 gene set enrichment analysis (GSEA) 327, 384, 387, 425, 436 GeneData 183, 188 GenePix 254, 260 General Public License (GPL) 1–2, 19, 33, 63, 90, 128, 133, 151, 190, 275, 305, 377, 382, 466, 470, 482 GeneSpring 224, 254–5 Genome Analysis ToolKit 252, 260 geospatial search 328 GGA Software 34, 69–70, 82–4, 86, 160, 171, 278, 394, 466 GGobi 125, 129 Git 19, 21, 32–3, 185, 188 GlaxoSmithKline (GSK) 28, 44, 90, 124, 257, 261, 301, 349 Google File System 468, 480 Google Patents 70, 86 Groovy 40, 253 Hadoop 453, 468, 476–7, 480 HBase 2, 10, 471, 507 Health Level Seven International (HL7) 453, 455–9, 463–4, 474, 479 high availability 285, 291, 296 high content imaging 266 histopathology 186 HTML5 71–2, 85, 389, 504 Human Genome Project 506 HUPO Proteomics Standards Initiative 95, 183 ICD10 (International Classification of Diseases) 453, 479 Illumina 229, 265, 270, 282 image analysis 127, 131–2, 139, 146–9, 267, 273 ImageJ 5, 131–45, 147–9 Imglib 158, 164, 171 InChI 38, 45, 66–9, 72, 81–2, 85, 87, 207, 211, 427 Indigo 153, 158, 160, 171 InfoVis 334 Innovative Medicines Initiative (IMI) 30, 181, 186 Integrated Genome Viewer 273 Investigation Study Assay (ISA) 173, 176–81, 183, 185–7 Java Universal Network Graph 334 Javascript Object Notation (JSON) 227, 271, 274–5, 283, 328, 372–3, 383, 398, 472 JBrowse 273, 275, 283 JChemPaint 70, 86 Johnson and Johnson 369, 374, 385 JSpecView 63, 71, 82–3, 86 Ketcher 21, 32, 34, 70, 82, 86 KnowIt 374–6, 380, 382–9 Knowledge Management 6, 59, 79, 188, 320, 342, 356, 362, 367–8, 370, 379, 381, 389, 391, 409–10, 418, 435, 442, 446, 448–50 Konstanz Information Miner (KNIME) 5, 58, 61, 89, 112–15, 117, 126–8, 131, 139–40, 149, 151–3, 155–9, 161–4, 166–71, 339, 490, 504 Published by Woodhead Publishing Limited, 2012 525 Index Laboratory Information Management Systems (LIMS) 121, 271 LAMP Stack 17, 33, 126, 285, 303 Latent Dirichlet Allocation 431 Learn Chemistry 78–80, 83, 87 licencing 12, 19, 31, 124–5, 132–3, 245, 251, 254–5, 258, 289, 326, 483, 495, 509, 513–15, 521 Life Science Platform (LSP) 9, 30–2, 520 Lightweight Directory Access Protocol (LDAP) 151, 307, 334, 372, 394 Linked Data 6, 16, 29–30, 34, 46, 48, 58–9, 61, 69, 82, 84, 169, 180, 187, 274, 367, 369–70, 382–5, 388–9, 418, 421–7, 429–33, 435, 437, 439–46, 448–52, 457, 479, 519 RDF 6, 16, 29–30, 46, 48, 59, 69, 82, 169, 187, 274, 383–5, 389, 421–7, 429–33, 435, 439–44, 452, 457, 519 triple store 30, 382–5, 423, 425, 427, 430, 440–1, 445 semantic web 6, 34, 58–9, 61, 84, 180, 367, 385, 388–9, 418, 421–4, 432, 439, 452, 479 SPARQL 43, 382–6, 389, 423, 425, 427, 430, 440, 443, 452 Linux 1, 17, 19, 33, 95, 121, 133, 151, 155–6, 180, 187, 232, 241–4, 247, 253, 257, 268, 285, 289–90, 295–8, 303, 484, 505, 510, 515 Lucene/SOLR 6, 322, 325, 327–33, 336, 339, 341, 344, 346–50, 397, 419, 437, 452 Lundbeck 9–14, 16, 18, 21, 23–5, 27–33 machine learning 34, 61, 125, 152 macro 131–3, 136–45, 147–9, 314 Malware 518 MapReduce 468–9, 472–3, 476–80 Markush structure 76 Mass spectroscopy 5, 89–95, 97, 99, 101, 103, 107, 109, 113, 115, 117, 119, 121, 123, 125–9, 179, 183, 265, 482 526 MATLAB 116, 138 Maven 124 MediaWiki 6, 30, 59, 78–83, 86–7, 285, 290–1, 297, 299, 313–18, 322–3, 367, 369–71, 373–9, 381–9, 391, 393–5, 397–8, 401, 407, 409, 411, 413, 415, 417–20 Medical Subject Headings (MeSH) 3, 132, 331, 339, 349, 397, 404, 420, 435, 476, 512, 519 medicinal chemistry 24, 61, 68, 286–7, 292–4, 401 metabolomics 5, 89–91, 93–5, 97, 99, 101, 103–4, 106–9, 112–15, 117, 119, 121, 123–9, 177, 180, 183, 188, 221, 326, 420 metadata 5, 21, 30, 81, 174–5, 177–81, 186, 224, 233–5, 270–1, 273–4, 276, 330, 333, 348, 354, 363, 385–6, 388, 398, 442, 446, 448, 466, 475 microdata/RDFa 6, 30, 169, 383, 389, 423, 427, 431–2, 440 Microsoft Excel 144, 147, 209, 328, 344 Microsoft Windows 23, 95, 107, 121, 133–4, 151, 155, 199–201, 204–5, 207, 211–12, 241–4, 253, 259, 267–8, 277, 289, 303–5, 307, 312 mind-mapping 409 Minimum Information Standards 174–5, 183, 185, 187–8, 225 Minimal Information About a Microarray Experiment (MIAME) 183, 188, 225, 237 Mirth 458–63, 478–9 Model-View-Controller 22–3, 34, 271 Mongrel 272 Moores law 132, 264, 282 Mozilla Rhino 138 MSPcrunch 250, 260 MySQL 32, 34, 157, 243, 249, 251, 254, 258, 260, 285, 290–1, 294–6, 303, 372, 407, 461, 466–7, 478–80 mzData 95, 98–9, 101 Published by Woodhead Publishing Limited, 2012 Index mzMine 89, 101–4, 106–8, 112–13, 126–7 mzML 95, 101, 126 MzXML 95, 101, 126 National Center for Biotechnology Information (NCBI) 71, 85–6, 252, 258–60, 264, 365, 419–20 National Library Of Medicine (NLM) 71, 464 NetBeans 253, 260 NetCDF 95, 101, 226, 228, 232, 238 NetVault 245–6, 260 network attached storage 246 NeuronJ 133–4, 141, 149 Nginx 279, 283, 285, 295–6, 298 NIH image 132–3 NMR spectroscopy 1, 179 NoSQL 453, 461, 466–9, 471–2, 475, 477–8, 480 Omero 272–3, 282–3 Ondex 121, 129, 319 ontology 38–9, 43–4, 59, 63–5, 67, 85, 173, 180, 183, 185, 187–8, 196, 201, 213, 220, 225–6, 233, 235, 238, 252, 271, 283, 339, 342, 373, 379, 396, 398, 401, 404, 419–20, 423–7, 430, 456 Ontology of Biomedical Investigations 183 Open Parser for Systematic IUPAC Nomenclature (OPSIN) 66–7 Open Source Chemistry Analysis Routines (OSCAR) 63–7, 85 Open Source Software Advisory Service 515 Open Web Application Security Project 512, 518, 521 OpenBabel 63, 70, 73–4, 76–7, 86 OpenEye 84, 291, 296, 298 OpenIdentify 336 OpenPHACTS 30–1, 34, 84, 87, 357, 365, 423, 432 OpenTox 5, 35–47, 49–51, 53–61 Oracle 11–13, 16, 21–2, 31–2, 231–2, 251, 258, 260, 296, 397, 466 Oxford Gene Technology 239 Pareto scaling 117 patents 69–70, 86, 240, 326, 338, 341, 354, 435, 443–4, 449, 510 pathway 36, 60, 186, 214, 220, 226–7, 235, 255, 342, 392–6, 401, 404, 406, 408, 413, 415, 417, 421, 423–4, 426, 429, 432, 441 patient 193, 223, 325–6, 339, 342, 408, 412, 426, 455–8, 468, 471–2, 475, 477, 482–4, 491 PDF 127–9, 149, 162–3, 220, 280, 323, 328, 330, 351, 353–7, 359–63, 365, 417, 419 peak detection 89, 102–4, 127 Pentaho 475–8, 480 Peregrine 332, 349 Perl 12, 17–18, 43, 113, 156, 158, 204, 206, 217, 233, 253, 255, 257, 260, 270, 289, 330, 401, 410, 417, 458, 463, 490, 503–4 pharmacokinetics 223 Pharmamatrix 395 php 78, 80–1, 86–7, 96, 126–7, 129, 171, 253–4, 258, 260, 285, 290–1, 297, 303, 382, 389, 420, 521 Phusion Passenger 272, 283 Picard 252, 260 Pipeline Pilot 334, 339, 349 Pistoia Alliance 34, 176, 187, 505, 511–14, 517, 520–1 Portland Press 363 PostGresSQL 466 pre-competitive 3, 7, 84, 187, 234–5, 408, 482, 505, 510–11, 520–1 PredTox 173, 181–3, 186, 188 Prefuse 334, 342, 344, 346, 348–9 Project Prospect 64–5, 84–5 proteomics 90, 95, 121, 126, 128, 183, 187–8, 221, 326 provenance 174, 362, 383, 385, 418 PubChem 68, 74, 85, 422, 426–7, 429 publisher 8, 63–5, 67–8, 353–5, 362–4 PubMed 70–1, 86, 331, 346, 365, 372–3, 395, 431, 433, 435 Published by Woodhead Publishing Limited, 2012 527 Index Python 113, 138, 158, 257, 278, 285, 289, 291–2, 295–7, 317, 330, 349 Quantitative Structure-Activity Relationship (QSAR) 36, 39–40, 45, 58, 60–1, 297, 429 Quick Response Codes 140–1, 148–9 R - scripting and statistics language R based 129, 228, 255, 257, 278 R BioConductor 95, 97, 98, 127, 128, 187, 190, 193, 215, 216, 217, 219, 255, 261, 490, 504 R code 115, 120 R development 44, 65, 67, 84, 127, 156, 170, 192, 237, 253, 255, 282, 425, 487 R environment 11, 39, 40, 209 R foundation 127, 490, 504 R GUI 129 R language 89, 95, 115, 138, 142, 255, 257, 291, 296, 317, 477 R Java 232, 477 R journal 129, 355 R Knime 113, 114, 117 R method 94, 115, 116, 141 R node 89, 114, 115, 118, 119, 120, 122, 152, 153, 154, 159, 161, 163, 168, 169, 170, 242, 468, 469 R package 95, 97, 102, 112, 117, 128, 129, 149, 180, 190, 215, 282 R project 7, 27, 33, 65, 224, 253, 258, 260, 287, 292, 293, 297, 298, 342, 344, 393, 396, 410, 509 R programming 255, 334, 477 R regression 128 R script 89, 117, 228, R serve 80, 242, 243, 247, 268, 272, 278, 290, 333, 374, 425, 427, 476, 477 R snipett 113, 115, 118, 119, 120 528 R Studio 95, 98, 253, 260 R statistics / statistical 39, 89, 95, 107, 127, 138, 158, 190, 205, 229, 258, 341, 504, R suite 296, 298, 477 R sweave 125 Rattle 125, 129 RDKit 32, 34, 158, 162–3, 171, 490, 504 REACH regulations 35 Really Simple Syndication (RSS) 280, 309, 315, 346, 373, 381, 404, 435, 443, 448–9 Red Hat 4, 7, 17, 33, 157, 232, 290, 296–8, 484, 510 regulatory issues 7, 125, 481–8, 491, 493, 495, 497, 499–500, 504 software validation 125, 483, 497, 504 regulatory agencies 484 GAMP 481, 483–7, 491, 495, 499–500, 504 regulated environment 7, 125 Eudralex 483, 488, 493, 499, 504 Representational State Transfer (REST) 18, 60, 66, 71, 226, 334, 472–3, 478 Riak 471–3 Ruby-on-Rails 17–18, 23, 33–4, 138, 271–2, 283 SADI framework 59, 61 Sainsbury Laboratory 265, 283 SAM format 169, 189, 195–6, 204, 220 Samba 241, 259, 283 SAMtools 220, 252, 260, 283 SAVANT 273, 283 scalability 15–16, 265–6, 281, 291, 303, 305, 309, 313–14, 328–9, 382, 432, 441, 444, 454, 466–8, 471, 473, 499, 512 schema 117, 183, 278, 281, 283, 292, 329, 339, 396, 422, 427, 440, 465–8, 471, 473 SciBorg 64, 66, 85 Published by Woodhead Publishing Limited, 2012 Index Scuttle 309–12, 321, 323 Security 15, 25, 37, 124–5, 237, 250, 304, 348, 363, 380–1, 425, 455, 476, 496, 500, 505, 510, 512–13, 515–20 Semantic Forms 371, 376, 379, 413, 420 Semantic Link Association Prediction 343, 431 Semantic MediaWiki 6, 30, 59, 81, 314, 367, 369–71, 373–9, 381–9, 391, 393–5, 397, 401, 407, 409, 411, 413, 415, 417–20 semantic search 389, 423, 435, 442, 444, 446, 450–1 Sequence Services 34, 511–16 SharePoint 305, 320, 323, 380–1, 389, 420, 443–4 Simile 334, 430 Simple Object Access Protocol (SOAP) 56, 71, 381, 446, 476 Single Nucleotide Polymorphism 98–9, 190, 200, 251, 253, 266, 279, 281, 389 SMILES 43, 45–8, 52–3, 160–1, 290–1, 427 Snoopy 81, 87 social bookmarking 299–300, 307, 309–12, 317, 323, 371, 380 social networking 6, 299–300, 309, 311, 388, 407, 418, 431, 482 software as a service 485, 487 Sourceforge 60, 85–7, 220, 252, 260, 283, 323, 420, 430, 433, 479, 504, 521 SQL server 66–70, 72, 251, 260 SQLite 69, 277–8, 283 Stallman 1, 133 statistical clustering 108, 112, 116, 120, 328, 385, 432, 444, 469, 472, 474 Status.Net 304–9, 321, 323 storage 7, 10, 21, 30, 90, 121, 180, 222, 241, 246–7, 266–8, 270, 277, 282, 295, 361, 381–2, 423–5, 436, 440, 445, 453, 465–8, 471–3, 496, 499, 507–8 support vector machine 116, 152 sustainability 156, 176, 185, 364 Symyx 11, 17, 21, 31–2 systems chemical biology 421, 424, 433 Tabix 252, 260 tagging 183, 311, 323, 325, 329, 332–3, 346, 401, 404, 406, 413, 417 Targetpedia 391, 394–8, 404, 406–7, 410, 412, 418–19, 450 Taverna 58, 61, 282–3, 339, 490, 504 Text Mining 59, 64, 67, 175, 353, 355, 357, 359, 363, 365, 395 Third Dimension Explorer 385–6 Tomcat 232, 253, 260, 328 toxicology 5, 10, 35–7, 39–41, 43–5, 47, 49, 51–61, 126, 181, 187–8, 222–3, 379, 421, 424, 426, 429, 431 TripleMap 6, 59, 423, 435, 437, 439, 441–6, 448–51 Tungsten Enterprise 295 Twitter 300, 304–5, 323, 325, 448, 452, 474, 480 Unified Medical Language System (UMLS) 463–5, 468, 473, 477–8, 480 URL shortener 305, 380 US 21CFR Part 11 479, 482, 490 Utopia documents 6, 351, 353–5, 357, 359, 361, 363–5 Variant Effect Predictor 252 VCFTools 252, 260 virtual machine 133, 304, 322 virtual server 296, 298 Web service 21, 43, 60, 76, 80, 254, 273–5, 283, 291, 309, 339, 353, 381, 408, 422, 425, 469 Web2.0 289, 299–300, 303, 310, 319, 322 Weka 158, 258, 476–7, 480 Published by Woodhead Publishing Limited, 2012 529 Index Wellcome Trust widget 21, 183, 430 Wikipedia 1, 32–3, 41, 78, 80, 87, 126, 149, 248, 260, 282–3, 294, 300, 313, 315, 318, 323, 355, 367–9, 388, 392–6, 401, 419, 433, 480, 506, 521 WordPress 505 530 World Wide Web Consortium 180, 187, 439, 441, 451 XCMS 97–8, 100–1, 127, 129 XML Database 467 XPath/XQuery 467 xrite 148 Yammer 304, 323 Published by Woodhead Publishing Limited, 2012 ... free /open source software in industry 505 Simon Thornber 22.1 505 22.2 Background 506 22.3 Open source innovation 508 22.4 Open source software in the pharmaceutical industry 510 22.5 Open source. .. engineering: An introductory engineering and life science approach K G Clarke 55 Quality assurance problem solving and training strategies for success in the pharmaceutical and life science industries... Woodhead Publishing Limited, 2012 Woodhead Publishing Series in Biomedicine: Number 16 Open source software in life science research Practical solutions in the pharmaceutical industry and beyond