METHODS IN MEDICAL INFORMATICS Fundamentals of Healthcare Programming in Perl, Python, and Ruby CHAPMAN & HALL/CRC Mathematical and Computational Biology Series Aims and scope: This series aims to capture new developments and summarize what is known over the entire spectrum of mathematical and computational biology and medicine It seeks to encourage the integration of mathematical, statistical, and computational methods into biology by publishing a broad range of textbooks, reference works, and handbooks The titles included in the series are meant to appeal to students, researchers, and professionals in the mathematical, statistical and computational sciences, fundamental biology and bioengineering, as well as interdisciplinary researchers involved in the field The inclusion of concrete examples and applications, and programming techniques and examples, is highly encouraged Series Editors N F Britton Department of Mathematical Sciences University of Bath Xihong Lin Department of Biostatistics Harvard University Hershel M Safer Maria Victoria Schneider European Bioinformatics Institute Mona Singh Department of Computer Science Princeton University Anna Tramontano Department of Biochemical Sciences University of Rome La Sapienza Proposals for the series should be submitted to one of the series editors above or directly to: CRC Press, Taylor & Francis Group 4th, Floor, Albert House 1-4 Singer Street London EC2A 4BQ UK METHODS IN MEDICAL INFORMATICS Fundamentals of Healthcare Programming in Perl, Python, and Ruby Jules J Berman Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2011 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Printed in the United States of America on acid-free paper 10 International Standard Book Number: 978-1-4398-4182-2 (Hardback) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging-in-Publication Data Berman, Jules J Methods in medical informatics : fundamentals of healthcare programming in Perl, Python, and Ruby / Jules J Berman p ; cm (Chapman & Hall/CRC mathematical and computational biology series ; 39) Includes bibliographical references and index ISBN 978-1-4398-4182-2 (alk paper) Medical informatics Methodology Medicine Data processing I Title II Series: Chapman and Hall/CRC mathematical & computational biology series ; 39 [DNLM: Medical Informatics methods Programming Languages Computing Methodologies W 26.5 B516m 2011] R858.B4719 2011 610.285 dc22 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com 2010011244 For Irene Contents P r e fa c e xv N o ta B e n e About the xxi Author xxiii Pa r t I F u n da m e n ta l A l g o r i t h m s o f M e d i c a l I n f o r m at i c s C h a p t e r P a r s in g 1.1 1.2 1.3 1.4 1.5 1.6 1.7 and and Methods Tr a n s f o r m in g Te x t F i l e s Peeking into Large Files 1.1.1 Script Algorithm 1.1.2 Analysis Paging through Large Text Files 1.2.1 Script Algorithm 1.2.2 Analysis Extracting Lines that Match a Regular Expression 1.3.1 Script Algorithm 1.3.2 Analysis Changing Every File in a Subdirectory 1.4.1 Script Algorithm 1.4.2 Analysis Counting the Words in a File 1.5.1 Script Algorithm 1.5.2 Analysis Making a Word List with Occurrence Tally 1.6.1 Script Algorithm 1.6.2 Analysis Using Printf Formatting Style 1.7.1 Script Algorithm 1.7.2 Analysis 3 5 7 10 10 10 11 12 12 14 14 14 16 16 17 18 vii v iii C o n t en t s C h a p t e r U t i l i t y S c r ip t s 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Random Numbers 2.1.1 Script Algorithm 2.1.2 Analysis Converting Non-ASCII to Base64 ASCII 2.2.1 Script Algorithm 2.2.2 Analysis Creating a Universally Unique Identifier 2.3.1 Script Algorithm 2.3.2 Analysis Splitting Text into Sentences 2.4.1 Script Algorithm 2.4.2 Analysis One-Way Hash on a Name 2.5.1 Script Algorithm 2.5.2 Analysis One-Way Hash on a File 2.6.1 Script Algorithm 2.6.2 Analysis A Prime Number Generator 2.7.1 Script Algorithm 2.7.2 Analysis C h a p t e r V i e w in g 3.1 3.2 3.3 3.4 3.5 and M o d i f y in g I m a g e s Viewing a JPEG Image 3.1.1 Script Algorithm 3.1.2 Analysis Converting between Image Formats 3.2.1 Script Algorithm 3.2.2 Analysis Batch Conversions 3.3.1 Script Algorithm 3.3.2 Analysis Drawing a Graph from List Data 3.4.1 Script Algorithm 3.4.2 Analysis Drawing an Image Mashup 3.5.1 Script Algorithm 3.5.2 Analysis C h a p t e r I n d e x in g Te x t 4.1 4.2 4.3 4.4 ZIPF Distribution of a Text File 4.1.1 Script Algorithm 4.1.2 Analysis Preparing a Concordance 4.2.1 Script Algorithm 4.2.2 Analysis Extracting Phrases 4.3.1 Script Algorithm 4.3.2 Analysis Preparing an Index 4.4.1 Script Algorithm 4.4.2 Analysis 21 21 21 22 22 23 24 24 24 25 25 26 26 27 28 30 30 30 31 31 32 34 37 37 38 39 40 40 41 42 42 43 44 44 46 46 46 50 53 53 54 56 57 57 59 60 61 63 63 65 68 C o n t en t s 4.5 ix Comparing Texts Using Similarity Scores 4.5.1 Script Algorithm 4.5.2 Analysis 69 69 76 Pa r t I I M e d i c a l D ata R e s o u r c e s C h a p t e r Th e N at i o n a l L ib r a r y H e a d in g s (M e SH) 5.1 5.2 5.3 5.4 5.5 of M e d i c in e ’ s M e d i c a l S u b j e c t C h a p t e r Th e I n t e r n at i o n a l C l a s s i f i c at i o n 6.1 Creating the ICD Dictionary 6.1.1 Script Algorithm 6.1.2 Analysis 6.2 of Dise ases 7.3 99 99 100 101 102 103 104 Building the ICD-O (Oncology) Dictionary 6.2.1 Script Algorithm 6.2.2 Analysis C h a p t e r SEER: Th e C a n c e r S u r v e i l l a n c e , E pi d e m i o l o gy, E n d R e s u lt s P r o g r a m 7.1 Parsing the SEER Data Files 7.1.1 Script Algorithm 7.1.2 Analysis 7.2 81 83 83 86 88 88 90 90 91 92 92 93 96 96 96 97 Determining the Hierarchical Lineage for MeSH Terms 5.1.1 Script Algorithm 5.1.2 Analysis Creating a MeSH Database 5.2.1 Script Algorithm 5.2.2 Analysis Reading the MeSH Database 5.3.1 Script Algorithm 5.3.2 Analysis Creating an SQLite Database for MeSH 5.4.1 Script Algorithm 5.4.2 Analysis Reading the SQLite MeSH Database 5.5.1 Script Algorithm 5.5.2 Analysis and Finding the Occurrences of All Cancers in the SEER Data Files 7.2.1 Script Algorithm 7.2.2 Analysis Finding the Age Distributions of the Cancers in the SEER Data Files 7.3.1 Script Algorithm 7.3.2 Analysis C h a p t e r OMIM: Th e O n l in e M e n d e l i a n I n h e r i ta n c e 8.1 Collecting the OMIM Entry Terms 8.1.1 Script Algorithm 8.1.2 Analysis 8.2 Finding Inherited Cancer Conditions 8.2.1 Script Algorithm 8.2.2 Analysis in Man 107 107 107 109 110 111 114 115 115 119 123 124 124 125 126 126 128 ... intent to infringe Library of Congress Cataloging -in- Publication Data Berman, Jules J Methods in medical informatics : fundamentals of healthcare programming in Perl, Python, and Ruby / Jules.. .METHODS IN MEDICAL INFORMATICS Fundamentals of Healthcare Programming in Perl, Python, and Ruby CHAPMAN & HALL/CRC Mathematical and Computational Biology Series Aims and scope: This... and Methods of Medical Informatics; Part II? ?Medical Data Resources; Part III—Primary Tasks of Medical Informatics; and Part IV? ?Medical Discovery Part I—Fundamental Algorithms and Methods of Medical