As a result, integration has become an important phase in biology research process.Integration allows biologists to combine knowledge from multiple disciplines [56, 110, 47, 88, 53] and
Trang 1SUPPORTING ON-THE-FLY DATA INTEGRATION FOR
BIOINFORMATICS
DISSERTATION
Presented in Partial Fulfillment of the Requirements for
the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Xuan Zhang, M.S.
* * * * * The Ohio State University
Science
Trang 2UMI Number: 3246116
3246116 2007
UMI Microform Copyright
All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company
Trang 3Xuan Zhang2007
Trang 4The use of computational tools and on-line data knowledgebases has changed theway the biologists conduct their research The fusion of biology and informationscience is expected to continue Data integration is one of the challenges faced bybioinformatics In order to build an integration system for modern biological research,three problems have to be solved A large number of existing data sources have to
be incorporated and when new data sources are discovered, they should be utilizedright away The variety of the biological data formats and access methods have to beaddressed Finally, the system has to be able to understand the rich and often fuzzysemantic of biological data
Motivated by the above challenges, a system and a set of tools have been plemented to support on-the-fly integration of biological data Metadata about theunderlying data sources are the backbone of the system Data mining tools havebeen developed to help users to write the descriptors semi-automatically With auto-matic code generation approach, we have developed several tools for bioinformaticsintegration needs An automatic data wrapper generation tool is able to transformdata between heterogeneous data sources Another code generation system can createprograms to answer projection, selection, cross product and join queries from flat filedata
Trang 5im-Real bioinformatics requests have been used to test our system and tools Thesecase studies show that our approach can reduce the human efforts involved in aninformation integration system Specifically, it makes the following contributions 1)Data mining tools allow new data sources to be understood with ease and integrated tothe system on-the-fly 2) Changes in data format are localized by using the metadatadescriptors System maintenance cost is low 3) Users interact with our systemthrough high-level declarative interfaces Programming efforts are reduced 4) Ourtools process data directly from flat files and requires no database support Dataparsing and processing are done implicitly 5) Request analysis and request executionare separated and our tools can be used in a data grid environment.
Trang 6This is dedicated to the ones I love To my parents, who believe in women inengineering To my husband, who never stop criticism And to my daughter, whose
smile is the best reward in the world
Trang 7I would like to express my deepest gratitude to my advisor, Professor GaganAgrawal He has been a great mentor and a wonderful colleague to me I am sofortunate to have the opportunity to learn from him on not only how to conductresearch but also how to be a better person
I also want to sincerely thank Professor Hakan Ferhatosmanoglu and ProfessorYusu Wang for serving in my dissertation committee
Trang 82003-present Graduate Research Associate,
Ohio State University
PUBLICATIONS
Xuan Zhang, Ruoming Jin, Gagan Agrawal “Assigning Schema Labels Using tology And Heuristics” In Proceedings of IEEE Symposium on Bioinformatics andBioengineering (BIBE’06), October 2006
On-Xuan Zhang, Gagan Agrawal “A Tool for Supporting Integration Across MultipleFlat-File Datasets” In Proceedings of IEEE Symposium on Bioinformatics andBioengineering (BIBE’06), October 2006
Xuan Zhang, Gagan Agrawal “Enabling Information Integration and Workflows
in a Grid Environment with Automatic Wrapper Generation” In Proceedings ofIEEE/ACM International Workshop on Grid Computing (GRID2005), November2005
Kaushik Sinha, Xuan Zhang, Ruoming Jin, Gagan Agrawal “Using data miningtechniques to learn layouts of flat-file biological datasets” In Proceedings of IEEESymposium on Bioinformatics and Bioengineering (BIBE’05), October 2005
Kaushik Sinha, Xuan Zhang, Ruoming Jin, Gagan Agrawal “Learning layouts ofbiological datasets semi-automatically” In Proceedings of International Workshop
Trang 9Xuan Zhang, Xiaoyang Gao, Gagan Agrawal “Integrated Retrieval from BiologicalDatabases Using an SQL Extension” In Proceedings of Workshop on Bioinformaticsand Computational Biology (BCB2003), December 2003.
Leonid Glimcher, Xuan Zhang, and Gagan Agrawal “Scaling and Parallelizing aScientific Feature Mining Application Using a Cluster Middleware” In Proceedings
of International Parallel and Distributed Processing Symposium (IPDPS2004), April2004
FIELDS OF STUDY
Major Field: Computer Science and Engineering
Studies in Bioinformatics Integration System: Prof Gagan Agrawal
Trang 10TABLE OF CONTENTS
Page
Abstract ii
Dedication iv
Acknowledgments v
Vita vi
List of Tables xi
List of Figures xii
Chapters: 1 Introduction 1
1.1 Motivation 1
1.2 Our Approach 4
1.3 Advantages 7
1.4 Thesis Organization 8
2 Literature Review 9
2.1 Biological Information Integration Systems 9
2.2 Grid Projects on Bioinformatics 10
2.3 Metadata Description 11
2.4 Wrappers 12
2.5 Biological Query 15
2.6 Semantic and Ontology 17
Trang 113 Schema Mining 18
3.1 Overall Context, Challenges, and System Overview 20
3.1.1 Challenges in Schema Mining 22
3.1.2 Summary of the Steps 25
3.2 Algorithm 26
3.2.1 Data Cleaning and Summarization 27
3.2.2 Scoring Function 30
3.2.3 Mining with Ontology 31
3.2.4 Mining with Heuristics 32
3.2.5 Ontology Database 35
3.3 Experimental Results 36
3.3.1 Score Clustering 37
3.3.2 Statistical Evaluation 38
3.3.3 Discussion 39
3.4 Summary 41
4 Automatic Wrapper Generation 48
4.1 System Overview 51
4.2 Technical Issues and Challenges 53
4.3 Metadata Description Language 56
4.4 System Implementation and Key Algorithms 59
4.4.1 Wrapper Generation System 59
4.4.2 Wrappers 65
4.5 Case Studies and Experimental Results 70
4.5.1 TRANSFAC-to-Reference 70
4.5.2 SWISSPROT-to-FASTA 72
4.6 Summary 73
5 Query Multiple Flat-File Datasets 76
5.1 Challenges and Our Approach 79
5.2 System Overview 81
5.3 Query Language 83
5.4 Query Analysis 85
5.4.1 Descriptor Parser 86
5.4.2 Application Analyzer 86
5.5 Query Execution 91
5.6 Experiments 94
5.6.1 POST-BLAST QUERY 94
Trang 125.6.2 CHIP-SUPPLEMENT QUERY 95
5.6.3 OMIM-PLUS 97
5.7 Summary 98
6 Query Flat-File Datasets Using Indices 100
6.1 Challenges and System Overview 102
6.1.1 Indexing Biological Data 105
6.2 Algorithms and System Implementation 108
6.2.1 Languages 108
6.2.2 Query Analysis 109
6.2.3 Query Execution: The Query-Proc Program 113
6.3 Experimental Results 116
6.3.1 General Database Search with Index 116
6.3.2 Similarity Search on Sequence Databases 118
6.4 Summary 122
7 Case Studies 125
7.1 System Overview 126
7.2 Case Study I: Gene Name Nomenclature 128
7.2.1 Case Study I.I: Nomenclature Across Species 131
7.2.2 Case Study I.II: Nomenclature Over Time 136
7.3 Case Study II: Correlation Between Gene’s Function and Location 139 7.4 Summary 143
8 Future Work 148
8.1 Understandability and Usability 149
8.2 Efficiency 149
8.2.1 Caching Data 150
8.2.2 Caching Responses 151
8.3 Functionality 152
8.3.1 Ontology for bioinformatics tools 153
8.3.2 Reason about workflows 154
9 Conclusion 156
Bibliography 158
Trang 13LIST OF TABLES
3.1 Profile Table for Token Categorization 28
3.2 Schema Mining Algorithm Evaluation 47
4.1 WRAPINFO data structure for the TRANSFAC-to-Reference Example 62
7.1 Summary of Databases 131
7.2 Usage of Registered Gene Names 135
7.3 Usage of Gene Names in Other Communities 136
7.4 Summary of Major Cellular Component and Molecular Function GOTerms 141
Trang 14LIST OF FIGURES
3.1 Overview of Metadata Learning for Biological Data 21
3.2 General Function for Schema Mining Score Calculation 30
3.3 Score Calculation with Heuristics 33
3.4 Results Evaluation 39
3.5 Pseudo-code of Approximate Frequent Token Mining Algorithm 43
3.6 Score Calculation with Ontology 44
3.7 Results of Attribute Labelling with Ontology 45
3.8 Results of Attribute Labelling with Heuristics 46
4.1 Overview of the Wrapper Generation System 51
4.2 The Descriptor for the Reference Table in the TRANSFAC-to-Reference Example 58
4.3 Automatic Generated Schema Mapping File for the TRANSFAC-to-Reference Example 60
4.4 Logical View of TRANSFAC Data Layout as a Tree 61
4.5 Overview of the Wrapper 66
4.6 The Algorithm for DataReader of Wrapper 67
Trang 154.7 The Algorithm for DataWriter of Wrapper 68
4.8 Results from TRANSFAC-to-Reference Problem 71
4.9 Results from SWISSPROT-to-FASTA Problem 71
4.10 The Descriptor for TRANSFAC in the TRANSFAC-to-Reference Ex-ample 75
5.1 Overview of the System 81
5.2 Query for POST-BLAST example 84
5.3 Types of Query Specified with Query Language 85
5.4 Internal Representation of the metadata for BLASTP 87
5.5 QUERYINFOR for POST-BLAST Example 89
5.6 Value Buffer for POST-BLAST Example 92
5.7 Performance on POST-BLAST Example 95
5.8 Performance on CHIP-SUPPLEMENT Example 97
5.9 Algorithm for the Synchronizer of query-proc 99
6.1 Overview of the Query System Using Indices 104
6.2 Query Example 108
6.3 The Metadata Descriptor for Yeast Genome 110
6.4 QUERYINFOR for Example Yeast Genome Query 111
6.5 Performance of Answering BLAST-ENHANCE Query 118
6.6 Performance of CYGD Similarity Search Using Singh’s Algorithm 120
6.7 Performance of GENBANK Similarity Search Using Ferhatosmanoglu’s Algorithm 121
Trang 166.8 Algorithm of Example Indexing Functions for Yeast Genome IDs 123
6.9 The Algorithm for the Synchronizer Using Indices 124
7.1 Overview of the On-the-Fly Biological Data Integration System andTools 126
7.2 The Metadata Descriptor for dictyBase 132
7.3 Performance of Entry Selection by Species 134
7.4 Trends of Nomenclature Between Swiss-Prot and Genome Databases 137
7.5 Performance of Historical Analysis 138
7.6 Correlation Analysis Workflow 144
7.7 Correlation Between Cellular Component and Molecular Functions 145
7.8 The Modification of Descriptor When Swiss-Prot Format Changes 146
7.9 Identification of Gene Name Attributes Using Schema Labelling Tool 147
Trang 17CHAPTER 1
INTRODUCTION
In this dissertation, a framework and a set of tools have been proposed and plemented for the on-the-fly integration of biological data They could minimize thehuman involvement in integrating new resource and reduce the maintenance costwhen participating autonomous data resources update Our approaches are mainlybased on data mining and code generation
im-1.1 Motivation
Biologists today spend large amount of time and effort in querying multiple remote
or local data sources, running data analysis programs and interpreting the results
As a result, integration has become an important phase in biology research process.Integration allows biologists to combine knowledge from multiple disciplines [56, 110,
47, 88, 53] and has become a critical issue in biological research in recent years.However, the explosion of biological data and computation resources has made humanintegration no longer feasible
First, the quantity of biological data is overwhelming In August 2005, the INSDCannounced that the DNA sequence database exceeded 100 gigabases [13] GenBank 1
1
Please see http://www.ncbi.nlm.nih.gov/Genbank/
Trang 18statistics showed that it contained 65,369,091,950 bases in 61,132,599 sequence records
in its traditional divisions as of August 2006 [14] New biological data is being duced at a phenomenal rate It has been reported that, on the average, biologicaldatabases grow exponentially and double in size about every 15 months [12] Thenumber of data depositories is increasing, too Manually tracing all the data resources
pro-is infeasible
Second, the interoperability between these biological services are poor Thesedata resources are usually developed autonomously and may represent same kind ofinformation heterogeneously They are represented in a variety of formats, and may
be organized in flat files, relational or object-oriented databases One main reasonfor the variety of data representation is that biological concepts are usually complexand data are semi-structured Another reason is that collaboration between differ-ence data authorities are low and therefore there are a limited number of constraintswhen designing data representation formats Unlike data in classic database systems,biological data is usually accessible through user-friendly web interfaces and down-loadable files For example, a biologist using microarray technology to uncover thegenetic basis of a disease needs to go through the following steps: 1)mapping the site
of a reactive spot in the micro-array output to its gene sequence, 2)comparing thesequence to known sequences to find protein or DNA homologues, 3)mining informa-tion about these homologues, and 4)annotating unknown sequence with informationfrom the mined sources The whole process involves querying multiple distributeddatabases, including sequence databases such as SWISSPROT, annotation databasessuch as GenCards and literature databases such as PubMed These databases com-municate their query results differently Their formats range from ASN.1 format for
Trang 19SWISSPROT, loosely structured HTML format for GeneCards, to structured XMLformat for PubMed This microarray research process also involves computationaltools, such as BLAST, that require the inputs in particular formats The heterogene-ity between the data layouts forbids the biologist carry on the workflow directly Forexample, he can not run BLAST search on SWISSPROT directly because BLASTprogram asks for sequences to be stored in FASTA format, and SWISSPROT dataare stored in a different and much more complicated form.
Third, a variety of tools exist that assist biologists in searching, mining and ing biological data Famous examples are FASTA [81], BLAST [5] and ClustalW [101].Most of these tools are free, either through downloading of source code or Web in-terfaces They are important for many analysis workflows and an integration systemwithout any tools offers limited support for bioinformatics research Several collec-tions of computer applications are freely available to public Examples include theonline list at Bioexplorer.Net 2 and the book Bioinformatics: Methods and Protocols
analyz-by Stephen Misener and Stephen A Krawetz [72] There are also a large amount
of hard-coded scripts written in languages such as Perl that perform specific datamanipulation tasks As the number of analysis tools increases, keeping track of theupdates to the existing services and enlisting all the new services can not be achievedmanually
Some biological resources provide their services through web services and grid vices Many biological and biomedical problems, such as molecular modelling for drugdesign, genetic/biochemical network, protein-protein interactions, require computa-tionally intensive numerical operations on a large and, in many cases, distributed data
ser-2
Please see http://www.bioexplorer.net/Methods and Protocols/
Trang 20domain The grid technology provides a powerful data-sharing environment for ing these large-scale computing and data-intensive computing applications Famousexamples include BIRN [51], BioGrid [74] and DataGrid WP10 3 Integration on top
solv-of the web services or grid services has many new challenges These services are notdesigned for direct access by humans, especially for biologist with little training onprogramming Their availabilities are constantly changing In such environment, it
is desirable to have an integration system that can dynamically include/exclude datasources and analysis services and a computer integration approach with less humaninterference is more likely to succeed
Motivated by the challenges of the bioinformatics integration, we designed a tem to increase the level of automation of the integration process When a new datasource is found, it will be examined using data mining techniques Suggestion aboutthe layout and schema of the data will be made so that the essential metadata iscollected This metadata contains all necessary information about how to interpretand process a data source Once the metadata is determined, the new data can berelated to other data sets in the integration system automatically We developed awrapper generation tool When the data or its subset is required by another compo-nent within the integration system, necessary data transformation can be performedautomatically Similarly, query execution tools find a solution to a user’s request byanalyzing the metadata about all relative resources All the data processing toolsgenerate executable programs which interact with data sets directly
sys-3
Please see http://edg-wp10.healthgrid.org/
Trang 21The first step toward integration is to understand the dataset It could be an pensive process, especially when enough documentation is not available The currentsolutions usually involve human intervention, which is both slow and error-prone Wehave developed a set of data mining techniques to examine a dataset and summa-rize information about its layout and schema Particularly, heuristics and ontologyknowledge are used to assign meaningful labels to data attributes These labels canfurther be used to construct the schema The details are discussed in Chapter 3.Joint efforts with another graduate students in our research groups has resulted
ex-in a set of algorithms to learn an unknown dataset’s layout The schema learnex-ingand layout learning together gather metadata information necessary for integrationpurpose To capture both parts of the metadata, we designed a declarative descrip-tion language for biological data The layout description is similar in flavor to theData Format Definition Language being developed by the DFDL Working Group inthe Global Grid Forum4 Such descriptions provide sufficient information for the sys-tem to understand the layout of binary or character flat-files, without relying on anydomain- or format-specific information The schema description language follows theXML DTD format to capture hierarchical data structures Semi-structured data can
be represented The layout description language is based on the common istics of biological flat file datasets The details of the metadata description languagewill be presented in Section 4.3
character-In order to solve the problem of data format disagreement, a tool was developed togenerated wrappers to convert data automatically Using the metadata information,
a schema mapping between the input and output is established by this tool A
4
Please see http://forge.gridforum.org/projects/dfdl-wg
Trang 22wrapper is then generated from the mapping This wrapper is able to discover andtransform information from one low-level dataset to another low-level dataset Theimplementation is discussed in Chapter 4.
Processing queries across multiple data resources is a central function of tion systems For datasets with large size and infrequence usage, the overhead paidfor loading them into a database system may not well justified Moreover, when adataset is versatile in its format, extra human effort is constantly required for pars-ing it correctly Many biological queries belong to this category We developed aquery processing tool targeting this scenario It requires no database support orutility programs It does ”lazy-parsing”, process dataset only when it is queried.Queries, written with SQL like language, are analyzed with the datasets’ metadata.Executable programs are generated to process the data files, test the values, and de-posit the answers in a desired format This query processing system is presented inChapter 5
integra-Indexing techniques have been widely used to improve search performance ofdatabase systems Many algorithms have also been developed for solving queriesthat cannot be simply answered with ”yes” or ”no” A famous example is sequencesimilarity search We incorporated the indexing techniques into the query processingtool The upgraded query tool is able to answer more queries with better performance.The indexing functions are treated as plug-in modules to the query execution programand could be reused by different datasets and queries This indexing enhanced querytool will be discussed in Chapter 6
Trang 232 The metadata description language is general enough that it can represent most
of the biological data in flat files
3 No database support is assumed for either the source or the target Data setsare treated as flat files and computational tools are treated as black boxes
4 Human involvement is limited Resources can be discovered on-the-fly, andtheir metadata can be learned Integration can be achieved semi-automatically
5 The integration system scales well with respect to the number of data resourcesmanaged For each resource, only one descriptor needs to be written, irrespec-tive of any other resources it may be integrated with Changed are localizedwithin metadata descriptors The maintenance costs are low
6 Our approach can efficiently transform large volumes of data between largenumber of sources even when the data is out-of-core Values are process in astreaming fashion Linear performance growth is guaranteed
In the last chapter, we used two comprehensive studies to demonstrate our tools.The first example examined the gene name nomenclature in the biology researchcommunity The second study was on the correlation between genes’ sub-cellular
Trang 24locations and their products’ functions Both case studies involved several neous flat file datasets They proved that our tools can be used in various integrationapplications with reasonable performance.
heteroge-1.4 Thesis Organization
The thesis is organized as follows I will first review the related work in ter 2 The data mining algorithms for schema learning will be discussed in Chapter 3.The wrapper generation system will be presented in Chapter 4 In the same chapter,the metadata description language will also be introduced The tool used to an-swer database-like queries and its enhancement with indexing ability are presented inChapter 5 and 6, respectively Chapter 7 summaries the system and tools using twocase studies Finally, Chapter 8 proposes the future work and Chapter 9 concludesthe thesis
Trang 25Chap-CHAPTER 2
LITERATURE REVIEW
In this chapter, we summarize related work in the areas related to biological dataintegration
2.1 Biological Information Integration Systems
Information or data integration has been widely studied for more than two decades.One of the early approaches to information integration was using federated databases [89].Use of mediators has been another dominant approach and it has been taken byprojects like TSIMMIS [46] and InfoHarness [91]
Integration of data sources has become a critical issue in biological research
in recent years, as it allows biologists to combine knowledge from multiple plines [56, 88, 53] Several systems exist that support a unified query interfaceover heterogeneous data sources The Sequence Retrieval System (SRS) [1] is one
disci-of the earliest bioinformatics integration systems It follows a federation approachand the underlying databanks are in their original formats It integrates the data-banks through meta-data and supports queries via a Web interface and applicationprogramming interfaces(APIs) The metadata is written in Icarus, an internal pro-gramming language, and can be edited so that SRS can be customized SRS is a flat
Trang 26file based system that aims to remain independent of the technology used for datastorage A token server is fed with rules in order to extract data fields within a datafile Kleisli system [110] is a mediator system with a complex object data model and
a high-level query language, sSQL It is able to create a local data warehouse withKleisli It also has a query optimizer to evaluate various query plans In order toaccess various underlying data sources, a large number of wrappers are hand-written.The Transparent Access to Multiple Bioinformatics Sources [47], also known as TAM-BIS, is another famous bioinformatics integration system Ontology is the core of theTAMBIS system It is an ontology based system Users forms the requests in terms
of retrieving instances described by concepts The TAMBIS ontology is supplied as asoftware component that reply to questions posted by other components Other fa-mous bioinformatics integration systems include K2 [32], a successor system to Kleisli,the P/FDM Mediator and DiscoveryLink [53]
There are also efforts [77, 4, 70, 37, 87, 52] on developing workflow managementsystems to help biologist set up complex analysis process on their data The analysisare captured as ”scientific workflows” in which data flow from one analytical step toanother under certain control
2.2 Grid Projects on Bioinformatics
Analysis of large and/or geographically distributed scientific datasets [24] hasemerged as a key component of grid computing Many projects have been working
on technologies to support data grids Most popular directions have been replicaservices [20, 25], reliable and predictable data transfers [3, 105], and constructing
Trang 27workflows [2, 34] Metadata cataloging and metadata services have also receivedmuch attention lately [35, 92].
The myGrid project has been developing technologies for integrating a variety ofservices in the web, through the use of web service composition language [111] IBMhas been developing Bioinformatic Workflow Builder Interface (BioWBI) for creatingweb service based workflows for biological researchers5 These two efforts typicallyrequire: 1) Use of XML for exchange of data between different sources, and 2) Javawrappers on existing applications Both requirements can introduce overheads
In a data grid, data from multiple sources may have to be analyzed using a variety
of analysis tools or services This can introduce a number of challenges In recentyears, several research groups have initiated work addressing some of these challenges.For examples, Chimera is a system for supporting virtual data views and demand-driven data derivation [43, 116] Similarly, CoDIMS-G is a system providing gridservices for data and program integration [41] A number of projects have focused
on scientific workflows The workflow management research group under Global GridForum’s Scheduling and Resource Management area is active in this area, and hascompiled a list of existing projects in this area6
2.3 Metadata Description
Almost all existing bioinformatics integration systems employ some standard todescribe the underlying data sources For example, SRS [40] describes each datasource using Icarus The information captured includes the type and structure ofdata, relationship to other data sources, the indexing and presentation methods and
Trang 28mapping to external object models These metadata languages are closely interleavedwith other components of the integration system BinX and Binary Format Descrip-tion (BFD) [11] is a meta-data descriptor that gives a machine-readable view of abinary file Metadata cataloging and metadata services have received much attentionlately [35, 92] Their focus is on mechanisms for storing, discovering, and accessingmetadata Metadata catalogs have been used by Artemis project [104] for supportinguniform access to heterogeneous scientific data sources in Grid environment Thoughthe focus is not specifically on biological data, the approach is quite similar to themediator-based bioinformatics systems.
One of the important challenges in data integration is that data formats or layoutsused by different data sources and expected by different data analysis tools can varysignificantly The common way of addressing this problem has been through hand-written wrappers The function of a wrapper program is to transform the data fromone source into a format that is understandable by another Hand-written wrappersare the traditional approach taken by biologists For example, SEQIO [86] is a packagefor reading and writing biological sequences from/to fifteen well-known formats andhas a file conversion program called fmtseq
In the mediator-based integration systems, as reviewed in [56], the DBMS andmediators submit queries to the wrappers, which translate them to local queries forthe data sources These wrappers are also responsible for retrieving the data fromthe sources and translating it into the common integrated data representation Inmany of the existing systems, wrappers are manually written The Kleisli Query
Trang 29System [110] provides each data resource with a specific wrapper These wrappersgive the Kleisli Query Engine a unified object view of the data The query plan ofthe TAMBIS system [47] is written in Collection Programming Language (CPL) [19],which is supplied with a library with wrapper services Biomediator [88] relies onwrappers to convert all data from various sources to XML format before furtherprocessing In these systems, the wrappers are hard-wired and post a constraint onsystems’ scalability DiscoveryLink [53] allows its users to define their own wrappersand re-configure the system through a registration process at a relatively higher level.Yet, the wrapper still has to be hand-coded.
Besides these mediator-based systems, Genomics Unified Schema (GUS) uses thedatawarehouse approach [33] Knowledge-based Integration of Neuroscience Data(KIND) combines wrappers for each source with ontologies [48] Both rely on manu-ally written wrappers BACIIS [71] is the only federated biological databases that weare aware of that is able to automatically derive extraction rules and store them inthe source wrappers However, the data source schema files used by BACIIS can onlydescribe HTML pages and the individual schema is mapped to a common domainontology contained by BACIIS
In a scientific workflow, whenever the data generated from the previous step doesnot meet the requirement of the input for the next step, a wrapper must be intro-duced as an intermediate step to prepare the data Hand-written wrappers are used
in workflow management systems, too In Kepler project [4], a suite of data mation actors (XQuery, Perl, etc) is included for the purpose of linking semanticallycompatible but syntactically incompatible web services together Users are responsi-ble for providing wrappers written with these actors Pegasys is a system for creating
Trang 30transfor-workflows for analysis of sequence data [87] The system is based on a specializeddata structure for sequence data and restricted to a fixed number of analysis packages.Wrappers are included as tools for users to choose.
The issues of format conversion for flat-files in a grid environment have not receivedmuch attention In researches on integration service for the Grid, such as CoDIMS-
G [41], wrappers are inserted between the control component and the database agement systems Hand-written wrappers are used in the testbeds IBM’s CLIOproject uses database schemas to derive the transformations needed to integrate thedatasets [82] This work assumes that data is stored in databases with well-definedquery interfaces, and is therefore, not applicable to flat-files The support for externaltables as part of Oracle’s recent implementation allows tables stored in flat-files to beaccessed from a database7 The data must be stored in the table format, or an accessdriver must be written The system can not support semi-structured data which iscommon in bioinformatics
man-Automatic wrapper generation has been an active research topic But currently,most of the automatic wrapper generation research has focused on extracting infor-mation from tabular structures in HTML files into database systems [8, 84, 23, 45]
As the web services become more and more popular, these effort could have a icant impact ROADRUNNER [29] generates record layout structure by comparingHTML pages Data fields are annotated by the user after this inference process.Heuristics about HTML pages are crucial to ROADRUNNER For example, it relies
signif-on tags to tell field name from field instance, the presence of closing tags to guish optional and repeating patterns These features make it hard to extend the
distin-7
See www.dbasupport.com/oracle/ora9i/External Tables9i.shtml
Trang 31application of this approach to data files other than HTML files Arasu et al haveproposed an approach [7] where no heuristic on HTML was used However, multiplepages generated from a same template must be collected for template construction.This, although useful for web-service-based applications, may not be suitable for somebioinformatics applications when all records are listed in only one flat file or when thecomputer programs generating the datasets are computational expensive The Webextractor developed by Hammer [54] could be used for flat files besides HTML pages.However, it requires a declarative specification which states how to extract informa-tion hierarchically When applied to biological datasets which may grow ponentially,this approach may have performance drawbacks when the data becomes out-of-core.Moreover, the above researches assume the target to be either a database system or
in a simple tabular form Data conversion between flat files of arbitrary format hasnot been studied by any of these efforts
2.5 Biological Query
The requests posted by biological research vary and some may go beyond therelational algebra which is assumed by tranditional database management systems.Searching, mining and browsing are all common Besides the wide range of querytypes, biologists also wish to access data through different semantic layers Variouslanguages have been proposed aiming to satify these needs Relational data modeland SQL are most common and are utilized in systems such as DiscoveryLink [53].Gene Logic’s Object Protocol Model (OPM) [22] provides object-oriented databasewith OPM-MQL query language TAMBIS [47] uses ontology to guide query formu-lation A TAMBIS query is formulated in terms of the concepts and relationships
Trang 32in the ontology SRS [40] developed its own query language and it supports stringcomparison and link operator.
Several systems address the query optimization problem OPM’s query optimation
is rule-based [22] Both K2 [32] and DiscoveryLink [53] does cost-based plan mization after query rewriting Knowledge-based Integration of Neuroscience Data(KIND) [50] adds domain knowledge to query optimization Eckman et al have fo-cused on optimizing the execution of queries that access multiple biological databases
opti-in a distributed environment [38]
Closely related to quering biological data is biological data indexing A largenumber of research efforts have focused on join techniques for relational database, asreviewed in [113] SRS [40] suppports flat file databank and map the data into objectmodels Indexing on single data element is possible with SRS token server Theindexing algorithms on biological data vary greatly in the data types being indexed,the indexing mechanisms, and the interfaces For example, efforts in [26, 76, 97] usetranditional electronic dictionary approaches to index medical literuature [17, 18]focuse on indexing full text of medical documents Indexing medical images hasbeen studied in [66, 99] Progresses have also been made on indexing string data
to assist similarity search on biological sequences The frequency transformationbased [60, 63, 79] and index tree based [100, 59] approaches have gained lots ofsuccess Although data indexing is supported by many integration systems, noneclaims to be able to support all the indexing schemes
Trang 332.6 Semantic and Ontology
Semantic interoperabiltity is another research area related to information tion Use of ontology and ontology alignment [57] and semantic mediation [49] havebeen some approaches proposed for data integration Many algorithms for schemamappings have been proposed, and are reviewed by Rahm and Bernstein [83] Forexample, Cupid [68] considers not only linguistic similarity, but also structural simi-larity between each pair of schema elements Ontologies have recently been used indata mining in a number of ways, including for improving mining algorithms [16],identification of disease gene candidates [102], and modeling the mining process [21],among others Ontology has also been used to guide the discovery of semi-structureddata schema Most of the efforts [78, 75, 107] are on the organization of the dataentities
Trang 34integra-CHAPTER 3
SCHEMA MINING
The first step to integrate an unknown dataset is to understand it Both semanticand its syntactic information needs to be collected To biological datasets, this is not
a trivial task Because of the complexity of biological concepts, bioinformatics data
is usually semi-structured Flat-file formats have been commonly used for its highflexibility There is no established standards for these flat file layouts As a result,their readiness for data consuming programs, such as integration systems, is poor.This poses a big challenge for biological data integration systems and increase thedifficulty of on-the-fly integration
We have designed a three-step approach for improving the accessibility of formatics data and reducing the human involvement in the integration process Thescheme starts with learning data layouts semi-automatically [93, 94] In the secondstep, parsers or wrappers are automatically generated using the learned layout [114].Finally, to allow integration of this data with other sources or data analysis programs,
bioin-we need to correctly assign labels to each distinct attribute in datasets This last step
is referred to as schema mining or schema label assignment, and is the focus of thischapter Note that once the schemas from different data sources become available,
Trang 35metadata are finalized, data is ready be transformed automatically between differentdatasets using the tool discussed in Chapter 4 and queried using the tools discussed
in Chapter 5 and 6
The schema of an arbitrary dataset contains two parts, the identities of the entitiesand the organization of them Most of the schema mining efforts have been onthe second aspect [78, 75, 107] Unfortunately, when data is represented in flat-files, neither information is available In this chapter, we present an approach foridentifying the data entities or attributes in a flat-file dataset Our approach is based
on the application of data mining techniques, and a unsupervised learning method isused Each attribute which needs to be labelled is represented by a set of frequentlyoccurring values found for that attribute We then apply ontology with heuristics toscore the likelihood of this attribute having each of the labels from a set of availablelabels Further, a clustering scheme is used to find cutoff values on the computedscores, and to assign labels This allows our approach to be effective even when there
is a considerable noise and other inconsistencies in the dataset
For evaluating our work, we created a small ontology, which is based on selectivesampling of information from a number of sources Detailed experimental evalua-tion of the work using three well-known biological datasets has demonstrated theeffectiveness of our approach
Overall, this work makes the following contributions We have introduced theproblem of schema labeling for flat-file data layouts We have shown how data miningtechniques from a number of areas can be adapted and incorporated to address thisproblem Specifically, our work builds on existing techniques for computing frequently
Trang 36occurring items, scoring with ontologies, and clustering Finally, our experimentalevaluation has shown the effectiveness of our approach.
The rest of this chapter is organized as follows The major technical challengesand the overview of our approach are discussed in Section 3.1 The algorithms aredescribed in details in Section 3.2 Experimental results are presented in Section 3.3.Finally, we summarize this work in Section 3.4
3.1 Overall Context, Challenges, and System Overview
In this section, we describe the overall approach for supporting on-the-fly tion of biological datasets proposed by our team The discussion will then focus onschema mining or attribute labeling We will discuss the challenges and provide anoverview of our approach
integra-Jointly with the work by another graduate student in our research group, we posed the following approach for bioinforamtics data integration as shown throughFigure 3.1 The layout of a flat-file bioinformatics dataset is first learned semi-automatically in two steps The first step is to infer the delimiters used by thedataset using d score, a metric we have defined d score is based on the information
pro-of frequency and position pro-of token sequences and the algorithm provides a superset
of the actual delimiters The incorrect delimiters can be removed manually by users
In the second step, the layout descriptor of the dataset is generated from the correctdelimiter set, based on the relative order of the delimiters More details can be found
in [93] and [94]
To integrate the dataset, we also need to gather its semantic information Oneessential step is to assign a label or name to every attribute and it is illustrated by
Trang 37Value Cleaning and Summarization
Dataset Flat File
Layout Learning Layout Descriptor Parser Generation Parser Parsing Raw Attribute Values
Attribute Summaries Score Calculation Scores
Schema Mining
Figure 3.1: Overview of Metadata Learning for Biological Data
Trang 38the rest of the Figure 3.1 We call this step schema mining and will focus on it inthis chapter.
The metadata collected can be used in various ways From the layout descriptor,
a parser can be created automatically This process was published in [114] and will bediscussed in Chapter 4 The generated parser parse datasets in a streaming fashion
by continuously looking for the next possible delimiters Metadata layout descriptorsalso provide enough information to query the datasets The tools developed for thispurpose will be presented in Chapter 5 and 6
3.1.1 Challenges in Schema Mining
We now discuss the major challenges involved in schema mining or automaticschema labeling for biological data We also give an overview of our data miningbased approach
Flat-file data formats usually use constant strings or delimiters for separatingdata values An data mining algorithm developed in our group is able to returnthe delimiter set One possible approach for scheme labeling could be to use thesedelimiters to obtain information about the attributes Unfortunately, for biologicaldatasets, this information is usually very limited or incomplete In GenBank, as
an example, there is no delimiter for the most important attribute, the sequencedata Each biological sequence string begins on the line immediately below the word
“ORIGIN”, which is the delimiter for another attribute which gives a local pointer
to the sequence start In other datasets, the delimiters differ greatly from naturalEnglish words For example, the SWISSPROT, one of the most popular biologicaldatabases, uses two-character strings to indicate line types Line codes “OS” and
Trang 39“OG” lead the lines for organism species and organelle, respectively In either case,
it is unlikely that a simple lexical analysis could successfully identify the attributesfrom the delimiters only
We proposed a different approach to identify data entities Rather than focusing
on delimiters or other summary forms of the data, we take a bottom-up approach It isbased on the observation that a schema is a reasonable abstraction of the dataset For
a given attribute, its data and its schema label can be related using the is-a operator,which is similar to the relationship of the terms within an ontology hierarchy Forexample, protein is a valid value for the attribute of molecule type, since the assertionprotein is-a molecule type is true Because of the same assertion, in a ontology treeabout molecule type, protein is a valid child Thus, in the ideal situation, where theontology database is complete and the attribute values do not have any errors, thefollowing two statements hold about an attribute with label att
be distinguished from values alone In this case, both have the label “date” Their
Trang 40full attribute names can only be obtained by consulting the database designers ordocuments.
However, a complication with the above approach arises because, for any realdatabase, the above two statements are not always true First of all, a complete andcomprehensive biological ontology database is needed for this approach to work Such
an ontology database does not exist and is extremely difficult, if not impossible, toconstruct Biology is a quickly evolving research field Its vocabulary expands quicklyand new discoveries about the existing entities are added constantly There do existmany biological ontology databases, like the NCBI taxonomy database and GeneOntology (GO) But they do not cover all aspects of biological sciences Certainly,one can improve the completeness by combining the existing ontology databases Butsome simple information, like biological molecule type, may still be missing At thesame time, the merged database could be so large in size that speed of the miningprocess may become extremely slow To address this problem, we have created ourown ontology database, where we have used a selective sampling method to controlthe size of the ontology and manually added some terms to cover more categories.The second problem with our proposed approach arises because biological databasesare not free of errors Many databases are manually entered and/or curated Some
of the biological data fields are treated as free text and little, if any, quality check isdone against them As a result, typos and errors are common in bioinformatics data.Different biological databases are maintained by different institutions with differentrequirement on data quality control For example, in Uniprot SWISSPROT data file,
we found many occurrences of “sequenceof”, which is obviously a typo with a ing space Therefore, it is required that the schema labeling algorithm is robust to