FUNDAMENTALS OF DATABASE SYSTEMS Fourth Edition phần 10 doc

29.2 Multimedia Databases I 929 • Marketing, advertising, retailing, entertainment, and travel: There are virtually no limits to using multimedia information in these applications-from effective sales presenta- tions to virtual tours of cities and art galleries. The film industry has already shown the power of special effects in creating animations and synthetically designed ani- mals, aliens, and special effects. The use of predesigned stored objects in multimedia databases will expand the range of these applications. • Real-time control and monitoring: Coupled with active database technology (see Chap- ter 24), multimedia presentation of information can be a very effective means for monitoring and controlling complex tasks such as manufacturing operations, nuclear power plants, patients in intensive care units, and transportation systems. Commercial Systems for Multimedia Information Management. There are no OBMSs designed for the sole purpose of multimedia data management, and therefore there are none that have the range of functionality required to fully support all of the multimedia information management applications that we discussed above. However, several OBMSs today support multimedia data types; these include lnformix Dynamic Server, OB2 Universal database (UOB) of IBM, Oracle 9 and 10, CA- JASMINE, Sybase, OOB II. All of these OBMSs have support for objects, which is essential for modeling a variety of complex multimedia objects. One major problem with these systems is that the "blades, cartridges, and extenders" for handling multimedia data are designed in a very ad hoc manner. The functionality is provided without much apparent attention to scalability and performance. There are products available that operate either stand-alone or in conjunction with other vendors' systems to allow retrieval of image data by content. They include Virage, Excalibur, and IBM's QBIC. Operations on multimedia need to be standardized. The MPEG- 7 and other standards are addressing some of these issues. 29.2.5 Selected Bibliography on Multimedia Databases Multimedia database management is becoming a very heavily researched area with several industrial projects on the way. Grosky (1994, 1997) provides two excellent tutori- als on the topic. Pazandak and Srivastava (1995) provide an evaluation of database systems related to the requirements of multimedia databases. Grosky et al. (1997) contains contributed articles including a survey on content-based indexing and retrieval by ]agadish (1997). Faloutsos et al. (1994) also discuss a system for image querying by content. Li et al. (1998) introduce image modeling in which an image is viewed as a hierarchical structured complex object with both semantics and visual properties. Nwosu et al. (1996) and Subramanian and ]ajodia (1997) have written books on the topic. Lassila (1998) discusses the need for metadata for accessing mutimedia information on the web; the semantic web effort is summarized in Fensel (2000). Khan (2000) did a dissertation on ontology-based information retrieval. Uschold and Gruninger (1996) is a good resource on ontologies Corcho et al. (2003) compare ontology languages and discuss methodologies to build ontologies. Multimedia content analysis, indexing, and filtering are discussed in Dimitrova (1999). A survey of content-based multimedia 930 I Chapter 29 Emerging Database Technologies and Applications retrieval is provided by Yoshitaka and Ichikawa (1999). The following WWW references may be consulted for additional information: CA- JASMINE (Multimedia ODBMS): http://www.cai.com/products/iasmine.htm Excalibur technologies: http://www.excalib.com Virage, Inc (Content based image retrieval): http://www.virage.com IBM's QBlC (Query by Image Content) product: 29.3 GEOGRAPHIC INFORMATION SYSTEMS Geographic information systems (GIS) are used to collect, model, store, and analyze information describing physical properties of the geographical world. The scope of GIS broadly encompasses two types of data: (1) spatial data, originating from maps, digital images, administrative and political boundaries, roads, transportation networks; physical data such as rivers, soil characteristics, climatic regions, land elevations, and (2) nonspatial data, such as socio-economic data (like census counts), economic data, and sales or marketing information. GIS is a rapidly developing domain that offers highly innovative approaches to meet some challenging technical demands. 29.3.1 GIS Applications It is possible to divide GISs into three categories: (1) cartographic applications, (2) digital terrain modeling applications, and (3) geographic objects applications. Figure 29.3 summarizes these categories. Incartographic and terrain modeling applications, variations in spatial attributes are captured-for example, soil characteristics, crop density, and air quality. In geographic objects applications, objects of interest are identified from a physical domain-for example, power plants, electoral districts, property parcels, product distribution districts, and city landmarks. These objects are related with pertinent application data-which may be, for this specific example, power consumption, voting patterns, property sales volumes, product sales volume, and traffic density. The first two categories of GIS applications require a field-based representation, whereas the third category requires an object-based one. The cartographic approach involves special functions that can include the overlapping of layers of maps to combine attribute data that will allow, for example, the measuring of distances in three- dimensional space and the reclassification of data on the map. Digital terrain modeling requires a digital representation of parts of earth's surface using land elevations at sample points that are connected to yield a surface model such as a three-dimensional net (connected lines in 3D) showing the surface terrain. It requires functions of interpolation between observed points as well as visualization. Inobject-based geographic applications, additional spatial functions are needed to deal with data related to roads, physical pipelines, communication cables, power lines, and such. For example, for a given region, 29.3 Geographic Information Systems I 931 GIS Applications r>: Cartographic Irrigation Crop yield analysis Land evaluation Planning and facilities management Landscape studies Traffic pattern analysis Digital Terrain Modeling Applications Earth science resource studies Civil engineering and military evaluation Soil surveys Air and water pollution studies Flood control Water resource management Geographic Objects Applications Car navigation systems Geographic market analysis Utility distribution and consumption Consumer product and services- economic analysis FIGURE 29.3 A possible classification of GIS applications (Adapted from Adam and Gangopadhyay (1997)). comparable maps can be used for comparison at various points of time to show changes in certain data such as locations of roads, cables, buildings, and streams. 29.3.2 Data Management Requirements of GIS The functional requirements of the GIS applications above translate into the following database requirements. Data Modeling and Representation. GIS data can be broadly represented in two formats: (l) vector and (2) raster. Vector data represents geometric objects such as points, lines, and polygons. Thus a lake may be represented as a polygon, a river by a series of line segments. Raster data is characterized as an array of points, where each point represents the value of an attribute for a real-world location. Informally, raster images are n-dimensional arrays where each entry is a unit of the image and represents an attribute. Two-dimensional units are called pixels, while three-dimensional units are called voxels. Three-dimensional elevation data is stored in a raster-based digital elevation model (OEM) format. Another raster format called triangular irregular network (TIN) is a topological vector-based approach that models surfaces by connecting sample points as vertices of triangles and has a point density that may vary with the roughness of the terrain. Rectangular grids (or elevation 932 IChapter 29 Emerging Database Technologies and Applications matrices) are two-dimensional array structures. In digital terrain modeling (OTM), the model also may be used by substituting the elevation with some attribute of interest such as population density or air temperature. GIS data often includes a temporal structure in addition to a spatial structure. For example, traffic flow or average vehicular speeds in traffic may be measured every 60 seconds at a set of points in a roadway nework. Data Analysis. GIS data undergoes various types of analysis. For example, in applications such as soil erosion studies, environmental impact studies, or hydrological runoff simulations, OTM data may undergo various types of geomorphometric analysis-measurements such as slope values, gradients (the rate of change in altitude), aspect (the compass direction of the gradient), profile convexity (the rate of change of gradient), plan convexity (the convexity of contours and other parameters). When GIS data is used for decision support applications, it may undergo aggregation and expansion operations using data warehousing, as we discussed in Section 28.3. In addition, geometric operations (to compute distances, areas, volumes), topological operations (to compute overlaps, intersections, shortest paths), and temporal operations (to compute internal-based or event-based queries) are involved. Analysis involves a number of temporal and spatial operations, which were discussed in Chapter 24. Data Integration. GISs must integrate both vector and raster data from a variety of sources. Sometimes edges and regions are inferred from a raster image to form a vector model, or conversely, raster imagessuch as aerial photographs are used to update vector models. Sev- eral coordinate systemssuch as Universal Transverse Mercator (UTM), latitude/longitude, and local cadastral systems are used to identify locations. Data originating from different coordinate systems requires appropriate transformations. Major public sources of geographic data, including the TIGER files maintained by U.S. Department of Commerce, are used for road maps by many Web-based map drawing tools (e.g., http://maps.yahoo.com). Often there are high-accuracy, attribute-poor maps that have to be merged with low-accuracy, attribute-rich maps. This is done with a process called "rubber-banding" where the user defines a set of control points in both maps and the transformation of the low accuracy map is accomplished by lining up the control points. A major integration issue is to create and maintain attribute information (such as air quality or traffic flow), which can be related to and integrated with appropriate geographical information over time as both evolve. Data Capture. The first step in developing a spatial database for cartographic modeling is to capture the two-dimensional or three-dimensional geographical information in digital form-a process that is sometimes impeded by source map characteristics such as resolution, type of projection, map scales, cartographic licensing, diversity of measurement techniques, and coordinate system differences. Spatial data can also be captured from remote sensors in satellites such as Landsat, NORA, and Advanced Very High Resolution Radiometer (AVHRR) as well as SPOT HRV (High Resolution Visible Range Instrument), which is free of interpretive bias and very accurate. For digital terrain modeling, data capture methods range from manual to fully automated. Ground surveys are the traditional approach and the most accurate, but they are very time consuming. Other techniques include photogrammetric sampling and digitizing cartographic documents. 29.3 Geographic Information Systems I 933 29.3.3 Specific GIS Data Operations GISapplications are conducted through the use of special operators such as the following: 1. Interpolation: This process derives elevation data for points at which no samples have been taken. It includes computation at single points, computation for a rectangular grid or along a contour, and so forth. Most interpolation methods are based on triangulation that uses the TIN method for interpolating elevations inside the triangle based on those of its vertices. 2. Interpretation: Digital terrain modeling involves the interpretation of operations on terrain data such as editing, smoothing, reducing details, and enhancing. Additional operations involve patching or zipping the borders of triangles (in TIN data), and merging, which implies combining overlapping models and resolving conflicts among attribute data. Conversions among grid models, contour models, and TIN data are involved in the interpretation of the terrain. 3. Proximity analysis: Several classes of proximity analysis include computations of "zones of interest" around objects, such as the determination of a buffer around a car on a highway. Shortest path algorithms using 2D or 3D information is an important class of proximity analysis. 4. Raster image processing: This process can be divided into two categories: (1) map algebra, which is used to integrate geographic features on different map layers to produce new maps algebraically; and (2) digital image analysis, which deals with analysis of a digital image for features such as edge detection and object detection. Detecting roads in a satellite image of a city is an example of the latter. 5. Analysis of networks: Networks occur in GIS in many contexts that must be ana- lyzed and may be subjected to segmentations, overlays, and so on. Network overlay refers to a type of spatial join where a given network-for example, a highway network-is joined with a point database-for example, incident locations-to yield, in this case, a profile of high-incident roadways. Other Database Functionality. The functionality of a GIS database is also subject to other considerations. • Extensibility: GISs are required to be extensible to accommodate a variety of constantly evolving applications and corresponding data types. If a standard DBMS is used, it must allow a core set of data types with a provision for defining additional types and methods for those types. • Dataquality control: As in many other applications, quality of source data is of par- amount importance for providing accurate results to queries. This problem is particularly significant in the GIS context because of the variety of data, sources, and measurement techniques involved and the absolute accuracy expected by applications users. 6. Visualization: A crucial function in GIS is related to visualization-the graphical display of terrain information and the appropriate representation of application 934 IChapter 29 Emerging Database Technologies and Applications attributes to go with it. Major visualization techniques include (1) contouring through the use of isolines, spatial units of lines or arcs of equal attribute values; (2) hillshading, an illumination method used for qualitative relief depiction using var- ied light intensities for individual facets of the terrain model; and (3) perspective displays, three-dimensional images of terrain model facets using perspective projection methods from computer graphics. These techniques impose cartographic data and other three-dimensional objects on terrain data providing animated scene ren- derings such as those in flight simulations and animated movies. Such requirements clearly illustrate that standard RDBMSs or ODBMSs do not meet the special needs of GIS. It is therefore necessary to design systems that support the vector and raster representations and the spatial functionality as well as the required DBMS features. A popular GIS software called ARC-INFO, which is not a DBMS but integrates RDBMS functionality in the INFO part of the system, is brieflydiscussed in the subsection that follows. More systems are likely to be designed in the future to work with relational or object databases that will contain some of the spatial and most of the nonspatial information. 29.3.4 An Example of a GIS Software: ARC-INFO ARC/INFo-a popular GIS software launched in 1981 by Environmental System Research Institute (ESRr)-uses the arc node model to store spatial data. A geographic layer-ealled coverage in ARC/INFO-eonsists of three primitives: (1) nodes (points), (2) arcs (similar to lines), and (3) polygons. The arc is the most important of the three and stores a large amount of topological information. An arc has a start node and an end node (and it therefore has direction too). Inaddition, the polygons to the left and the right of the arc are also stored along with each arc. As there is no restriction on the shape of the arc, shape points that have no topological information are also stored along with each arc. The database managed by the INFO RDBMS thus consists of three required tables: (1) node attribute table (NAT), (2) arc attribute table (AAT), and (3) polygon attribute table (PAT). Additional information can be stored in separate tables and joined with any of these three tables. The NAT contains an internal !D for the node, a user-specified !D, the coordinates of the node, and any other information associated with that node (e.g., names of the intersecting roads at the node). The AAT contains an internal !D for the are, a user- specified !D, the internal !D of the start and end nodes, the internal !D of the polygons to the left and the right, a series of coordinates of shape points (if any), the length of the are, and any other data associated with the arc (e.g., the name of the road the arc represents). The PAT contains an internal ID for the polygon, a user-specified !D, the area of the polygon, the perimeter of the polygon, and any other associated data (e.g., name of the county the polygon represents). Typical spatial queries are related to adjacency, containment, and connectivity. The arc node model has enough information to satisfyall three types of queries, but the RDBMS isnot ideally suited for this type of querying. A simple example will highlight the number of timesa relational database has to be queried to extract adjacency information. Assume that we are trying to determine whether two polygons, A and B, are adjacent to each other. We would have to exhaustively look at the entire AAT to determine whether there is an edge that has A 29.3 Geographic Information Systems I 935 on one side and B on the other. The search cannot be limited to the edges of either polygon as we do not explicitly store all the arcs that make a polygon in the PAT. Storing all the arcs in the PAT would be redundant because all the information is already there in the AAT. ESRI has released Arc/Storm (Arc Store Manager) which allows multiple users to use the same GIS, handles distributed databases, and integrates with other commercial RDBMSs like ORACLE, INFORMIX, and SYBASE. While it offers many performance and functional advantages over ARC/INFO, it is essentially an RDBMS embedded within a GIS. 29.3.5 Problems and Future Issues in GIS GIS is an expanding application area of databases, reflecting an explosion in the number of end users using digitized maps, terrain data, space images, weather data, and traffic information support data. As a consequence, an increasing number of problems related to GIS applications has been generated and will need to be solved: 1. New architectures: GISapplications will need a new client-server architecture that will benefit from existing advances in RDBMS and ODBMS technology. One possible solution is to separate spatial from nonspatial data and to manage the latter entirely by a DBMS. Such a process calls for appropriate modeling and integration as both types of data evolve. Commercial vendors find that it is more viable to keep a small number of independent databases with an automatic posting of updates across them. Appropriate tools for data transfer, change management, and workflow management will be required. 2. Versioningand object life-cycle approach: Because of constantly evolving geographical features, GISs must maintain elaborate cartographic and terrain data-a management problem that might be eased by incremental updating coupled with update authorization schemes for different levels of users. Under the object life- cycle approach, which covers the activities of creating, destroying, and modifying objects as well as promoting versions into permanent objects, a complete set of methods may be predefined to control these activities for GISobjects. 3. Data standards: Because of the diversity of representation schemes and models, formalization of data transfer.standards is crucial for the success of GIS. The international standardization body (rso Tc2l0 and the European standards body (CEN Tc278) are now in the process of debating relevant issues-among them conversion between vector and raster data for fast query performance. 4. Matching applications and data structures: Looking again at Figure 27.5, we see that a classification of GISapplications is based on the nature and organization of data. In the future, systems covering a wide range of functions-from market analysis and utilities to car navigation-will need boundary-oriented data and functionality. On the other hand, applications in environmental science, hydrology, and agriculture will require more area-oriented and terrain model data. It is not clear that all this functionality can be supported by a single general-purpose GIS. The specialized needs of GISs will require that general purpose DBMSs must be 936 IChapter 29 Emerging Database Technologies and Applications enhanced with additional data types and functionality before full-fledged GIS applications can be supported. 5. Lack of semantics in data structures: This is evident especially in maps. Information such as highway and road crossings may be difficult to determine based on the stored data. One-way streets are also hard to represent in the present GISs. Trans- portation CAD systems have incorporated such semantics into GIS. 29.3.6 Selected Bibliography for GIS There are a number of books written on GIS. Adam and Gangopadhyay (1997) and Laurini and Thompson (1992) focus on GIS database and information management problems. Kemp (1993) gives an overview of GIS issues and data sources. Huxhold (1991) gives an intruduction to Urban GIS. Maguire et al. (1991) have a very good collection of GIS-related papers. Antenucci (1998) presents a discussion of the GIS technologies. Shekhar and Chawla (2002) discusses issues and approaches to spatial data management which is at the core of all GIS. Demers (2002) is another recent book on the fundamentals of GIS. Bosso- maier and Green (2002) is a primer on GIS operations, languages, metadata paradigms and standards. Peng and Tsou (2003) discusses Internet GISwhich includes a suite of emerging new technologies aimed at making GISmore mobile, powerful, and flexible, as well as better able to share and communicate geographic information. The TIGER files for road data in the United States are managed by the U.S. Department of Commerce (1993). Laser-Scan's Web site (http://www.lsl.co.uk/papers) is a good source of information. Environmental System Research Institute (ESRI) has an excellent library of GIS books for all levels at http://www.esri.com. The GIS terminology is defined at http:// www.esri.com/library/glossary/glossary.html. The university of Edinburgh maintains a GIS WWW resource list at http://www.geo.ed.ac.uk/home/giswww.html 29.4 GENOME DATA MANAGEMENT 29.4.1 Biological Sciences and Genetics The biological sciences encompass an enormous variety of information. Environmental science gives us a view of how species live and interact in a world filled with natural phenom- ena. Biology and ecology study particular species. Anatomy focuses on the overall structure of an organism, documenting the physical aspects of individual bodies. Traditional medicine and physiology break the organism into systems and tissues and strive to collect information on the workings of these systems and the organism as a whole. Histology and cell biology delve into the tissue and cellular levels and provide knowledge about the inner structure and function of the cell. This wealth of information that has been generated, classified,and stored for centuries has only recently become a major application of database technology. Genetics has emerged as an ideal field for the application of information technology. In a broad sense, it can be thought of as the construction of models based on information 29.4 Genome Data Management I 937 about genes-which can be defined as basic units of heredity-and populations and the seeking out of relationships in that information. The study of genetics can be divided into three branches: (1) Mendelian genetics, (2) molecular genetics, and (3) population genetics. Mendelian genetics is the study of the transmission of traits between generations. Molecular genetics is the study of the chemical structure and function of genes at the molecular level. Population genetics is the study of how genetic information varies across populations of organisms. Molecular genetics provides a more detailed look at genetic information by allowing researchers to examine the composition, structure, and function of genes. The origins of molecular genetics can be traced to two important discoveries. The first occurred in 1869 when Friedrich Miescher discovered nuclein and its primary component, deoxyribonucleic acid (DNA). In subsequent research DNA and a related compound, ribonucleic acid (RNA), were found to be composed of nucleotides (a sugar, a phosphate, and a base, which combined to form nucleic acid) linked into long polymers via the sugar and phosphate. The second discovery was the demonstration in 1944 by Oswald Avery that DNA was indeed the molecular substance carrying genetic information. Genes were thus shown to be composed of chains of nucleic acids arranged linearly on chromosomes and to serve three primary functions: (1) replicating genetic information between generations, (2) providing blueprints for the creation of polypeptides, and (3) accumulating changes-thereby allowing evolution to occur. Waston and Crick found the double-helix structure of the DNA in 1953, which gave molecular genetics research a new direction. 6 Discovery of the DNA and its structure is hailed as probably the most important biological work of the last 100 years, and the field it opened may be the scientific frontier for the next 100. In 1962, Watson, Crick, and Wilkins won the Nobel Prize for physiology/medicine for this breakthrough. 7 29.4.2 Characteristics of Biological Data Biological data exhibits many special characteristics that make management of biological information a particularly challenging problem. We will thus begin by summarizing the characteristics related to biological information, and focusing on a multidisciplinary field called bioinforrnatics that has emerged, with graduate degree programs now in place in several universities. Bioinformatics addresses information management of genetic information with special emphasis on DNA sequence analysis. It needs to be broadened into a wider scope to harness all types of biological information-its modeling, storage, retrieval, and management. Moreover, applications of bioinformatics span design of targets for drugs, study of mutations and related diseases, anthropological investigations on migration pat- terms of tribes, and therapeutic treatments. Characteristic 1: Biological data is highly complex when compared with most other domains or applications. Definitions of such data must thus be able to represent a complex substructure of data as well as relationships and to ensure that no information is lost 6. See Nature, 171:737 1953. 7. http://www.pbs.org/wgbh/aso/databank/entries/doS3dn.html 938 I Chapter 29 Emerging Database Technologies and Applications during biological data modeling. The structure of biological data often provides an additional context for interpretation of the information. Biological information systems must be able to represent any level of complexity in any data schema, relationship, or schema substructure-not just hierarchical, binary, or table data. As an example, MITOMAP is a database documenting the human mitochondrial genome.f This single genome is a small, circular piece of DNA encompassing information about 16,569 nucleotide bases; 52 gene loci encoding messenger RNA, ribosomal RNA, and transfer RNA; 1000 known population variants; over 60 known disease associations; and a limited set of knowledge on the complex molecular interactions of the biochemical energy producing pathway of oxidative phosphorylation. As might be expected, its management has encountered a large number of problems; we have been unable to use the traditional RDBMS or ODBMS approches to capture all aspects of the data. Characteristic 2: The amount and range of variability in data is high. Hence, biological systems must be flexible in handling data types and values. With such a wide range of possible data values, placing constraints on data types must be limited since this may exclude unexpected values-e.g., outlier values-that are particularly common in the biological domain. Exclusion of such values results in a loss of information. In addition, frequent exceptions to biological data structures may require a choice of data types to be available for a given piece of data. Characteristic 3: Schemas in biological databases change at a rapid pace. Hence, for improved information flow between generations or releases of databases, schema evolution and data object migration must be supported. The ability to extend the schema, a frequent occurrence in the biological setting, is unsupported in most relational and object database systems. Presently systems such as GenBank rerelease the entire database with new schemas once or twice a year rather than incrementally changing the system as changes become necessary. Such an evolutionary database would provide a timely and orderly mechanism for following changes to individual data entities in biological databases over time. This sort of tracking is important for biological researchers to be able to access and reproduce previous results. Characteristic 4: Representations of the same data by different biologists will likely be different (even when using the same system). Hence, mechanisms for "aligning" different biological schemas or different versions of schemas should be supported. Given the complexity of biological data, there are a multitude of ways of modeling any given entity, with the results often reflecting the particular focus of the scientist. While two individuals may produce different data models if asked to interpret the same entity, these models will likely have numerous points in common. In such situations, it would be useful to biological investigators to be able to run queries across these common points. By linking data elements in a network of schemas, this could be accomplished. Characteristic 5: Most users of biological datado not require write access to the database; read-only access is adequate. Write access is limited to privileged users called curators. For example, the database created as part of the MITOMAP project has on average more than 8. Detailsof MITOMAP and its information complexity can be seen in Kogelniket al. (1997, 1998) and at http://www. mitomap.org. [...]... Swiss-prot + TrEMBL) and summarize important issues of database management of such resources They discuss three main types of databases: Sequence Databases such as DDBJJEMBL/ GENEBANK Nucleotide Sequence Database; Secondary Databases such as PROSITE, PRINTS and Pfam; and Integrated Databases such as InterPro, that integrates data from six major protein signature databases (Pfam, PRINTS, ProDom, PROSITE, SMART,... Journal of the ACM KDD: Knowledge Discovery in Databases LNCS: Lecture Notes in Computer Science NCC: Proceedings of the National Computer Conference (published by AFIPS) 963 964 I Selected Bibliography OOPSLA: Proceedings of the ACM Conference on Object-Oriented Programming Systems, Languages, and Applications PODS: Proceedings of the ACM Symposium on Principles of Database Systems SIGMOD: Proceedings of. .. only a very high-level view of the data at the time of searching and thus cannot easily make use of any knowledge gleaned from the structure of the GOB tables Search methods are most useful when users are simply looking for an index into map or probe data Exploratory ad hoc searching of the database is not encouraged by present interfaces Integration of the database structures of GOB and OMIM (see below)... Emerging Database Technologies and Applications which a term node may have multiple parents and multiple children A child term can be an instance of (is a) or a part of its parent In the latest release of the GO database, there are over 13,000 terms and more than 18,000 relationships between terms The annotation of gene products is operated independently by each of the collaborating databases A subset of. .. Management of Data TKDE: IEEE Transactions on Knowledge and Data Engineering (journal) TOCS: ACM Transactions on Computer Systems (journal) on TODS: ACM Transactions on Database Systems (journal) TOIS: ACM Transactions on Information Systems (journal) TOOlS: ACM Transactions on Office Information Systems (journal) TSE: IEEE Transactions on Software Engineering (journal) VLDB: Proceedings of the International... searching through a static interface The Genome Database (GOB) Created in 1989, the Genome Database (GOB) is a catalog of human gene mapping data, a process that associates a piece of information with a particular location on the human genome The degree of precision of this location on the map depends upon the source of the data, but it is usually not at the level of individual nucleotide bases GOB data includes... words, the number of users requiring write access is small Users generate a wide variety of read-access patterns into the database, but these patterns are not the same as those seen in traditional relational databases User requested ad hoc searches demand indexing of often unexpected combinations of data instance classes Characteristic 6: Most biologists are not likely to have any knowledge of the internal... links to SCOP, CATH, PFAM and PROSITE Karp (1996) discusses the problems of interlinking the variety of databases mentioned in this section He defines two types of 29.4 Genome Data Management links: those that integrate the data and those that relate the data between databases These were used to design the Ecocyc database Some of the important web links include the following: The Human Genome sequence... supported In addition, no mechanism for evolving the schema is documented Table 29.1 summarizes the features of the major genome-related databases, as well as HGMOB and ACEOB databases Some additional protein databases exist; they contain information about protein structures Prominent protein databases include SWISSPROT at the University of Geneva, Protein Data Bank (POB) at Brookhaven National Laboratory,... Structure Database (EMSD), which is a relational database (http://www.ebi.ac.uk/msd) (Boutselakis et al., 2003) is designed to be a single access point for protein and nucleic acid structures and related information The database is derived from Protein Data Bank (PDB) entries The search database contains an extensive set of derived properties, goodness -of- fit indicators, and links to other EBI databases . important issues of database management of such resources. They discuss three main types of databases: Sequence Databases such as DDBJJEMBL/ GENEBANK Nucleotide Sequence Database; Secondary Databases. user of the database is not able to access the structure of the data directly for querying or other functions, although complete snapshots of the database are available for export in a number of. set of well-defined vocabularies of terms and relationships. The terms are organized in the form of directed acyclic graphs (DAGs), in TABLE 29.1 SUMMARY OF THEMAJOR GENOME-RELATED DATABASES DATABASE MAJOR INITIAL CURRENT DB PROBLEM PRIMARY DATA NAME CONTENT TECHNOLOGY TECHNOLOGY AREAS TYPES Genbank DNA/RNA Text files

Định dạng
Số trang	99
Dung lượng	3,75 MB