Part 2 book “Fundamentals of database systems” has contents: Indexing structures for files, algorithms for query processing and optimization, physical database design and tuning, concurrency control techniques, database recovery techniques, database security, distributed databases, distributed databases,… and other contents.
part File Structures, Indexing, and Hashing This page intentionally left blank chapter 17 Disk Storage, Basic File Structures, and Hashing D atabases are stored physically as files of records, which are typically stored on magnetic disks This chapter and the next deal with the organization of databases in storage and the techniques for accessing them efficiently using various algorithms, some of which require auxiliary data structures called indexes These structures are often referred to as physical database file structures, and are at the physical level of the threeschema architecture described in Chapter We start in Section 17.1 by introducing the concepts of computer storage hierarchies and how they are used in database systems Section 17.2 is devoted to a description of magnetic disk storage devices and their characteristics, and we also briefly describe magnetic tape storage devices After discussing different storage technologies, we turn our attention to the methods for physically organizing data on disks Section 17.3 covers the technique of double buffering, which is used to speed retrieval of multiple disk blocks In Section 17.4 we discuss various ways of formatting and storing file records on disk Section 17.5 discusses the various types of operations that are typically applied to file records We present three primary methods for organizing file records on disk: unordered records, in Section 17.6; ordered records, in Section 17.7; and hashed records, in Section 17.8 Section 17.9 briefly introduces files of mixed records and other primary methods for organizing records, such as B-trees These are particularly relevant for storage of object-oriented databases, which we discussed in Chapter 11 Section 17.10 describes RAID (Redundant Arrays of Inexpensive (or Independent) Disks)—a data storage system architecture that is commonly used in large organizations for better reliability and performance Finally, in Section 17.11 we describe three developments in the storage systems area: storage area networks (SAN), network583 584 Chapter 17 Disk Storage, Basic File Structures, and Hashing attached storage (NAS), and iSCSI (Internet SCSI—Small Computer System Interface), the latest technology, which makes storage area networks more affordable without the use of the Fiber Channel infrastructure and hence is getting very wide acceptance in industry Section 17.12 summarizes the chapter In Chapter 18 we discuss techniques for creating auxiliary data structures, called indexes, which speed up the search for and retrieval of records These techniques involve storage of auxiliary data, called index files, in addition to the file records themselves Chapters 17 and 18 may be browsed through or even omitted by readers who have already studied file organizations and indexing in a separate course The material covered here, in particular Sections 17.1 through 17.8, is necessary for understanding Chapters 19 and 20, which deal with query processing and optimization, and database tuning for improving performance of queries 17.1 Introduction The collection of data that makes up a computerized database must be stored physically on some computer storage medium The DBMS software can then retrieve, update, and process this data as needed Computer storage media form a storage hierarchy that includes two main categories: ■ ■ Primary storage This category includes storage media that can be operated on directly by the computer’s central processing unit (CPU), such as the computer’s main memory and smaller but faster cache memories Primary storage usually provides fast access to data but is of limited storage capacity Although main memory capacities have been growing rapidly in recent years, they are still more expensive and have less storage capacity than secondary and tertiary storage devices Secondary and tertiary storage This category includes magnetic disks, optical disks (CD-ROMs, DVDs, and other similar storage media), and tapes Hard-disk drives are classified as secondary storage, whereas removable media such as optical disks and tapes are considered tertiary storage These devices usually have a larger capacity, cost less, and provide slower access to data than primary storage devices Data in secondary or tertiary storage cannot be processed directly by the CPU; first it must be copied into primary storage and then processed by the CPU We first give an overview of the various storage devices used for primary and secondary storage in Section 17.1.1 and then discuss how databases are typically handled in the storage hierarchy in Section 17.1.2 17.1.1 Memory Hierarchies and Storage Devices In a modern computer system, data resides and is transported throughout a hierarchy of storage media The highest-speed memory is the most expensive and is therefore available with the least capacity The lowest-speed memory is offline tape storage, which is essentially available in indefinite storage capacity 17.1 Introduction At the primary storage level, the memory hierarchy includes at the most expensive end, cache memory, which is a static RAM (Random Access Memory) Cache memory is typically used by the CPU to speed up execution of program instructions using techniques such as prefetching and pipelining The next level of primary storage is DRAM (Dynamic RAM), which provides the main work area for the CPU for keeping program instructions and data It is popularly called main memory The advantage of DRAM is its low cost, which continues to decrease; the drawback is its volatility1 and lower speed compared with static RAM At the secondary and tertiary storage level, the hierarchy includes magnetic disks, as well as mass storage in the form of CD-ROM (Compact Disk–Read-Only Memory) and DVD (Digital Video Disk or Digital Versatile Disk) devices, and finally tapes at the least expensive end of the hierarchy The storage capacity is measured in kilobytes (Kbyte or 1000 bytes), megabytes (MB or million bytes), gigabytes (GB or billion bytes), and even terabytes (1000 GB) The word petabyte (1000 terabytes or 10**15 bytes) is now becoming relevant in the context of very large repositories of data in physics, astronomy, earth sciences, and other scientific applications Programs reside and execute in DRAM Generally, large permanent databases reside on secondary storage, (magnetic disks), and portions of the database are read into and written from buffers in main memory as needed Nowadays, personal computers and workstations have large main memories of hundreds of megabytes of RAM and DRAM, so it is becoming possible to load a large part of the database into main memory Eight to 16 GB of main memory on a single server is becoming commonplace In some cases, entire databases can be kept in main memory (with a backup copy on magnetic disk), leading to main memory databases; these are particularly useful in real-time applications that require extremely fast response times An example is telephone switching applications, which store databases that contain routing and line information in main memory Between DRAM and magnetic disk storage, another form of memory, flash memory, is becoming common, particularly because it is nonvolatile Flash memories are high-density, high-performance memories using EEPROM (Electrically Erasable Programmable Read-Only Memory) technology The advantage of flash memory is the fast access speed; the disadvantage is that an entire block must be erased and written over simultaneously Flash memory cards are appearing as the data storage medium in appliances with capacities ranging from a few megabytes to a few gigabytes These are appearing in cameras, MP3 players, cell phones, PDAs, and so on USB (Universal Serial Bus) flash drives have become the most portable medium for carrying data between personal computers; they have a flash memory storage device integrated with a USB interface CD-ROM (Compact Disk – Read Only Memory) disks store data optically and are read by a laser CD-ROMs contain prerecorded data that cannot be overwritten WORM (Write-Once-Read-Many) disks are a form of optical storage used for 1Volatile memory typically loses its contents in case of a power outage, whereas nonvolatile memory does not 585 586 Chapter 17 Disk Storage, Basic File Structures, and Hashing archiving data; they allow data to be written once and read any number of times without the possibility of erasing They hold about half a gigabyte of data per disk and last much longer than magnetic disks.2 Optical jukebox memories use an array of CD-ROM platters, which are loaded onto drives on demand Although optical jukeboxes have capacities in the hundreds of gigabytes, their retrieval times are in the hundreds of milliseconds, quite a bit slower than magnetic disks This type of storage is continuing to decline because of the rapid decrease in cost and increase in capacities of magnetic disks The DVD is another standard for optical disks allowing 4.5 to 15 GB of storage per disk Most personal computer disk drives now read CDROM and DVD disks Typically, drives are CD-R (Compact Disk Recordable) that can create CD-ROMs and audio CDs (Compact Disks), as well as record on DVDs Finally, magnetic tapes are used for archiving and backup storage of data Tape jukeboxes—which contain a bank of tapes that are catalogued and can be automatically loaded onto tape drives—are becoming popular as tertiary storage to hold terabytes of data For example, NASA’s EOS (Earth Observation Satellite) system stores archived databases in this fashion Many large organizations are already finding it normal to have terabyte-sized databases The term very large database can no longer be precisely defined because disk storage capacities are on the rise and costs are declining Very soon the term may be reserved for databases containing tens of terabytes 17.1.2 Storage of Databases Databases typically store large amounts of data that must persist over long periods of time, and hence is often referred to as persistent data Parts of this data are accessed and processed repeatedly during this period This contrasts with the notion of transient data that persist for only a limited time during program execution Most databases are stored permanently (or persistently) on magnetic disk secondary storage, for the following reasons: ■ ■ ■ Generally, databases are too large to fit entirely in main memory The circumstances that cause permanent loss of stored data arise less frequently for disk secondary storage than for primary storage Hence, we refer to disk—and other secondary storage devices—as nonvolatile storage, whereas main memory is often called volatile storage The cost of storage per unit of data is an order of magnitude less for disk secondary storage than for primary storage Some of the newer technologies—such as optical disks, DVDs, and tape jukeboxes—are likely to provide viable alternatives to the use of magnetic disks In the future, databases may therefore reside at different levels of the memory hierarchy from those described in Section 17.1.1 However, it is anticipated that magnetic 2Their rotational speeds are lower (around 400 rpm), giving higher latency delays and low transfer rates (around 100 to 200 KB/second) 17.2 Secondary Storage Devices disks will continue to be the primary medium of choice for large databases for years to come Hence, it is important to study and understand the properties and characteristics of magnetic disks and the way data files can be organized on disk in order to design effective databases with acceptable performance Magnetic tapes are frequently used as a storage medium for backing up databases because storage on tape costs even less than storage on disk However, access to data on tape is quite slow Data stored on tapes is offline; that is, some intervention by an operator—or an automatic loading device—to load a tape is needed before the data becomes available In contrast, disks are online devices that can be accessed directly at any time The techniques used to store large amounts of structured data on disk are important for database designers, the DBA, and implementers of a DBMS Database designers and the DBA must know the advantages and disadvantages of each storage technique when they design, implement, and operate a database on a specific DBMS Usually, the DBMS has several options available for organizing the data The process of physical database design involves choosing the particular data organization techniques that best suit the given application requirements from among the options DBMS system implementers must study data organization techniques so that they can implement them efficiently and thus provide the DBA and users of the DBMS with sufficient options Typical database applications need only a small portion of the database at a time for processing Whenever a certain portion of the data is needed, it must be located on disk, copied to main memory for processing, and then rewritten to the disk if the data is changed The data stored on disk is organized as files of records Each record is a collection of data values that can be interpreted as facts about entities, their attributes, and their relationships Records should be stored on disk in a manner that makes it possible to locate them efficiently when they are needed There are several primary file organizations, which determine how the file records are physically placed on the disk, and hence how the records can be accessed A heap file (or unordered file) places the records on disk in no particular order by appending new records at the end of the file, whereas a sorted file (or sequential file) keeps the records ordered by the value of a particular field (called the sort key) A hashed file uses a hash function applied to a particular field (called the hash key) to determine a record’s placement on disk Other primary file organizations, such as B-trees, use tree structures We discuss primary file organizations in Sections 17.6 through 17.9 A secondary organization or auxiliary access structure allows efficient access to file records based on alternate fields than those that have been used for the primary file organization Most of these exist as indexes and will be discussed in Chapter 18 17.2 Secondary Storage Devices In this section we describe some characteristics of magnetic disk and magnetic tape storage devices Readers who have already studied these devices may simply browse through this section 587 588 Chapter 17 Disk Storage, Basic File Structures, and Hashing 17.2.1 Hardware Description of Disk Devices Magnetic disks are used for storing large amounts of data The most basic unit of data on the disk is a single bit of information By magnetizing an area on disk in certain ways, one can make it represent a bit value of either (zero) or (one) To code information, bits are grouped into bytes (or characters) Byte sizes are typically to bits, depending on the computer and the device We assume that one character is stored in a single byte, and we use the terms byte and character interchangeably The capacity of a disk is the number of bytes it can store, which is usually very large Small floppy disks used with microcomputers typically hold from 400 KB to 1.5 MB; they are rapidly going out of circulation Hard disks for personal computers typically hold from several hundred MB up to tens of GB; and large disk packs used with servers and mainframes have capacities of hundreds of GB Disk capacities continue to grow as technology improves Whatever their capacity, all disks are made of magnetic material shaped as a thin circular disk, as shown in Figure 17.1(a), and protected by a plastic or acrylic cover Figure 17.1 (a) A single-sided disk with read/write hardware (b) A disk pack with read/write hardware Track (a) Actuator Arm Read/write head Spindle Disk rotation (b) Cylinder of tracks (imaginary) Actuator movement 17.2 Secondary Storage Devices 589 A disk is single-sided if it stores information on one of its surfaces only and doublesided if both surfaces are used To increase storage capacity, disks are assembled into a disk pack, as shown in Figure 17.1(b), which may include many disks and therefore many surfaces Information is stored on a disk surface in concentric circles of small width,3 each having a distinct diameter Each circle is called a track In disk packs, tracks with the same diameter on the various surfaces are called a cylinder because of the shape they would form if connected in space The concept of a cylinder is important because data stored on one cylinder can be retrieved much faster than if it were distributed among different cylinders The number of tracks on a disk ranges from a few hundred to a few thousand, and the capacity of each track typically ranges from tens of Kbytes to 150 Kbytes Because a track usually contains a large amount of information, it is divided into smaller blocks or sectors The division of a track into sectors is hard-coded on the disk surface and cannot be changed One type of sector organization, as shown in Figure 17.2(a), calls a portion of a track that subtends a fixed angle at the center a sector Several other sector organizations are possible, one of which is to have the sectors subtend smaller angles at the center as one moves away, thus maintaining a uniform density of recording, as shown in Figure 17.2(b) A technique called ZBR (Zone Bit Recording) allows a range of cylinders to have the same number of sectors per arc For example, cylinders 0–99 may have one sector per track, 100–199 may have two per track, and so on Not all disks have their tracks divided into sectors The division of a track into equal-sized disk blocks (or pages) is set by the operating system during disk formatting (or initialization) Block size is fixed during initialization and cannot be changed dynamically Typical disk block sizes range from 512 to 8192 bytes A disk with hard-coded sectors often has the sectors subdivided into blocks during initialization Blocks are separated by fixed-size interblock gaps, which include specially coded control information written during disk initialization This information is used to determine which block on the track follows each (a) Track Sector (arc of track) (b) Three sectors Two sectors One sector 3In some disks, the circles are now connected into a kind of continuous spiral Figure 17.2 Different sector organizations on disk (a) Sectors subtending a fixed angle (b) Sectors maintaining a uniform recording density 590 Chapter 17 Disk Storage, Basic File Structures, and Hashing interblock gap Table 17.1 illustrates the specifications of typical disks used on large servers in industry The 10K and 15K prefixes on disk names refer to the rotational speeds in rpm (revolutions per minute) There is continuous improvement in the storage capacity and transfer rates associated with disks; they are also progressively getting cheaper—currently costing only a fraction of a dollar per megabyte of disk storage Costs are going down so rapidly that costs as low 0.025 cent/MB—which translates to $0.25/GB and $250/TB—are already here A disk is a random access addressable device Transfer of data between main memory and disk takes place in units of disk blocks The hardware address of a block—a combination of a cylinder number, track number (surface number within the cylinder on which the track is located), and block number (within the track) is supplied to the disk I/O (input/output) hardware In many modern disk drives, a single number called LBA (Logical Block Address), which is a number between and n (assuming the total capacity of the disk is n + blocks), is mapped automatically to the right block by the disk drive controller The address of a buffer—a contiguous Table 17.1 Specifications of Typical High-End Cheetah Disks from Seagate Description Cheetah 15K.6 Cheetah NS 10K Model Number Height Width Length Weight ST3450856SS/FC 25.4 mm 101.6 mm 146.05 mm 0.709 kg ST3400755FC 26.11 mm 101.85 mm 147 mm 0.771 kg 450 Gbytes 400 Gbytes 8 Capacity Formatted Capacity Configuration Number of disks (physical) Number of heads (physical) Performance Transfer Rates Internal Transfer Rate (min) Internal Transfer Rate (max) Mean Time Between Failure (MTBF) 1051 Mb/sec 2225 Mb/sec 1211 Mb/sec 1.4 M hours Seek Times Avg Seek Time (Read) Avg Seek Time (Write) Track-to-track, Seek, Read Track-to-track, Seek, Write Average Latency Courtesy Seagate Technology 3.4 ms (typical) 3.9 ms (typical) 0.2 ms (typical) 0.4 ms (typical) ms 3.9 ms (typical) 4.2 ms (typical) 0.35 ms (typical) 0.35 ms (typical) 2.98 msec 1158 Index Phantoms, transaction support in SQL, 771 PHP arrays, 486–488 bibliographic references, 497 collecting data from forms and inserting records, 493–494 connecting to databases, 491–493 features, 484–485 functions, 488–490 overview of, 481–482 retrieval queries, 494–495 server variables and forms, 490–491 simple example of, 482–484 summary and exercises, 496–497 variables, data types, and constructs, 485–486 PHP Extension and Application Repository (PEAR), 491 Phrase queries, types of queries in IR systems, 1008 Physical clustering, of records on disks, 617 Physical data independence, in three-schema architecture, 36 Physical data models, 30 Physical database design See also Database design bibliographic references, 740 data organization in, 587 denormalization as design decision related to query speed, 731–732 in ER (Entity-Relationship) model, 202 factors influencing, 727–729 indexing decisions, 730–731 overview of, 9, 326–327 summary and exercises, 739–740 tuning and, 735–736 Physical database file structures, 583 Physical database phase, in database design, 311 Physical indexes vs logical, 668 ordering primary and clustering indexes, 642 Physical problems/catastrophes, recovery needed due to, 751 Physical relationships, between file records, 617 Pile file (heap), 602 Pipelined evaluation, converting query trees into query execution plans, 710 Pipelining, combining operations using, 700 Pivoting (rotation) functionality of data warehouses, 1078 working with data cubes, 1070–1072 PL/SQL designing database programming language from scratch, 449 impedance mismatch and, 450 writing database applications with, 447 Plaintext, 864 Point events (facts), in temporal databases, 946 Pointers, blocks of data and, 597 Points on maps, 959–960 in temporal databases, 945 Policies access control for e-commerce and Web, 854–855 flow policies, 860 for label-based security, 853 security policies, 836 Polygons, on maps, 960 Polyinstantiation, in mandatory access control, 849–850 Polymorphism (operator overloading) defined, 369 in OO systems, 357 overview of, 367–368 specifying in SQL, 375–376 populating (loading) databases, 33 Populations, in statistical database security, 859 Positional iterator, SQLJ, 461–462 Positive literals, in Datalog language, 973 Precedence graph (serialization graph), 763–765 Precision metrics finding relevant information and, 1019 measures of relevance in IR, 1015–1017 Precision, vs security, 841 Precompilers DML commands and, 42 embedded SQL and, 452 in SQL programming, 449 Predicate-defined (conditiondefined) subclasses, 252, 264 Predicate dependency graph, 982 Predicate locking, 801 Predicates as arity or degree of p, 973 built-in, 972–973 fact-defined and rule-defined, 978 interpretation of, 976 in Prolog languages, 970–972 relational schemas and, 66 Prediction, as goal of data mining, 1037 Preprocessors embedded SQL and, 452 in SQL programming, 449 in Web usage analysis, 1025–1027 Presentation layer (client), in threetier client/server architecture, 892 Pretty Good Privacy (PGP), 854 Primary file organization B-trees as, 651 data organization and, 587 Primary indexes cost functions for SELECT operations, 713 methods for simple selection, 686 for ordered records (sorted files), 605 overview of, 633–635 searching nondense multilevel primary index, 646 tables comparing index types, 642 types of ordered indexes, 632 PRIMARY KEY clause, CREATE TABLE command, 95 Primary keys defined, 519 normal forms based on, 516–517 primary indexes and, 633 relational model constraints, 69 Primary site, concurrency control techniques for distributed databases, 910–911 Primary storage, 584 Prime attributes, 519, 526 Printer servers, in client/server architecture, 45 Privacy information privacy vs information security, 841–842 issues in database security, 866–867 protecting in statistical databases, 859 Private keys, in public (asymmetric) key algorithms, 864 Privileged software, 19 Index Privileges discretionary, 842–844 granting/revoking, 111, 844–846 limits on propagation of, 846–847 unauthorized escalation and abuse, 855, 858 views for specifying, 844 Proactive updates, valid time relations and, 949 Probabilistic model, for information retrieval, 1005–1006 Procedural DMLs, 37–38 Process-driven design, 310 PROCESS RULES, in active database systems, 938 Processes in database design, 322 multiprogramming and, 744 Processors, parallel, 1079 Program-data independence, 11–12, 23–24 Program-operation independence, 12 Program variables, 599 Programming languages advantages/disadvantages of, 477 approaches to database programming, 449 DBMS, 36–38 impedance mismatch and, 450 object-orientation creating compatibility between, 369 Web databases See PHP XML, 432–436 Programs, insulation between programs and data, 11–13 PROJECT operations algorithms for, 696–697 Query processing and optimizing, 696–697 in relational algebra, 149–150 Projection attributes, SELECT command and, 98 Projective operators, types of spatial operators, 961 Prolog language See also Datalog language logic programming and, 970 notation, 970–973 Proof-theoretic interpretation, of rules in deductive databases, 975 Properties, of association rules, 1041 Properties of relational decompositions dependency preservation, 552–553 dependency-preserving and nonadditive join decomposition into 3NF schemas, 560–563 dependency-preserving decomposition into 3NF schemas, 558–559 insufficiency of normal forms and, 552 nonadditive join decomposition into BCNF schemas, 559–560 nonadditive (lossless) join, 553–556 overview of, 544, 551 successive nonadditive join decompositions, 557 testing binary decompositions for nonadditive join property, 553–556 Protocols concurrency control, 777 deadlock prevention, 785–787 for ensuring serializability of transaction schedules, 767–768 Proximity queries, 1008 PSM (Persistent stored modules), 474–476 Public (asymmetric) key algorithms, 863–865 Public keys, in public (asymmetric) key algorithm, 864 Publishing XML documents, 431 Punctuation marks, text preprocessing in information retrieval, 1011 Pure time conditions, 955 QBE (Query-By-Example) basic retrieval in, 1091–1095 domain calculus and, 183, 185 grouping, aggregation, and database modification in, 1095–1098 overview of, 1091 QMF (Query Management Facility), 185 Quadtrees, 963 Qualified aggregations, in UML class diagrams, 228 Qualified associations, in UML class diagrams, 228 Qualifier conditions, XPath, 432 Quality control, data warehousing and, 1080 Quantifiers collection operators in OQL, 403–405 existential and universal, 177–178 1159 transforming, 180 using in queries, 180–182 Queries See also OQL (object query language); SQL (Structured Query Language) content-based retrieval, 965 database tuning and, 736–738 defined, design decisions related to query speed, 731–732 evaluating nonrecursive Datalog queries, 981–983 information retrieval, 1007–1009 interactive interface for, 40 IR systems, 1007–1009 keyword-based, 39 modes of interaction in IR systems, 999 physical database design and, 728–729 processing in databases, 19–20 in Prolog languages, 973 retrieval queries from database tables, 494–495 spatial, 958, 961 statistical, 859 TSQL2, 954–956 Query blocks, 681 Query-By-Example See QBE (Query-By-Example) Query compilers, 41 Query decomposition, 905–907 Query execution plans converting query trees into, 709–710 creating, 679 Query graphs creating, 679 notation for, 179–180, 701–703 Query languages DML as, 38 for federated databases, 886 SQL See also SQL (Structured Query Language) TSQL2 See also SQL (Structured Query Language) Query Management Facility (QMF), 185 Query mapping, 901 Query modification, 135 Query optimizer, 41, 679 Query processing and optimizing aggregate functions, 698–699 bibliographic references, 725 catalog information used in cost functions, 712–713 1160 Index converting query trees into query execution plans, 709–710 cost components of query execution, 711–712 cost functions for JOIN, 715–718 cost functions for SELECT, 713–715 DBMS module for, 20 disjunctive selection conditions, 688 external sorting, 682–685 heuristic algebraic optimization algorithm, 708–709 heuristic optimization of query trees, 703–706 heuristics used in query optimization, 700–701 hybrid hash-join, 696 implementing JOIN operations, 689–690 implementing SELECT operations, 685 join selection factors, 693–694 multiple relation queries and JOIN ordering, 718–719 nested-loop joins, 690–693 notation for query trees and query graphs, 701–703 operations, 700 OUTER JOIN operations, 699–700 overview of, 679–681 partition-hash joins, 694–696 PROJECT operations, 696–697 query optimization in Oracle, 721–722 search methods for complex selection, 686–687 search methods for simple selection, 685–686 selectivity and cost estimates in query optimization, 710–711 selectivity of conditions and, 687–688 semantic query optimization, 722–723 set operations, 697–698 summary and exercises, 723–725 transformation rules for relational algebra operations, 706–708 translating SQL queries into relational algebra, 681–682 Query processing and optimizing, in distributed databases data transfer costs for distributed query processing, 902–904 distributed query processing using semijoin operation, 904 overview of, 901–902 query update and decomposition, 905–907 Query results cursors for looping over tuples in, 450 ordering, 106–107 path expressions and, 400–402 retrieval queries from database tables, 494–495 Query (transaction) server, in twotier client/server architecture, 47 Query trees converting into query execution plans, 709–710 creating, 679 notation for, 163–165, 701–703 optimization of, 703–706 R-Trees, for spatial indexing, 962 RAID (Redundant Array of Inexpensive Disks) levels, 620–621 overview of, 617–619 performance improvements, 619–620 reliability improvements, 619 RAM (Random Access Memory), 585 Random access storage devices, 592 Randomizing function (hash function), 606 Range queries, 686, 961 Range relations, of tuple variables, 175–176 Rational Rose data modeler, 338 database design with, 337 tools and options for data modeling, 338–342 RBAC (role-based access control), 851–852 RBG (red, blue, green) colors, 967 RDBMS (relational database management systems) creating indexes, 731 ORDBMS (object-relational database management systems), 354 providing application flexibility, 23–24 two-tier client/server architectures and, 46 RDBs (relational databases) designing See relational database design overview of, 395–396 schemas See relational database schemas RDF (Resource Description Framework), 436 Reachability, of objects, 363 Read command, hard disks, 591 Read-only transaction, 745 READ operation, transactions, 751 Read (or Get) operation, on files, 600 Read phase, of optimistic concurrency control, 794 Read-set, of transaction, 747 Read timestamp, 789 Read-write conflicts, in transaction schedules, 757 Read/write heads, on hard disks, 591 Read/write, OSs controlling disk read/write, 40 Read-write transactions, 745–747 read_item(X), 746 Real-time database technology, Reasoning mechanisms, in knowledge representation, 268 Recall metrics, in IR, 1015–1017, 1019 Recall/precision curve, in IR, 1017 Record-at-a-time DMLs, 38 Record-based data models, 31 Record pointers, 609 Records See also Files (of records) anchor record (block anchor), 633 blocking, 597 catalog information used in query cost estimation, 712 fixed-length and variable-length, 595–597 inserting, 493–494 mixed, 616–617 ordered (sorted files), 603–606 phantom records, concurrency control techniques, 800–801 placing file records on disk, 594 spanned/unspanned, 597–598 in SQL/CLI, 464–468 types of, 594–595 unordered (heap files), 601–602 Recoverability, transaction schedules based o, 757–759 Recovery See also Backup and recovery; Database recovery techniques Index transaction management in distributed databases, 912–913 types of failures and, 750–751 Recursive closure operations, in relational algebra, 168–169 Recursive relationships, 168, 215 Recursive rules, in Prolog languages, 972 Red, blue, green (RBG) colors, 967 REDO phase, of ARIES recovery algorithm, 823 Redo transaction, 753 REDO, write-ahead logging and, 810–811 Redundancy, controlling in databases, 17–18 Redundant Array of Inexpensive Disks (RAID) See RAID (Redundant Array of Inexpensive Disks) REF keyword, specifying relationships via reference, 376 Reference types, OIDs using, 373–374 References foreign key, 73 representing object relationships, 360 specifying relationships via reference, 376 Referencing relations, 73 Referential integrity constraints inclusion dependencies and, 571 integrity constraints in databases, 21 relational data model and, 73–74 specifying in SQL, 95–96 Reflexive associations, in UML class diagrams, 227 Regression function, 1058 Regression, in data mining, 1057–1058 Regression rule, 1057 Regular entity types, 219, 287–288 Relation extension, 62 Relation intension, 62 Relation nodes notation for, 703 in query graphs, 179 Relation schemas domains and, 61 goodness of, 501–502 in relational databases, 501 Relation (table) level, assigning privileges at, 842–843 Relational algebra aggregate functions and grouping, 166–168 bibliographic references, 194–195 CARTESIAN PRODUCT operation, 155–157 complete set of relational algebra operations, 161, 164 DIVISION operation, 162–163 EQUIJOIN and NATURAL JOIN operations, 159–161 examples of queries in, 171–174 generalized projection, 165–166 JOIN operation, 157–158 notation for query trees, 163–165 OUTER JOIN operations, 169–170 OUTER UNION operation, 170–171 overview of, 145–146 PROJECT operation, 149–150 recursive closure operations, 168–169 RENAME operation, 151–152 SELECT operation, 147–149 sequences of operations, 151 summary and exercises, 185–194 transformation rules for operations, 706–708 translating SQL queries into, 681–682 UNION, INTERSECTION, and MINUS operations, 152–155 Relational calculus domain (relational) calculus, 183–185 overview of, 146–147 tuple relational calculus See Tuple relational calculus Relational completeness, of relational query languages, 174 Relational data model bibliographic references, 85 characteristics of relations, 63–66 classifying DBMSs and, 49 concepts, 60–61 constraints, 67–70 correspondence to ER model, 293 Delete operation, 77–78 domains, attributes, tuples, and relations, 61–63 formal languages for See Relational algebra; Relational calculus Insert operation, 76–77 1161 integrity, referential integrity, and foreign keys, 73–74 in list of data model types, 31 mapping from EER model to See EER-to-Relational mapping mapping from ER model to See ER-to-Relational mapping notation, 66–67 other types of constraints, 74–75 overview of, 50, 59–60 practical language for See SQL (Structured Query Language) schemas, 70–73 SQL compared with, 97 summary and exercises, 79–85 transactions and, 79 update operations, 75–76, 78–79 Relational database design algorithms for, 557, 566–567 attribute semantics in, 503–507 bibliographic references, 302, 579 bottom-up approach to, 544 Boyce-Codd normal form (BCNF), 529–531 dependency preservation properties of decompositions, 552–553 dependency-preserving and nonadditive join decomposition into 3NF schemas, 560–563 dependency-preserving decomposition into 3NF schemas, 558–559 disallowing possibility for spurious tuples, 510–513 domain-key normal form (DKNF), 574–575 equivalence of sets of functional dependencies, 549 first normal form (1NF), 519–523 formal analysis of relational schemas, 513 formal definition of fourth normal form, 533–534, 568–570 functional dependencies based on arithmetic functions and procedures, 572–574 functional dependency and, 513–516 general definition of second normal form, 526–527 general definition of third normal form, 528 goodness of relational schemas, 501–502 1162 Index inclusion dependencies, 571–572 inference rules for functional and multivalued dependencies, 568 inference rules for functional dependencies, 545–549 informal guidelines for relational schemas, 503, 513 join dependencies and fifth normal form, 534–535 key definitions, 518–519 mapping from EER model to relational model See EER-toRelational mapping mapping from ER model to relational model See ER-toRelational mapping minimal sets of functional dependencies, 549–551 multivalued dependency and fourth normal form, 531–533 nonadditive join decomposition into 4NF relations, 570 nonadditive join decomposition into BCNF schemas, 559–560 nonadditive (lossless) join properties of decompositions, 553–556 normal forms based on primary keys, 516–517 normalization of relations, 517–518 NULL values and dangling tuples and, 563–565 overview of, 285 practical use of normal forms, 518 reducing NULL values in tuples, 509–510 reducing redundant information in tuples, 507–509 relational decomposition and insufficiency of normal forms, 552 second normal form (2NF), 523 successive nonadditive join decompositions, 557 summary and exercises, 299–301, 575–578 template dependencies, 572 testing binary decompositions for nonadditive join property, 557 third normal form (3NF), 523–525 top-down and bottom-up approaches, 502 tuning and, 733 Relational database management systems See RDBMS (relational database management systems) Relational database schemas algorithms for schema design, 557 bibliographic references, 542 clear semantics for attributes in, 503–507 components of, 70–73 disallowing possibility for spurious tuples, 510–513 formal analysis of, 513 functional dependency and, 513–516 informal guidelines, 503, 513 overview of, 501–502 reducing NULL values in tuples, 509–510 reducing redundant information in tuples, 507–509 relation schemas in, 501 summary and exercises, 535–542 Relational database state, 70 Relational design by analysis, 543 Relational design by synthesis, 544 Relational expressions, 983 Relational OLAP (ROLAP), 1079 Relational operators in deductive database systems, 980–981 relational expressions and, 983 Relations (relation states) See also Tables alternative definition of, 64–65 column-based storage of, 669–670 defined, 61 interpretation (meaning) of, 66 legality of, 514 normalization of, 517–518 ordering tuples in, 63 ordering values within tuples, 64 overview of, 62–63 values and NULLS in tuples, 65–66 Relations, temporal bitemporal time, 950–952 transaction time, 949–950 valid time, 947–949 Relationship relation (lookup table) mapping of binary 1:1 relationship types, 289 mapping of binary 1:N relationship types, 290 mapping of binary M:N relationship types, 290–291 Relationships in data modeling, 31 in ODMG object model, 386 references to, 360 representing in OO systems, 356 specifying by reference, 376 symbols for, 1084 University student database example, Relationships, in EER model class/subclass relationships, 247 specific relationship types and, 249–250 Relationships, in ER model attributes of relationship types, 218 constraints on binary relationship types, 216–218 degree of relationship greater than two, 228–232 degree of relationship type, 213–214 overview of, 212 relationship types, sets, and instances, 212–213 relationships as attributes, 214 role names and recursive relationships, 215 Relevant sets, in probabilistic model for IR, 1005 Reliability, in distributed databases, 881, 882 Remote commands, for SQL injection attacks, 857 RENAME operation, in relational algebra, 151–152 Reorganize operation, on files, 600 Repeating field or groups, in file records, 595 Repeating history, in ARIES recovery algorithm, 821 Replication active rules for maintaining consistency of replicated tables, 943 in distributed databases, 897 example of fragmentation, allocation, and replication, 898–901 transparency of, 880 Representational (or implementation) data models, 31 Requirements collection and analysis phase in database design, 200, 311–313 database design starting with, of information system (IS) life cycle, 307 Index Reset operations, on files, 599 Resource Description Framework (RDF), 436 Response time, physical database design and, 326 Restrict option, of delete operation, 77 Result equivalence, of transaction schedules, 762 Result relations, 75 Result tables, in QBE, 1095 Retrieval operations database design and, 728 from database tables, 494–495 on files, 599 modes of interaction in IR systems, 999 objects, 362 QBE (Query-By-Example), 1091–1095 types of relational data model operations, 75 Retrieval transactions, 322 Retroactive update, valid time relations and, 949 Return values, of PHP functions, 490 Reverse engineering, Rational Rose and, 338 Revoking privileges, 844, 845–846 Rewrite blocks, file organization and, 602 Rewrite time, as disk parameter, 1089 RIFT (rotation invariant feature transform), 968 Rigorous two-phase locking, 785 Rivest, Ron, 865 ROLAP (relational OLAP), 1079 Role-based access control (RBAC), 851–852 Role hierarchy, in role-based access control, 851 Role names, and recursive relationships, 215 Roll-up display functionality of data warehouses, 1078 working with data cubes, 1070–1072 ROLLBACK (or ABORT) operation, 752 Rollbacks, in database recovery, 813–815, 950 Root element, XML schema language, 429 Root tag, XML documents, 423 Roots, of tree structures, 646 Rotation See Pivoting (rotation) Rotation invariant feature transform (RIFT), 968 Rotational delay (rd) as disk parameter, 1087 on hard disks, 591 Row-level access control, 852–853 Row-level triggers, 937 Rows See Tuples (rows) Rows, in SQL, 89 RSA encryption algorithm, 865 Rule consideration, in active databases deferred consideration, 942 overview of, 938–939 Rule-defined predicates (views), 978 Rule sets, in active database systems, 938 Rules, in deductive databases interpretation of, 975–977 overview of, 21, 932 in Prolog/Datalog notation, 970–972 safe, 979–980 Runtime database processor DBMS component modules, 42 query execution and, 679 Runtime, specifying SQL queries at, 458–459 Safe expressions, in tuple relational calculus, 182–183 Safe rules, in deductive databases, 979–980 Sampling algorithm, in data mining, 1042 SANs (Storage Area Networks), 621–622 Saturation, hue, saturation, and value (HSV), 967 SAX (Simple API for XML), 423 Scale-invariant feature transform (SIFT), 968 Scan operations, files, 600 Scanner, for SQL, 679 Schedules (histories), of transactions characterizing based on recoverability, 757–759 characterizing based on serializability, 759–760 equivalence of, 768–770 overview of, 755–757 serial, nonserial, and conflictserializable schedules, 761–763 1163 testing conflict serializability of, 763–765 Schema conceptual design, 313–321 entity type describing for entity sets, 208 instances and database state and, 32–33 ontologies and, 272 relational See Relational database schemas relational data model and, 70–73 three-schema architecture See Three-schema architecture Schema construct, 32, 222 Schema diagram, 32 Schema evolution, 33 Schema matching, types of Web information integration, 1023 Schema, SQL change statements, 137–139 names, 89 overview of, 89–90 Schema (view) integration, 316–317, 319–321 Schemaless XML documents, 422 Scientific applications, 25 Scope, variable, 490 Scripting languages, PHP as, 482 SCSI (Small Computer System Interface), 591 SDL (storage definition language), 37, 110 Search engines overview of, 998–999 vertical and metasearch, 1018 Search fields, 648 Search trees, 647–649 Searches conversational, 1029–1030 faceted, 1028–1029 information retrieval See IR (Information Retrieval) measures of relevance, 1014–1015 methods for complex selection, 686–687 methods for simple selection, 685–686 navigational, informational, and transactional, 996 social searches, 1029 Web See Web search and analysis Second normal form (2NF) general definition of, 526–527 overview of, 523 Secondary access path, 631 1164 Index Secondary file organization, 587 Secondary indexes advantages of, 668 cost functions for SELECT, 714 methods for simple selection, 686 overview of, 636–642 tables comparing index types, 642 types of ordered indexes, 632–633 Secondary keys, 636 Secondary storage, 584, 711 Secret key algorithms, 863 Sectors, of hard disk, 589 Security vs precision, 841 Web security, 1028 Security and authorization subsystem, DBMS, 19 Security, database See Database security Seek time (s) as disk parameter, 1087 on hard disks, 591 Segmentation, automatic analysis of images, 967 SELECT command, SQL aggregate functions used in, 125 basic form of, 97–98 FROM clause, 107 DISTINCT keyword with, 103 information retrieval with, 97 projection attributes and selection conditions, 98, 100 in SQL retrieval queries, 129–130 SELECT-FROM-WHERE structure, of SQL queries, 98–100 SELECT operations cost functions for, 713–715 disjunctive selection conditions, 688 on files, 599 implementing, 685 in relational algebra, 147–149 search methods for complex selection, 686–687 search methods for simple selection, 685–686 selectivity of conditions, 687–688 SELECT operator (σ), 147 Select-project-join queries, 179 Selection cardinality, 712 Selection conditions in domain calculus, 184 SELECT command and, 98, 100 SELECT operation and, 147 Selection, functionality of data warehouses, 1079 Selective inheritance, in ODBs (object databases), 368 Selectivity and cost estimates, in query optimization catalog information used in cost functions, 712–713 cost components of query execution, 711–712 cost functions for JOIN, 715–718 cost functions for SELECT, 713–715 multiple relation queries and JOIN ordering, 718–719 overview of, 710–711 Selectivity, of conditions, 687–688 Self-describing data, 10–11, 416 Semantic constraints relational model constraints, 68 template dependencies and, 572 types of constraints, 74 Semantic data models abstraction concepts in, 268 aggregation and association, 269–271 classification and instantiation, 268 compared with knowledge representation, 267–268 ER (Entity-Relationship) model, 245 identification, 269 for information retrieval, 1006–1007 specialization and generalization, 269 Semantic query optimization, 722–723 Semantic relationships, in semantic model for IR, 1006 Semantic Web, 272–273 Semantics approach to IR, 1000 of attributes, 503–507, 514 equivalence of transaction schedules and, 769–770 heterogeneity of in federated databases, 886–887 integrity constraints and, 21 tagging images, 969 Semijoin operation, 904 Semistructured data, 416–417 Separators, XPath, 432 Sequence diagrams, UML, 329, 331 Sequential order, in accessing data blocks, 592 Sequential patterns in data mining, 1037 describing knowledge discovered by data mining, 1039 discovery of, 1057 in pattern discovery phase of Web usage analysis, 1027 Serial schedules, 761 Serializability, of transaction schedules characterizing schedules based on, 759–760 serial, nonserial, and conflictserializable schedules, 761–763 testing conflict serializability of schedules, 763–765 used for concurrency control, 765–768 view serializability, 768–769 Serialization (precedence) graph, 763–765 Servers client program calling database server, 451 database servers, 42 DBMS module for, 29 parallel architecture for, 1079 PHP variables, 490–491 server level in two-tier client/ server architecture, 47 specialized servers in client/server architecture, 45–46 Set-at-a-time DMLs, 38 Set constructor, 359 SET DIFFERENCE operation algorithms for, 697–698 in relational algebra, 152–155 Set null (set default) option, in delete operations, 77–78 Set operations algorithms for, 697–698 query processing and optimizing, 697–698 SQL, 104 Set types, in network data model, 51 Sets equivalence of, 549 explicit sets of values in SQL, 122 SQL table as multiset of tuples, 97 tables as, 103–105 Shadow directory, 820 Shadow paging, 820–821 Shamir, Adi, 865 Shape, automatic analysis of images, 967 Shape descriptors, 965 Shared nothing architecture, 887–888 Index Shared subclasses (multiple inheritance), 256, 297 Shared variables, embedded SQL and, 452 Sharing data and multiuser transactions, 13–14 Sharing databases, Shrinking (second) phase, in twophase locking, 782 SIFT (scale-invariant feature transform), 968 Simple API for XML (SAX), 423 Simple (atomic) attributes, in ER model, 205–207 Simple Object Access Protocol (SOAP), 436 Simultaneous update, 949 Single inheritance, subclasses and, 256–257 Single-level indexes clustering indexes, 635–636 overview of, 632–633 primary indexes, 633–635 secondary indexes, 636–642 tables comparing index types, 642 Single-loop joins cost functions for, 716 methods for implementing joins, 689 Single-quoted strings, PHP text processing, 485–486 Single-relation options, for mapping specialization or generalization, 295 Single-sided disks, 589 Single time points, in temporal databases, 946 Single-user systems, 49 Single-user transaction processing system, 744–745 Single-valued attributes, in ER model, 206 Singular value decompositions (SVD), 967 Slice and dice, functionality of data warehouses, 1078 Small Computer System Interface (SCSI), 591 SMART document retrieval system, 998 SMP (symmetric multiprocessor), 1079 Snowflake schema, for multidimensional data models, 1073–1074 SOAP (Simple Object Access Protocol), 436 Social searches, 1029 Software costs, choosing a DBMS, 323 Software developers, 16 Software engineers database actors on the scene, 16 design and testing of applications, 199 Sort-merge joins cost functions for, 717 methods for implementing joins, 689–690 Sort-merge strategy, 683 Sorting external, 682–685 functionality of data warehouses, 1078 implementing aggregate operations, 699 ordered records (sorted files), 603–606 Space utilization, physical database design and, 326 Spamming, Web spamming, 1028 Spanned/unspanned organization, of records, 597 Sparse indexes, 633 Spatial analysis, 959 Spatial applications, 25 Spatial databases applications of spatial data, 964–965 data indexing, 961–963 data mining, 963–964 data types and models, 959–960 dynamic operators, 961 operators, 960–961 overview of, 957–959 Spatial joins/overlays, 961 Spatial outliers, 965 Special purpose DBMSs, 50 Specialization/generalization constraints on, 251–254 definitions, 264 design choices for, 263–264 EER-to-Relational mapping, 294–297 generalization, 250–251 hierarchies and lattices, 254–257 in knowledge representation, 269 notation for, 1084–1085 refining conceptual schemas, 257–258 specialization, 248–250 UML (Unified Modeling Language), 265–266 1165 Specialized servers, in client/server architecture, 45 Specific attributes (local attributes), of subclass, 249 Specific relationship types, subclasses and, 249–250 Specification, conceptualization and, 272 Speech input and output, queries and, 39 SQL-99, 942–943 SQL/CLI (Call Level Interface) database programming with, 464–468 library of functions, 448 SQL injection attacks code injection, 856 function call injection, 856–857 protecting against, 858 risks associated with, 857–858 SQL manipulation, 856 types of, 855 SQL programming techniques approaches to database programming, 449–450 bibliographic references, 479 database programming techniques and issues, 448–449 dynamic SQL, 448, 458–459 embedded SQL See Embedded SQL function calls See Function calls, database programming with impedance mismatch, 450 overview of, 447–448 sequence of interactions in, 451 SQL/PSM (SQL/Persistent Stored Modules) See SQL/PSM (SQL/ Persistent Stored Modules) summary and exercises, 477–478 SQL/PSM (SQL/Persistent Stored Modules) overview of, 473 specifying persistent stored modules, 475–476 stored procedures and functions, 473–475 SQL (Structured Query Language) See also Embedded SQL * (asterisk) for retrieving all attribute values of selected tuples, 102–103 aliases, 101–102 bibliographic references, 114 CHECK clauses for specifying constraints on tuples, 97 1166 Index clauses in simple SQL queries, 107 common data types, 92–94 CREATE TABLE command, 90–92 data definition in, 89 dealing with ambiguous attribute names, 100–101 DELETE command, 109 embedding SQL commands in Java, 459–461 external sorting, 682–685 INSERT command, 107–109 list of features in, 110–111 manipulation by SQL injection attacks, 856 missing or unspecified WHERE clauses, 102 naming constraints, 96–97 object-relational features in, 354 ordering query results, 106–107 overview of, 87–89 QBE compared with, 1098 schema and catalog concepts in, 89–90 SELECT-FROM-WHERE structure of queries, 98–100 servers, 47 specifying attribute constraints and default values, 94–95 specifying key and referential integrity constraints, 95–96 substring pattern matching and arithmetic operators, 105–106 summary and exercises, 111–114 tables as sets in, 103–105 temporal data types, 945 transaction support, 770–772 translating SQL queries into relational algebra, 681–682 UDT (user-defined types) in, 111 UPDATE command, 109–110 SQL (Structured Query Language), advanced features aggregate functions, 124–126 ALTER command, 138–139 bibliographic references, 143 clauses in retrieval queries, 129–130 comparisons involving NULL and three-valued logic, 116–117 correlated nested queries, 119–120 CREATE ASSERTION command, 131–132 CREATE TRIGGER command, 132–133 CREATE VIEW command, 134–135 DROP command, 138 EXISTS and NOT EXISTS functions, 120–122 explicit sets and renaming of attributes, 122 GROUP BY clause, 126–129 HAVING clause, 127–129 inline views, 137 nested queries, 117–119 outer and inner joins, 123–124 overview of, 115 schema change statements, 137 summary and exercises, 139–143 UNIQUE function, 122 view implementation and update, 135–137 views (virtual tables) in, 133–134 SQL (Structured Query Language), ODB extensions to dot notation for build path expressions, 376 encapsulation of operations, 374–375 inheritance and polymorphism, 375–376 OIDs (object identifiers) using reference types, 373–374 overview of, 369–370 specifying relationships via reference, 376 tables based on UDTs, 374 UDTs and complex structures for objects, 370–373 SQLJ embedding SQL command in Java, 459–461 retrieving multiple tuples using iterators, 461–464 SQLODE communication variable, 454 SQLSTATE communication variable, 454 Standards database approach and, 22 database design specification, 328 SQL, 88 Star schema, 1073 Starvation, concurrency control and, 788 State in ODMG object model, 382 relational database state, 70–72 transaction, 751–752 State constraints, 75 Statechart diagrams, UML, 329, 333 Statement-level active rules, in STARBURST example, 940–942 Statement-level triggers overview of, 937 in STARBURST example, 940 Statement records, in SQL/CLI, 464–468 Static (early) binding, in ODMS, 368 Static files, 601 Static hashing, 610 Static Web pages, 420 Statistical analysis, in pattern discovery phase of Web usage analysis, 1026 Statistical approach, to IR, 1000–1002 Statistical database security, 859–860 Statistical databases, 837–838, 874 Statistical queries, 859 Steal/no-steal techniques in database recovery, 811–812 UNDO/REDO recovery algorithm, 819 Stem, of words, 1010 Stemming, text preprocessing in information retrieval, 1010 Stopwords in keyword queries, 1007 removal, 1009–1010 text/document sources, 966 Storage allocation of file blocks on disk, 598 bibliographic references, 630 buffer management and, 593–594 column-based storage of relations, 669–670 cost components of query execution, 711 covert channels, 861 database storage, 586–587 database storage reorganization, 43 database tuning and, 733 file headers (descriptors) and, 598 file systems and See Files (of records) files, fixed-length records, and variable-length records, 595–597 hardware structures of disk devices, 588–592 iSCSI (Internet SCSI), 623–624 magnetic tape devices, 592–593 Index measuring capacity, 585 memory hierarchies and, 584–586 NAS (network-attached storage), 622–623 overview of, 583–584 parallelization of access See RAID (Redundant Array of Inexpensive Disks) placing file records on disk, 594 record blocking and, 597 records and record types, 594–595 SANs (Storage Area Networks), 621–622 secondary storage devices, 587 spanned/unspanned records, 597–598 summary and exercises, 624–630 Storage Area Networks (SANs), 621–622 Storage definition language (SDL), 37, 110 Storage medium, physical, 584 Stored attributes, in ER model, 206 Stored data manager module, DBMS, 40, 42 Stored procedures, 21, 473–475 Stream-based processing, 700 Streaming XML documents, 423 Strict hierarchies, 255 Strict schedules, 759 Strict timestamp ordering, 790–791 Strict two-phase locking, 784–785 Strings pattern matching, 105 PHP text processing, 485 Strong entity types, 219, 287 Struct (tuple) constructors, 358–359 Structural constraints, of relationships, 218 Structural diagrams, UML, 329 Structured data extracting, 1022 overview of, 416 vs unstructured, 993–994 Structured domains, in UML class diagrams, 227 Structured literals, 378 Subclasses in EER model, 246–248, 264 generalizing into superclasses, 250 as leaf classes in UML, 265 options for mapping specialization or generalization, 294 predicate-defined and userdefined, 252 shared, 256 specific attributes (local attributes) of, 249 specific relationship types and, 249–250 union types or categories, 258–260 Subset of Cartesian product, 63 Subsets, of attributes, 68–69 Substring pattern matching, in SQL, 105–106 Subtrees, 646 Subtypes, 247, 365–366 SUM function aggregate functions in SQL, 124–125 grouping and, 166, 168 implementing aggregate operations, 698 Superclass/subclass relationships in EER model, 264 overview of, 247 union types or categories, 258–260 Superclasses base class and, 265 in EER model, 246–248, 264 generalization and, 250 options for mapping specialization or generalization, 294 specialization and, 248 Superkeys defined, 518 relational model constraints, 69 Supertypes, 247, 365 Superuser accounts, 838 Supervised learning classification and, 1051 neural networks and, 1058 Support, for association rules, 1040 Surrogate keys, 298 Survivability, challenges in database security, 867 SVD (singular value decompositions), 967 Symmetric key algorithms, 863 Symmetric multiprocessor (SMP), 1079 Synonyms, thesaurus as collection of, 1010 Syntactic analysis, in semantic model for IR, 1006 System accounts, 838 catalog, 42 definition in database application life cycle, 308 1167 recovery needed due to system error, 750 security issues at system level, 836 System designers, 16 System environment DBMS module, 40–42 tools, application environments, and communication facilities, 43–44 utilities for, 42–43 System independent mapping, in choosing a DBMS, 326 System logs See also Logs/logging auditing and, 839–840 database recovery and, 808 tracking transaction operations, 753–754 Systems analyst, 16 Table inheritance, in SQL, 376 Tables ALTER TABLE command, 138–139 assigning privileges at table level, 842–843 base tables (relations) vs virtual relations, 90 basing on UDTs, 374 DROP TABLE command, 138 in relational model, 60, 61 retrieval queries from database tables, 494–495 in SQL, 89 SQL table as multiset of tuples, 97, 103–105 virtual See Views Tags HTML, 418–419 semistructured data and, 417 Tape jukeboxes, 586 Tape, magnetic, 592–593 Tape reel, 592 Taxonomies, 272 Technical metadata, in data warehousing, 1078 Templates dependencies, 572 in Query-By-Example, 1091 Temporal aggregation, 957 Temporal databases attribute versioning for incorporating time in OODBs, 953–954 bitemporal time relations, 950–952 options for storing tuples in temporal relations, 952–953 overview of, 943–945 1168 Index querying constructs using TSQL2 language, 954–956 time representation, calendars and time dimensions, 945–947 time series data, 957 transaction time relations, 949–950 valid time relations, 947–949 Temporal intersection join, 952 Temporal normal form, 952 Temporal variables, 948 Temporary updates (dirty reads), concurrency control and, 748–749 Term frequency-inverse document frequency See TF-IDF (term frequency-inverse document frequency) Terminated state, transactions, 752 Terms (keywords) modes of interaction in IR systems, 999 sets of terms in Boolean model for IR, 1002 Ternary relationships choosing between binary and ternary relationships, 228–231 constraints on, 232 in ER (Entity-Relationship) model, 213–214 Tertiary storage, 584, 586 Testing conflict serializability of schedules, 763–765 in database application life cycle, 308 Texels (texture elements), 967 Text preprocessing in information retrieval, 1009–1012 sources in multimedia databases, 966 storing XML document as, 431 Texture, automatic analysis of images, 967 TF-IDF (term frequency-inverse document frequency) applying to inverted indexing, 1013 in vector space model for IR, 1003–1004 Thematic analysis, for spatial databases, 959 Theorem proving, in deductive databases, 976 Thesaurus ontologies, 272 text preprocessing in information retrieval, 1010–1011 Third normal form (3NF) dependency-preserving and nonadditive join decomposition into, 558–563 dependency-preserving decomposition into, 558–559 general definition of, 528 overview of, 523–525 Thomas’s write rule, 791 Threats, to database security, 836–837 Three-phase commit (3PC) protocol, 908 three-schema architecture data independence and, 35–36 levels of, 34–35 overview of, 33 Three-tier architectures client/server architecture, 892–894 PHP, 482 for Web applications, 47–49 Three-valued logic, 116–117 Time constraints, on queries and transactions, 729 TIME data type, 945 Time dimensions, in temporal databases, 945–947 Time periods, in temporal databases, 946 Time representation, in temporal databases, 945–947 Time series management systems, 957 patterns in, 1039, 1057 as specialized database applications, 25 in temporal databases, 946, 957 Time-varying attributes, 953 Timeouts, for dealing with deadlocks, 788 TIMESTAMP data type, SQL, 93, 945 Timestamp ordering (TO) basic, 789–790 for concurrency control, 777 multiversion technique based on, 792 strict timestamp ordering, 790–791 Thomas’s write rule, 791 Timestamps overview of, 789 read and write, 789 transaction time relations and, 949 Timing channels, covert, 861 TO See Timestamp ordering (TO) Tool developers, 17 Tools, DBMS, 43–44 Top-down methodology for conceptual refinement, 257 for database design, 502 for schema design, 315–316 Topical relevance, in IR, 1015 Topological operators, 960 Topological relationships, among spatial objects, 959 Topologies, network, 879 Total categories, 260 Total participation, binary relationships and, 217 Total specialization constraint, 253 Tracks, on hard disks, 589 Trade-off analysis, 345 Training costs, in choosing a DBMS, 323–324 Transaction-id, 753 Transaction processing systems ACID properties, 754–755 bibliographic references, 775 characterizing schedules based on recoverability, 757–759 characterizing schedules based on serializability, 759–760 commit point of transactions, 754 concurrency control, 747–750 database design and, 306 equivalence of schedules, 769–770 overview of, 743–744 recovery, 750–751 schedules (histories) of transactions, 756–757 serial, nonserial, and conflictserializable schedules, 761–763 serializability used for concurrency control, 765–768 single-user vs multiuser, 744–745 SQL support for transactions, 770–772 summary and exercises, 772–774 system log, 753–754 testing conflict serializability of schedules, 763–765 transaction states and operations, 751–752 transactions, database items, read/write operations, and DBMS buffers, 745–747 view equivalence and view serializability, 768–769 Index Transaction processing systems, in distributed databases catalog management, 913 concurrency control, 909–912 operating system support, 909 overview of, 907–908 recovery, 912–913 two-phase and three-phase commit protocols, 908–909 Transaction Table, in ARIES recovery algorithm, 822 Transaction time, in temporal databases, 946 Transaction time relations, in temporal databases, 949–950 Transaction timestamp, 786 Transactional databases, distinguishing data warehouses from, 1069 Transactional searches, 996 Transactions ACID properties, 754–755 canned, 15 commit point of, 754 committed and aborted, 750 defined, designing, 322–323 interactive, 801 multiuser, 13–14 recovery needed due to transaction error, 750 relational data model and, 79 schedules (histories) of, 756–757 SQL transaction control commands, 111 states and operations, 751–752 throughput in physical database design, 327 types of, 745 Transfer rate (tr), disk blocks, 1088 Transformation approach, to image database queries, 966 Transience collections, 367 data, 586 object lifetime and, 378 objects, 355, 363 Transition constraints, 75 Transition tables, in STARBURST example, 940 Transitive closure, of relations, 168 Transitive dependencies, in 3NF, 523–524 Transparency autonomy as complement to, 882 in distributed databases, 879–881 Tree data models See Hierarchical data models Tree structures See also B+-trees; B-trees decision making in database design, 730 FP-tree (frequent-pattern tree) algorithm, 1043–1045 leaf-deep trees, 718 overview of, 646–647 R-trees, 962 search trees, 647–649 specialization hierarchy, 255 TV-trees (telescoping vector trees), 967 Triggers active rules specified by, 933 associating with database tables, 21 before, after, and instead triggers, 938 CREATE TABLE command, 132–133 CREATE TRIGGER command, 936 creating in SQL, 111 overview of, 932 row-level and statement-level, 937 specifying constraints, 74 in SQL-99, 942–943 Truth values, of atoms, 184 TSQL2 language, 954–956 Tuning databases design, 735–736 guidelines for, 738–739 implementation and, 311 indexes, 734–735 overview of, 733–734 queries, 736–738 system implementation and tuning, 327–328 Tuple-based constraints, 97 Tuple relational calculus examples of queries in, 178–179 existential and universal quantifiers, 177–178 expressions and formulas, 176–177 notation for query graphs, 179–180 overview of, 174–175 safe expressions, 182–183 SQL based on, 88 transforming universal and existential quantifiers, 180 1169 tuple variables and range relations, 175–176 universal quantifier used in queries, 180–182 Tuple versioning approach, to implementing temporal databases, 947–953 bitemporal time relations, 950–952 implementation considerations, 952–953 transaction time relations and, 949–950 valid time relations and, 947–949 Tuples (rows) classification in mandatory access control, 848 combining using JOIN operation, 157–158 comparison of values in, 118 component values of, 67 dangling tuples in relational design, 563–565 defined, 61 disallowing spurious, 510–513 eliminating duplicates, 150 hypothesis tuples, 572 n-tuple for relations, 62 ordering in relations, 64 ordering values within, 64–65 reducing NULL values in, 509–510 reducing redundant information in, 507–509 retrieving all attribute values of selected, 102–103 retrieving multiple tuples in SQLJ, 461–464 retrieving multiple tuples using cursors, 455–457 SQL table as multiset of, 97 storing in temporal relations, 952–953 unspecified WHERE clause and, 102 valid time relations and, 948 values and NULLS in, 65–66 versioning for incorporating time in relational databases, 953 Tuples variables aliases and, 101 looping with iterators, 98 range relations and, 175–176 TV-trees (telescoping vector trees), 967 1170 Index Two-phase commit (2PC) protocol recovery in multidatabase systems, 825–826 transaction management in distributed databases, 908 Two-phase locking basic locks, 784 binary locks, 778–780 conversion of locks, 782 overview of, 777–778 serializability guaranteed by, 782–784 shared/exclusive (read/write) locks, 780–782 variations on two-phase locking, 784–785 Two-tier client/server architecture, 46–47 Two-way joins, 689 Type (class) hierarchies constraints on extents corresponding to, 366–367 inheritance and, 369 in OO systems, 356 simple model for inheritance, 364–366 Type-compatible relations, 697 Type constructors atom constructor, 358 collection constructor, 359 defined, 369 ODB features included in SQL, 370 ODL and, 359–360 struct (tuple) constructor, 358–359 Type generator, 358–359 UDT (user-defined types) creating, 370–373 in SQL, 111 tables based on, 374 UML (Unified Modeling Language) class diagrams, 226–228 for database application design, 329 as design specification standard, 328 diagram types, 329–334 notation for ER diagrams, 224 object modeling with, 200 representing specialization/generalization in, 265–266 University student database example, 334–337 UMLS metathesaurus, 1010–1011 Unary relational operations CARTESIAN PRODUCT operation, 155–157 overview of, 146 PROJECT operation, 149–150 SELECT operation, 147–149 UNION, INTERSECTION, and MINUS operations, 152–155 Unbalanced trees, 646 Unconstrained write assumption, 769 UNDO/NO-REDO recovery immediate update techniques, 818–819 overview of, 807, 809 Undo operations, transactions, 753 UNDO phase, of ARIES recovery algorithm, 823 UNDO/REDO recovery immediate update techniques, 819 overview of, 807, 809 UNDO, write-ahead logging and, 810–811 Unidirectional associations, in UML class diagrams, 227 Unified Modeling Language See UML (Unified Modeling Language) UNION operation algorithms for, 697–698 in relational algebra, 152–155 SQL set operations, 104 Union types (categories) EER-to-Relational mapping, 297–299 modeling, 258–260 UNIQUE function, SQL, 122 Unique identity, in ODMS, 357 UNIQUE KEY clause, CREATE TABLE command, 96 Unique keys, in relational models, 70 Uniqueness constraints on entity attributes, 208–209 factors influencing physical database design, 729 integrity constraints in databases, 21 overview of, 68–70 specifying in SQL, 95–96 Universal quantifiers transforming, 180 in tuple relational calculus, 177–178 used in queries, 180–182 Universal relation assumption, 552 Universal relation schema, 552 Universal relations, 544 Universe of discourse (UoD), University student database example data records in, 6–9 EER schema applied to, 260–263 Unordered (heap files) records, 601–602 Unrepeatable read problem, 750 Unstructured data HTML and, 418–420 information retrieval dealing with, 993–994 Unsupervised learning clustering and, 1054 neural networks and, 1058 UoD (universe of discourse), Update anomalies, avoiding redundant information in tuples, 507 UPDATE command, SQL active rules and, 936 overview of, 109–110 Update operations bitemporal databases and, 950 database design and, 728 factors influencing physical database design, 729 operations on files, 599 query processing in distributed databases, 905–907 in relational data model, 78–79 types of relational data model operations, 75 Update transactions, 322 Usage projections, data warehousing and, 1080 Use case diagrams, UML, 329–331 User accounts, database security and, 839–840 User-defined subclasses, 252, 264 User-defined time, 947 User-defined types See UDT (userdefined types) User-friendly interfaces, 38 User interfaces GUIs (graphical user interfaces), 20, 39, 1061 multiple users, 20 User labels, combining with data labels, 869–870 Users classifying DBMSs by number of, 49 Index database actors on the scene, 15–16 measures of relevance in IR, 1015 multiuser transactions, 13–14 types of users in information retrieval, 995–996 Utilities, DBMS system, 42–43 Valid event data, 957 Valid state database states, 33 relational databases, 71 Valid time databases, 946 Valid time, in temporal databases, 946 Valid time relations, in temporal databases, 947–949 valid XML documents, 422–425 Validation in database application life cycle, 307–308 of queries, 679 Validation (optimistic) concurrency control, 777, 794–795 Validation phase, of optimistic concurrency control, 794 Value, hue, saturation, and, 967 Value references, in RDBs, 396 Value sets (domains), of attributes, 209–210 Values stored in records, 594 in tuples, 65–66 Values (literals) atomic formulas as, 973 atomic literals, 378 collection literals, 382 complex types for, 358–360 in OO systems, 358 structured literals, 378 Variable-length records, 595–597 Variables bind variables (parameterized statements), 858 communication variables in SQL, 454 domain, 183 instance, 356 iterator variables, in OQL, 399–400 limited, 980 PHP, 485–486 PHP server, 490–491 PHP variable names, 484–485 program, 599 in Prolog languages, 971 scope, 490 shared, 452 temporal, 948 tuple, 98, 101, 175–176 VDL (view definition language), 37 Vector space model, for information retrieval, 1003–1005 Vertical fragmentation, in distributed databases, 881, 895 Vertical partitioning, database tuning and, 735 Vertical propagation, of privileges, 847 Vertical search engines, 1018 Very large databases, 586 Victim selection algorithm, for deadlock prevention, 788 Video applications, 25 Video clips, in multimedia databases, 932, 965 Video segments, in multimedia databases, 966 Video sources, in multimedia databases, 966 View definition language (VDL), 37 View equivalence, of transaction schedules, 768–769 View integration approach, in conceptual schema design, 315 View materialization, 135 View serializability, of transaction schedules, 768–769 Views data warehouses compared with, 1079–1080 database designers creating, 15 granting/revoking privileges, 844 multiple views of data supported in databases, 12 specifying as named queries in OQL, 402–403 Views (virtual tables), SQL vs base tables, 134 CREATE VIEW command, 134–135 implementation and update, 135–137 inline views, 137 overview of, 89, 133–134 Virtual data, in views, 12 Virtual data warehouses, 1070 Virtual private databases (VPDs), 868–869 Virtual relations, specifying with CREATE VIEW command, 90 1171 Virtual tables See Views (virtual tables), SQL Visible/hidden attributes, of objects, 361 Vocabularies in inverted indexing, 1012 searching, 1013–1014 Volatile storage, 586 Voting method, distributed concurrency control based on, 912 VPDs (virtual private databases), 868–869 Wait-die transaction timestamp, 786 Wait-for graph, 787 WAL (write-ahead logging), 810–812 WANs (wide area networks), 879 Weak entity types, 219–220, 288–289 Web access control policies for, 854–855 hypertext documents and, 415 interchanging data on, 24 Web analysis, 1019, 1027 Web applications, architectures for, 47–49 Web-based user interfaces, 38 Web browsers, 38 Web clients, 38 Web content analysis agent-based approach to, 1024–1025 concept hierarchies in, 1024 database-based approach to, 1025 ontologies and, 1023–1024 overview of, 1022 segmenting Web pages and detecting noise, 1024 structured data extraction, 1022 types of Web analysis, 1019 Web information integration, 1022–1023 Web crawlers, 1028 Web databases, programming See PHP Web forms, collecting data from/inserting record into, 493–494 Web interface, for database applications, 449 Web Ontology Language (OWL), 969 Web pages analyzing link structure of, 1020–1021 1172 Index content analysis, 1024 ranking, 1000 Web query interface integration, 1023 Web search and analysis analyzing link structure of Web pages, 1020–1021 comparing with information retrieval, 1018–1019 HITS ranking algorithm, 1021–1022 overview of, 1018 PageRank algorithm, 1021 practical uses of Web analysis, 1027–1028 searching the Web, 1020 Web content analysis, 1022–1025 Web searches combining browsing and retrieval, 1000 Web usage analysis, 1025–1027 Web security, 1028 Web servers middle tier in three-tier architecture, 48 specialized servers in client/server architecture, 45 Web Services Description Language (WSDL), 436 Web spamming, 1028 Web structure analysis analyzing link structure of Web pages, 1020–1022 types of Web analysis, 1019 Web usage analysis pattern analysis phase of, 1027 pattern discovery phase of, 1026–1027 preprocessing phase of, 1025–1026 types of Web analysis, 1019 Well-formed XML, 422–425 WHERE clause DELETE command, 109 explicit sets of values in, 122 missing or unspecified, 102 in SQL retrieval queries, 129–130 UPDATE command, 109–110 Wide area networks (WANs), 879 Wildcard (*) types of queries in IR systems, 1008–1009 using with XPath, 433 WITH CHECK OPTION, view updates and, 137 WordNet thesaurus, 1011 Wound-wait transaction timestamp, 786 Wrappers, structured data extraction and, 1022 Write-ahead logging (WAL), 810–812 Write command, hard disks and, 591 Write phase, of optimistic concurrency control, 794 Write-set, of transactions, 747 Write timestamp, 789 Write-write conflicts, in transaction schedules, 757 write_item(X), 746 WSDL (Web Services Description Language), 436 XML access control, 853–854 XML declaration, 423 XML (eXtended Markup Language) data model, 51 interchanging data on Web using, 24 XML (Extensible Markup Language) bibliographic references, 443 converting graphs into trees, 441 hierarchical (tree) data model, 420–422 hierarchical XML views over flat or graph-based data, 436–440 languages, 432 languages related to, 436 overview of, 415–416 storing/extracting XML documents from databases, 431–432, 442 structured, semistructured, and unstructured data, 416–420 summary and exercises, 442–443 well-formed and valid documents, 422–425 XML schema language, 425–430 XPath, 432–434 XQuery, 434–435 XML schema language, 425–430 example schema file, 426–428 list of concepts in, 428–429 overview of, 425 XPath, 432–434 XQuery, 434–435 XSL (Extensible Stylesheet Language), 415, 436 XSLT (Extensible Stylesheet Language Transformations), 415, 436 ... reserved for databases containing tens of terabytes 17.1 .2 Storage of Databases Databases typically store large amounts of data that must persist over long periods of time, and hence is often referred... NULL Overflow buckets Bucket 321 Record pointer 761 91 22 NULL Record pointer 1 82 Record pointer Bucket Record pointer 981 6 52 Record pointer Record pointer 72 522 NULL Record pointer Record... (b) Name Ssn 123 456789 Smith, John 12 21 25 29 (c) Name = Smith, John Ssn = 123 456789 Figure 17.5 Three record storage formats (a) A fixed-length record with six fields and size of 71 bytes (b)