Design Decisions that Do Not Affect Program Logic- 123docz.net

The discussion in this section makes frequent reference to the term block . This is the term used in the Oracle DBMS product to refer to the smallest amount of data that can be transferred between the storage medium and main memory. The corresponding term in IBM ’ s DB2 database management system is page .

7.4.1 Indexes

Indexes provide one of the most commonly used methods for rapidly retrieving specifi ed rows from a table without having to search the entire table.

Each table can have one or more indexes specifi ed. Each index applies to a particular column or set of columns. For each value of the column(s), the index lists the location(s) of the row(s) in which that value can be found. For example, an index on Customer Location would enable us to readily locate all of the rows that had a value for Customer Location of (say) New York.

The specifi cation of each index includes:

■ The column(s).

■ Whether it is unique (i.e., whether there can be no more than one row for any given value; see “ Index Properties ” section).

■ Whether it is the sorting index (see “ Index Properties ” section).

■ The structure of the index (for some DBMSs; see “ Balanced Tree Indexes ” and “ Bit-Mapped Indexes ” sections).

The advantages of an index are that:

■ It can improve data access performance for a retrieval or update.

■ Retrievals that only refer to indexed columns do not need to read any data blocks (access to indexes is often faster than direct access to data blocks bypassing any index).

The disadvantages are that each index:

■ Adds to the data access cost of a create transaction or an update transaction in which an indexed column is updated.

■ Takes up disk space.

■ May increase lock contention (see Section 7.5.1 ).

■ Adds to the processing and data access cost of reorganize and table load utilities.

Whether an index will actually improve the performance of an individual query depends on two factors:

1. Whether the index is actually used by the query.

2. Whether the index confers any performance advantage on the query.

Index Usage by Queries

DML (Data Manipulation Language) 3 only specifi es what you want, not how to get it. The optimizer built into the DBMS selects the best available access method based on its knowledge of indexes, column contents, and so on. Thus, index usage cannot be explicitly specifi ed but is determined by the optimizer during DML compilation. How it implements the DML will depend on:

■ The DML clauses used, in particular the predicate(s) in the where clause (see Figure 7.1 for examples).

■ The tables accessed, their size, and content.

■ What indexes there are on those tables.

3 This is the SQL query language, often itself called “ SQL, ” and most commonly used to retrieve data from a relational database.

FIGURE 7.1

Retrieval and update queries.

select EMP_NO, EMP_NAME, SALARY from EMPLOYEE

where SALARY > 80000;

update EMPLOYEE

set SALARY = SALARY* 1.1 where SALARY > 80000;

Some predicates will preclude the use of indexes; these include:

■ Negative conditions (e.g., “ not equals ” and those involving NOT ).

■ LIKE predicates in which the comparison string starts with a wildcard.

■ Comparisons including scalar operators (e.g., + ) or functions (e.g., data type conversion functions).

■ ANY/ALL subqueries, as in Figure 7.2 . ■ Correlated subqueries, as in Figure 7.3 .

Certain update operations may also be unable to use indexes. For example, while the retrieval query in Figure 7.1 can use an index on the Salary column if there is one, the update query in the same fi gure cannot.

Note that the DBMS may require that, after an index is added, a utility is run to examine table contents and indexes and recompile each SQL query. Failure to do this would prevent any query from using the new index.

Performance Advantages of Indexes

Even if an index is available and the query is formulated in such a way that it can use that index, the index may not improve performance if more than a certain proportion of rows are retrieved. That proportion depends on the DBMS.

Index Properties

If an index is defi ned as unique , each row in the associated table must have a different value in the column or columns covered by the index. Thus, this is a means of implementing a uniqueness constraint, and a unique index should there- fore be created on each table ’ s primary key as well as on any other sets of columns FIGURE 7.2

An ALL subquery.

select EMP_NO, EMP_NAME, SALARY from EMPLOYEE

where SALARY > all (select SALARY from EMPLOYEE where DEPT_NO = '123');

FIGURE 7.3

A correlated subquery.

select EMP_NO, EMP_NAME from EMPLOYEE as E1 where exists

(select*

from EMPLOYEE as E2

where E2.EMP_NAME = E1.EMP_NAME and E2.EMP_NO <> E1.EMP_NO);

having a uniqueness constraint. However, since the database administrator can always drop any index (except perhaps that on a primary key) at any time, a unique index cannot be relied on to be present whenever rows are inserted. As a result, most programming standards require that a uniqueness constraint is explicitly tested for whenever inserting a row into the relevant table or updating any column participating in that constraint.

The sorting index (called the clustering index in DB2) of each table is the one that controls the sequence in which rows are stored during a bulk load or reorganization that occurs during the existence of that index. Clearly there can be only one such index for each table. Which column(s) should the sorting index cover? In some DBMSs there is no choice; the index on the primary key will also control row sequence. Where there is a choice, any of the following may be worthy candidates, depending on the DBMS:

■ Those columns most frequently involved in inequalities (e.g., where >

or >= appears in the predicate).

■ Those columns most frequently specifi ed as the sorting sequence.

■ The columns of the most frequently specifi ed foreign key in joins.

■ The columns of the primary key.

The performance advantages of a sorting index are:

■ Multiple rows relevant to a query can be retrieved in a single I/O operation.

■ Sorting is much faster if the rows are already more or less in sequence (note that rows can get out of sequence between reorganizations).

By contrast, creating a sorting index on one or more columns may confer no advantage over a nonsorting index if those columns are mostly involved in index- only processing (i.e., if those columns are mostly accessed only in combination with each other or are mostly involved in = predicates).

Consider creating other (nonunique, nonsorting) indexes on:

■ Columns searched or joined with a low hit rate.

■ Foreign keys.

■ Columns frequently involved in aggregate functions, existence checks, or DISTINCT selection.

■ Sets of columns frequently linked by AND in predicates.

■ Code and meaning columns for a classifi cation table if there are other less-frequently accessed columns.

■ Columns frequently retrieved.

Indexes on any of the following may not yield any performance benefi t:

■ Columns with low cardinality (the number of different values is signifi cantly less than the number of rows) unless a bit-mapped index is used (see “ Bit-Mapped Indexes ” section).

■ Columns with skewed distribution (many occurrences of one or two particular values and few occurrences of each of a number of other values).

■ Columns with low population ( NULL in many rows).

■ Columns that are frequently updated.

■ Columns that take up a signifi cant proportion of the row length.

■ Tables occupying a small number of blocks, unless the index is to be used for joins, a uniqueness constraint, or referential integrity, or if index-only processing is to be used.

■ Columns with the varchar (variable length) data type.

Balanced Tree Indexes

Figure 7.4 illustrates the structure of a balanced tree index (often referred to as a B-tree index ) used in most relational DBMSs. Note that the depth of the tree may be only one (in which case the index entries in the root block point directly to data blocks); two (in which case the index entries in the root block point to leaf blocks in which index entries point to data blocks); three (as shown in the fi gure); or more than three (in which the index entries in nonleaf blocks point to other nonleaf blocks). The term balanced refers to the fact that the tree structure is symmetrical. If insertion of a new record causes a particular leaf block to fi ll up, the index entries must be redistributed evenly across the index with additional index blocks created as necessary, leading eventually to a deeper index.

Particular problems may arise with a balanced tree index on a column or columns on which inserts are sequenced (i.e., each additional row has a

FIGURE 7.4

Balanced tree index structure.

Nonleaf block Nonleaf

block

Leaf block

Leaf block Root

block

Data block Data

block Data

block

Data block

higher value in those column(s) than the previous row added). In this case, the insertion of new index entries is focused on the rightmost (highest value) leaf block, rather than evenly across the index, resulting in more frequent redistribu- tion of index entries that may be quite slow if the entire index is not in main memory. This makes a strong case for random, rather than sequential, primary keys.

Bit-Mapped Indexes

Another index structure provided by some DBMSs is the bit-mapped index . This has an index entry for each value that appears in the indexed column. Each index entry includes a column value followed by a series of bits, one for each row in the table. Each bit is set to one if the corresponding row has that value in the indexed column and zero if it has some other value. This type of index confers the most advantage where the indexed column is of low cardinality (the number of different values is signifi cantly less than the number of rows). By contrast, such an index may impact negatively on the performance of an insert operation into a large table as every bit in every index entry that represents a row after the inserted row must be moved one place to the right. This is less of a problem if the index can be held permanently in main memory (see Section 7.4.3 ).

Indexed Sequential Tables

A few DBMSs support an alternative form of index referred to as ISAM (indexed sequential access method). This may provide better performance for some types of data population and access patterns.

Hash Tables

Some DBMSs provide an alternative to an index to support random access in the form of a hashing algorithm to calculate block numbers from key values.

Tables managed in this fashion are referred to as hashed random (or “ hash ” for short). Again, this may provide better performance for some types of data population and access patterns. Note that this technique is of no value if partial keys are used in searches (e.g., “ Show me the customers whose names start with ‘ Smi ’ ” ) or a range of key values is required (e.g., “ Show me all customers with a birth date between 1/1/1948 and 12/31/1948 ” ), whereas indexes do support these types of query.

Heap Tables

Some DBMSs provide for tables to be created without indexes. Such tables are sometimes referred to as heaps . If the table is small (only a few blocks) an index may provide no advantage. Indeed if all the data in the table will fi t into a single block, accessing a row via an index requires two blocks to be read (the index block and the data block) compared with reading in and scanning (in main memory) the one block; in this case, an index degrades performance. Even if the data in the table require two blocks, the average number of blocks read to access

a single row is still less than the two necessary for access via an index. Many reference (or classifi cation) tables fall into this category.

Note, however, that the DBMS may require that an index be created for the primary key of each table that has one, and a classifi cation table will certainly require a primary key. If so, performance may be improved by one of the following:

■ Creating an additional index that includes both code (the primary key) and meaning columns; any access to the classifi cation table that requires both columns will use that index rather than the data table itself (which is now in effect redundant but only takes up space rather than slowing down access).

■ Assigning the table to main memory in such a way that ensures the classifi cation table remains in main memory for the duration of each load of the application (see Section 7.4.3 ).

7.4.2 Data Storage

A relational DBMS provides the database designer with a variety of options (depending on the DBMS) for the storage of data.

Table Space Usage

Many DBMSs enable the database designer to create multiple table spaces to which tables can be assigned. Since these table spaces can each be given different block sizes and other parameters, tables with similar access patterns can be stored in the same table space and each table space then tuned to optimize the performance for the tables therein. The DBMS may even allow you to interleave rows from different tables, in which case you may be able to arrange, for example, for the

Order Item rows for a given order to follow the Order row for that order, if they are frequently retrieved together. This reduces the average number of blocks that need to be read to retrieve an entire order. The facility is sometimes referred to as clustering , which may lead to confusion with the term clustering index (see “ Index Properties ” section).

Free Space

When a table is loaded or reorganized, each block may be loaded with as many rows as can fi t (unless rows are particularly short and there is a limit imposed by the DBMS on how many rows a block can hold). If a new row is inserted and the sorting sequence implied by the primary index dictates that the row should be placed in an already full block, that row must be placed in another block. If no provision has been made for additional rows, that will be the last block (or if that block is full, a new block following the last block). Clearly this “ overfl ow ” situa- tion will cause a degradation over time of the sorting sequence implied by the primary index and will reduce any advantages conferred by the sorting sequence of that index.

This is where free space enters the picture. A specifi ed proportion of the space in each block can be reserved at load or reorganization time for rows subsequently inserted. A fallback can also be provided by leaving every n th block empty at load or reorganization time. If a block fi lls up, additional rows that belong in that block will be placed in the next available empty block. Note that once this happens, any attempt to retrieve data in sequence will incur extra block reads. This caters, of course, not only for insertions but for increases in the length of existing rows, such as those that have columns with the varchar data type.

The more free space you specify, the more rows can be fi tted in or increased in length before performance degrades and reorganization is necessary. At the same time, more free space means that any retrieval of multiple consecutive rows will need to read more blocks. Obviously for those tables that are read-only, you should specify zero free space. In tables that have a low frequency of create transactions (and update transactions that increase row length), zero free space is also reasonable, since additional data can be added after the last row. Free space can and should be allocated for indexes as well as data.

Table Partitioning

Some DBMSs allow you to divide a table into separate partitions based on one of the indexes. For example, if the fi rst column of an index is the state code, a separate partition can be created for each state. Each partition can be independently loaded or reorganized and can have different free space and other settings.

Drive Usage

Choosing where a table or index is on disk enables you to use faster drives for more frequently accessed data or to avoid channel contention by distributing across multiple disk channels tables that are accessed in the same query.

Compression

One option that many DBMSs provide is the compression of data in the stored table (e.g., shortening of null columns or text columns with trailing space). While this may save disk space and increase the number of rows per block, it can add to the processing cost.

Distribution and Replication

Modern DBMSs provide many facilities for distributing data across multiple net- worked servers. Among other things, distributing data in this manner can confer performance and availability advantages. However, this is a specialist topic and is outside the scope of this brief overview of physical database design.

7.4.3 Memory Usage

Some DBMSs support multiple input/output buffers in main memory and enable you to specify the size of each buffer and allocate tables and indexes to particular

buffers. This can reduce or even eliminate the need to swap frequently accessed tables or indexes out of main memory to make room for other data. For example, a buffer could be set up that is large enough to accommodate all the classifi cation tables in their entirety. Once they are all in main memory, any query requiring data from a classifi cation table does not have to read any blocks for that purpose.

Design Decisions that Do Not Affect Program Logic

Other Constraints and Derivation Rules

Mapping from ORM to UML