Addison Wesley SQL Performance Tuning Sep 2002 ISBN 0201791692

EXPLAIN The non-standard EXPLAIN statement (see Table 17-1) is the vital way to find out what the optimizer has done We haven't mentioned it up to now because this book's primary goal has been to show what you can do before the fact But EXPLAIN is the way to measure whether your estimates correspond to DBMS reality In many shops, it's customary to get an EXPLAIN for every SQL statement before submitting it for execution That is quite reasonable What's perhaps less reasonable is the custom of trying out every transformation one can think of and submitting them all for explanation That is mere floundering Understanding principles—in other words, estimating what's best before the fact—is more reliable and less time consuming So don't flounder—read this book! Here is an example of a typical EXPLAIN output, from Informix: SET EXPLAIN ON QUERY: SELECT column1, column2 FROM Table1; Estimated cost: 3 Estimated # of rows returned: 50 1) Owner1.Table1 : SEQUENTIAL SCAN With most DBMSs, the EXPLAIN result goes to a table or file so you can select the information you need In many cases, the EXPLAIN statement's output is much harder to follow than the short example we show here, which is why graphic tools like IBM's Visual EXPLAIN are useful Chapter 17 Cost-Based Optimizers Supposedly, a rule-based optimizer (RBO) differs from a cost-based optimizer (CBO) Consider this SQL statement: SELECT * FROM Table1 WHERE column2 = 55 Assume that column2 is an indexed, non-unique column A rule-based optimizer will find column2 in the system catalog and discover that it is indexed, but not uniquely indexed The RBO then combines this data with the information that the query uses the equals operator A common assumption in the field of optimization is that "= " search conditions will retrieve 5% of all rows (In contrast, the assumption for greater-than conditions is that they will retrieve 25% of all rows.) That is a narrow search, and usually it's faster to perform a narrow search with a B-tree rather than scan all rows in the table Therefore the rule-based optimizer makes a plan: find matching rows using the index on column2 Notice that the rule-based optimizer is using a non-volatile datum (the existence of an index) and a fixed assumption (that equals searches are narrow) A cost-based optimizer can go further Suppose the system catalog contains three additional pieces of information: (1) that there are 100 rows in Table1, (2) that there are two pages in Table1, and (3) that the value 55 appears 60 times in the index for column2 Those facts change everything The equals operation will match on 60% of the rows, so it's not a narrow search And the whole table can be scanned using two page reads, whereas an index lookup would take three page reads (one to lookup in the index, two more to fetch the data later) Therefore the cost-based optimizer makes a different plan: find matching rows using a table scan Notice that the cost-based optimizer is using volatile data (the row and column values that have been inserted) and an override (that the contents are more important than the fixed assumptions) In other words, a cost-based optimizer is a rule-based optimizer that has additional, volatile information available to it so that it can override the fixed assumptions that would otherwise govern its decisions The terminology causes an impression that one optimizer type is based on rules while the other is based on cost That's unfortunate because both optimizer types use rules and both optimizer types have the goal of calculating cost The reality is that cost-based is an extension of rule-based, and a better term would have been something like "rule-based++." Most vendors claim that their DBMSs have cost-based optimizers, as you can see from Table 17-1 The claims don't mean much by themselves What's important is whether the optimizer estimates cost correctly and how it acts on the estimate In this chapter, we'll look at the actions DBMSs take to fulfill their claims Table 17-1 Cost-Based Optimizers Claims to be CBO IBM Yes Informix Yes Ingres Yes InterBase Yes Microsoft Yes MySQL No Oracle Yes Sybase Yes "Explains" the Access Plan EXPLAIN SET EXPLAIN EXECUTE QEP SELECT … PLAN EXPLAIN EXPLAIN EXPLAIN PLAN FOR SET SHOWPLAN ON "Updates" Statistics for the Optimizer RUNSTATS UPDATE STATISTICS optimizedb utility SET STATISTICS UPDATE STATISTICS ANALYZE TABLE ANALYZE UPDATE STATISTICS Notes on Table 17-1: Claims to be CBO column This column is "Yes" if the DBMS's documentation makes the claim that it operates with a cost-based optimizer "Explains" the Access Plan column This column shows the non-standard statement provided by the DBMS so that you can examine the access plan the optimizer will use to resolve an SQL statement For example, if you want to know how Oracle will resolve a specific SELECT statement, just execute an EXPLAIN PLAN FOR statement for the SELECT "Updates" Statistics for the Optimizer column This column shows the non-standard statement or utility the DBMS provides so that you can update volatile information, or statistics, for the optimizer For example, if your DBMS is MySQL and you've just added many rows to a table and want the optimizer to know about them, just execute an ANALYZE TABLE statement for the table Appendix B Glossary This glossary contains only terms that specifically are used for SQL optimization For terms that apply to the subject of SQL in general, consult the 1,000-term glossary on our Web site, ourworld.compuserve.com/homepages/OCELOTSQL/glossary.htm Before the definition there may be a "Used by" note For example, "Used by: Microsoft, Sybase" indicates that Microsoft and Sybase authorities prefer the term and/or definition The words "Used by: this book only" indicate a temporary and nonstandard term that exists only for this book's purposes When a word has multiple meanings, the first definition is marked "[1]" and subsequent definitions are marked with incremented numbers "See" and "see also" refer to other terms in this glossary Numbers/Symbols _rowid Used by: MySQL See also [row identifier] 1NF First normal form table, a table that contains only scalar values 2NF Second normal form table, a 1NF table that contains only columns that are dependent upon the entire primary key 3NF Third normal form table, a 2NF table whose non-key columns are also mutually independent; that is, each column can be updated independently of all the rest A access plan The plan used by the optimizer to resolve an SQL statement ADO ActiveX Data Objects, an API that enables Windows applications to access a database aggregate function See [set function] API Application Programming Interface, the method by which a programmer writing an application program can make requests of the operating system or another application applet A Java program that can be downloaded and executed by a browser B B-tree A structure for storing index keys; an ordered, hierarchical, paged assortment of index keys Some people say the "B" stands for "Balanced." back compression Making index keys shorter by throwing bytes away from the back See also [compression] See also [front compression] balanced tree See [B-tree] BDB Berkeley DB, an embedded database system that is bundled with MySQL big-endian A binary data transmission/storage format in which the most significant bit (or byte) comes first binary sort A sort of the codes that are used to store the characters bitmap Multiple lines of bits bitmap index An index containing one or more bitmaps See also [bitmap] bit vector A line of bits block Used by: Oracle [1] Oracle-speak for the smallest I/O unit in a database All other DBMSs use the word page [2] When Job #1 has an exclusive lock on Table1 and Job #2 tries to access Table1, Job #2 must wait—it is blocked See also [page] block fetch A fetch that gets multiple rows at once blocking unit See [page] BNF Backus-Naur Form, a notation used to describe the syntax of programming languages bookmark Used by: Microsoft A pointer in an index key If the data is in a clustered index, the bookmark is a clustered index key If the data is not in a clustered index, the bookmark is an RID See also [row locator] buffer See [buffer pool] buffer pool A fixed-size allocation of memory, used to store an inmemory copy of a bunch of pages bulk INSERT A multiple-row INSERT submitted to the DBMS as a single unit for processing all at once bytecode An intermediate form of code in which executable Java programs are represented Bytecode is higher level than machine code, but lower level than source code UNION shared (lock mode) A lock that may coexist with any number of other shared locks, or with one update lock, on the same object shift Movement of rows caused by a change in the row size When the length of a row is changed due to an UPDATE or DELETE operation, that row and all subsequent rows on the page may have to be moved or shifted See also [migration] shrinking update A data-change statement that decreases the size of a row skew An observation about the distribution of values in a set If value1, value2, and value3 each occur five times, there is no skew: the values are evenly distributed On the other hand, if value1 occurs five times and value2 occurs ten times and value3 occurs 100 times, there is skew: the values are unevenly distributed sort key Used by: this book only A string with a series of one-byte numbers that represents the relative ordering of characters sort-merge join A method for producing a joined table Given two input tables Table1 and Table2, processing is as follows: (a) Sort Table1 rows according to join-column values (b) Sort Table2 rows according to join-column values (c) Merge the two sorted lists, eliminating rows where no duplicates exist SPL Used by: Informix Stored Procedure Language, Informix term for stored procedures splitting (an index) A process whereby the DBMS makes a newly inserted or updated key fit into the index when the key won't fit in the current page To make the key fit, the DBMS splits the page: it takes some keys out of the current page and puts them in a new page statistics Volatile data about the database, stored in the system catalog so that the optimizer has access to it stmt Statement container, an ODBC resource strong-clustered index Used by: this book only See also [clustered index] subquery A SELECT within another SQL statement, usually within another SELECT subselect See [subquery] synchronized (method) An attribute of a Java method that provides concurrency control among multiple threads sharing an object syntax Used by: this book only A choice of words and their arrangement in an SQL statement T table scan A search of an entire table, row by row tablespace A file or group of files that contain data third normal form See [3NF] throughput The number of operations the DBMS can do in a time unit tid Used by: Ingres See also [row identifier] transaction A series of SQL statements that constitute an atomic unit of work: either all are committed as a unit or they are all rolled back as a unit A transaction begins with the first statement since the last transaction end and finishes with a transaction end (either COMMIT or ROLLBACK) statement transform The process of rewriting an SQL statement to produce the same result, but with different syntax When two SQL statements have different syntax but will predictably and regularly produce the same outputs, they are known as transforms of one another transitively dependent A concept used in normalization If column2 is dependent on column1 and column3 is dependent on column2, then it is also true that column3 is dependent on column1; the Law of Transitivity applies to dependence too U UDT User-defined data type, a data type defined by a user using the SQL CREATE TYPE statement Unicode A character encoding Used in Java and sometimes used by DBMSs to support the SQL NCHAR/NCHAR VARYING data type unique index An index with perfect selectivity, that is, an index with no duplicate values allowed Standard SQL allows multiple NULLs in a unique index, since a NULL is not considered to be equal to any other value, including another NULL Many DBMSs, however, accept only one NULL in a unique index Some DBMSs won't allow even a single NULL uniquifier Used by: Microsoft A 4-byte value added to a clustered index key to make it unique update (lock mode) A lock that may coexist with any number of shared locks, but not with another update lock nor with an exclusive lock on the same object URL Uniform Resource Locator, the electronic address for an Internet site V vector A one-dimensional array versioning A mechanism that sometimes doesn't use shared locks for rows Versioning is also known as Multi Version Concurrency Control (MVCC) W weak-clustered index Used by: this book only See also [clustered index] UPDATE The SQL Standard description of the typical UPDATE format is: UPDATE SET { = [, ] | ROW = } [ WHERE ] For example: UPDATE Table1 SET column1 = 1, column2 = 2, column3 = 3 WHERE column1 1 OR column2 2 OR column3 3 This example updates multiple columns with the same SET clause, which is better than the alternative—using three UPDATE statements would reduce locking but increase logging (GAIN: –8/8 with three UPDATEs instead) A little-known fact about the SET clause is that evaluation must be from left to right Therefore, if any assignment is likely to fail, put that assignment first in the SET clause For example, if the column3 = 3 assignment in the last UPDATE is likely to fail, change the statement to: UPDATE Table1 SET column3 = 3, column1 = 1, column2 = 2 WHERE column1 1 OR column2 2 OR column3 3 GAIN 6/8 With this change, the DBMS will fail immediately and not waste time setting column1 = 1 and column2 = 2 The UPDATE example also contains a WHERE clause that specifies a precise exclusion of the conditions that are going to be true when the SET clause is done This is like saying—Make it so unless it's already so This WHERE clause is redundant, but you might be lucky and find that no rows need to be updated (GAIN: 5/8 if no data change required) By the way, this trick does not work if any of the columns can contain NULL Dependent UPDATE Often two data changes occur in a sequence and are related For example, you might UPDATE the customer balance then INSERT into a transaction table: BEGIN UPDATE Customers SET balance = balance + 500 WHERE cust_id = 12345; INSERT INTO Transactions VALUES (12345, 500); END This is improvable First of all, it would be safer to say that the INSERT should only happen if the UPDATE changed at least one row (the number of changed rows is available to a host program at the end of any data-change statement) Second, if it's true that this is a sequence that happens regularly, then the INSERT statement should be in a trigger (It goes without saying that the whole thing should be in a stored procedure; see Chapter 11, "Stored Procedures.") When the sequence is two UPDATEs rather than an INSERT and an UPDATE, it sometimes turns out that the best optimizations involve ON UPDATE CASCADE (for a primary/foreign key relationship), or that the columns being updated should be merged into one table Batch UPDATE An UPDATE statement contains a SET clause (the operation) and a WHERE clause (the condition) Which comes first, the condition or the operation? You may not remember when batch processing was the norm It's what used to happen in the days when a single sequential pass of the data was optimal, because of the nature of the devices being employed The essential loop that identifies a batch-processor goes like this: get next record, do every operation on that record, rep that is: For each row { Do every operation relevant to this row } This contrasts with the normal SQL set orientation, which goes like this: find the records and then do the operations on them that is: For each operation { Do every row relevant to this operation } You can change an SQL DBMS into a batch processor by using the CASE operator For example: /* the set-oriented method */ UPDATE Table1 SET column2 = 'X' WHERE column1 < 100 UPDATE Table1 SET column2 = 'Y' WHERE column1 >= 100 OR column1 IS NULL /* the batch-oriented method */ UPDATE Table1 SET column2 = CASE WHEN column1 < 100 THEN 'X' ELSE 'Y' END GAIN: 5/7 Portability InterBase doesn't support CASE The gain shown is for only seven DBMSs The batch-oriented method is reasonable if you are updating 100% of the rows in the table Generally you should be leery of statements that select everything and then decide what to do based on the selection—such code might just be a transplant from a legacy system In this example, though, there is an advantage because the WHERE clauses are dispensed with, and because the rows are processed in ROWID order The Bottom Line: UPDATE Update multiple columns with the same UPDATE … SET clause, rather than with multiple UPDATE statements UPDATE … SET clause evaluation must be from left to right If any assignment is likely to fail, put it first in the SET clause It can be helpful to add a redundant WHERE clause to an UPDATE statement, in cases where it's possible that no rows need to be updated This won't work if any of the columns can contain NULL If you're updating all rows of a table, use batch processing for the UPDATE Check out ON UPDATE CASCADE (for a primary/foreign key relationship) if you find you're doing multiple UPDATE statements on related tables in sequence Consider whether columns belonging to multiple tables should belong to the same table if they're frequently being updated in sequence Put related—and frequently done—UPDATE statements into triggers and/or stored procedures Chapter 11 Stored Procedures A stored procedure is something that is both stored (that is, it's a database object) and procedural (that is, it can contain constructs like IF, WHILE, BEGIN/END) Like a procedure in any language, an SQL stored procedure can accept parameters, declare and set variables, and return scalar values A stored procedure may contain one or more SQL statements, including SELECT, so it can return result sets too Listing 11-1 shows an example of a stored procedure declaration, in SQL Standard syntax Portability MySQL does not support stored procedures All gains shown in this chapter are for only seven DBMSs Also, because stored procedures were around long before the SQL Standard got around to requiring them, every DBMS uses slightly different syntax to define stored procedures A discussion of the differences is beyond the scope of this book Listing 11-1 Stored procedure declaration, SQL standard syntax CREATE PROCEDURE /* always CREATE PROCEDURE or FU Sp_proc1 /* typically names begin with Sp (param1 INT) /* parenthesized parameter list MODIFIES SQL DATA /* SQL data access characteristi BEGIN DECLARE num1 INT; /* variable declaration */ IF param1 0 THEN /* IF statement */ SET param1 = 1; /* assignment statement */ END IF; /* terminates IF block */ UPDATE Table1 SET /* ordinary SQL statement */ column1 = param1; END ... This glossary contains only terms that specifically are used for SQL optimization For terms that apply to the subject of SQL in general, consult the 1,000-term glossary on our Web site, ourworld.compuserve.com/homepages/OCELOTSQL/glossary.htm... Taking an invariant assignment out of a loop host program A computer program, written in a non -SQL language, containing calls to an SQL API or containing embedded SQL statements hot spot A page in either the index or data file that every job wants... table, then there will also be a lock on the table itself, as a separate lock record in-to-out A plan to process a subquery; for each row in the inner query, lookup in the outer query isolation level (of a transaction) In standard SQL, a setting that determines the type of

Định dạng
Số trang	68
Dung lượng	245,74 KB