Normalizing the relations in a database separates entities into their own relations and makes it possible for you to enter, modify, and delete data without disturbing entities other than the one directly being modifi ed. However, normalization is not without its downside.
When you split relations so that relationships are represented by matching primary and foreign keys, you force the DBMS to perform matching operations between relations whenever a query requires data from more than one table. For example, in a normalized database you store data about an order in one relation, data about a customer in a second relation, and data about the order lines in yet a third relation. The operation typically used to bring the data into a single table so you can prepare an output such as an invoice is known as a join.
In theory, a join looks for rows with matching values between two tables and creates a new row in a result table every time it fi nds a match. In practice, however, performing a join involves manipulating more data than the simple combination of the two tables being joined would suggest. Joins of large tables (those of more than a few hundred rows) can signifi cantly slow down the perfor- mance of a DBMS.
To understand what can happen, you need to know something about the relational algebra join operation. As with all relational algebra operations, the result of a join is a new table.
Note: Relational algebra is a set of operations used to manipulate and extract data from relations. Each operation performs a single manipulation of one or two tables. To complete a query, a DBMS uses a sequence of relational algebra operations; relational algebra is therefore procedural. SQL, on the other hand, is based on the relational calculus, which is nonprocedural, allowing you to specify what you want rather than how to get it. A single SQL Retrieval command can require a DBMS to perform any or all of the operations in the relational algebra.
6.8.1 Equi-Joins
In its most common form, a join forms new rows when data in the two source tables match. Because we are looking for rows with equal values, this type of join is known as an equi-join (or a natural equi-join ). As an example, consider the two tables in Figure 6.6 .
Notice that the ID number column is the primary key of the customers table and that the same column is a foreign key in the orders table. The ID number column in orders therefore serves to relate orders to the customers to which they belong.
Assume that you want to see the names of the customers who placed each order. To do so, you must join the two tables, creating combined rows wherever there is a matching ID number. In database terminology, we are joining the two tables over ID number . The result table can be found in Figure 6.7 .
An equi-join can begin with either source table. (The result should be the same regardless of the direction in which the join is performed.) The join compares each row in one source table with the rows in the second. For each row in the fi rst that matches data in the second source table in the column or columns over which the join is being performed, a new row is placed in the result table.
FIGURE 6.6
Two tables with a primary key – foreign key relationship.
FIGURE 6.7
The joined result table.
Assuming that we are using the customers table as the fi rst source table, pro- ducing the result table in Figure 6.7 might therefore proceed conceptually as follows:
1. Search orders for rows with an ID number of 001. Because there are no matching rows in orders , do not place a row in the result table.
2. Search orders for rows with an ID number of 002. There are two matching rows in orders . Create two new rows in the result table, placing the same customer information at the end of each row in orders . 3. Search orders for rows with an ID number of 003. There is one
matching row in orders . Place one new row in the result table.
4. Search orders for rows with an ID number of 004. There are two matching rows in orders . Place two rows in the result table.
5. Search orders for rows with an ID number of 005. There are no matching rows in orders . Therefore, do not place a row in the result table.
6. Search orders for rows with an ID number of 006. There are three matching rows in orders . Place three rows in the result table.
Notice that if an ID number does not appear in both tables, then no row is placed in the result table. This behavior categorizes this type of join as an inner join .
6.8.2 What Is Really Going On: Product and Restrict
From a relational algebra point of view, a join can be implemented using two other operations: product and restrict. As you will see, this sequence of operations requires the manipulation of a great deal of data and, if implemented by a DBMS, can result in very slow query performance.
The restrict operation retrieves rows from a table by matching each row against logical criteria (a predicate ). Those rows that meet the criteria are placed in the result table; those that do not meet the criteria are omitted.
The product operation (the mathematical Cartesian product) makes every pos- sible pairing of rows from two source tables. In Figure 6.8 , for example, the product of the customers and orders tables produces a result table with 48 rows (the six customers times the eight orders). The ID number column appears twice because it is a part of both source tables.
Note: Although 48 rows may not seem like a lot, consider the size of a product table created from tables with 100 and 1000 rows! The manipulation of a table of this size can tie up a lot of disk input/output and computer processing unit time.
In some rows, the ID number is the same. These are the rows that would have been included in a join. We can therefore apply a restrict predicate to the product
FIGURE 6.8
The product of the customers and orders tables.
table to end up with the same table provided by the join you saw earlier. The predicate ’ s logical condition can be written as:
customers.id _ numb = orders.id _ numb
The rows that are selected by this predicate appear in black in Figure 6.9 ; those eliminated by the predicate are in gray. Notice that the black rows are exactly the same as those in the result table of the join ( Figure 6.7 ).
FIGURE 6.9
The product of the customers and orders tables after applying a restrict predicate.
Note: Although this may seem like a highly ineffi cient way to implement a join, it is actually quite fl exible, in particular because the relationship between the columns over which the join is being performed doesn ’ t have to be equal. A user could just as easily request a join where the value in table A was greater than the value in table B, and so on.
6.8.3 The Bottom Line
Because of the processing overhead created when performing a join, some data- base designers make a conscious decision to leave tables unnormalized. For example, if Lasers Only always accessed the line items at the same time it accessed order information, then a designer might choose to combine the line item and order data into one table, knowing full well that the unnormalized relation exhib- its anomalies. The benefi t is that retrieval of order information will be faster than if it were split into two tables.
Should you leave unnormalized relations in your database to achieve better retrieval performance? In this author ’ s opinion, there is rarely any need to do so.
Assuming that you are working with a relatively standard DBMS that supports SQL as its query language, there are SQL syntaxes that you can use when writing queries that avoid joins. That being the case, it does not seem worth the problems that unnormalized relations present to leave them in the database. Careful writing of retrieval queries can provide performance that is nearly as good as that of retrieval from unnormalized relations.
Note: For a complete discussion of writing SQL queries to avoid joins, see Harrington ’ s book, SQL Clearly Explained , Second Edition, also published by Morgan Kaufmann.