Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 32 Part I Laying the Foundation Planning Data Stores T he enterprise data architect helps an organization plan the most effective use of information throughout the organization. An organization’s data store configuration includes multiple types of data stores, as illustrated in the following figure, each with a specific purpose: ■ Operational databases ,or OLTP (online transaction processing) databases collect first-generation transactional data that is essential to the day-to-day operation of the organization and unique to the organization. An organization might have an operational data store to serve each unit or function within it. Regardless of the organi- zation’s size, an organization with a singly focused purpose may very well have only one operational database. ■ For performance, operational stores are tuned for a balance of data retrieval and updates, so indexes and locking are key concerns. Because these databases receive first-generation data, they are subject to data update anomalies, and benefit from normalization. A typical organizational data store configuration includes several operational data stores feeding multiple data marts and a single master data store (see graphic). ReferenceDB Data Warehouse Sales Data Mart Manufacturing Data Mart Manufacturing OLTP Sales OLTP Alternate Location Mobile Sales OLTP ■ Caching data stores , sometime called reporting databases , are optional read-only copies of all or part of an operational database. An organization might have multiple caching data stores to deliver data throughout the organization. Caching data stores continued 32 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 33 Data Architecture 2 continued might use SQL Server replication or log shipping to populate the database and are tuned for high-performance data retrieval. ■ Reference data stores are primarily read-only, and store generic data required by the organization but which seldom changes — similar to the reference section of the library. Examples of reference data might be unit of measure conversion factors or ISO country codes. A reference data store is tuned for high-performance data retrieval. ■ Data warehouses collect large amounts of data from multiple data stores across the entire enterprise using an extract-transform-load (ETL) process to convert the data from the various formats and schema into a common format, designed for ease of data retrieval. Data warehouses also serve as the archival location, storing historical data and releasing some of the data load from the operational data stores. The data is also pre-aggregated, making research and reporting easier, thereby improving the accessibility of information and reducing errors. ■ Because the primary task of a data warehouse is data retrieval and analysis, the data- integrity concerns present with an operational data store don’t apply. Data warehouses are designed for fast retrieval and are not normalized like master data stores. They are generally designed using a basic star schema or snowflake design. Locks generally aren’t an issue, and the indexing is applied without adversely affecting inserts or updates. Chapter 70, ‘‘BI Design,’’ discusses star schemas and snowflake designs used in data ware- housing. ■ The analysis process usually involves more than just SQL queries, and uses data cubes that consoli- date gigabytes of data into dynamic pivot tables. Business intelligence (BI) is the combination of the ETL process, the data warehouse data store, and the acts of creating and browsing cubes. ■ A common data warehouse is essential for ensuring that the entire organization researches the same data set and achieves the same result for the same query — a critical aspect of the Sarbanes-Oxley Act and other regulatory requirements. ■ Data marts are subsets of the data warehouse with pre-aggregated data organized specifically to serve the needs of one organizational group or one data domain. ■ Master data store ,or master data management (MDM), refers to the data warehouse that combines the data from throughout the organization. The primary purpose of the master data store is to provide a single version of the truth for organizations with a complex set of data stores and multiple data warehouses. Smart Database Design My career has focused on turning around database projects that were previously considered failures and recommending solutions for ISV databases that are performing poorly. In nearly every case, the root cause of the failure was the database design. It was too complex, too clumsy, or just plain inadequate. 33 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 34 Part I Laying the Foundation Without exception, where I found poor database performance, I also found data modelers who insisted on modeling alone or who couldn’t write SQL queries to save their lives. Throughout my career, what began as an observation was reinforced into a firm conviction. The database schema is the foundation of the database project; and an elegant, simple database design outperforms a complex database both in terms of the development process and the final performance of the database application. This is the basic idea behind the Smart Database Design. While I believe in a balanced set of goals for any database, including performance, usability, data integrity, availability, extensibility, and security, all things being equal, the crown goes to the database that always provides the right answer with lightning speed. Database system A database system is a complex system. By complex, I mean that the system consists of multiple compo- nents that interact with one another, as shown in Figure 2-1. The performance of one component affects the performance of other components and thus the entire system. Stated another way, the design of one component will set up other components, and the whole system, to either work well together or to frus- trate those trying to make the system work. FIGURE 2-1 The database system is the collective effort of the server environment, maintenance jobs, the client application, and the database. • Four Components AD-306 • Performance Decisions The Database System 2006 PASS Community Summit DB Instead of randomly trying performance tips (and the Internet has an overwhelming number of SQL Server performance and optimization tips), it makes more sense to think about the database as a system and then figure out how the components of the database system affect one another. You can then use this knowledge to apply the performance techniques in a way that provides the most benefit. 34 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 35 Data Architecture 2 Every database system contains four broad technologies or components: the database itself, the server platform, the maintenance jobs, and the client’s data access code, as illustrated in Figure 2-2. Each com- ponent affects the overall performance of the database system: ■ The server environment is the physical hardware configuration (CPUs, memory, disk spindles, I/O bus), the operating system, and the SQL Server instance configuration, which together pro- vide the working environment for the database. The server environment is typically optimized by balancing the CPUs, memory and I/O, and identifying and eliminating bottlenecks. ■ The database maintenance jobs are the steps that keep the database running optimally (index defragmentation, DBCC integrity checks, and maintaining index statistics). ■ The client application is the collection of data access layers, middle tiers, front-end applications, ETL (extract, transform, and load) scripts, report queries, or SSIS (SQL Server Integration Services) packages that access the database. These can not only affect the user’s perception of database performance, but can also reduce the overall performance of the database system. ■ Finally, the database component includes everything within the data file: the physical schema, T-SQL code (queries, stored procedures, user-defined functions (UDFs), and views), indexes, and data. FIGURE 2-2 Smart Database Design is the premise that an elegant physical schema makes the data intuitively obvious and enables writing great set-based queries that respond well to indexing . This in turn creates short, tight transactions, which improves concurrency and scalability , while reducing the aggregate workload of the database. This flow from layer to layer becomes a methodology for designing and optimizing databases. Adv Scalability Concurrency Indexing Set-Based Schema Enables Enables Enables Enables All four database components must function well together to produce a high-performance database sys- tem; if one of the components is weak, then the database system will fail or perform poorly. However, of these four components, the database itself is the most difficult component to design and the one that drives the design of the other three components. For example, the database workload 35 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 36 Part I Laying the Foundation determines the hardware requirements. Maintenance jobs and data access code are both designed around the database; and an overly complex database will complicate both the maintenance jobs and the data access code. Physical schema The base layer of Smart Database Design is the database’s physical schema. The physical schema includes the database’s tables, columns, primary and foreign keys, and constraints. Basically, the ‘‘physical’’ schema is what the server creates when you run data definition language (DDL) commands. Designing an elegant, high-performance physical schema typically involves a team effort and requires numerous design iterations and reviews. Well-designed physical schemas avoid overcomplexity by generalizing similar types of objects, thereby creating a schema with fewer entities. While designing the physical schema, make the data obvious to the developer and easy to query. The prime consideration when converting the logical database design into a physical schema is how much work is required in order for a query to navigate the data struc- tures while maintaining a correctly normalized design. Not only is the schema then a joy to use, but it also makes it easier to code correct queries, reducing the chance of data integrity errors caused by faulty queries. Other hallmarks of a well-designed schema include the following: ■ The primary and foreign keys are designed for raw physical performance. ■ Optional data (e.g., second address lines, name suffixes) is designed using patterns (nullable columns, surrogate nulls, or missing rows) that protect the integrity of the data both within the database and through the query. Conversely, a poorly designed (either non-normalized or overly complex) physical schema encourages developers to write iterative code, code that uses temporary buckets to manipulate data, or code that will be difficult to debug or maintain. Agile Modeling A gile development is popular for good reasons. It gets the job done more quickly and often produces a better result than traditional methods. Agile development also fits well with database design and development. The traditional waterfall process steps through four project phases: requirements gathering, design, develop- ment, and implementation. While this method may work well for some endeavors, when creating software, the users often don’t know what they want until they see it, which pushes discovery beyond the requirements gathering phase and into the development phase. Agile development addresses this problem by replacing the single long waterfall with numerous short cycles or iterations. Each iteration builds out a working model that can be tested, and enables users to play with the continued 36 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 37 Data Architecture 2 continued software and further discover their needs. When users see rapid progress and trust that new features can be added, they become more willing to allow features to be planned into the life cycle of the software, instead of insisting that every feature be implemented in the next version. When I’m developing a database, each iteration is usually 2–5 days long and is a mini cycle of discovery, coding, unit testing, and more discoveries with the client. A project might consist of a dozen of these tight iterations; and with each iteration, more features are fleshed out in the database and code. Set-based queries SQL Server is designed to handle data in sets. SQL is a declarative language, meaning that the SQL query describes the problem, and the Query Optimizer generates an execution plan to resolve the problem as a set. Application programmers typically develop while-loops that handle data one row at a time. Iterative code is fine for application tasks such as populating a grid or combo box, but it is inappropriate for server-side code. Iterative T-SQL code, typically implemented via cursors, forces the database engine to perform thousands of wasteful single-row operations, instead of handling the problem in one larger, more efficient set. The performance cost of these single-row operations is huge. Depending on the task, SQL cursors perform about half as well as set-based code, and the performance differential grows with the size of the data. This is why set-based queries, based on an obvious physical schema, are so critical to database performance. A good physical schema and set-based queries set up the database for excellent indexing, further improving the performance of the query (see Figure 2-2). However, queries cannot overcome the errors of a poor physical schema and won’t solve the perfor- mance issues of poorly written code. It’s simply impossible to fix a clumsy database design by throwing code at it. Poor database designs tend to require extra code, which performs poorly and is difficult to maintain. Unfortunately, poorly designed databases also tend to have code that is tightly coupled (refers directly to tables), instead of code that accesses the database’s abstraction layer (stored procedures and views). This makes it all that much harder to refactor the database. Indexing An index is an organized pointer used to locate information in a larger collection. An index is only useful when it matches the needs of a question. In this case, it becomes the shortcut between a ques- tion and the right answer. The key is to design the fewest number of shortcuts between the right questions and the right answers. A sound indexing strategy identifies a handful of queries that represent 90% of the workload and, with judicious use of clustered indexes and covering indexes, solves the queries without expensive bookmark lookup operations. An elegant physical schema, well-written set-based queries, and excellent indexing reduce transaction duration, which implicitly improves concurrency and sets up the database for scalability. 37 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 38 Part I Laying the Foundation Nevertheless, indexes cannot overcome the performance difficulties of iterative code. Poorly written SQL code that returns unnecessary columns is much more difficult to index and will likely not take advan- tage of covering indexes. Moreover, it’s extremely difficult to properly index an overly complex or non- normalized physical schema. Concurrency SQL Server, as an ACID-compliant database engine, supports transactions that are atomic, consistent, isolated, and durable. Whether the transaction is a single statement or an explicit transaction within BEGIN TRAN COMMIT TRAN statements, locks are typically used to prevent one transaction from seeing another transaction’s uncommitted data. Transaction isolation is great for data integrity, but locking and blocking hurt performance. Multi-user concurrency can be tuned by limiting the extraneous code within logical transactions, setting the transaction isolation level no higher than required, keeping trigger code to a minimum, and perhaps using snapshot isolation. A database with an excellent physical schema, well-written set-based queries, and the right set of indexes will have tight transactions and perform well with multiple users. When a poorly designed database displays symptoms of locking and blocking issues, no amount of transaction isolation level tuning will solve the problem. The sources of the concurrency issue are the long transactions and additional workload caused by the poor database schema, lack of set-based queries, or missing indexes. Concurrency tuning cannot overcome the deficiencies of a poor database design. Advanced scalability With each release, Microsoft has consistently enhanced SQL Server for the enterprise. These technologies can enhance the scalability of heavy transaction databases. The Resource Governor, new in SQL Server 2008, can restrict the resources available for different sets of queries, enabling the server to maintain the SLA agreement for some queries at the expense of other less critical queries. Indexed views were introduced in SQL Server 2000. They actually materialize the view as a clustered index and can enable queries to select from joined data without hitting the joined tables, or to pre- aggregate data. In effect, an indexed view is a custom covering index that can cover across multiple tables. Partitioned tables can automatically segment data across multiple filegroups, which can serve as an auto- archive device. By reducing the size of the active data partition, the requirements for maintaining the data, such as defragging the indexes, are also reduced. Service Broker can collect transactional data and process it after the fact, thereby providing an ‘‘over- time’’ load leveling as it spreads a five-second peak load over a one-minute execution without delaying the calling transaction. While these high-scalability features can extend the scalability of a well-designed database, they are lim- ited in their ability to add performance to a poorly designed database, and they cannot overcome long 38 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 39 Data Architecture 2 transactions caused by a lack of indexes, iterative code, or all the multiple other problems caused by an overly complex database design. The database component is the principle factor determining the overall monetary cost of the database. A well-designed database minimizes hardware costs, simplifies data access code and maintenance jobs, and significantly lowers both the initial and the total cost of the database system. A performance framework By describing the dependencies between the schema, queries, indexing, transactions, and scalability, Smart Database Design is a framework for performance. The key to mastering Smart Database Design is understanding the interaction, or cause-and-effect relationship, between these hierarchical layers (schema, queries, indexing, concurrency). Each layer enables the next layer; conversely, no layer can overcome deficiencies in lower layers. The practical application of Smart Database Design takes advantage of these dependencies when developing or optimizing a database by employing the right best practices within each layer to support the next layer. Reducing the aggregate workload of the database component has a positive effect on the rest of the database system. An efficient database component reduces the performance requirements of the server platform, increasing capacity. Maintenance jobs are easier to plan and also execute faster when the database component is designed well. There is less client access code to write and the code that needs to be written is easier to write and maintain. The result is an overall database system that’s simpler to maintain, cheaper to run, easier to connect to from the data access layer, and that scales beautifully. Although it’s not a perfect analogy, picturing a water fountain on a hot summer day can help demon- strate how shorter transactions improve overall database performance. If everyone takes a small, quick sip from the fountain, then no queue forms; but as soon as someone fills up a liter-sized Big Gulp cup, others begin to wait. Regardless of the amount of hardware resources available to a database, time is finite, and the greatest performance gain is obtained by eliminating the excess work of wastefully long transactions, or throwing away the Big Gulp cup. The quick sips of a well-designed query hitting an elegant, properly indexed database will outperform and be significantly easier on the budget than the Bug Gulp cup, with its poorly written query or cursor, on a poorly designed database missing an index. Striving for database design excellence is a smart business move with an excellent estimated return on investment. From my experience, every day spent on database design saves two to three months of development and maintenance time. In the long term, it’s far cheaper to design the database correctly than to throw money or labor at project overruns or hardware upgrades. The cause-and-effect relationship between the layers helps diagnose performance problems as well. When a system is experiencing locking and blocking problems, the cause is likely found in the indexing or query layers. I’ve seen databases that were drowning under the weight of poorly written code. However, the root cause wasn’t the code; it was the overly complex, anti-normalized database design that was driving the developers to write horrid code. The bottom line? Designing an elegant database schema is the first step in maximizing the performance of the overall database system, while reducing costs. 39 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 40 Part I Laying the Foundation Issues and objections I’ve heard objections to the Smart Database Design framework and I like to address them here. Some say that buying more hardware is the best way to improve performance. I disagree. More hardware only masks the problem until it explodes later. Performance problems tend to grow exponentially as DB size grows, whereas hardware performance grows more or less linearly over time. One can almost predict when even the ‘‘best’’ hardware available no longer suffices to get acceptable performance. In several cases, I’ve seen companies spend incredible amounts to upgrade their hardware and they saw little or no improvement because the bottleneck was the transaction locking and blocking and poor code. Sometimes, a faster CPU only waits faster. Strategically, reducing the workload is cheaper than increasing the capacity of the hardware. Some claim that fixing one layer can overcome deficiencies in lower layers. It’s true that a poor schema will perform better when properly indexed than without indexes. However, adding the indexes doesn’t really solve the deficiencies, it only masks the deficiencies. The code is still doing extra work to compen- sate for the poor schema. The cost of developing code and designing correct indexes is still higher for the poor schema. Any data integrity or extensibility risks are still there. Some argue that they would like to apply Smart Database Design but they can’t because the database is a third-party database and they can’t modify the schema or the code. True, for most third-party prod- ucts, the database schema and queries are not open for optimization, and this can be very frustrating if the database needs optimization. However, most vendors are interested in improving their product and keeping their clients happy. Both clients and vendors have contracted with me to help identify areas of opportunity and suggest solutions for the next revision. Some say they’d like to apply Smart Database Design but they can’t because any change to the schema would break hundreds of other objects. It’s true — databases without abstraction layers are expensive to alter. An abstraction layer decouples the database from the client applications, making it possible to change the database component without affecting the client applications. In the absence of a well-designed abstraction layer, the first step toward gaining system performance is to create one. As expensive as it may seem to refactor the database and every application so that all communications go through an abstraction layer, the cost of not doing so could very well be that IT can’t respond to the organization’s needs, forcing the company to outsource or develop wasteful extra databases. At the worst, the failure of the database to be extensible could force the end of the organization. In both the case of the third-party database and the lack of abstraction, it’s still a good idea to optimize at the lowest level possible, and then move up the layers; but the best performance gains are made when you can start optimizing at the lowest level of the database component, the physical schema. Some say that a poorly designed database can be solved by adding more layers of code and converting the database to an SOA-style application. I disagree. The database should be refactored with a clean normalized design and a proper abstraction layer. This will reduce the overall workload and solve a host of usability and performance issues much better than simply wrapping a poorly designed database with more code. Summary When introducing the optimization chapter in her book Inside SQL Server 2000, Kalen Delaney correctly writes that optimization can’t be added to a database after it has been developed; it has to be designed into the database from the beginning. 40 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 41 Data Architecture 2 This chapter presented the concept of the Information Architecture Principle, unpacked the six database objectives, and then discussed the Smart Database Design, showing the dependencies between the layers and how each layer enables the next layer. In a chapter packed with ideas, I’d like to highlight the following: ■ The database architect position should be equally involved in the enterprise-level design and the project-level designs. ■ Any database design or implementation can be measured by six database objectives: usability, extensibility, data integrity, performance, availability, and security. These objectives don’t have to compete — it’s possible to design an elegant database that meets all six objectives. ■ Each day spent on the database design will save three months later. ■ Extensibility is the most expensive database objective to correct after the fact. A brittle database — one that has ad hoc SQL directly accessing the table from the client — is the worst design possible. It’s simply impossible to fix a clumsy database design by throwing code at it. ■ Smart Database Design is the premise that an elegant physical schema makes the data intuitively obvious and enables writing great set-based queries that respond well to indexing. This in turn creates short, tight transactions, which improves concurrency and scalability while reducing the aggregate workload of the database. This flow from layer to layer becomes a methodology for designing and optimizing databases. ■ Reducing the aggregate workload of the database has a greater positive effect than buying more hardware. From this overview of data architecture, the next chapter digs deeper into the concepts and patterns of relational database design, which are critical for usability, extensibility, data integrity, and performance. 41 www.getcoolebook.com . Microsoft has consistently enhanced SQL Server for the enterprise. These technologies can enhance the scalability of heavy transaction databases. The Resource Governor, new in SQL Server 20 08, . out in the database and code. Set-based queries SQL Server is designed to handle data in sets. SQL is a declarative language, meaning that the SQL query describes the problem, and the Query Optimizer. V4 - 07/21/2009 12:02pm Page 38 Part I Laying the Foundation Nevertheless, indexes cannot overcome the performance difficulties of iterative code. Poorly written SQL code that returns unnecessary