Nielsen p04.tex V4 - 07/21/2009 1:06pm Page 512 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 513 Creating the Physical Database Schema IN THIS CHAPTER Creating the database files Creating tables Creating primary and foreign keys Configuring constraints Creating the user data columns Documenting the database schema Creating indexes T he longer I work with databases, the more I become convinced that the real magic is the physical schema design. No aspect of application devel- opment has more potential to derail an application or enable it to soar than the physical schema — not even indexing. (This idea is crucial to my view of Smart Database Design as expressed in Chapter 2, ‘‘Data Architecture’’.) The primary features of the application are designed at the data schema level. If the data schema supports a feature, then the code will readily bring the feature to life; but if the feature is not designed in the tables, then the client application can jump through as many hoops as you can code and it will never work right. The logical database schema, discussed in Chapter 3, ‘‘Relational Database Design,’’ is a necessary design step to ensure that the business requirements are well understood. However, a logical design has never stored nor served up any data. In contrast, the physical database schema is an actual data store that must meet the Information Architecture Principle’s call to make information ‘‘readily available in a usable format for daily operations and analysis by individuals, groups, and processes.’’ It’s the physical design that meets the database objectives of usability, scalability, integrity, and extensibility. This chapter first discusses designing the physical database schema and then focuses on the actual implementation of the physical design: ■ Creating the database files ■ Creating the tables ■ Creating the primary and foreign keys ■ Creating the data columns ■ Adding data-integrity constraints 513 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 514 Part IV Developing with SQL Server ■ Creating indexes (although indexes can be easily added or modified after the physical schema implementation) While this is the chapter entitled ‘‘Creating the Physical Database Schema,’’ it’s actually just the core chapter of 10 chapters that encompass the design and creation of databases. This chapter focuses on the syntax of creating the database. However, Chapter 3, ‘‘Relational Database Design,’’ discusses how to design a logical schema. Similarly, while this chapter covers the syntax and mechanics of creating indexes, Chapter 64, ‘‘Indexing Strategies,’’ explores how to fine-tune indexes for performance. Part III, ‘‘Beyond Relational,’’ has three chapters related to database design. Chapters 17–19, discuss modeling and working with data that’s not traditionally thought of as relational: hierarchical data, spatial data, XML data, full-text searches, and storing BLOBs using Filestream. Implementation of the database often considers partitioning the data, which is covered in Chapter 68, ‘‘Partitioning.’’ And whereas this chapter focuses on building relational databases, Chapter 70, ‘‘BI Design,’’ covers creating data warehouses. What’s New with the Physical Schema? S QL Server 2008 supports several new data types, and an entirely new way of storing data in the data pages — sparse columns . Read on! Designing the Physical Database Schema If there’s one area where I believe the SQL Server community is lacking in skills and emphasis, it’s translating a logical design into a decent physical schema. When designing the physical design, the design team should begin with a clean logical design and/or well-understood and documented business rules, and then brainstorm until a simple, flexible design emerges that performs excellently, is extensible and flexible, has appropriate data integrity, and is usable by all those who will consume the data. I firmly believe it’s not a question of compromising one database attribute for another; all these goals can be met by a well-designed elegant physical schema. Translating the logical database schema into a physical database schema may involve the following changes: ■ Converting complex logical designs into simpler, more agile table structures ■ Converting logical many-to-many relationships to two physical one-to-many relationships with an associative, or junction, table ■ Converting logical composite primary keys to surrogate (computer-generated) single-column primary keys ■ Converting the business rules into constraints or triggers, or, better yet, into data-driven designs 514 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 515 Creating the Physical Database Schema 20 Logical to physical options Every project team develops the physical database schema drawing from these two disciplines (logical data modeling and physical schema design) in one of the following possible combinations: ■ A logical database schema is designed and then implemented without the benefit of physi- cal schema development. This plan is a sure way to develop a slow and unwieldy database schema. The application code will be frustrating to write and the code will not be able to overcome the performance limitations of the design. ■ A logical database schema is developed to ensure that the business requirements are under- stood. Based on the logical design, the database development team develops a physical database schema. This method can result in a fast, usable schema. Developing the schema in two stages is a good plan if the development team is large enough and one team is designing and collecting the business requirements and another team is developing the physical database schema. Make sure that having a completed logical database schema does not squelch the team’s creativity as the physical database schema is designed. ■ The third combination of logical and physical design methodologies combines the two into a single development step as the database development team develops a physical database schema directly from the business requirements. This method can work well providing that the design team fully understands logical database modeling, physical database modeling, and advanced query design. The key task in designing a physical database schema is brainstorming multiple possible designs, each of which meets the user requirements and ensures data integrity. Each design is evaluated based on its simplicity, performance of possible query paths, flexibility, and maintainability. Refining the data patterns The key to simplicity is refining the entity definition with a lot of team brainstorming so that each table does more work — rearranging the data patterns until an elegant and simple pattern emerges. This is where a broad repertoire of database experience aids the design process. Often the solution is to view the data from multiple angles, finding the commonality between them. Users are too close to the data and they seldom correctly identify the true entities. What a user might see as multiple entities a database design team might model as a single entity with dynamic roles. Combining this quest for simplicity with some data-driven design methods can yield normalized databases with higher data integrity, more flexibility and agility, and dramatically fewer tables. Designing for performance A normalized logical database design without the benefit of physical database schema optimization will perform poorly, because the logical design alone doesn’t consider performance. Issues such as lock con- tention, composite keys, excessive joins for common queries, and table structures that are difficult to update are just some of the problems that a logical design might bring to the database. Designing for performance is greatly influenced by the simplicity or complexity of the design. Each unnecessary complexity requires additional code, extra joins, and breeds even more complexity. 515 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 516 Part IV Developing with SQL Server One particular decision regarding performance concerns the primary keys. Logical database designs tend to create composite meaningful primary keys. The physical schema can benefit from redesigning these as single-column surrogate (computer-generated) keys. The section on creating primary keys later in this chapter discusses this in more detail. Responsible denormalization A popular myth is that the primary task of translating a logical design into a physical schema is denor- malization. Denormalization, purposefully breaking the normal forms, is the technique of duplicating data within the data to make it easier to retrieve. Interestingly, the Microsoft Word spell checker sug- gests replacing ‘‘denormalization’’ with ‘‘demoralization.’’ Within the context of a transactional, OLTP database, I couldn’t agree more. Normalization is described in Chapter 3, ‘‘Relational Database Design.’’ Some examples of denormalizing a data structure, including the customer name in an [Order] table would enable retrieving the customer name when querying an order without joining to the Customer table. Or, including the CustomerID in a ShipDetail table would enable joining directly from the ShipDetail table to the Customer table while bypassing the OrderDetail and [Order] tables. Both of these examples violate the normalization because the attributes don’t depend on the primary key. Some developers regularly denormalize portions of the database in an attempt to improve performance. While it might seem that this would improve performance because it reduces the number of joins, I have found that in practice the additional code (procedures, triggers, constraints, etc.) required to keep the data consistent, or to renormalize data for use in set-based queries, actually costs performance. In my consulting practice, I’ve tested a normalized design vs. a denormalized design several times. In every case the normalized design was about 15% faster than the denormalized design. Best Practice T here’s a common saying in the database field: ‘‘Normalize ‘til it hurts, then denormalize ‘til it works.’’ Poppycock! That’s a saying of data modelers who don’t know how to design an efficient physical schema. I never denormalize apart from the two cases identified as responsible denormalization. ■ Denormalize aggregate data — such as account balances, or inventory on hand quantities within OLTP databases — for performance even though such data could be calculated from the inventory transaction table or the account transaction ledger table. These may be calculated using a trigger or a persisted computed column. ■ If the data is not original and is primarily there for OLAP or reporting purposes, data consistency is not the primary concern. For performance, denormalization is a wise move. The architecture of the databases, and which databases or tables are being used for which purpose, are the driving factors in any decision to denormalize a part of the database. 516 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 517 Creating the Physical Database Schema 20 If the database requires both OLTP and OLAP, the best solution might just be to create a few tables that duplicate data for their own distinct purposes. The OLTP side might need its own tables to maintain the data, but the reporting side might need that same data in a single, wide, fast table from which it can retrieve data without any joins or locking concerns. The trick is to correctly populate the denormalized data in a timely manner. Indexed views are basically denormalized clustered indexes. Chapter 64, ‘‘Indexing Strate- gies,’’ discusses setting up an indexed view. Chapter 70, ‘‘BI Design,’’ includes advice on creating a denormalized reporting database and data warehouse. Designing for extensibility Maintenance over the life of the application will cost significantly more than the initial development. Therefore, during the initial development process you should consider as a primary objective making it as easy as possible to maintain the physical design, code, and data. The following techniques may reduce the cost of database maintenance: ■ Enforce a strong T-SQL based abstraction layer. ■ Always normalize the schema. ■ Data-driven designs are more flexible, and therefore more extensible, than rigid designs. ■ Use a consistent naming convention. ■ Avoid data structures that are overly complex, as well as unwieldy data structures, when simpler data structures will suffice. ■ Develop with scripts instead of using Management Studio’s UI. ■ Enforce the data integrity constraints from the beginning. Polluted data is a bear to clean up after even a short time of loose data-integrity rules. ■ Develop the core feature first, and once that’s working, then add the bells and whistles. ■ Document not only how the procedure works, but also why it works. Creating the Database The database is the physical container for all database schemas, data, and server-side programming. SQL Server’s database is a single logical unit, even though it may exist in several files. Database creation is one of those areas in which SQL Server requires little administrative work, but you may decide instead to fine-tune the database files with more sophisticated techniques. The Create DDL command Creating a database using the default parameters is very simple. The following data definition language (DDL) command is taken from the Cape Hatteras Adventures sample database: CREATE DATABASE CHA2; The CREATE command will create a data file with the name provided and a .mdf file extension, as well as a transaction log with an .ldf extension. 517 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 518 Part IV Developing with SQL Server Of course, more parameters and options are available than the previous basic CREATE command sug- gests. By default, the database is created as follows: ■ Default collation: Server collation ■ Initial size: A data file of 3 MB, and a transaction log of 1 MB ■ Location: The data file and transaction log default location is determined during setup and can be changed in the Database Settings page of the Server Properties dialog. While these defaults might be acceptable for a sample or development database, they are sorely inade- quate for a production database. Better alternatives are explained as the CREATE DATABASE command is covered. Using the Object Explorer, creating a new database requires only that the database name be entered in the New Database form, as shown in Figure 20-1. Use the New Database menu command from the Databases node’s context menu to open the New Database form. FIGURE 20-1 The simplest way to create a new database is by entering the database name in Object Explorer’s New Database page. 518 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 519 Creating the Physical Database Schema 20 The New Database page includes several individual subpages — the General, Options, and Filegroups pages, as shown in Table 20-1. For existing databases, the Files, Permissions, Extended Properties, Mirroring, and Log Shipping pages are added to the Database Properties page (not shown). TABLE 20-1 Database Property Pages Page New Database Existing Database General Create new database, setting the name, owner, collation, recovery model, full-text indexing, and data file properties View (read-only) general properties: name, last back-up, size, collation Files n/a View and modify database owner, collation, recovery model, full-text indexing, and database files Filegroups View and modify filegroup information View and modify filegroup information Options View and modify database options such as auto shrink, ANSI settings, page verification method, and single-user access View and modify database options such as auto shrink, ANSI settings, page verification method, and single-user access Permissions n/a View and modify server roles, users, and permissions. See Chapter 50, ‘‘Authorizing Securables,’’ for more details. Extended Properties n/a View and modify extended properties Mirroring n/a View and configure database mirroring, covered in Chapter 47, ‘‘Mirroring’’ Transaction Log Shipping n/a View and configure database mirroring, covered in Chapter 46, ‘‘Log Shipping’’ Database-file concepts A database consists of two files (or two sets of files): the data file and the transaction log. The data file contains all system and user tables, indexes, views, stored procedures, user-defined functions, triggers, and security permissions. The write-ahead transaction log is central to SQL Server’s design. All updates to the data file are first written and verified in the transaction log, ensuring that all data updates are written to two places. Never store the transaction log on the same disk subsystem as the data file. For the sake of the transactional-integrity ACID properties and the recoverability of the database, it’s critical that a failing disk subsystem not be able to take out both the data file and the transaction file. 519 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 520 Part IV Developing with SQL Server The transaction log contains not only user writes but also system writes such as index writes, page splits, table reorganizations, and so on. After one intensive update test, I inspected the log using Lumigent’s Log Explorer and was surprised to find that about 80 percent of all entries represented system activities, not user updates. Because the transaction file contains not only the current information but also all updates to the data file, it has a tendency to grow and grow. Administering the transaction log involves backing up and truncating it as part of the recovery plan, as discussed in Chapter 41, ‘‘Recovery Planning.’’ How SQL Server uses the transaction log within transactions is covered in Chapter 66, ‘‘Managing Transactions, Locking, and Blocking.’’ Configuring file growth Prior to SQL Server version 7, the data files required manual size adjustment to handle additional data. Fortunately, for about a decade now, SQL Server can automatically grow thanks to the following options (see Figure 20-2): ■ Enable Autogrowth: As the database begins to hold more data, the file size must grow. If autogrowth is not enabled, an observant DBA will have to manually adjust the size. If auto- growth is enabled, SQL Server automatically adjusts the size according to one of the following growth parameters: ■ In percent: When the data file needs to grow, this option will expand it by the percent specified. Growing by percent is the best option for smaller databases. With very large files, this option may add too much space in one operation and hurt performance while the data file is being resized. For example, adding 10 percent to a 5GB data file will add 500MB; writing 500MB could take a while. ■ In megabytes: When the data file needs to grow, this option will add the specified number of megabytes to the file. Growing by a fixed size is a good option for larger data files. Best Practice T he default setting is to grow the data file by 1 MB. Autogrow events require database locks, which severely impact performance. Imagine a database that grows by a couple of gigabytes. It will have to endure 2,048 tiny autogrow events. On the other hand, a large autogrowth event will consume more time. The best solution is to turn autogrow off and manually increase the file size during the database maintenance window. However, if that’s not expedient, then I recommend setting Autogrow to a reasonable mid-size growth. For my open-source O/R DBMS product, Nordic, I set both the initial size and the autogrow to 100MB. ■ Maximum file size: Setting a maximum size can prevent the data file or transaction log file from filling the entire disk subsystem, which would cause trouble for the operating system. The maximum size for a data file is 16 terabytes, and log files are limited to 2 terabytes. This does not limit the size of the database because a database can include multiple files. 520 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 521 Creating the Physical Database Schema 20 FIGURE 20-2 With Management Studio’s New Database form, a new database is configured for automatic file growth and a maximum size of 20GB. Automatic file growth can be specified in code by adding the file options to the CREATE DATABASE DDL command. File size can be specified in kilobytes (KB), megabytes (MB), gigabytes (GB), or terabytes (TB). Megabytes is the default. File growth can be set to a size or a percent. The following code creates the NewDB database with an initial data-file size of 10MB, a maximum size of 200GB, and a file growth of 10MB. The transaction log file is initially 5MB, with a maximum size of 10GB and a growth of 100MB: CREATE DATABASE NewDB ON PRIMARY (NAME = NewDB, FILENAME = ‘c:\SQLData\NewDB.mdf’, SIZE = 10MB, MAXSIZE = 200Gb, FILEGROWTH = 100) LOG ON (NAME = NewDBLog, FILENAME = ‘d:\SQLLog\NewDBLog.ldf’, SIZE = 5MB, MAXSIZE = 10Gb, FILEGROWTH = 100); All the code in this chapter and all the sample databases are available for download from the book’s website, www.sqlserverbible.com. In addition, there are extensive queries using the catalog views that relate to this chapter. If autogrowth is not enabled, then the files require manual adjustment if they are to handle additional data. File size can be adjusted in Management Studio by editing it in the database properties form. An easy way to determine the files and file sizes for all databases from code is to query the sys.database_files catalog view. 521 www.getcoolebook.com . schemas, data, and server- side programming. SQL Server s database is a single logical unit, even though it may exist in several files. Database creation is one of those areas in which SQL Server requires. complexity. 515 www.getcoolebook.com Nielsen c20.tex V4 - 07/23/2009 8:26pm Page 516 Part IV Developing with SQL Server One particular decision regarding performance concerns the primary keys. Logical. and Blocking.’’ Configuring file growth Prior to SQL Server version 7, the data files required manual size adjustment to handle additional data. Fortunately, for about a decade now, SQL Server can automatically grow