Nielsen c01.tex V4 - 07/21/2009 11:57am Page 22 Part I Laying the Foundation ■ SQL Data Services (SDS): The database side of Microsoft Azure is a full-featured relational SQL Server in the cloud that provides an incredible level of high availability and scalable performance without any capital expenses or software licenses at a very reasonable cost. I’m a huge fan of SDS and I host my ISV software business on SDS. Version 1 does have a few limitations: 10GB per database, no heaps (I can live with that), no access to the file system or other SQL Servers (distributed queries, etc.), and you’re limited to SQL Server logins. Besides the general descriptions here, Appendix A includes a chart detailing the differences between the multiple editions. Exploring the Metadata When SQL Server is initially installed, it already contains several system objects. In addition, every new user database contains several system objects, including tables, views, stored procedures, and functions. Within Management Studio’s Object Explorer, the system databases appear under the Databases ➪ Sys- tem Databases node. System databases SQL Server uses five system databases to store system information, track operations, and provide a tem- porary work area. In addition, the model database is a template for new user databases. These five sys- tem databases are as follows: ■ master: Contains information about the server’s databases. In addition, objects in master are available to other databases. For example, stored procedures in master may be called from a user database. ■ msdb: Maintains lists of activities, such as backups and jobs, and tracks which database backup goes with which user database ■ model: The template database from which new databases are created. Any object placed in the model database will be copied into any new database. ■ tempdb: Used for ad hoc tables by all users, batches, stored procedures (including Microsoft stored procedures), and the SQL Server engine itself. If SQL Server needs to create temporary heaps or lists during query execution, it creates them in tempdb. tempdb is dropped and recreated when SQL Server is restarted. ■ resource: This hidden database, added in SQL Server 2005, contains information that was previously in the master database and was split out from the master database to make service pack upgrades easier to install. Metadata views Metadata is data about data. One of Codd’s original rules for relational databases is that information about the database schema must be stored in the database using tables, rows, and columns, just like user data. It is this data about the data that makes it easy to write code to navigate and explore the database schema and configuration. SQL Server has several types of metadata: ■ Catalog views: Provide information about static metadata — such as tables, security, and server configuration 22 www.getcoolebook.com Nielsen c01.tex V4 - 07/21/2009 11:57am Page 23 The World of SQL Server 1 ■ Dynamic management views (DMVs) and functions: Yield powerful insight into the current state of the server and provide data about things such as memory, threads, stored procedures in cache, and connections ■ System functions and global variables: Provide data about the current state of the server, the database, and connections for use in scalar expressions ■ Compatibility views: Serve as backward compatibility views to simulate the system tables from previous versions of SQL Server 2000 and earlier. Note that compatibility views are deprecated, meaning they’ll disappear in SQL Server 11, the next version of SQL Server. ■ Information schema v iews: The ANSI SQL-92 standard nonproprietary views used to exam- ine the schema of any database product. Portability as a database design goal is a lost cause, and these views are of little practical use for any DBA or database developer who exploits the features of SQL Server. Note that they have been updated for SQL Server 2008, so if you used these in the past they may need to be tweaked. These metadata views are all listed in Management Studio’s Object Explorer under the Database ➪ Views ➪ System Views node, or under the Database ➪ Programmability ➪ Functions ➪ Metadata Function node. What’s New? There are about 50 new features in SQL Server 2008, as you’ll discover in the What’s New in 2008 sidebars in many chapters. Everyone loves lists (and so do I), so here’s my list highlighting the best of what’s new in SQL Server 2008. Paul’s top-ten new features in SQL Server 2008: 10. PowerShell — The new Windows scripting language has been integrated into SQL Server. If you are a DBA willing to learn PowerShell, this technology has the potential to radically change how you do your daily jobs. 9. New data types — Specifically, I’m more excited about Date, Time, and DateTime2 than Spatial and HierarchyID. 8. Tablix — Reporting Services gains the Tablix and Dundas controls, and loses that IIS requirement. 7. Query processing optimizations — The new star joins provide incredible out-of-the-box performance gains for some types of queries. Also, although partitioned tables were introduced in SQL Server 2005, the query execution plan performance improvements and new UI for partitioned tables in SQL Server 2008 will increase their adoption rate. 6. Filtered indexes — The ability to create a small targeted nonclustered index over a very large table is the perfect logical extension of indexing, and I predict it will be one of the most popular new features. 5. Management Data Warehouse — A new consistent method of gathering performance data for further analysis by Performance Studio or custom reports and third parties lays the foundation for more good stuff in the future. 4. Data compression — The ability to trade CPU cycles for reduced IO can significantly improve the scalability of some enterprise databases. I believe this is the sleeper feature that will be the compelling reason for many shops to upgrade to SQL Server 2008 Enterprise Edition. 23 www.getcoolebook.com Nielsen c01.tex V4 - 07/21/2009 11:57am Page 24 Part I Laying the Foundation The third top new feature is Management Studio’s many enhancements. Even though it’s just a tool and doesn’t affect the performance of the engine, it will help database developers and DBAs be more produc- tive and it lends a more enjoyable experience to every job role working with SQL Server: 3. Management Studio — The primary UI is supercharged with multi-server queries and configuration servers, IntelliSense, T-SQL debugger, customizable Query Editor tabs, Error list in Query Editor, easily exported data from Query Results, launch profiler from Query Editor, Object Search, a vastly improved Object Explorer Details page, a new Activity Monitor, improved ways to work with query plans, and it’s faster. And for the final top two new features, one for developers and one for DBAs: 2. Merge and Table-valued parameters — Wow! It’s great to see new T-SQL features on the list. Table-valued parameters alone is the compelling reason I upgraded my Nordic software to SQL Server 2008. Table-valued parameters revolutionize the way application transactions communicate with the database, which earns it the top SQL Server 2008 database developer feature and number two in this list. The new merge command combines insert, update, and delete into a single transaction and is a slick way to code an upsert operation. I’ve recoded many of my upsert stored procedures to use merge with excellent results. 1. Policy-based management (PBM) — PBM means that servers and databases can be declar- atively managed by applying and enforcing consistent policies, instead of running ad hoc scripts. This feature has the potential to radically change how enterprise DBAs do their daily jobs, which is why it earns the number one spot on my list of top ten SQL Server 2008 features. Going, Going, Gone? W ith every new version of SQL Server, some features change or are removed because they no longer make sense with the newer feature set. Discontinued means a feature used to work in a previous SQL Server version but no longer appears in SQL Server 2008. Deprecated means the feature still works in SQL Server 2008, but it’s going to be removed in a future version. There are two levels of deprecation; Microsoft releases both a list of the features that will be gone in the next version, and a list of the features that will be gone in some future version but will still work in the next version. Books Online has details about all three lists (just search for deprecated), but here are the highlights: Going Eventually (Deprecated) These features are deprecated from a future version of SQL Server. You should try to remove these from your code: ■ SQLOLEDB ■ Timestamp (although the synonym rowversion continues to be supported) continued 24 www.getcoolebook.com Nielsen c01.tex V4 - 07/21/2009 11:57am Page 25 The World of SQL Server 1 continued ■ Text, ntext,andimage data types ■ Older full-text catalog commands ■ Sp_configure ‘user instances enabled’ ■ Sp_lock ■ SQL-DMO ■ Sp stored procedures for security, e.g., sp_adduser ■ Setuser (use Execute as instead) ■ System tables ■ Group by all Going Soon (Deprecated) The following features are deprecated from the next version of SQL Server. You should definitely remove these commands from your code: ■ Older backup and restore options ■ SQL Server 2000 compatibility level ■ DATABASEPROPERTY command ■ sp_dboption ■ FastFirstRow query hint (use Option(Fast n)) ■ ANSI-89 (legacy) outer join syntax (*=, =*); use ANSI-92 syntax instead ■ Raiserror integer string format ■ Client connectivity using DB-Lib and Embedded SQL for C Gone (Discontinued) The following features are discontinued in SQL Server 2008: ■ SQL Server 6, 6.5, and 7 compatibility levels ■ Surface Area Configuration Tool (unfortunately) ■ Notification Services ■ Dump and Load commands (use Backup and Restore) ■ Backup log with No-Log continued 25 www.getcoolebook.com Nielsen c01.tex V4 - 07/21/2009 11:57am Page 26 Part I Laying the Foundation continued ■ Backup log with truncate_only ■ Backup transaction ■ DBCC Concurrencyviolation ■ sp_addgroup, sp_changegroup, sp_dropgroup,andsp_helpgroup (use security roles instead) The very useful Profiler trace feature can report the use of any deprecated features. Summary If SQL Server 2005 was the ‘‘kitchen sink’’ version of SQL Server, then SQL Server 2008 is the version that focuses squarely on managing the enterprise database. Some have written that SQL Server 2008 is the second step of a two-step release. In the same way that SQL Server 2000 was part two to SQL Server 7, the theory is that SQL Server 2008 is part two to SQL Server 2005. At first glance this makes sense, because SQL Server 2008 is an evolution of the SQL Server 2005 engine, in the same way that SQL Server 2000 was built on the SQL Server 7 engine. However, as I became intimate with SQL Server 2008, I changed my mind. Consider the significant new technologies in SQL Server 2008: policy-based management, Performance Data Warehouse, PowerShell, data compression, and Resource Governor. None of these technologies existed in SQL Server 2005. In addition, think of the killer technologies introduced in SQL Server 2005 that are being extended in SQL Server 2008. The most talked about new technology in SQL Server 2005 was CLR. Hear much about CLR in SQL Server 2008? Nope. Service Broker has some minor enhancements. Two SQL Server 2005 new technologies, HTTP endpoints and Notification Services, are actually discontinued in SQL Server 2008. Hmmm, I guess they should have been on the SQL Server 2005 deprecation list. No, SQL Server 2008 is more than a SQL Server 2005 sequel. SQL Server 2008 is a fresh new vision for SQL Server. SQL Server 2008 is the first punch of a two-punch setup focused squarely at manag- ing the enterprise-level database. SQL Server 2008 is a down payment on the big gains coming in SQL Server 11. I’m convinced that the SQL Server Product Managers nailed it and that SQL Server 2008 is the best direction possible for SQL Server. There’s no Kool-Aid here — it’s all way cool. 26 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 27 Data Architecture IN THIS CHAPTER Pragmatic data architecture Evaluating database designs Designing performance into the database Avoiding normalization over-complexity Relational design patterns Y ou can tell by looking at a building whether there’s an elegance to the architecture, but architecture is more than just good looks. Architecture brings together materials, foundations, and standards. In the same way, data architecture is the study of defining what a good database is and how one builds a good database. That’s why data architecture is more than just data modeling, more than just server configuration, and more than just a collection of tips and tricks. Data architecture is the overarching design of the database, how the database should be developed and implemented, and how it interacts with other software. In this sense, data architecture can be related to the architecture of a home, a fac- tory, or a skyscraper. Data architecture is defined by the Information Architecture Principle and the six attributes by which every database can be measured. Enterprise data architecture extends the basic ideas of designing a single database to include designing which types of databases serve which needs within the organization, how those databases share resources, and how they communicate with one another and other software. In this sense, enterprise data architecture is community planning or zoning, and is concerned with applying the best database meta-patterns (e.g., relational OTLP database, object-oriented database, multidimensional) to an organization’s various needs. 27 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 28 Part I Laying the Foundation Author’s Note D ata architecture is a passion of mine, and without question the subject belongs in any comprehensive database book. Because it’s the foundation for the rest of the book — the ‘‘why’’ behind the ‘‘how’’ of designing, developing, and operating a database — it makes sense to position it toward the beginning of the book. Even if you’re not in the role of database architect yet, I hope you enjoy the chapter and that it presents a useful viewpoint for your database career. Keep in mind that you can return to read this chapter later, at any time when the information might be more useful to you. Information Architecture Principle For any complex endeavor, there is value in beginning with a common principle to drive designs, pro- cedures, and decisions. A credible principle is understandable, robust, complete, consistent, and stable. When an overarching principle is agreed upon, conflicting opinions can be objectively measured, and standards can be decided upon that support the principle. The Information Architecture Principle encompasses the three main areas of information management: database design and development, enterprise data center management, and business intelligence analysis. Information Architecture Principle: Information is an organizational asset, and, according to its value and scope, must be organized, inventoried, secured, and made readily available in a usable format for daily operations and analysis by individuals, groups, and processes, both today and in the future. Unpacking this principle reveals several practical implications. There should be a known inventory of information, including its location, source, sensitivity, present and future value, and current owner. While most organizational information is stored in IT databases, un-inventoried critical data is often found scattered throughout the organization in desktop databases, spreadsheets, scraps of papers, Post-it notes, and (the most dangerous of all) inside the head of key employees. Just as the value of physical assets varies from asset to asset and over time, the value of information is also variable and so must be assessed. Information value may be high for an individual or department, but less valuable to the organization as a whole; information that is critical today might be meaningless in a month; or information that may seem insignificant individually might become critical for organiza- tional planning once aggregated. If the data is to be made easily available in the future, then current designs must be loosely connected, or coupled, to avoid locking the data in a rigid, but brittle, database. Database Objectives Based on the Information Architecture Principle, every database can be architected or evaluated by six interdependent database objectives. Four of these objectives are primarily a function of design, develop- ment, and implementation: usability, extensibility, data integrity,andperformance. Availability and security are more a function of implementation than design. 28 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 29 Data Architecture 2 With sufficient design effort and a clear goal of meeting all six objectives, it is fully possible to design and develop an elegant database that does just that. The idea that one attribute is gained only at the expense of the other attributes is a myth. Each objective can be measured on a continuum. The data architect is responsible for informing the organization about these six objectives, including the cost associated with meeting each objective, the risk of failing to meet the objective, and the recommended level for each objective. It’s the organization’s privilege to then prioritize the objectives compared with the relative cost. Usability The usability of a data store (the architectural term for a database) involves the completeness of meeting the organization’s requirements, the suitability of the design for its intended purpose, the effectiveness of the format of data available to applications, the robustness of the database, and the ease of extracting information (by programmers and power users). The most common reason why a database is less than usable is an overly complex or inappropriate design. Usability is enabled in the design by ensuring the following: ■ A thorough and well-documented understanding of the organizational requirements ■ Life-cycle planning of software features ■ Selecting the correct meta-pattern (e.g., relational OTLP database, object-oriented database, multidimensional) for the data store. ■ Normalization and correct handling of optional data ■ Simplicity of design ■ A well-defined abstraction layer with stored procedures and views Extensibility The Information Architecture Principle states that the information must be readily available today and in the future, which requires the database to be extensible, able to be easily adapted to meet new requirements. Data integrity, performance, and availability are all mature and well understood by the computer science and IT professions. While there may be many badly designed, poorly performing, and often down databases, plenty of professionals in the field know exactly how to solve those problems. I believe the least understood database objective is extensibility. Extensibility is incorporated into the design as follows: ■ Normalization and correct handling of optional data ■ Generalization of entities when designing the schema ■ Data-driven designs that not only model the obvious data (e.g., orders, customers), but also enable the organization to store the behavioral patterns, or process flow. ■ A well-defined abstraction layer with stored procedures and views that decouple the database from all client access, including client apps, middle tiers, ETL, and reports. ■ Extensibility is also closely related to simplicity. Complexity breeds complexity, and inhibits adaptation. 29 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 30 Part I Laying the Foundation Data integrity The ability to ensure that persisted data can be retrieved without error is central to the Information Architecture Principle, and it was the first major problem tackled by the database world. Without data integrity, a query’s answer cannot be guaranteed to be correct; consequently, there’s not much point in availability or performance. Data integrity can be defined in multiple ways: ■ Entity integrity involves the structure (primary key and its attributes) of the entity. If the pri- mary key is unique and all attributes are scalar and fully dependent on the primary key, then the integrity of the entity is good. In the physical schema, the table’s primary key enforces entity integrity. ■ Domain integrity ensures that only valid data is permitted in the attribute. A domain is a set of possible values for an attribute, such as integers, bit values, or characters. Nullabil- ity (whether a null value is valid for an attribute) is also a part of domain integrity. In the physical schema, the data type and nullability of the row enforce domain integrity. ■ Referential integrity refers to the domain integrity of foreign keys. Domain integrity means that if an attribute has a value, then that value must be in the domain. In the case of the foreign key, the domain is the list of values in the related primary key. Referential integrity, therefore, is not an issue of the integrity of the primary key but of the foreign key. ■ Transactional integrity ensures that every logical unit of work, such as inserting 100 rows or updating 1,000 rows, is executed as a single transaction. The quality of a database product is measured by its transactions’ adherence to the ACID properties: atomic — all or nothing, consis- tent — the database begins and ends the transaction in a consistent state, isolated — one transaction does not affect another transaction, and durable — once committed always committed. In addition to these four generally accepted definitions of data integrity, I add user-defined data integrity: ■ User-defined integrity means that the data meets the organization’s requirements. Simple business rules, such as a restriction to a domain, limit the list of valid data entries. Check constraints are commonly used to enforce these rules in the physical schema. ■ Complex business rules limit the list of valid data based on some condition. For example, certain tours may require a medical waiver. Implementing these rules in the physical schema generally requires stored procedures or triggers. ■ Some data-integrity concerns can’t be checked by constraints or triggers. Invalid, incomplete, or questionable data may pass all the standard data-integrity checks. For example, an order without any order detail rows is not a valid order, but no SQL constraint or trigger traps such an order. The abstraction layer can assist with this problem, and SQL queries can locate incomplete orders and help in identifying other less measurable data-integrity issues, including wrong data, incomplete data, questionable data, and inconsistent data. Integrity is established in the design by ensuring the following: ■ A thorough and well-documented understanding of the organizational requirements ■ Normalization and correct handling of optional data ■ A well-defined abstraction layer with stored procedures and views ■ Data quality unit testing using a well-defined and understood set of test data ■ Metadata and data audit trails documenting the source and veracity of the data, including updates 30 www.getcoolebook.com Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 31 Data Architecture 2 Performance/scalability Presenting readily usable information is a key aspect of the Information Architecture Principle. Although the database industry has achieved a high degree of performance, the ability to scale that performance to very large databases with more connections is still an area of competition between database engine vendors. Performance is enabled in the database design and development by ensuring the following: ■ A well-designed schema with normalization and generalization, and correct handling of optional data ■ Set-based queries implemented within a well-defined abstraction layer with stored procedures and views ■ A sound indexing strategy that determines which queries should use bookmark lookups and which queries would benefit most from clustered and non-clustered covering indexes to eliminate bookmark lookups ■ Tight, fast transactions that reduce locking and blocking ■ Partitioning, which is useful for advanced scalability Availability The availability of information refers to the information’s accessibility when required regarding uptime, locations, and the availability of the data for future analysis. Disaster recovery, redundancy, archiving, and network delivery all affect availability. Availability is strengthened by the following: ■ Quality, redundant hardware ■ SQL Server’s high-availability features ■ Proper DBA procedures regarding data backup and backup storage ■ Disaster recovery planning Security The sixth database objective based of the Information Architecture Principle is security. For any organi- zational asset, the level of security must be secured depending on its value and sensitivity. Security is enforced by the following: ■ Physical security and restricted access of the data center ■ Defensively coding against SQL injection ■ Appropriate operating system security ■ Reducing the surface area of SQL Server to only those services and features required ■ Identifying and documenting ownership of the data ■ Granting access according to the principle of least privilege ■ Cryptography — data encryption of live databases, backups, and data warehouses ■ Meta-data and data audit trails documenting the source and veracity of the data, including updates 31 www.getcoolebook.com . SQL Server 2008 is the second step of a two-step release. In the same way that SQL Server 2000 was part two to SQL Server 7, the theory is that SQL Server 2008 is part two to SQL Server 2005. At. because SQL Server 2008 is an evolution of the SQL Server 2005 engine, in the same way that SQL Server 2000 was built on the SQL Server 7 engine. However, as I became intimate with SQL Server 2008, . SQL Server 2005 deprecation list. No, SQL Server 2008 is more than a SQL Server 2005 sequel. SQL Server 2008 is a fresh new vision for SQL Server. SQL Server 2008 is the first punch of a two-punch