Table Structures and Indexing■

Table Structures and Indexing

A lot of us have jobs where we need to give people structure but that is different from controlling.

—Keith Miller To me, the true beauty of the relational database engine comes from its declarative nature. As a programmer, I simply ask the engine a question, and it answers it. The questions I ask are usually pretty simple; just give me some data from a few tables, correlate it on some of the data, do a little math perhaps, and give me back these pieces of information (and naturally do it incredibly fast if you don’t mind). Generally, the engine obliges with an answer extremely quickly. But how does it do it? If you thought it was magic, you would not be right. It is a lot of complex code implementing a massive amount of extremely complex algorithms that allow the engine to answer your questions in a timely manner. With every passing version of SQL Server, that code gets better at turning your relational request into a set of operations that gives you the answers you desire in remarkably small amounts of time. These operations will be shown to you on a query plan, which is a blueprint of the algorithms used to execute your query. I will use query plans often in this chapter and others to show you how your design choices can affect the way work gets done.

Our job as data-oriented designers and programmers is to assist the query optimizer (which takes your query and turns it into a plan of how to run the query), the query processor (which takes the plan and uses it to do the actual work), and the storage engine (which manages IO for the whole process) by first designing and implementing as close to the relational model as possible by normalizing your structures, using good set-based code (no cursors), following best practices with coding T-SQL, and so on. This is a design book, so I won’t cover T-SQL coding, but it is a skill you should master. Consider Apress’s Beginning T-SQL 2012 by Kathi Kellenberger (Aunt Kathi!) and Scott Shaw or perhaps one of Itzik Ben-Gan’s Inside SQL books on T-SQL for some deep learning on the subject. Once you have built your system correctly, the next step is to help out by adjusting the physical structures using indexing, filegroups, files, partitioning, and everything else you can do to adjust the physical layers to assist the optimizer deal with your commonly asked questions.

When it comes to tuning your database structures, you must maintain a balance between doing too much and too little. Indexing strategies are a great example of this. If you don’t use indexes enough, searches will be slow, as the query processor could have to read every row of every table for every query (which, even if it seems fast on your machine, can cause the concurrency issues we will cover in the next chapter by forcing the query processor to lock a lot more resources than is necessary). Use too many indexes, and modifying data could take too long, as indexes have to be maintained. Balance is the key, kind of like matching the amount of fluid to the size of the glass so that you will never have to answer that annoying question about a glass that has half as much fluid as it can hold. (The answer is either that the glass is too large or the waitress needs to refill your glass immediately, depending on the situation.)

Everything we have done so far has been centered on the idea that the quality of the data is the number one concern. Although this is still true, in this chapter, we are going to assume that we’ve done our job in the logical and implementation phases, so the data quality is covered. Slow and right is always better than fast and wrong (how would you like to get paid a week early, but only get half your money?), but the obvious goal of building a computer system is to do things right and fast. Everything we do for performance should affect only the performance of the system, not the data quality in any way.

We have technically added indexes in previous chapters as a side effect of adding primary key and unique constraints (in that a unique index is built by SQL Server to implement the uniqueness condition). In many cases, those indexes will turn out to be a lot of what you need to make normal queries run nicely, since the most common searches that people will do will be on identifying information. Of course, you will likely discover that some of the operations you are trying to achieve won’t be nearly as fast as you hope. This is where physical tuning comes in, and at this point, you need to understand how tables are structured and consider organizing the physical structures.

The goal of this chapter is to provide a basic understanding of the types of things you can do with the physical database implementation, including the indexes that are available to you, how they work, and how to use them in an effective physical database strategy. This understanding relies on a base knowledge of the physical data structures on which we based these indexes—in other words, of how the data is structured in the physical SQL Server storage engine. In this chapter, I’ll cover the following:

• Physical database structure: An overview of how the database and tables are stored.

This acts mainly as foundation material for subsequent indexing discussion, but the discussion also highlights the importance of choosing and sizing your datatypes carefully.

• Indexing: A survey of the different types of indexes and their structure. I’ll demonstrate many of the index settings and how these might be useful when developing your strategy, to correct any performance problems identiﬁed during optimization testing.

• Index usage scenarios: I’ll discuss a few specialized cases of how to apply and use indexes.

• Index Dynamic Management View queries: In this section, I will introduce a couple of the dynamic management views that you can use to help determine what indexes you may need and to see which indexes have been useful in your system.

Once you understand the physical data structures, it will be a good bit easier to visualize what is occurring in the engine and then optimize data storage and access without affecting the correctness of the data. It’s essential to the goals of database design and implementation that the physical storage not affect the physically implemented model. This is what Codd’s eighth rule, also known as the Physical Data Independence rule—is about. As we discussed in Chapter 1, this rule states that the physical storage can be implemented in any manner as long as the users don’t have to know about it. It also implies that if you change the physical storage, the users shouldn’t be affected. The strategies we will cover should change the physical model but not the model that the users (people and code) know about. All we want to do is enhance performance, and understanding the way SQL Server stores data is an important step.

Note

■ I am generally happy to treat a lot of the deeper internals of SQl Server as a mystery left to the engine to deal with. For a deeper explanation consider any of Kalen delaney’s books, where I go whenever I feel the pressing need to ﬁgure out why something that seems bizarre is occurring. The purpose of this chapter is to give you a basic feeling for what the structures are like, so you can visualize the solution to some problems and understand the basics of how to lay out your physical structures.

Some of the samples may not work 100% the same way on your computer, depending on our hardware situa- tions, or changes to the optimizer from updates or service packs.

447

Physical Database Structure

In SQL Server, databases are physically structured as several layers of containers that allow you to move parts of the data around to different disk drives for optimum access. As discussed in Chapter 1, a database is a collection of related data. At the logical level, it contains tables that have columns that contain data. At the physical level, databases are made up of files, where the data is physically stored. These files are basically just typical Microsoft Windows files, and they are logically grouped into filegroups that control where they are stored on a disk. Each file contains a number of extents, which are is an allocation of 64-KB in a database file that’s made up of eight individual contiguous 8-KB pages. The page is the basic unit of data storage in SQL Server databases. Everything that’s stored in SQL Server is stored on pages of several types—data, index, overflow and others—but these are the ones that are most important to you (I will list the others later in the section called “Extents and Pages”). The following sections describe each of these containers in more detail, so you understand the basics of how data is laid out on disk.

Note

■ because of the extreme variety of hardware possibilities and needs, it’s impossible in a book on design to go into serious depth about how and where to place all your ﬁles in physical storage. I’ll leave this task to the dbA-oriented books. For detailed information about choosing and setting up your hardware, check out http://

msdn.microsoft.com or most any of glenn berry’s writing. glenn’s blog (at http://sqlserverperformance.

wordpress.com/ at the time of this writing) contained a wealth of information about SQl Server hardware, particularly CPu changes.

Files and Filegroups

Figure 10-1 provides a high-level depiction of the objects used to organize the ﬁles (I’m ignoring logs in this chapter, because you don’t have direct access to them).

Database

Primary FileGroup2 FileGroupN

...

File1 Path1 File2 Path2

FileN PathN

File1 Path1 File2 Path2

FileN PathN

File1 Path1 File2 Path2

FileN PathN

Figure 10-1. Database storage organization

At the top level of a SQL Server instance, we have the database. The database is comprised of one or more filegroups, which are logical groupings of one or more files. We can place different filegroups on different disk drives (hopefully on a different disk drive controller) to distribute the I/O load evenly across the available

hardware. It’s possible to have multiple files in the filegroup, in which case SQL Server allocates space across each file in the filegroup. For best performance, it’s generally best to have no more files in a filegroup than you have physical CPUs (not including hyperthreading, though the rules with hyperthreading are changing as processors continue to improve faster than one could write books on the subject).

A filegroup contains one or more files, which are actual operating system files. Each database has at least one primary filegroup, whose files are called primary files (commonly suffixed as .mdf, although there’s no requirement to give the files any particular names or extension). Each database can possibly have other secondary filegroups containing the secondary files (commonly suffixed as .ndf), which are in any other filegroups. Files may only be a part of a single filegroup. SQL Server proportionally fills files by allocating extents in each filegroup equally, so you should make all of the files the same size if possible. (There is also a file type for full-text indexing and a filegroup type we used in Chapter 7 for filestream types of data that I will largely ignore in this chapter as well. I will focus only on the core file types that you will use for implementing your structures.)

You control the placement of objects that store physical data pages at the filegroup level (code and metadata is always stored on the primary filegroup, along with all the system objects). New objects created are placed in the default filegroup, which is the PRIMARY filegroup (every database has one as part of the CREATE DATABASE statement, or the first file specified is set to primary) unless another filegroup is specified in any CREATE

CREATE TABLE <tableName>

(…) ON <ﬁleGroupName>

This command assigns the table to the filegroup, but not to any particular file. Where in the files the object is created is strictly out of your control.

Tip

■ If you want to move a table to a different ﬁlegroup, you can use the MOVE TO option of the ALTER TABLE statement if the table has a clustered index, or for a heap (a table without a clustered index, covered later in this chapter), create a clustered index on the object on the ﬁlegroup you want it and then drop it. For nonclustered indexes, use the DROP_EXISTING setting on the CREATE INDEX statement.

Use code like the following to create indexes and specify a ﬁlegroup:

CREATE INDEX <indexName> ON <tableName> (<columnList>) ON <ﬁlegroup>;

Use the following type of command (or use ALTER TABLE) to create constraints that in turn create indexes (UNIQUE, PRIMARY KEY):

CREATE TABLE <tableName>

( …

<primaryKeyColumn> int CONSTRAINT PKTableName ON <ﬁleGroup>

… );

For the most part, having just one ﬁlegroup and one ﬁle is the best practice for a large number of databases.

If you are unsure if you need multiple ﬁlegroups, my advice is to build your database on a single ﬁlegroup and see if the data channel provided can handle the I/O volume (for the most part, I will avoid making too many such generalizations, as tuning is very much an art that requires knowledge of the actual load the server will be under).

As activity increases and you build better hardware with multiple CPUs and multiple drive channels, you might place indexes on their own filegroup, or even place files of the same filegroup across different controllers.

449 In the following example, I create a sample database with two ﬁlegroups, with the secondary ﬁlegroup

having two files in it (I put this sample database in an SQL\Data folder in the root of the C drive to keep the example simple (and able to work on the types of drives that many of you will probably be testing my code on), but it is rarely a good practice to place your files on the C:\ drive when you have others available. I generally put a \sql directory on every drive and put everything SQL in that directory in subfolders to keep things consistent over all of our servers. Put the files wherever works best for you.):

CREATE DATABASE demonstrateFilegroups ON

PRIMARY ( NAME = Primary1, FILENAME = 'c:\sql\data\demonstrateFilegroups_primary.mdf', SIZE = 10MB),

FILEGROUP SECONDARY

( NAME = Secondary1, FILENAME = 'c:\sql\data\demonstrateFilegroups_secondary1.ndf', SIZE = 10MB),

( NAME = Secondary2, FILENAME = 'c:\sql\data\demonstrateFilegroups_secondary2.ndf', SIZE = 10MB)

LOG ON ( NAME = Log1,FILENAME = 'c:\sql\log\demonstrateFilegroups_log.ldf', SIZE = 10MB);

You can define other file settings, such as minimum and maximum sizes and growth. The values you assign depend on what hardware you have. For growth, you can set a FILEGROWTH parameter that allows you to grow the file by a certain size or percentage of the current size, and a MAXSIZE parameter, so the file cannot just fill up existing disk space. For example, if you wanted the file to start at 1GB and grow in chunks of 100MB up to 2GB, you could specify the following:

CREATE DATABASE demonstrateFileGrowth ON

PRIMARY ( NAME = Primary1,FILENAME = 'c:\sql\data\demonstrateFileGrowth_primary.mdf', SIZE = 1GB, FILEGROWTH=100MB, MAXSIZE=2GB)

LOG ON ( NAME = Log1,FILENAME = 'c:\sqll\data\demonstrateFileGrowth_log.ldf', SIZE = 10MB);

The growth settings are fine for smaller systems, but it’s usually better to make the files large enough so that there’s no need for them to grow. File growth can be slow and cause ugly bottlenecks when OLTP traffic is trying to use a file that’s growing. When SQL Server is running on a desktop operating system like Windows XP or greater (and at this point you probably ought to be using something greater like Windows 7—(or presumably Windows 8 or 9 depending on when you are reading this) or on a server operating system such as Windows Server 2003 or greater (again, it is 2012, so emphasize “or greater”), you can improve things by using “instant” file allocation (though only for data files). Instead of initializing the files, the space on disk can simply be allocated and not written to immediately. To use this capability, the system account cannot be LocalSystem, and the user account that the SQL Server runs under must have SE_MANAGE_VOLUME_NAME Windows permissions. Even with the existence of instant file allocation, it’s still going to be better to have some idea of what size data you will have and allocate space proactively, as you then have cordoned off the space ahead of time: no one else can take it from you, and you won’t fail when the file tries to grow and there isn’t enough space. In either event, the DBA staff should be on top of the situation and make sure that you don’t run out of space.

You can query the sys.ﬁlegroups catalog view to view the ﬁles in the newly created database:

USE demonstrateFilegroups;

SELECT CASE WHEN fg.name IS NULL

then CONCAT('OTHER-',df.type_desc COLLATE DATABASE_DEFAULT) ELSE fg.name END AS ﬁle_group,

df.name AS ﬁle_logical_name,

df.physical_name AS physical_ﬁle_name FROM sys.ﬁlegroups fg

RIGHT JOIN sys.database_ﬁles df

ON fg.data_space_id = df.data_space_id;

This returns the following results:

file_group file_logical_name physical_file_name

--- --- --- PRIMARY Primary1 c:\sql\data\demonstrateFilegroups_primary.mdf OTHER-LOG Log1 c:\sql\log\demonstrateFilegroups_log.ldf

SECONDARY Secondary1 c:\sql\data\demonstrateFilegroups_secondary1.ndf SECONDARY Secondary2 c:\sql\data\demonstrateFilegroups_secondary2.ndf The LOG file isn’t technically part of a filegroup, so I used a right outer join to the database files and gave it a default filegroup name of OTHER plus the type of file to make the results include all files in the database. You may also notice a couple other interesting things in the code. First, the CONCAT function is new to SQL Server 2012 to add strings together.. Second is the COLLATE database_default. The strings in the system functions are in the collation of the server, while the literal OTHER- is in the database collation. If the server doesn’t match the database, this query would fail.

There’s a lot more information than just names in the catalog views I’ve referenced already in this chapter. If you are new to the catalog views, dig in and learn them. There is a wealth of information in those views that will be invaluable to you when looking at systems to see how they are set up and to determine how to tune them.

Tip

■ An interesting feature of ﬁlegroups is that you can back up and restore them individually. If you need to restore and back up a single table for any reason, placing it in its own ﬁlegroup can achieve this.

These databases won’t be used anymore, so if you created them, just drop them if you desire:

USE MASTER;

DROP DATABASE demonstrateFileGroups;

DROP DATABASE demonstrateFileGrowth;

Extents and Pages

As shown in Figure 10-2, files are further broken down into a number of extents, each consisting of eight separate 8-KB pages where tables, indexes, and so on are physically stored. SQL Server only allocates space in a database to extents. When files grow, you will notice that the size of files will be incremented only in 64-KB increments.

Figure 10-2. Files and extents

The Language of Data Modeling■

Physical Model Implementation Case Study■