Data Protection with Check Constraints and Triggers
You can’t, in sound morals, condemn a man for taking care of his own integrity. It is his clear duty.
—Joseph Conrad One of the weirdest things I see in database implementations is that people spend tremendous amounts of time designing the correct database storage (or, at least, what seems like tremendous amounts of time to them) and then just leave the data unprotected with tables being more or less treated like buckets that will accept anything, opting to let code outside of the database layer to do all of the data protection. Honestly, I do understand the allure, in that the more constraints you apply, the harder development is in the early stages of the project, and the programmers honestly do believe that they will catch everything. The problem is, there is rarely a way to be 100%
sure that all code written will always enforce every rule.
The second argument against using automatically enforced data protection is that programmers want complete control over the errors they will get back and over what events may occur that can change data.
Later in this chapter, I will suggest methods that will let data formatting or even cascading insert operations occur to make sure that certain conditions in the data are met, making coding a lot easier. While the data being manipulated “magically” can be confusing initially, you have to think of the data layer as part of the application, not as a bucket with no limitations. Keeping the data from becoming an untrustworthy calamity of random bits is in everyone’s best interest.
Perhaps, in an ideal world, you could control all data input carefully, but in reality, the database is designed and then turned over to the programmers and users to “do their thing.” Those pesky users immediately exploit any weakness in your design to meet the requirements that they “thought they gave you in the first place.” No matter how many times I’ve forgotten to apply a UNIQUE constraint in a place where one was natural to be (yeah, I am preaching to myself along with the choir in this book sometimes), it’s amazing to me how quickly the data duplications start to occur. Ultimately, user perception is governed by the reliability and integrity of the data users retrieve from your database. If they detect data anomalies in their data sets (usually in skewed report values), their faith in the whole application plummets faster than a skydiving elephant who packed lunch instead of a parachute.
After all, your future reputation is based somewhat on the perceptions of those who use the data on a daily basis.
One of the things I hope you will feel as you read this chapter (and keep the earlier ones in mind) is that, if at all possible, the data storage layer should own protection of the fundamental data integrity. Not that the other code shouldn’t play a part: I don’t want to have to wait for the database layer to tell me that a required value is missing, but at the same time, I don’t want a back-end loading process to have to use application code to validate
246
that the data is fundamentally correct either. If the column allows NULLs, then I should be able to assume that a NULL value is at least in some context allowable. If the column is a nvarchar(20) column with no other data checking, I should be able to put every Unicode character in the column, and up to 20 concatenated values at that. The primary point of data protection is that the application layers ought do a good job of making it easy for the user, but the data layer can realistically be made nearly 100 percent trustworthy, whereas the application layers cannot. At a basic level, you expect keys to be validated, data to be reasonably formatted and fall within acceptable ranges, and required values to always exist, just to name a few. When those criteria can be assured, the rest won’t be so difficult, since the application layers can trust that the data they fetch from the database meets them, rather than having to revalidate.
The reason I like to have the data validation and protection logic as close as possible to the data it guards is that it has the advantage that you have to write this logic only once. It’s all stored in the same place, and it takes forethought to bypass. At the same time, I believe you should implement all data protection rules in the client, including the ones you have put at the database-engine level. This is mostly for software usability sake, as no user wants to have to wait for the round-trip to the server to find out that a column value is required when the UI could have simply indicated this to them, either with a simple message or even with visual cues. You build these simple validations into the client, so users get immediate feedback. Putting code in multiple locations like this bothers a lot of people because they think it’s
Bad for performance
•
More work
•
As C.S. Lewis had one of his evil characters note, “By mixing a little truth with it they had made their lie far stronger.” The fact of the matter is that these are, in fact, true statements, but in the end, it is a matter of degrees.
Putting code in several places is a bit worse on performance, usually in a minor way, but done right, it will help, rather than hinder, the overall performance of the system. Is it more work? Well, initially it is for sure. I certainly can’t try to make it seem like it’s less work to do something in multiple places, but I can say that it is completely worth doing. In a good user interface, you will likely code even simple rules in multiple places, such as having the color of a column indicate that a value is required and having a check in the submit button that looks for a reasonable value instantly before trying to save the value where it is again checked by the business rule or object layer.
The real problem we must solve is that data can come from multiple locations:
Users using custom front-end tools
•
Users using generic data manipulation tools, such as Microsoft Access
•
Routines that import data from external sources
•
Raw queries executed by data administrators to fix problems caused by user error
•
Each of these poses different issues for your integrity scheme. What’s most important is that each of these scenarios (with the possible exception of the second) forms part of nearly every database system developed. To best handle each scenario, the data must be safeguarded, using mechanisms that work without the responsibility of the user.
If you decide to implement your data logic in a different tier other than directly in the database, you have to make sure that you implement it—and far more importantly, implement it correctly—in every single one of those clients. If you update the logic, you have to update it in multiple locations anyhow. If a client is “retired” and a new one introduced, the logic must be replicated in that new client. You’re much more susceptible to coding errors if you have to write the code in more than one place. Having your data protected in a single location helps prevent programmers from forgetting to enforce a rule in one situation, even if they remember everywhere else.
Because of concurrency, every statement is apt to fail due to a deadlock, or a timeout, or the data validated in the UI no longer being in the same state as it was even milliseconds ago. In Chapter 11, we will cover concurrency, but suffice it to say that errors arising from issues in concurrency are often exceedingly random in
appearance and must be treated as occurring at any time. And concurrency is the final nail in the coffin of using a client tier only for integrity checking. Unless you elaborately lock all users out of the database objects you are using, the state could change and a database error could occur. Are the errors annoying? Yes, they are, but they are the last line of defense between having excellent data integrity and something quite the opposite.
In this chapter, I will look at the two basic building blocks of enforcing data integrity in SQL Server, first using declarative objects: check constraints, which allow you to define predicates on new rows in a table, and triggers, which are stored procedure style objects that can fire after a table’s contents have changed.
Check Constraints
Check constraints are part of a class of the declarative constraints that are a part of the base implementation of a table. Basically, constraints are SQL Server devices that are used to enforce data integrity automatically on a single column or row. You should use constraints as extensively as possible to protect your data, because they’re simple and, for the most part, have minimal overhead.
One of the greatest aspects of all of SQL Server’s constraints (other than defaults) is that the query optimizer can use them to optimize queries, because the constraints tell the optimizer about some additional quality aspect of the data. For example, say you place a constraint on a column that requires that all values for that column must fall between 5 and 10. If a query is executed that asks for all rows with a value greater than 100 for that column, the optimizer will know without even looking at the data that no rows meet the criteria.
SQL Server has five kinds of declarative constraints:
• NULL: Determines if a column will accept NULL for its value. Though NULL constraints aren’t technically constraints, they behave like them.
• PRIMARY KEY and UNIQUE constraints: Used to make sure your rows contain only unique combinations of values over a given set of key columns.
• FOREIGN KEY: Used to make sure that any migrated keys have only valid values that match the key columns they reference.
• DEFAULT: Used to set an acceptable default value for a column when the user doesn’t provide one. (Some people don’t count defaults as constraints, because they don’t constrain updates.)
• CHECK: Used to limit the values that can be entered into a single column or an entire row.
We have introduced NULL, PRIMARY KEY, UNIQUE, and DEFAULT constraints in enough detail in Chapter 6; they are pretty straightforward without a lot of variation in the ways you will use them. In this section, I will focus the examples on the various ways to use check constraints to implement data protection patterns for your columns/
rows. You use CHECK constraints to disallow improper data from being entered into columns of a table. CHECK constraints are executed after DEFAULT constraints (so you cannot specify a default value that would contradict a CHECK constraint) and INSTEAD OF triggers (covered later in this chapter) but before AFTER triggers. CHECK constraints cannot affect the values being inserted or deleted but are used to verify the validity of the supplied values.
The biggest complaint that is often lodged against constraints is about the horrible error messages you will get back. It is one of my biggest complaints as well, and there is very little you can do about it, although I will posit a solution to the problem later in this chapter. It will behoove you to understand one important thing: all statements should have error handling as if the database might give you back an error—because it might.
There are two flavors of CHECK constraint: column and table. Column constraints reference a single column and are used when the individual column is referenced in a modification. CHECK constraints are considered table constraints when more than one column is referenced in the criteria. Fortunately, you don’t have to worry about declaring a constraint as either a column constraint or a table constraint. When SQL Server compiles the constraint, it verifies whether it needs to check more than one column and sets the proper internal values.
248
We’ll be looking at building CHECK constraints using two methods:
Simple expressions
•
Complex expressions using user-defined functions
•
The two methods are similar, but you can build more complex constraints using functions, though the code in a function can be more complex and difficult to manage. In this section, we’ll take a look at some examples of constraints built using each of these methods; then we’ll take a look at a scheme for dealing with errors from constraints. First, though, let’s set up a simple schema that will form the basis of the examples in this section.
The examples in this section on creating CHECK constraints use the sample tables shown in Figure 7-1.
Figure 7-1. The example schema
To create and populate the tables, execute the following code (in the downloads, I include a simple create database for a database named Chapter7 and will put all objects in that database):
CREATE SCHEMA Music;
GO
CREATE TABLE Music.Artist (
ArtistId int NOT NULL, Name varchar(60) NOT NULL,
CONSTRAINT PKMusic_Artist PRIMARY KEY CLUSTERED (ArtistId), CONSTRAINT PKMusic_Artist_Name UNIQUE NONCLUSTERED (Name) );
CREATE TABLE Music.Publisher (
PublisherId int NOT NULL, PRIMARY KEY Name varchar(20) NOT NULL, CatalogNumberMask varchar(100) NOT NULL
CONSTRAINT DfltMusic_Publisher_CatalogNumberMask DEFAULT ('%'), CONSTRAINT AKMusic_Publisher_Name UNIQUE NONCLUSTERED (Name), );
CREATE TABLE Music.Album (
AlbumId int NOT NULL, Name varchar(60) NOT NULL, ArtistId int NOT NULL,
CatalogNumber varchar(20) NOT NULL,
PublisherId int NOT NULL --not requiring this information CONSTRAINT PKMusic_Album PRIMARY KEY CLUSTERED(AlbumId), CONSTRAINT AKMusic_Album_Name UNIQUE NONCLUSTERED (Name), CONSTRAINT FKMusic_Artist$records$Music_Album
FOREIGN KEY (ArtistId) REFERENCES Music.Artist(ArtistId), CONSTRAINT FKMusic_Publisher$published$Music_Album
FOREIGN KEY (PublisherId) REFERENCES Music.Publisher(PublisherId) );
Then seed the tables with the following data:
INSERT INTO Music.Publisher (PublisherId, Name, CatalogNumberMask) VALUES (1,'Capitol',
'[0-9][0-9][0-9]-[0-9][0-9][0-9a-z][0-9a-z][0-9a-z]-[0-9][0-9]'), (2,'MCA', '[a-z][a-z][0-9][0-9][0-9][0-9][0-9]');
INSERT INTO Music.Artist(ArtistId, Name) VALUES (1, 'The Beatles'),(2, 'The Who');
INSERT INTO Music.Album (AlbumId, Name, ArtistId, PublisherId, CatalogNumber) VALUES (1, 'The White Album',1,1,'433-43ASD-33'),
(2, 'Revolver',1,1,'111-11111-11'), (3, 'Quadrophenia',2,2,'CD12345');
A likely problem with this design is that it isn’t normalized well enough for a realistic solution. Publishers usually have a mask that’s valid at a given point in time, but everything changes. If the publishers lengthen the size of their catalog numbers or change to a new format, what happens to the older data? For a functioning system, it would be valuable to have a release-date column and catalog number mask that was valid for a given range of dates. Of course, if you implemented the table as presented, the enterprising user, to get around the improper design, would create publisher rows such as 'MCA 1989-1990', 'MCA 1991-1994', and so on and mess up the data for future reporting needs, because then, you’d have work to do to correlate values from the MCA company (and your table would be not even technically in First Normal Form!).
As a first example of a check constraint, consider if you had a business rule that no artist with a name that contains the word 'Pet' followed by the word 'Shop' is allowed, you could code the following as follows (note, all examples assume a case-insensitive collation, which is almost certainly the norm):
ALTER TABLE Music.Artist WITH CHECK
ADD CONSTRAINT chkMusic_Artist$Name$NoPetShopNames CHECK (Name not like '%Pet%Shop%');
Then, test by trying to insert a new row with an offending value:
INSERT INTO Music.Artist(ArtistId, Name) VALUES (3, 'Pet Shop Boys');
This returns the following result Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the CHECK constraint "chkMusic_Artist$Name$NoPetShopNames".
The conflict occurred in database "Chapter7", table "Music.Artist", column 'Name'.
250
thereby keeping my music collection database safe from at least one band from the ’80s.
When you create a CHECK constraint, the WITH NOCHECK setting (the default is WITH CHECK) gives you the opportunity to add the constraint without checking the existing data in the table.
Let’s add a row for another musician who I don’t necessarily want in my table:
INSERT INTO Music.Artist(ArtistId, Name) VALUES (3, 'Madonna');
Later in the process, it is desired that no artists with the word “Madonna” will be added to the database, but if you attempt to add a check constraint
ALTER TABLE Music.Artist WITH CHECK
ADD CONSTRAINT chkMusic_Artist$Name$noMadonnaNames CHECK (Name NOT LIKE '%Madonna%');
rather than the happy “Command(s) completed successfully.” message you so desire to see, you see the following:
Msg 547, Level 16, State 0, Line 1
The ALTER TABLE statement conflicted with the CHECK constraint "chkMusic_
Artist$Name$noMadonnaNames". The conflict occurred in database "Chapter7", table "Music.Artist", column 'Name'.
In order to allow the constraint to be added, you might specify the constraint using WITH NOCHECK rather than WITH CHECK because you now want to allow this new constraint, but there’s data in the table that conflicts with the constraint, and it is deemed too costly to fix or clean up the existing data.
ALTER TABLE Music.Artist WITH NOCHECK
ADD CONSTRAINT chkMusic_Artist$Name$noMadonnaNames CHECK (Name NOT LIKE '%Madonna%');
The statement is executed to add the check constraint to the table definition, though using NOCHECK means that the bad value does not affect the creation of the constraint. This is OK in some cases but can be very confusing because anytime a modification statement references the column, the CHECK constraint is fired. The next time you try to set the value of the table to the same bad value, an error occurs. In the following statement, I simply set every row of the table to the same name it has stored in it:
UPDATE Music.Artist SET Name = Name;
This gives you the following error message:
Msg 547, Level 16, State 0, Line 1
The UPDATE statement conflicted with the CHECK constraint "chkMusic_Artist$Name$noMadonnaNames".
The conflict occurred in database "Chapter7", table "Music.Artist", column 'Name'.
“What?” most users will exclaim. If the value was in the table, shouldn’t it already be good? The user is correct. This kind of thing will confuse the heck out of everyone and cost you greatly in support, unless the data in question is never used. But if it’s never used, just delete it, or include a time range for the values. CHECK Name NOT LIKE %Madonna% OR rowCreateDate < '20111131' could be a reasonable compromise. Using NOCHECK and leaving the values unchecked is almost worse than leaving the constraint off in many ways.
Tip
■ if a data value could be right or wrong, based on external criteria, it is best not to be overzealous in your enforcement. The fact is, unless you can be 100 percent sure, when you use the data later, you will still need to make sure that the data is correct before usage.
One of the things that makes constraints excellent beyond the obvious data integrity reasons is that if the constraint is built using WITH CHECK, the optimizer can make use of this fact when building plans if the constraint didn’t use any functions and just used simple comparisons such as less than, greater than, and so on. For example, imagine you have a constraint that says that a value must be less than or equal to 10. If, in a query, you look for all values of 11 and greater, the optimizer can use this fact and immediately return zero rows, rather than having to scan the table to see whether any value matches.
If a constraint is built with WITH CHECK, it’s considered trusted, because the optimizer can trust that all values conform to the CHECK constraint. You can determine whether a constraint is trusted by using the sys.check_constraints catalog object:
SELECT definition, is_not_trusted FROM sys.check_constraints
WHERE object_schema_name(object_id) = 'Music' AND name = 'chkMusic_Artist$Name$noMadonnaNames';
This returns the following results (with some minor formatting, of course):
definition is_not_trusted
--- --- (NOT [Name] LIKE '%Madonna%') 1
Make sure, if at all possible, that is_not_Trusted = 0 for all rows so that the system trusts all your CHECK constraints and the optimizer can use the information when building plans.
Caution
■ Creating CHECk constraints using the CHECK option (instead of NOCHECK) on a tremendously large table can take a very long time to apply, so often, you’ll feel like you need to cut corners to get it done fast. The problem is that the shortcut on design or implementation often costs far more in later maintenance costs or, even worse, in the user experience. if at all possible, it’s best to try to get everything set up properly, so there is no confusion.
To make the constraint trusted, you will need to clean up the data and use ALTER TABLE <tableName>
WITH CHECK CHECK CONSTRAINT constraintName to have SQL Server check the constraint and set it to trusted.
Of course, this method suffers from the same issues as creating the constraint with NOCHECK in the first place (mostly, it can take forever!). But without checking the data, the constraint will not be trusted, not to mention that forgetting to reenable the constraint is too easy. For our constraint, we can try to check the values:
ALTER TABLE Music.Artist WITH CHECK CHECK CONSTRAINT chkMusic_Artist$Name$noMadonnaNames;