SQL and the CLR

Many world-class and mission-critical corporate applications have been created using T-SQL and SQL Server, so why integrate SQL Server with the CLR? The fact is, integrating the CLR provides a host of beneﬁts to developers and DBAs that wasn’t possible or wasn’t easy with SQL Server 2000 and earlier. It also opens up a plethora of questions about the applicability of this still reasonably new technology.

In the last two sections, I approached the topic of stored procedures and ad hoc access to data, but as of SQL Server 2005, there’s another interesting architectural option to consider. Beyond using T-SQL to code objects, you can use a .NET language to write your objects to run not in interpreted manner that T-SQL objects do but rather in what is known as the SQLCLR, which is a SQL version of the CLR that is used as the platform for the .NET languages to build objects that can be leveraged by SQL Server just like T-SQL objects.

Using the SQLCLR, Microsoft provides a choice in how to program objects by using the enhanced programming architecture of the CLR for SQL Server objects. By hosting the CLR inside SQL Server, developers and DBAs can develop SQL Server objects using any .NET-compatible language, such as C# or Visual Basic. This opens up an entire new world of possibilities for programming SQL Server objects and makes the integration of the CLR one of the most powerful new development features of SQL Server.

Back when the CLR was introduced to us database types, it was probably the most feared new feature of SQL Server. As adoption of SQL Server versions 2005 and greater are almost completely the norm now, use of the CLR still may be the problem we suspected, but the fact is, properly built objects written in the CLR need to follow many of the same principals as T-SQL and the CLR can be very useful when you need it.

Microsoft chose to host the CLR inside SQL Server for many reasons; some of the most important motivations follow:

• Rich language support: .NET integration allows developers and DBAs to use any .NET—

compatible language for coding SQL Server objects. This includes such popular languages as C# and VB.NET.

• Complex procedural logic and computations: T-SQL is great at set-based logic, but .NET languages are superior for procedural/iterative code. .NET languages have enhanced looping constructs that are more ﬂexible and perform far better than T-SQL. You can more easily factor .NET code into functions, and it has much better error handling than T-SQL. T-SQL has some computational commands, but .NET has a much larger selection of computational commands. Most important for complex code, .NET ultimately compiles into native code while T-SQL is an interpreted language. This can result in huge performance wins for .NET code.

630

• String manipulation, complex statistical calculations, custom encryption, and so on:

As discussed earlier, heavy computational requirements such as string manipulation, complex statistical calculations, and custom encryption algorithms that don’t use the native SQL Server encryption fare better with .NET than with T-SQL in terms of both performance and ﬂexibility.

• .NET Framework classes: The .NET Framework provides a wealth of functionality within its many classes, including classes for data access, ﬁle access, registry access, network functions, XML, string manipulation, diagnostics, regular expressions, arrays, and encryption.

• Leveraging existing skills: Developers familiar with .NET can be productive immediately in coding SQL Server objects. Familiarity with languages such as C# and VB.NET, as well as being familiar with the .NET Framework is of great value. Microsoft has made the server-side data-access model in ADO.NET similar to the client-side model, using many of the same classes to ease the transition. This is a double-edged sword, as it’s necessary to determine where using .NET inside SQL Server provides an advantage over using T-SQL. I’ll consider this topic further throughout this section.

• Easier and safer substitute for extended stored procedures: You can write extended stored procedures in C++ to provide additional functionality to SQL Server. This ability necessitates an experienced developer ﬂuent in C++ and able to handle the risk of writing code that can crash the SQL Server engine. Stored procedures written in .NET that extend SQL Server’s functionality can operate in a managed environment, which eliminates the risk of code crashing SQL Server and allows developers to pick the .NET language with which they’re most comfortable.

• New SQL Server objects and functionality: If you want to create user-deﬁned aggregates or user-deﬁned types (UDTs) that extend the SQL Server type system, .NET is your only choice. You can’t create these objects with T-SQL. There’s also some functionality only available to .NET code that allows for streaming table-valued functions.

• Integration with Visual Studio: Visual Studio is the premier development environment from Microsoft for developing .NET code. This environment has many productivity enhancements for developers. The Professional and higher versions also include a new SQL Server project, with code templates for developing SQL Server objects with .NET.

These templates signiﬁcantly ease the development of .NET SQL Server objects. Visual Studio .NET also makes it easier to debug and deploy .NET SQL Server objects.

However, while these are all good reasons for the concept of mixing the two platforms, it isn’t as if the CLR objects and T-SQL objects are equivalent. As such, it is important to consider the reasons that you might choose the CLR over T-SQL, and vice versa, when building objects. The inclusion of the CLR inside SQL Server offers an excellent enabling technology that brings with it power, ﬂexibility, and design choices. And of course, as we DBA types are cautious people, there’s a concern that the CLR is unnecessary and will be misused by developers.

Although any technology has the possibility of misuse, you shouldn’t dismiss the SQLCLR without consideration as to where it can be leveraged as an effective tool to improve your database designs.

What really makes using the CLR for T-SQL objects is that in some cases, T-SQL just does not provide native access to the type of coding you need without looping and doing all sorts of machinations. In T-SQL it is the SQL queries and smooth handling of data that make it a wonderful language to work with. In almost every case, if you can fashion a SQL query to do the work you need, T-SQL will be your best bet. However, once you have to start

using cursors and/or T-SQL control of ﬂow language (for example, looping through the characters of a string or through rows in a table) performance will suffer mightily. This is because T-SQL is an interpreted language. In a well-thought-out T-SQL object, you may have a few non-SQL statements, variable declarations, and so on. Your statements will not execute as fast as they could in a CLR object, but the difference will often just be milliseconds if not microseconds.

The real difference comes if you start to perform looping operations, as the numbers of operations grow fast and really start to cost. In the CLR, the code is compiled and runs very fast. For example, I needed to get the maximum time that a row had been modiﬁed from a join of multiple tables. There were three ways to get that information. The ﬁrst method is to issue a correlated subquery in the SELECT clause. I will demonstrate the query using several columns from the Sales Order Header and Customer tables in the AdventureWorks2012:

SELECT SalesOrderHeader.SalesOrderID, (SELECT MAX(DateValue)

FROM (SELECT SalesOrderHeader.OrderDate AS DateValue UNION ALL

SELECT SalesOrderHeader.DueDate AS DateValue UNION ALL

SELECT SalesOrderHeader.ModiﬁedDate AS DateValue UNION ALL

SELECT Customer.ModiﬁedDate as DateValue) AS dates ) AS lastUpdateTime

FROM Sales.SalesOrderHeader JOIN Sales.Customer

ON Customer.CustomerID = SalesOrderHeader.CustomerID;

Yes, Oracle users will probably note that this subquery performs the task that their GREATEST function will (or so I have been told many times). The approach in this query is a very good approach, and works adequately for most cases, but it is not necessarily the fastest way to answer the question that I’ve posed. A second, and a far more natural, approach for most programmers is to build a T-SQL scalar user-deﬁned function:

CREATE FUNCTION dbo.date$getGreatest (

@date1 datetime, @date2 datetime, @date3 datetime = NULL, @date4 datetime = NULL )

RETURNS datetime AS

BEGIN

RETURN (SELECT MAX(dateValue)

FROM ( SELECT @date1 AS dateValue UNION ALL

SELECT @date2 UNION ALL SELECT @date3 UNION ALL

SELECT @date4 ) AS dates);

END;

632

Now to use this, you can code the solution in the following manner:

SELECT SalesOrderHeader.SalesOrderID,

dbo.date$getGreatest (SalesOrderHeader.OrderDate, SalesOrderHeader.DueDate, SalesOrderHeader.ModiﬁedDate,

Customer.ModiﬁedDate) AS lastUpdateTime FROM Sales.SalesOrderHeader

JOIN Sales.Customer

ON Customer.CustomerID = SalesOrderHeader.CustomerID;

This is a pretty decent approach, though it is actually slower to execute than the native T-SQL approach in the tests I have run, as there is some overhead in using user-deﬁned functions, and since the algorithm is the same, you are merely costing yourself. The third method is to employ a CLR user-deﬁned function. The function I will create is pretty basic and uses what is really a brute force algorithm:

<SqlFunction(IsDeterministic:=True, DataAccess:=DataAccessKind.None, _ Name:="date$getMax_CLR", _ IsPrecise:=True)> _ Public Shared Function MaxDate(ByVal inputDate1 As SqlDateTime, _

ByVal inputDate2 As SqlDateTime, _ ByVal inputDate3 As SqlDateTime, _ ByVal inputDate4 As SqlDateTime _

) As SqlDateTime Dim outputDate As SqlDateTime

If inputDate2 > inputDate1 Then outputDate = inputDate2 Else outputDate = inputDate1 If inputDate3 > outputDate Then outputDate = inputDate3 If inputDate4 > outputDate Then outputDate = inputDate4 Return New SqlDateTime(outputDate.Value)

End Function

Generally, I just let VS .NET build and deploy the object into tempdb for me and script it out to distribute to other databases. (I have included this VB script in the downloads in a .vb ﬁle. If you want to build it yourself you will use SQL Data Tools. I have also included the T-SQL binary representations in the download for those who just want to build the object and execute it, though other than the binary representation it will seem very much like the T-SQL versions. In both cases, you will need to enable CLR using sp_conﬁgure setting 'clr enabled').

For cases where the number of data parameters is great (ten or so in my testing on moderate, enterprise- level hardware), the CLR version will execute several times faster than either of the other versions. This is very true in most cases where you have to do some very functional-like logic, rather than using set-based logic.

After deploying, you the call you make still looks just like normal T-SQL (the ﬁrst execution may take a bit longer due to the just in time compiler needing to compile the binaries the ﬁrst time):

SELECT SalesOrderHeader.SalesOrderID,

dbo.date$getMax_CLR (SalesOrderHeader.OrderDate, SalesOrderHeader.DueDate, SalesOrderHeader.ModiﬁedDate,

Customer.ModiﬁedDate) as lastUpdateTime FROM Sales.SalesOrderHeader

JOIN Sales.Customer

ON Customer.CustomerID = SalesOrderHeader.CustomerID;

Ignoring for a moment the performance factors, some problems are just easier to solve using the CLR, and the solution is just as good, if not better than using T-SQL. For example, to get a value from a comma- delimited list in T-SQL requires either a looping operation or the use of techniques requiring a Numbers table (as introduced in Chapter 12). This technique is slightly difﬁcult to follow and is too large to reproduce here as an illustration.

However, in .NET, getting a comma-delimited value from a list is a built-in operation:

Dim tokens() As String = Strings.Split(s.ToString(), delimiter.ToString(), _

-1, CompareMethod.Text) 'return string at array position speciﬁed by parameter

If tokenNumber > 0 AndAlso tokens.Length >= tokenNumber.Value Then Return tokens(tokenNumber.Value - 1).Trim()

In this section, I have probably made the CLR implementation sound completely like sunshine and puppy dogs. For some usages, particularly functions that don’t access data other than what is in parameters, it certainly can be that way. Sometimes, however, the sun burns you, and the puppy dog bites you and messes up your new carpet. The fact is, the CLR is not bad in and of itself but it must be treated with the respect it needs. It is deﬁnitely not a replacement for T-SQL. It is a complementary technology that can be used to help you do some of the things that T-SQL does not necessarily do well.

The basic thing to remember is that while the CLR offers some great value, T-SQL is the language on which most all of your objects should be based. A good practice is to continue writing your routines using T-SQL until you ﬁnd that it is just too difﬁcult or slow to get done using T-SQL; then try the CLR.

In the next two sections, I will cover the guidelines for choosing either T-SQL or the CLR.

Guidelines for Choosing T-SQL

Let’s get one thing straight: T-SQL isn’t going away anytime soon. On the contrary, it’s being enhanced, along with the addition of the CLR. Much of the same code that you wrote today with T-SQL back in SQL Server 7 or 2000 is still best done the same way with SQL Server 2005, 2008, 2012, and most likely going on for many versions of SQL Server. If your routines primarily access data, I would ﬁrst consider using T-SQL. The CLR is a complementary technology that will allow you to optimize some situations that could not be optimized well enough using T-SQL.

The exception to this guideline of using T-SQL for SQL Server routines that access data is if the routine contains a signiﬁcant amount of conditional logic, looping constructs, and/or complex procedural code that isn’t suited to set-based programming. What’s a signiﬁcant amount? You must review that on a case-by-case basis. It is also important to ask yourself, “Is this task even something that should be done in the data layer, or is the design perhaps suboptimal and a different application layer should be doing the work?”

If there are performance gains or the routine is much easier to code and maintain when using the CLR, it’s worth considering that approach instead of T-SQL. T-SQL is the best tool for set-based logic and should be your first consideration if your routine calls for set-based functionality (which should be the case for most code you write). I suggest avoiding rewriting your T-SQL routines in the CLR unless there’s a definite benefit. If you are rewriting routines, do so only after trying a T-SQL option and asking in the newsgroups and forums if there is a better way to do it. T-SQL is a very powerful language that can do amazing things if you understand it. But if you have loops or algorithms that can’t be done easily, the CLR is there to get you compiled and ready to go.

Keep in mind that T-SQL is constantly being enhanced with a tremendous leap in functionality. In SQL Server 2012 they have added vastly improved windowing functions, query paging extensions, and quite a lot of new functions to handle tasks that are unnatural in relational code. In 2008, they added such features as MERGE, table parameters, and row constructors; and in 2005, we got CTEs (which gave us recursive queries), the ability to PIVOT data, new TRY-CATCH syntax for improved error handling, and other features that we can now take advantage of. If there are new T-SQL features you can use to make code faster, easier to write, and/or easier to maintain, you should consider this approach before trying to write the equivalent functionality in a CLR language.

634

Note

■ Truthfully, if T-SQL is used correctly with a well designed database, almost all of your code will ﬁt nicely into T-SQL code with only a function or two possibly needing to be created using the CLR.

Guidelines for Choosing a CLR Object

The integration of the CLR is an enabling technology. It’s not best suited for all occasions, but it has some advantages over T-SQL that merit consideration. As we’ve discussed, CLR objects compile to native code, and is better suited to complex logic and CPU-intensive code than T-SQL. One of the best scenarios for the CLR approach to code is writing scalar functions that don’t need to access data. Typically, these will perform an order (or orders) of magnitude faster than their T-SQL counterparts. CLR user-deﬁned functions can take advantage of the rich support of the .NET Framework, which includes optimized classes for functions such as string manipulation, regular expressions, and math functions. In addition to CLR scalar functions, streaming table- valued functions is another great use of the CLR. This allows you to expose arbitrary data structures—such as the ﬁle system or registry—as rowsets, and allows the query processor to understand the data.

The next two scenarios where the CLR can be useful are user-defined aggregates and CLR based UDTs. You can only write user-defined aggregates with .NET. They allow a developer to perform any aggregate such as SUM or COUNT that SQL Server doesn’t already do. Complex statistical aggregations would be a good example. I’ve already discussed .NET UDTs. These have a definite benefit when used to extend the type system with additional primitives such as point, SSN, and date (without time) types. As I discussed in Chapter 6, you shouldn’t use .NET UDTs to define business objects in the database.

CLR Object Types

This section provides a brief discussion of each of the different types of objects you can create with the CLR. You’ll also ﬁnd additional discussion about the merits (or disadvantages) of using the CLR for each type.

You can build any of the following types of objects using the CLR:

User-deﬁned functions

•

Stored procedures

•

Triggers

•

User-deﬁned aggregates

•

User-deﬁned types

•

CLR User-Deﬁned Functions

When the CLR was added to SQL Server, using it would have been worth the effort had it allowed you only to implement user-deﬁned functions. Scalar user-deﬁned functions that are highly computational are the sweet spot of coding SQL Server objects with CLR, particularly when you have more than a statement or two executing such functions. In fact, functions are the only type of objects that I have personally created and used using the CLR. I have seen some reasonable uses of several others, but they are generally fringe use cases. Those functions have been a tremendous tool for improving performance of several key portions of the systems I have worked with.

You can make both table value and scalar functions, and they will often be many times faster than corresponding T-SQL objects when there is no need for data other than what you pass in via parameters. CLR

The Language of Data Modeling■

Physical Model Implementation Case Study■