Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 492 Part III Beyond Relational so searching for LIKE ‘word%’ is fast, but LIKE ‘%word%’ is terribly slow. Searching for strings within a string can’t use the b-tree structure of an index to perform a fast index seek so it must perform a table scan instead, as demonstrated in Figure 19-1. It’s like looking for all the instances of ‘‘Paul’’ in the telephone book. The phone book isn’t indexed by first name, so each page must be scanned. FIGURE 19-1 Filtering by a where clause value that begins with a wildcard is not ‘‘sargable’’ — that is, not a searchable argument — so it forces the Query Optimizer to use a full scan. 492 www.getcoolebook.com Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 493 Using Integrated Full-Text Search 19 Basically, Integrated Full-Text Search (iFTS) extends SQL Server beyond the traditional relational data searches by building an index of every significant word and phrase. In addition, the full-text search engine adds advanced features such as the following: ■ Searching for one word near another word ■ Searching with wildcards ■ Searching for inflectional variations of a word (such as run, ran, running) ■ Weighting one word or phrase as more important to the search than another word or phrase ■ Performing fuzzy word/phrase searches ■ Searching character data with embedded binary objects stored with SQL Server ■ Using Full-Text Search in the WHERE clause or as a data source like a subquery Full-Text Search must be installed with the instance of SQL Server. If it’s not installed on your instance, it may be added later using the SQL Server Installation Center (see Chapter 4, ‘‘Installing SQL Server 2008’’). What’s New with Full-Text Search? T he history of Full-Text Search began in late 1998 when Microsoft reengineered one of its search engines (Site Server Search — designed for websites) to provide search services for SQL Server 7. The engine was called MSSearch, and it also provided search services to Exchange Content Indexing and SharePoint Portal Server 2001. I liked Full-Text Search when it was first introduced back in SQL Server 7, and I’m glad that it’s still here and Microsoft is continuing to invest in it. Microsoft continued to improve iFTS’s performance and scalability with SQL Server 2000 and SQL Server 2005. Also, in case you didn’t follow the evolution of Full-Text Search back in SQL Server 2005, Microsoft worked on bringing Full-Text Search closer to industry standards: ■ The list of noise words was renamed to the industry standard term of stoplist . ■ The many set-up stored procedures were simplified into normal DDL CREATE, ALTER, and DROP commands. With SQL Server 2008, the old stored procedure methods of setting up Full-Text Search are deprecated, meaning they will be removed in a future version. SQL 2008 Integrated Full-Text Search (iFTS) is the fourth-generation search component for SQL Server, and this new version is by far the most scalable and feature-rich. SQL 2008 iFTS ships in the Workgroup, Standard, and Enterprise versions of SQL Server. With SQL Server 2008, SQL Server is no longer dependent on the indexing service of Windows. Instead, it is now fully integrated within SQL Server, which means that the SQL Server development team can advance Full-Text Search features without depending on a release cycle. continued 493 www.getcoolebook.com Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 494 Part III Beyond Relational continued The integration of FTS in the SQL engine should also result in better performance because the Query Optimizer can make an informed decision whether to invoke the full-text engine before or after applying non-FTS filters. Minor enhancements include the following: ■ A number of new DMVs expose the workings of iFTS ■ Forty new languages ■ Noise words management with T-SQL using create fulltext stoplist ■ Thesaurus stored in system table and instance-scoped All the code samples in this chapter use the Aesop’s Fables sample database. The Aesop_Create.sql script will create the database and populate it with 25 of Aesop’s fables. The database create script as well as the chapter code can be downloaded from www.sqlserverbible.com. Integrated Full-Text Search is not installed by default by SQL Server 2008 Setup. The option is under the Database Engine Services node. To add Integrated Full-Text Search to an existing instance, use the Programs and Features application in the Control Panel to launch SQL Server Setup in a mode that allows changing the SQL Server components. Microsoft is less than consistent with the naming of Integrated Full-Text Search. Manage- ment Studio and Books Online sometimes call it only Full-Text Search or Full-Text Indexing. In this chapter I use Integrated Full-Text Search or iFTS. If I use a different term it’s only because the sentence is referring to a specific UI command in Management Studio. Configuring Full-Text Search Catalogs A full-text search catalog is a collection of full-text indexes for a single SQL Server database. Each cata- log may store multiple full-text indexes for multiple tables, but each table is limited to only one cata- log. Typically, a single catalog will handle all the full-text searches for a database, although dedicating a single catalog to a very large table (one with over one million rows) will improve performance. Catalogs may index only user tables (not views, temporary tables, table variables, or system tables). Creating a catalog with the wizard Although creating and configuring a full-text search catalog with code is easy, the task is usually done once and then forgotten. Unless the repeatability of a script is important for redeploying the project, the Full-Text Indexing Wizard is sufficient for configuring full-text search. The wizard may be launched from within Management Studio’s Object Explorer. With a table selected, use the context menu and select Full-Text Index ➪ Define Full-Text Index. 494 www.getcoolebook.com Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 495 Using Integrated Full-Text Search 19 The Full-Text Indexing Wizard starts from a selected database and table and works through multiple steps to configure the full-text catalog, as follows: 1. Select a unique index that full-text can use to identify the rows indexed with full-text. The primary key is typically the best choice for this index; however, any non-nullable, unique, single-column index is sufficient. If the table uses composite primary keys, another unique index must be created to use full-text search. 2. Choose the columns to be full-text indexed, as shown in Figure 19-2. Valid column data types are character data types ( char, nchar, varchar, nvarchar, text, ntext,andxml)and binary data types ( binary, varbinary, varbinary(max), and the deprecated image). (Indexing binary images is an advanced topic covered later in this chapter.) You may need to specify the language used for parsing the words, although the computer default will likely handle this automatically. Full-text search can also read documents stored in binary, varbinary, varbinary(max), and image columns. Using full-text search with embedded BLOBs (binary large objects) is covered later in this chapter. FIGURE 19-2 Any valid columns are listed by the Full-Text Indexing Wizard and may be selected for indexing. 3. Enable change tracking if desired. This will automatically update the catalog when the data changes. The Automatic option means that Change Tracking is enabled and automatically updated. The Manual option means that updates are manual but change tracking is still enabled. Change Tracking can also be completely disabled. 4. Select a catalog or opt to create a new catalog. The stoplist may also be selected. 495 www.getcoolebook.com Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 496 Part III Beyond Relational 5. Skip creating a population schedule; there’s a better way to keep the catalog up-to-date. (The strategies for maintaining a full-text index are discussed later in the chapter.) 6. Click Finish. When the wizard is finished, if Change Tracking was selected in step 3, then the Start full population check box was also automatically selected, so Full-Text Search should begin a population immediately and iFTS will be set to go as soon as all the data is indexed. Depending on the amount of data in the indexed columns, the population may take a few seconds, a few minutes, or a few hours to complete. If Change Tracking was disabled, then the iFTS indexes are empty and need to be populated. To ini- tially populate the catalog, right-click on the table and select Full-Text Index Table ➪ Enable Full-Text Index, and then Full-Text Index Table ➪ Start Full Population from the context menu. This directs SQL Server to begin passing data to Full-Text Search for indexing. Creating a catalog with T-SQL code To implement full-text search using a method that can be easily replicated on otherservers,yourbest option is to create a SQL script. Creating a catalog with code means following the same steps as the Full-Text Indexing Wizard. Creating full-text catalogs and indexes uses normal DDL CREATE statements. The following code configures a full-text search catalog for the Aesop’s Fables sample database: USE AESOP; CREATE FULLTEXT CATALOG AesopFT; CREATE FULLTEXT INDEX ON dbo.Fable(Title, Moral, Fabletext) KEY INDEX FablePK ON AesopFT WITH CHANGE_TRACKING AUTO; Use the alter fulltext index command to change the full-text catalog to manually populate it. Pushing data to the full-text index Full-text indexes are different from data engine clustered and non-clustered indexes that are updated as part of the ACID transaction (see Chapter 66, ‘‘Managing Transactions, Locking, and Blocking’’ for details on ACID transactions). Full-text indexes are updated only when the Database Engine passes new data to the full-text engine. That’s both a benefit and a drawback. On the one hand, it means that updating the full-text index doesn’t slow down large-text updates. On the other hand, the full-text index is not real-time in the way SQL Server data is. If a user enters a r ´ esum ´ e and then searches for it using full-text search before the full-text index has been updated, then the r ´ esum ´ e won’t be found. Every full-text index begins empty, and if data already exists in the SQL Server tables, then it must be pushed to the full-text index by means of a full population. A full population re-initializes the index and passes data for all rows to the full-text index. A full population may be performed with Management Studio or T-SQL code. Because the data push is driven by SQL Server, data is sent from one table at a time regardless of how many tables might be full-text indexed in a catalog. If the full-text index is created for an empty SQL Server table, then a full population is not required. 496 www.getcoolebook.com Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 497 Using Integrated Full-Text Search 19 Two primary methods of pushing ongoing changes to a full-text index exist: ■ Incremental populations: An incremental population uses a timestamp to pass any rows that have changed since the last population. This method can be performed manually from Management Studio or by means of T-SQL code or scheduled as a SQL Server Agent job (typi- cally, each evening). Incremental population requires a rowversion (timestamp)columnin the table. Incremental populations present two problems. First, a built-in delay occurs between the time the data is entered and the time the user can find the data using full-text search. Second, incremental populations consolidate all thechangesintoasingleprocessthatconsumesa significant amount of CPU time during the incremental change. In a heavily used database, the choice is between performing incremental populations each evening and forcing a one-day delay each time or performing incremental populations at scheduled times throughout the day and suffering performance hits at those times. ■ Change tracking and background population (default): SQL Server can watch for data changes in columns that are full-text indexed and then send what is effectively a single-row incremental population every time a row changes. While this method seems costly in terms of performance, in practice, the effect is not noticeable. The full-text update isn’t fired by a trigger, so the update transaction doesn’t need to wait for the data to be pushed to the full-text index. Instead, the full-text update occurs in the background slightly behind the SQL DML transaction. The effect is a balanced CPU load and a full-text index that appears to be near real-time. Change tracking can also be configured to require manual pushes of only the changed data. Best Practice I f the database project incorporates searching for words within columns, use full-text search with change tracking and background population. It’s the best overall way to balance search performance with update performance. Maintaining a catalog with Management Studio Within Management Studio, the iFTS catalogs are maintained with the right-click menu for each table. The menu offers the following maintenance options under Full-Text Index Table: ■ Define Full-Text Indexing on Table: Launches the Full-Text Indexing Wizard to create a new catalog as described earlier in the chapter ■ Enable/Disable Full-Text Index: Turns iFTS on or off for the catalog ■ Delete Full-Text Index: Drops the selected table from its catalog ■ Start Full Population: Initiates a data push of all rows from the selected SQL Server table to its full-text index catalog 497 www.getcoolebook.com Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 498 Part III Beyond Relational ■ Start Incremental Population: Initiates a data push of rows that have changed since the last population in the selected table from SQL Server to the full-text index ■ Stop Population: Halts any currently running full-text population push ■ Track Changes Manually: Enables Change Tracking but does not push any data to the index ■ Track Changes Automatically: Performs a full or incremental population and then turns on change tracking so that SQL Server can update the index ■ Disable Change Tracking: Temporarily turns off change tracking ■ Apply Tacked Changes: Pushes updates of rows that have been flagged by change tracking to the full-text index as the changes occur ■ Update Index: Pushes an update of all rows that change tracking has flagged to the full-text index ■ Properties: Launches the Full-Text Search Property page, which can be used to modify the catalog for the selected table Maintaining a catalog in T-SQL code Each of the previous Management Studio iFTS maintenance commands can be executed from T-SQL code. The following examples demonstrate iFTS catalog-maintenance commands applied to the Aesop’s Fables sample database: ■ Full population: ALTER FULLTEXT INDEX ON Fable START FULL POPULATION; ■ Incremental population: ALTER FULLTEXT INDEX ON Fable START Incremental POPULATION ■ Remove a full-text catalog: DROP FULLTEXT INDEX ON dbo.Fable DROP FULLTEXT CATALOG AesopFT Word Searches Once the catalog is created, iFTS is ready for word and phrase queries. Word searches are performed with the CONTAINS keyword. The effect of CONTAINS is to pass the word search to the iFTS compo- nent with SQL Server and await the reply. Word searches can be used within a query in one of two ways, CONTAINS or CONTAINSTABLE. The Contains function CONTAINS operates within the WHERE clause, much like a WHERE IN(subquery). The parameters within the parentheses are passed to the iFTS engine, which returns an ‘‘include’’ or ‘‘omit’’ status for each row. 498 www.getcoolebook.com Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 499 Using Integrated Full-Text Search 19 The first parameter passed to the iFTS engine is the column name to be searched, or an asterisk for a search of all columns from one table. If the FROM clause includes multiple tables, then the table must bespecifiedinthe CONTAINS parameter. The following basic iFTS searches all indexed columns for the word ‘‘Lion’’: USE Aesop; SELECT Title FROM Fable WHERE CONTAINS (Fable.*,‘Lion’); The following fables contain the word ‘‘Lion’’ in either the fable title, moral, or text: Title The Dogs and the Fox The Hunter and the Woodman The Ass in the Lion’s Skin Androcles Integrated Full-Text Search is not case sensitive. Even if the server is configured for a case- sensitive collation, iFTS will accept column names regardless of the case. The ContainsTable function Not only will iFTS work within the WHERE clause, but the CONTAINSTABLE function operates as a table or subquery and returns the result set from the full-text search engine. This SQL Server feature opens up the possibility of powerful searches. CONTAINSTABLE returns a result set with two columns. The first column, Key, identifies the row using the unique index that was defined when the catalog was configured. The second column, Rank, reports the ranking of the rows using values from 1 (low) to 1000 (high). There is no high/median/low range or fixed range to the rank value; the rank compares the row with other rows only with regard to the following factors: ■ The frequency/uniqueness of the word in the table ■ The frequency/uniqueness of the word in the column Therefore, a rare word will be ranked as statistically more important than a common word. Because rank is only a relative ranking, it’s useful for sorting the results, but never assume a certain rank value indicates significance and thus filter by a rank value in the WHERE clause. The same parameters that define the iFTS for CONTAINS also define the search for CONTAINSTABLE. The following query returns the raw data from the iFTS engine: SELECT * FROM CONTAINSTABLE (Fable, *, ‘Lion’); 499 www.getcoolebook.com Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 500 Part III Beyond Relational Result: KEY RANK 386 480 20 48 14 32 The key by itself is useless to a human, but joining the CONTAINSTABLE results with the Fable table, as if CONTAINSTABLE were a derived table, enables the query to return the Rank and the fable’s Title, as follows: SELECT Fable.Title, FTS.Rank FROM Fable INNER JOIN CONTAINSTABLE (Fable, *, ‘Lion’) AS FTS ON Fable.FableID = FTS.[KEY] ORDER BY FTS.Rank DESC; Result: Title Rank Androcles 86 The Ass in the Lion’s Skin 80 The Hunter and the Woodman 48 The Dogs and the Fox 32 AfourthCONTAINSTABLE parameter, top n limit, reduces the result set from the full-text search engine much as the SQL SELECT TOP predicate does. The limit is applied assuming that the result set is sorted descending by rank so that only the highest ranked results are returned. The following query demon- strates the top n limit throttle: SELECT Fable.Title, FTS.Rank FROM Fable INNER JOIN CONTAINSTABLE (Fable, *, ‘Lion’, 2)ASFTS ON Fable.FableID = FTS.[KEY] ORDER BY FTS.Rank DESC; Result: Title Rank Androcles 86 The Ass in the Lion’s Skin 80 The advantage of using the top n limit option is that the full-text search engine can pass less data back to the query. It’s more efficient than returning the full result set and then performing a SQL TOP in the SELECT statement. It illustrates the principle of performing the data work at the server instead of the client. In this case, the full-text search engine is the server process and SQL Server is the client process. 500 www.getcoolebook.com Nielsen c19.tex V4 - 07/21/2009 1:02pm Page 501 Using Integrated Full-Text Search 19 Advanced Search Options Full-text search is powerful, and you can add plenty of options to the search string. The options described in this section work with CONTAINS and CONTAINSTABLE. Multiple-word searches Multiple words may be included in the search by means of the OR and AND conjunctions. The following query finds any fables containing both the word ‘‘Tortoise’’ and the word ‘‘Hare’’ in the text of the fable: SELECT Title FROM Fable WHERE CONTAINS (FableText,‘Tortoise AND Hare’); Result: Title The Hare and the Tortoise One significant issue pertaining to the search for multiple words is that while full-text search can easily search across multiple columns for a single word, it searches for multiple words only if those words are in the same column. For example, the fable ‘‘The Ants and the Grasshopper’’ includes the word ‘‘thrifty’’ in the moral and the word ‘‘supperless’’ in the text of the fable itself. But searching for ‘‘thrifty and sup- perless’’ across all columns yields no results, as shown here: SELECT Title FROM Fable WHERE CONTAINS (*,‘ "Thrifty AND supperless" ’); Result: (0 row(s) affected) Two solutions exist, and neither one is pretty. The query can be reconfigured so that the AND conjunc- tion is at the WHERE-clause level, rather than within the CONTAINS parameter. The problem with this solution is performance. The following query requires two remote scans to the full-text search engine: SELECT Title FROM Fable WHERE CONTAINS (*,‘Thrifty’) AND CONTAINS(*,‘supperless’) Result: Title The Ants and the Grasshopper 501 www.getcoolebook.com . for SQL Server, and this new version is by far the most scalable and feature-rich. SQL 2008 iFTS ships in the Workgroup, Standard, and Enterprise versions of SQL Server. With SQL Server 2008, SQL. performance and scalability with SQL Server 2000 and SQL Server 2005. Also, in case you didn’t follow the evolution of Full-Text Search back in SQL Server 2005, Microsoft worked on bringing Full-Text. using the SQL Server Installation Center (see Chapter 4, ‘‘Installing SQL Server 2008 ’). What’s New with Full-Text Search? T he history of Full-Text Search began in late 1998 when Microsoft