Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 40 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
40
Dung lượng
412,17 KB
Nội dung
156 CHAPTER 12 Using XML to transport relational data entities—the XML Schema Again, the subject of XML and XML Schema exceeds the scope of this article; therefore let’s emphasize one principal benefit of using the XML Schema In SQL Server, the XML standard is implemented as a data type, and because all data types typically represent (implement) a specific (data) domain, the purpose of the XML Schema in regards to the XML data type is to enforce its domain The XML Schema will provide us with a guarantee that the discography data coming in or going out of our database is valid—that it complies with the business rules Data domain A data domain defines which values are allowed in a specific data element (such as a variable, a column, and so on) For instance, a data element of a numerical type can only contain numerical data (numbers)—it can’t, for instance, contain letters or punctuation marks (with the obvious exception of the decimal point) A data element of the integer numerical type can only contain numbers, and no other characters The XML data domain is similar to the two examples in the previous paragraph, but is governed by a much more complex set of rules defined by an XML Schema ENTITIES OF PRINCIPAL IMPORTANCE Let’s take another look at the physical model We can see two entities that stand out as being more significant to the business compared to the rest The first such entity is the Album—even from Joe’s narrative it should be quite clear that the Album represents a principal business entity It contains all the information vital to the discography business: all the data about the Tracks and about the Album itself The other principal entity—also verifiable both in the logical and the physical models, as well as in Joe’s statements—is the Band Bands represent (at least in our particular model) the groups of Persons collectively responsible for the existence of the discography business This means that we’ll require two XML Schemas: one to represent the Albums, and one to represent the Bands By using two separate schemas, we’ll also be able to isolate the two principal business entities (allowing independent exchange of information regarding each of them), and we’ll be able to eliminate some redundancy We’ll illustrate that last statement in a minute Let’s now implement our physical model in the form of XML Schemas We’ll be implementing the same data model as before, but this time using a different technology See listing Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Understanding before coding Listing 157 The Album XML Schema In listing 1, we can observe how our data model can be implemented as an XML Schema from the perspective of the Album entity: a Discography contains one or more Albums, which contain one or more Tracks written by one or more Authors and performed by one or more Bands Because we’ll be using a separate XML Schema for the Band entity, we can leave out the Band Members from the Album definition, clearly eliminating unnecessary data redundancy, as shown in listing Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 158 CHAPTER 12 Listing Using XML to transport relational data The Band XML Schema In listing we can observe how the data model can be implemented from the perspective of the Band entity: a Discography is a collection of one or more Bands, containing one or more Members This way, each individual Band entity can exist independently of any Album entity, but the consistency of the Discography as a whole remains intact as long as each Album entity references the appropriate Band entity (or entities) In listings and we can observe that both XML Schemas import a third one This is due to yet another simplification, based on the fact that both the Album and the Band A few comments on the structure of the XML Schemas The entities are implemented as XML elements Their attributes are implemented as XML attributes of the XML element implementing the corresponding entity The relationships between the entities are implemented in the structuring of the XML, and the nesting of XML elements For example, following the logical model rule, which states that the Discography entity contains Album entities, the Album element is placed inside the Discography element, and because an Album entity contains Track entities, the latter are represented by elements nested inside the Album element Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Understanding before coding 159 XML Schemas use a shared collection of types These shared types are defined in the Common XML Schema, shown in listing For instance, the Person entity is present in the Album XML as well as the Band XML; therefore both can use the same type for the Person entity, rather than explicitly implementing two separate types with the same set of properties Listing Common XML Schema Now, if you want to see an example of the amount of redundancy eliminated because we chose to separate the two principal entities and implemented two XML Schemas instead of one, look at the XML examples containing partial discography data of two /www.manning.com/SQLServerMVPDeep well-known rock bands published at http:/ Dives (both XML Schemas are also located there) Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 160 CHAPTER 12 Using XML to transport relational data Enabling and maintaining the data flow After implementing the data store part of the data model, we can now focus on the operational part of the logical model We mentioned three data management operations that will be supported by our solution: entity creation, entity modification, and entity retrieval Regarding their relationship to the data flow, we can divide the supported data management operations into two groups: Inbound operations—Govern the flow of data into the database Create and Update are both inbound operations; Outbound operations—Govern the flow of data out of the database Read is the outbound operation With inbound operations, our objective should be clear We’ll have to Extract the data from the XML source Insert the data into the data store that doesn’t yet exist there Update data that already exists in the data store to reflect the data extracted from the source With outbound operations, the objective is to Read the data from the database and return it in XML format Preparing the inbound data flow Before we begin coding, we must consider all the relevant facts about the XML sources used in our solution Both XML Schemas allow the XML to contain more than one entity The related entities are nested in the source, which reflects the relationships between them Not only must we extract the entities from the XML source, but we also have to this in the correct order How we determine the correct order? By reviewing the physical model, shown in figure 1, the dependency of individual sets of data can be observed (follow the arrows and identify where they all point to) When importing the data into the database, we should start with the independent entities and finish with dependent ones This is a valid order of inbound operations for the Album XML Schema: Title (doesn’t depend on any other entity) Album (depends on Title) Track (depends on Title and Album) Person (doesn’t depend on any other entity) Track Author (depends on Person and Track) Band (doesn’t depend on any other Entity) Track Performer (depends on Band and Track) Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Enabling and maintaining the data flow 161 This is a valid order of inbound operations for the Band XML Schema: Band (doesn’t depend on any other entity) Person (doesn’t depend on any other entity) Band Member (depends on Person and Band) EXTRACTING DATA FROM XML USING TRANSACT-SQL You can choose from three data retrieval methods implemented in SQL Server 2005 and SQL Server 2008, and all the details regarding them are available in Books Online In this chapter, we only need to know the bare essentials about these methods: The purpose of the value() method is to extract the value from a single XML data element (a singleton) and return it in the designated data type We’ll use this method to extract the values from the XML nodes The purpose of the query() method is to read data from one or more XML nodes and return a sequence of XML data elements or a single XML data element The query() method can also be used to create XML data, but in this chapter we’ll only use it to retrieve data The return type of the query() method is XML We’ll use this method to specify the target of the extraction operation and to transform the source data if needed The purpose of the nodes() method is to read data from an XML entity and return a set of XML nodes This method returns a row of XML data for each node in the XML entity that corresponds to the given criteria We’ll use this method to retrieve the data from the XML source in the form of a dataset representing a single entity or a single relationship between our entities The execution of all three methods is governed through an XQuery statement or an XPath expression passed to each of the methods as an argument A detailed explanation of XQuery and XPath expressions is once again outside the scope of this chapter, but a brief version of the explanation is presented in the sidebar, “A few words on XPath expressions and XQuery statements.” A few words on XPath expressions and XQuery statements The XPath expression is the principal expression used in retrieving data from XML entities It guides the XML processor as it traverses the XML entity toward the targets containing the data that you want to extract For example, the /orders/order/orderDate XPath expression points to all elements named orderDate that exist inside elements named order, which in turn exist inside the element named orders, which exists at the root of the XML entity We could compare the XPath expression with the FROM clause of a Transact-SQL (T-SQL) query An XPath expression can be extended with an XPath predicate, the purpose of which is to restrict the traversal of the XML entity even further Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 162 CHAPTER 12 Using XML to transport relational data A few words on XPath expressions and XQuery statements (continued) For example, the /orders/order/orderDate[ > 20080101] XPath expression contains an XPath predicate (enclosed in square brackets) restricting the XPath expression to point to only those elements named orderDate that contain values greater than 20080101 We could compare the XPath predicate with the WHERE clause of a T-SQL query Compared to the XPath expression, the XQuery statement provides additional functionality needed in extracting the data from XML entities and transforming it An XQuery statement can also be used to write XML data One or more XPath expressions are used in every XQuery statement In this chapter, no data management operations against XML entities will require any knowledge of XQuery In table 3, we can see the XPath expressions pointing to individual entities of the Album XML Schema, and in table we can see the XPath expressions pointing to individual entities of the Band XML Schema Table XPath expressions used to extract the entities from the Album XML Entity XPath expression Title /ma:discography/ma:album /ma:discography/ma:album/ma:track Album /ma:discography/ma:album Track /ma:discography/ma:album/ma:track Person /ma:discography/ma:album/ma:track/ma:author Band /ma:discography/ma:album/ma:track/ma:band Note that in tables and 4, the names of the elements are prefixed with a reference to the respective XML namespace implemented by each XML Schema You can observe all of the XML namespace declarations in listings through The namespaces are declared in the xmlns attributes of the root (schema) element of each XML Schema Each XML Schema also targets a specific XML namespace, as declared in the targetNamespace attribute of the schema element This specifies the namespace of the XML entity in which a particular XML Schema is used Table XPath expressions used to extract the entities from the Band XML Entity XPath expression Band /mb:bands/mb:band Person /mb:bands/mb:band/mb:member Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Enabling and maintaining the data flow 163 A few words about XML namespaces First of all, the subject of XML namespaces exceeds the scope of this chapter But what you should know about XML namespaces in order to understand their role in these examples is that they represent the business domain in which a particular XML entity exists In our examples, we’ve introduced three XML namespaces: one for Album data, another for Band data, and a third to represent a shared domain used both by the Album and the Band domains Think about it: does an Album represent the same business entity as a Band? No, absolutely not! Therefore, if we’ve decided on using XML to represent each of them, we need a way to distinguish between them, and this is where XML namespaces come in An XML entity that exists in the Album namespace can’t be mistaken for an XML entity that exists in the Band namespace, although they’re both represented as XML In plain English: an Album can’t be a Band and a Band can’t be an Album Microsoft SQL Server 2005 and later versions support XML namespaces and introduces two methods used to declare them using T-SQL Throughout this chapter we’ll be using the WITH XMLNAMESPACES clause to declare XML namespaces that will be used in XPath expressions All the details regarding XML namespaces in SQL Server and the WITH XMLNAMESPACES clause can be found in Books Online General information regarding XML namespaces can also be found online: http:// www.w3.org/TR/xml-names/ Importing the data Using the XPath expressions listed in tables and 4, we can prepare individual T-SQL SELECT statements used to extract the data from the XML source In these SELECT statements, we’ll use the XML retrieval methods mentioned earlier, and in the final definition of the query, we’ll include them in INSERT statements that will be used to import the data extracted from the XML source into the corresponding tables of our Discography database Note that in the INSERT statements, we’ll also have to prevent certain constraint violations—most of all, we’ll need to prevent the import of data that already exists in the database EXTRACTING ALBUM DATA The source of the Album data is an XML entity based on the Album XML Schema shown in listing earlier in this chapter This XML Schema provides the structure to hold the data for the Title, Album, and Person entities, including data for the Track Author and Track Performer associative entities All the details regarding XPath functions implemented in SQL Server are available in the Books Online article titled “XQuery Functions against the xml Data Type.” Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 164 CHAPTER 12 Using XML to transport relational data In the following examples, @xml designates a variable of the XML type holding the XML data in question First off, listing shows the code to extract the titles Listing with select union select Extracting the titles xmlnamespaces ( 'http://www.w3.org/2001/XMLSchema-instance' as xsi ,'http://schemas.milambda.net/Music' as m ,'http://schemas.milambda.net/Music-Album' as ma ) Discography.Album.query (' data(@ma:title) ').value ( '.' ,'nvarchar(450)' ) as Title from @xml.nodes (' /ma:discography/ma:album ') Discography (Album) Discography.Track.query (' data(@ma:title) ').value ( '.' ,'nvarchar(450)' ) from @xml.nodes (' /ma:discography/ma:album/ma:track ') Discography (Track) The Title entity contains both the Album and the Track titles Because in SQL Server 2005 it’s not possible to specify a union XPath expression, the two sets must be merged into one using the T-SQL UNION clause Using the union XPath expression, the query could be simplified as shown in listing Listing with select Simplified query with union XPath expression xmlnamespaces ( 'http://www.w3.org/2001/XMLSchema-instance' as xsi ,'http://schemas.milambda.net/Music' as m ,'http://schemas.milambda.net/Music-Album' as ma ) Discography.Album.query (' Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Enabling and maintaining the data flow from data(@ma:title) ').value ( '.' ,'nvarchar(450)' ) as Title @xml.nodes (' /ma:discography/ma:album | /ma:discography/ma:album/ma:track ') Discography (Album) Next up, listing shows the code to extract the albums Listing with select Extracting the albums xmlnamespaces ( 'http://www.w3.org/2001/XMLSchema-instance' as xsi ,'http://schemas.milambda.net/Music' as m ,'http://schemas.milambda.net/Music-Album' as ma ) Discography.Album.query (' data(@ma:title) ').value ( '.' ,'nvarchar(450)' ) as Title ,Discography.Album.query (' data(@ma:published) ').value ( '.' ,'datetime' ) as Published from @xml.nodes (' /ma:discography/ma:album ') Discography (Album) Listing shows the code to extract the tracks Listing with select Extracting the tracks xmlnamespaces ( 'http://www.w3.org/2001/XMLSchema-instance' as xsi ,'http://schemas.milambda.net/Music' as m ,'http://schemas.milambda.net/Music-Album' as ma ) Discography.Track.query Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 165 Querying full-text indexes 181 ALTER FULLTEXT INDEX ON Production.ProductDescription SET CHANGE_TRACKING AUTO; Or to turn it off: ALTER FULLTEXT INDEX ON Production.ProductDescription SET CHANGE_TRACKING OFF; If you need to, you can also disable or enable a full-text index: ALTER FULLTEXT INDEX ON Production.ProductDescription DISABLE; ALTER FULLTEXT INDEX ON Production.ProductDescription ENABLE; The DISABLE command will turn off all change tracking, but will leave the data in the index intact You may want to disable an index if you’ll no longer be using a table for updates, or if you’re about to make a huge number of updates to the table and have change tracking set to AUTO Be aware, though, that while disabled, no change tracking is performed Thus, immediately after issuing the ENABLE command, you’ll want to issue the ALTER command with START FULL POPULATION to rebuild the index Finally, you may want to remove the full-text index altogether To so, issue the DROP command: DROP FULLTEXT INDEX ON Production.ProductDescription; Querying full-text indexes So far we’ve created a catalog, and then created a full-text index to put in the catalog But users are strange creatures; not only they expect us to keep their data safe, but they expect to get it back! So let’s proceed to step three in our one-two-three—querying our full-text index Basic searches SQL Server provides an assortment of ways to query our full-text index The first we’ll look at is the CONTAINS keyword, which is added to the WHERE clause of a query The form is CONTAINS(column, 'word'), where column is the name of the column you want to look for the text in and 'word' is the word or phrase you want to look for If you want to look at all of the columns in the table you’ve indexed, you can use an asterisk (*) in place of the column name Let’s a simple search of our product description table: SELECT ProductDescriptionID as PDID, [Description] FROM [Production].[ProductDescription] pd WHERE CONTAINS(pd.[Description], 'ride'); Running this query returns 10 rows back to us Note that we could also have used CONTAINS(*, 'ride') and received the same results, because we only have one column in the full-text index CONTAINS looks for an exact match of the word you pass in Most of the time, searching for an exact match is what you’ll want to Sometimes, though, you’ll want something less exact, which returns a broader scope of results For Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 182 CHAPTER 13 Full-text searching those times, SQL Server provides the FREETEXT keyword The format is identical to CONTAINS: FREETEXT(column, 'text') SELECT ProductDescriptionID AS PDID, [Description] FROM [Production].[ProductDescription] pd WHERE FREETEXT(pd.[Description], 'ride'); When you run this, you get 22 rows back If you run this and look through the results, in addition to the word ride, you’ll also see descriptions that contain the word riding This is because of the way FREETEXT functions FREETEXT works in a two-step process: Stemming—The full-text search engine takes the word, in this case ride, and adds the variants of the word to its search Thus it would have ride, rides, rode, ridden, and riding in the list Thesaurus—After doing the stemming, FREETEXT then goes to the thesaurus and retrieves the list of words that go with ride For example, it might add words such as drive, commute, and transportation to the list The process is repeated for each word found in the stemming process FREETEXT then performs a search on all of the words and returns the results to you Whereas CONTAINS looks for an exact match, FREETEXT looks to match the meaning of the word you’re searching for Which will be better for your application depends on your users and their needs So far, we’ve looked at some fairly simple queries, but we don’t have to limit ourselves Here’s a slightly more complex example that uses FREETEXT and joins data from three different tables in the AdventureWorks database This example is closer to what you might in a production application: SELECT [Name], ProductNumber, [Description] FROM [Production].[Product] p , [Production].[ProductDescription] pd , [Production].[ProductModelProductDescriptionCulture] pmpdc WHERE p.ProductModelID = pmpdc.ProductModelID AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID AND FREETEXT(pd.[Description], 'shift'); FORMSOF CONTAINS and FREETEXT are both powerful, but there are times when you want something a little looser than CONTAINS, but perhaps not quite as free as FREETEXT SQL Server provides a way to get to this middle ground by using FORMSOF inside the text string we’re searching for The syntax is admittedly arcane, so let's look at an example: SELECT [Name], ProductNumber, [Description] FROM [Production].[Product] p , [Production].[ProductDescription] pd , [Production].[ProductModelProductDescriptionCulture] pmpdc WHERE p.ProductModelID = pmpdc.ProductModelID AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID AND CONTAINS(pd.[Description], 'FORMSOF(INFLECTIONAL, light)' ); Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Querying full-text indexes 183 As you can see, to use this, place FORMSOF inside the text string you’re passing into the CONTAINS keyword This strange syntax throws a lot of people, so I’m going to say it again to emphasize the point You must enclose FORMSOF inside the string you’re passing to the CONTAINS keyword After FORMSOF, use either the word INFLECTIONAL or THESAURUS, then a comma, and then the word we’re looking for—in this example, light When using FORMSOF INFLECTIONAL, the full-text engine will go through the stemming process, generating the words light, lightest, lit, and so on But unlike FULLTEXT, it’ll stop and not go to the thesaurus to add any more words Conversely, you can also use the thesaurus without the stemming by using FORMSOF THESAURUS: SELECT [Name], ProductNumber, [Description] FROM [Production].[Product] p , [Production].[ProductDescription] pd , [Production].[ProductModelProductDescriptionCulture] pmpdc WHERE p.ProductModelID = pmpdc.ProductModelID AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID AND CONTAINS(pd.[Description], 'FORMSOF(THESAURUS, light)' ); In this second case, the full-text engine will look for all the words in its thesaurus that match the word light, but will not look for any stemmers of light Phrases, NEAR, OR, and prefixed terms We can pull a few other tricks out of our hat when performing searches It’s possible to search for exact phrases by enclosing them inside double quotes If we wanted to search for the phrase stiff ride, all we’d have to is pass it into the CONTAINS predicate like this: CONTAINS(pd.[Description], '"stiff ride"' ) In this case, the full-text engine will only return results when it finds that exact phrase Let’s say your marketing folks are typical of those in most companies: they throw great parties, but aren’t very consistent when it comes to data entry Sometimes they use stiff ride, but other times stiff, ride and perhaps even stiff stable ride Yet we still want to find results when stiff and ride are in the same description, in close proximity For those searches, the NEAR keyword was created Phrase your CONTAINS clause like CONTAINS(pd.[Description], 'stiff NEAR ride' ) and it will return all of the results you’re looking for In order to use NEAR most effectively, it should be combined with the ranking features provided by CONTAINSTABLE and FREETEXTTABLE, discussed later in this chapter Behind the scenes, NEAR returns all results where both words are found It then assigns a rank to them, based on how far apart in the text the two words occur The closer together, the higher the rank; the more distant, the lower A rank of zero is given when the two words are more than 50 words apart Therefore, you’ll need to combine NEAR with the ranking feature so that results can be sorted in a way most useful to your users You may also see times where a tilde (~) is substituted for the word near, as in CONTAINS(pd.[Description], 'stiff ~ ride' ) Although this syntax is acceptable, it’s not nearly as readable and thus not widely used I highly encourage you to stick to the word NEAR for readability Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 184 CHAPTER 13 Full-text searching It’s also possible to pass an OR clause into the full-text engine: CONTAINS (pd.[Description], 'stiff OR ride' ) This will return all matches where either word—stiff or ride—is in the results The final way to search is using prefixed terms This is the closest to a form of the traditional SQL LIKE syntax, but with some important differences LIKE performs pattern matching, and will search for the characters If you were to enter and pd.[Description] like '%light%' as part of a SQL statement, it would return a hit on semilightweight Assuming you only wanted search results that began with the word light, this would be an undesirable result Prefixed term search, on the other hand, only looks for full words (not patterns) that begin with the word you’re searching for To use a prefixed term, append an asterisk (*) to the end of the word, like so: CONTAINS(pd.[Description], '"light*"' ) If you look at the results of a query with light*, you’ll see the word lightweight returned This word hasn’t been in any of our previous results because it was neither a stemmer nor a thesaurus match Instead, the full-text engine went to the index and found all words that began with light and bypassed any stemming or thesaurus activity It’s necessary to enclose the prefixed term inside double quotes; otherwise the search engine will only look for light and ignore the * Additionally, even though they’re called prefixed terms, you can place the * at the front of the search word as well as the end Note that you may have seen light-weight in some results This is because the hyphen (-) acted as a word breaker and the full-text engine considered light-weight to be two words—light and weight Ranking When searching, especially with FREETEXT, it can be desirable to order your search results in terms of relevance—returning results that most closely match your search term or phrase first, then those that match the least last To achieve this, SQL Server provides two more functions: CONTAINSTABLE and FREETEXTTABLE These directly correspond to the way CONTAINS and FREETEXT searching work, but instead of being used in the WHERE clause, they return a table Because they both act the same in terms of how to use them, we’ll use FREETEXTTABLE for our example: SELECT [KEY], RANK FROM FREETEXTTABLE([Production].[ProductDescription] , [Description] , 'light' ); The first thing to notice is that FREETEXTTABLE is used in the FROM clause As stated, both of these return a table for you to work with Into the function we pass three parameters The first is the name of the table we’re free-text searching, in this case Production.ProductDescription The second item is the column we’re searching, here Description We could’ve also passed in an * to search through all full-text-indexed columns in our table The final item is the word we’re looking for; here I used light Note our SELECT statement Both FREETEXTTABLE and CONTAINSTABLE return two columns: KEY and RANK KEY is the unique key for the row in the source table RANK is an indicator of relevance Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Querying full-text indexes RANK will be a number from to 1000 Remember some important things when dealing with this number First of all, it’s not sequential There will be gaps in the numbers returned Second, the numbers aren’t unique You may have multiple rows returned that all have the same rank Table shows the results of the preceding query run against our AdventureWorks database As you can see, we have multiple rows with a rank of 113, as well as 310 I think you’d agree that this simple query isn’t overly useful Let’s look at a slightly more complex query: 185 Table Simple ranking query results KEY RANK 249 113 409 310 457 113 704 310 1090 113 1183 437 1185 310 1199 310 1206 310 SELECT fts.[KEY], fts.[RANK], [Description] FROM [Production].[ProductDescription] AS pd INNER JOIN FREETEXTTABLE([Production].[ProductDescription] , [Description] , 'light' ) AS fts ON fts.[KEY] = pd.ProductDescriptionID ORDER BY fts.[RANK]; Table shows the results from the query Table Results for medium-complexity ranking query KEY RANK Description 249 113 Value-priced bike with many features of our top-of-the-line models Has the same light, stiff frame, and the quick acceleration we’re famous for 457 113 This bike is ridden by race winners Developed with the AdventureWorks Cycles professional race team, it has an extremely light heat-treated aluminum frame, and steering that allows precision control 1090 113 Our lightest and best quality aluminum frame made from the newest alloy; it is welded and heat-treated for strength Our innovative design results in maximum comfort and performance 1185 310 Aluminum cage is lighter than our mountain version; perfect for longdistance trips 1199 310 Light-weight, wind-resistant, packs to fit into a pocket 1206 310 Simple and light-weight Emergency patches stored in handle 704 310 A light yet stiff aluminum bar for long-distance riding 409 310 Alluminum-alloy frame provides a light, stiff ride, whether you are racing in the velodrome or on a demanding club ride on country roads 1183 437 Affordable light for safe night riding—uses three AAA batteries Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 186 CHAPTER 13 Full-text searching This gives us more information than the previous query, but, as per listing 1, let’s take it further and create a query you might use in a real-world production application Listing Real-world example using FREETEXTTABLE SELECT fts.[KEY], fts.[RANK], [Name] , ProductNumber, [Description] FROM [Production].[ProductDescription] pd INNER JOIN FREETEXTTABLE([Production].[ProductDescription] , [Description], 'light' ) as fts ON fts.[KEY] = pd.ProductDescriptionID INNER JOIN [Production].[ProductModelProductDescriptionCulture] AS pmpdc ON pmpdc.ProductDescriptionID = pd.ProductDescriptionID INNER JOIN [Production].[Product] AS p ON p.ProductModelID = pmpdc.ProductModelID; Here we combine data from multiple tables and order them in a manner most relevant to the user Because this query returns 43 rows, I’ll leave it to your inquiring mind to key it in and see the results Whereas querying completes step three of our one-two-three, you’ll often want to customize the behavior of your searches, especially when using FREETEXT To that we’ll need to looking at customizing the thesaurus and stopwords Custom thesaurus and stopwords Using FREETEXT or FormsOf-Thesaurus, it’s possible to search for words using the thesaurus to augment your search A natural question is, can we add our own words to the thesaurus? The answer is a definite yes In this section, we’ll look at how to customize your thesaurus, and then cover the concept of stopwords Custom thesaurus The first step is tracking down the name of your thesaurus SQL Server stores the location in the registry Open regedit or your favorite registry tool and navigate to HKEY_LOCAL_MACHINE > SOFTWARE > Microsoft > Microsoft SQL Server > [insert your instance name here] > MSSearch > Language > [insert your language abbreviation here] WARNING Be careful not to accidentally change your registry entry We only want to examine values here My instance name is MSSQL10MSSQLSERVER For residents of the United States, the language abbreviation is ENU (short for English, US) (If you don’t know your abbreviation, don't worry; shortly we’ll look at a simple query that will let you discover it.) Regardless of where you live, navigate to the branch with your appropriate language abbreviation Looking at the name TsaurusFile, you’ll discover the name is tsenu.xml Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Custom thesaurus and stopwords 187 Now we need to go to the appropriate place on your hard drive Assuming your instance is typical, this will be in C:\Program Files\Microsoft SQL Server\[your instance name]\MSSQL\FTData\ For me, this was C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData\ Now that we’ve located it, let’s navigate to the folder and open it up in your favorite text editor (Be sure your text editor can handle Unicode files Notepad works fine for this.) TIP If you’re running on Vista, Windows 7, or Windows Server, make sure you run the text editor in Administrator mode so you can save changes! Take a look at listing Listing Default thesaurus XML file 0 Internet Explorer IE IE5 NT5 W2K Windows 2000 run jog > The first thing to notice is the file is commented out We’ll want to uncomment it as the first step Next you’ll see the section We won’t need to change it, nor the diacritics tag, which specifies accent sensitivity The next three sections are examples Ultimately you’ll delete them and replace them with your own, but let’s take a moment to look at what’s there The first section is an tag With an expansion set, all terms are equivalent With the expansion tags, if the user enters any of those terms, it’s the same as if they’d entered all of the terms Thus in the first example, if a user were to type in Internet Explorer, a full-text search would return all records that contained Internet Explorer, IE, or IE5 Replacement sets, the next section, are something you’ll use less often With replacements, SQL Server doesn’t look for the word in the (pattern) tag; Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 188 CHAPTER 13 Full-text searching instead it looks for the word in the tag In this case, if a user types in W2K, the full-text search engine will instead look for Windows 2000 One true-life situation I can think of where this would be useful is addresses If you know your system converts all street or state abbreviations to their full expanded name, then you could use this as a trap with full-text searching For example: St Str Street Thus a user typing in Elm St would be able to find Elm Street in your system You could also expand this with states Let me reiterate: this example assumes your source system automatically replaces all abbreviations with their full words Using a replacement prevents the full-text search engine from looking for words that won’t be there Replacements can also be useful in situations of error correction For example, your database of addresses is used by your company’s offshore support center, where they frequently misspell Street as Stret You could add Stret to the thesaurus and help the users Of the two, you’ll probably use expansions most of the time, but know that replacement sets exist and what they can be useful for Finally, please note there are a handful of restrictions around the thesaurus file: You must have system administrator rights to be able to edit the file You should make sure the editor understands Unicode format Entries can’t be empty Phrases placed in the thesaurus have a maximum of 512 characters You may not have any duplicate entries among the tags of expansion sets or the tags of replacement sets So let’s create a simple example to test out our custom thesaurus Open up your tsenu.xml file, and change it to this: 0 light doodleysquat Here I’m going to make the word light and the word doodleysquat substitutable for each other In case you’re wondering, we’re picking a nonsense word that we’re sure we won’t find in the AdventureWorks database Save the changes Unfortunately, saving the changes isn’t enough to make the full-text search engine pick up the updates to our thesaurus file We have to tell SQL Server the file Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Custom thesaurus and stopwords 189 has been updated In SQL Server 2008, this is fairly simple Make sure you’re in the right database for your catalog (in this case, AdventureWorks2008) and execute the stored procedure: exec sys.sp_fulltext_load_thesaurus_file 1033; go The 1033 on the end refers to the local identifier (LCID) for the language of your thesaurus file 1033 is for English, US To discover your LCID, use this simple query: select [name], lcid from sys.fulltext_languages order by [name]; Okay, we’re at the finish line Assuming your full-text engine has restarted, it should’ve picked up your new, customized thesaurus file Let’s go back to a query we used earlier, slightly altered: SELECT [Name], ProductNumber, [Description] FROM [Production].[Product] p , [Production].[ProductDescription] pd , [Production].[ProductModelProductDescriptionCulture] pmpdc WHERE p.ProductModelID = pmpdc.ProductModelID AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID AND CONTAINS(pd.[Description], 'FORMSOF(Thesaurus, doodleysquat)' ); We should get the same results as if we’d used the word light instead of doodleysquat As you can see, adding custom entries to the SQL Server Full Text Search Thesaurus isn’t difficult; there are a few steps you need to follow in order to make it happen Once you know how, you can use the functionality to make searches for your users more productive Stopwords and stoplists Every language has some words that are used so often that indexing them for full-text searching would be useless—words such as a, an, and, the, or, and so forth These words are known as noise words and will be ignored when you’re doing any full-text indexing In some companies, certain words become noise words For example, your company may have a rule that its company name must appear in all comment records in the form of a copyright notice In that case, your company name would become a noise word because it appears so often as to be meaningless to search for SQL Server 2005 had support for custom noise words for its full-text search, but it was in the form of a simple text file that applied to the entire server SQL Server 2008 has moved to a new concept, stopwords This change is much deeper than just a rebranding With the change comes a lot more flexibility and functionality SQL Server 2008 introduces two new tools: stopwords and stoplists A stoplist acts as a named container for a group of stopwords You can then associate a stoplist with one or more tables This is a great enhancement over noise words, which applied to the entire server Now you can associate a group of stopwords, in a stoplist, with specific tables without affecting the rest of the tables in the database or server Let’s run a query to demonstrate We’ll use the same query we’ve used elsewhere in this chapter: Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 190 CHAPTER 13 Full-text searching SELECT [Name], ProductNumber, [Description] FROM [Production].[Product] p , [Production].[ProductDescription] pd , [Production].[ProductModelProductDescriptionCulture] pmpdc WHERE p.ProductModelID = pmpdc.ProductModelID AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID AND CONTAINS(pd.[Description], 'shifting'); To create a stoplist, you can use the first of the new SQL Server 2008 commands, CREATE FULLTEXT STOPLIST: CREATE FULLTEXT STOPLIST ArcanesStoplist; The stoplist will act as a holder for a specific set of words that we want to ignore We refer to that group of words by the name we gave it, ArcanesStoplist Now we need to add some words to the list Here are two ways to so; both use the ALTER FULLTEXT STOPLIST statement: ALTER FULLTEXT STOPLIST ArcanesStoplist ADD 'shifting' LANGUAGE 1033; ALTER FULLTEXT STOPLIST ArcanesStoplist ADD 'light' LANGUAGE 'English'; The command is straightforward; use the ALTER FULLTEXT STOPLIST statement and provide the name of the list you want to add a word to Then comes the command ADD, followed by the word you want to add Next you have to specify the language You can specify the language in two ways, either by using the language ID (in this example, 1033) or the name for the language If you were to jump the gun and rerun our query, you’d think it would now ignore the word shifting because we just added it as a stopword to our stoplist But there’s still one more step You need to attach your stoplist to a table that has a full-text index on it This is a major improvement over 2005, where stopwords were implemented as noise words, one simple text file that applied to the entire server SQL Server 2008 now allows you to get granular with the application of custom groups of words You’re limited to one stoplist per table, though On the plus side, one stoplist can be applied to multiple tables Here’s the code to associate our stoplist with a table: ALTER FULLTEXT INDEX ON [Production].[ProductDescription] SET STOPLIST ArcanesStoplist; All we need to is specify the table name and the stoplist to associate with that table Now go run our test query You should get back zero rows Congratulations! You’ve now associated your stoplist with the full-text index I’m sure you don’t want to leave it this way, so let’s look at what it will take to clean up the mess First, you can decide you no longer want the stoplist associated with the full-text index Time to use the ALTER command again: ALTER FULLTEXT INDEX ON [Production].[ProductDescription] SET STOPLIST system; Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Custom thesaurus and stopwords 191 Setting the stoplist to the keyword system will change from your custom stoplist to the system stoplist You can also use the word OFF instead of SYSTEM to turn off stopwords altogether for the specified table If you want to use the standard system set of stopwords instead of using a custom set, use the SYSTEM keyword, as in the previous example You can also use an INSERT INTO statement to copy the system stopwords into your custom stoplist, and then add or remove words as needed There may be times when you want to remove only a word or two from a stoplist, but not disassociate the entire list It's possible to easily remove individual words from the list: ALTER FULLTEXT STOPLIST ArcanesStoplist DROP 'shifting' LANGUAGE 1033; ALTER FULLTEXT STOPLIST ArcanesStoplist DROP 'light' LANGUAGE 'English'; The syntax for the DROP is identical to ADD, except for using the word DROP instead of ADD Finally, you may want to drop the stoplist altogether There’s a DROP statement for that as well: DROP FULLTEXT STOPLIST ArcanesStoplist; This covers the basic use of stopwords and stoplists in 2008 Let’s take a moment now to look at some advanced queries that will help you manage your stopwords and stoplists USEFUL QUERIES PERTAINING TO STOPWORDS AND STOPLISTS Our first query returns a list of all the user-defined stoplists in our database: SELECT stoplist_id, name FROM sys.fulltext_stoplists; Our next query returns a list of stopwords for our user-defined stoplists in the database Note the linking to get the associated stoplist name and language: SELECT sl.name as StoplistName , sw.stopword as StopWord, lg.alias as LanguageAlias , lg.name as LanguageName, lg.lcid as LanguageLCID FROM sys.fulltext_stopwords sw JOIN sys.fulltext_stoplists sl ON sl.stoplist_id = sw.stoplist_id JOIN master.sys.syslanguages lg ON lg.lcid = sw.language_id; This next query gets a list of all of the stopwords that ship with SQL Server 2008: SELECT ssw.stopword, slg.name FROM sys.fulltext_system_stopwords ssw JOIN sys.fulltext_languages slg ON slg.lcid = ssw.language_id; Used together, these queries can help you discover the state of your stopwords and stoplists Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 192 CHAPTER 13 Full-text searching Useful system queries So far we’ve learned how to create catalogs and full-text indexes, along with a variety of ways to get information back out of our indexes All are great tools for the database developer If you’re a database administrator, though, there are few more tricks you can use to administer and take care of your full-text indexes Basic queries to discover what catalogs, indexes, and columns exist Let’s start with a few basic queries First, let’s retrieve a list of the catalogs that are associated with the current database SELECT fulltext_catalog_id, name, is_default FROM sys.fulltext_catalogs; This may return one or more rows, depending on how many catalogs you have If you created the AdventureWorksFTC catalog, it’ll appear in the name column The fulltext_catalog_id is a numbered auto incrementing primary key without any meaning The is_default column will contain a if this catalog is the default; otherwise the value will be This is useful, so let’s dig further Let’s get a list of all of the indexes in our database To so, we’ll delve into the sys.fulltext_indexes view, adding it to the information provided by the sys.fulltext_catalogs view we used in the previous example: List names of all FTS indexes SELECT cat.[name] as CatalogName , object_name(object_id) as table_name , is_enabled , change_tracking_state_desc FROM sys.fulltext_indexes, sys.fulltext_catalogs cat ORDER BY cat.[name], table_name; The results of this query are shown in table Our query returns four rows, including the row for the Document table we created earlier in the chapter Note that the query shows that all of these indexes are enabled, and it also displays the change-tracking mode being used Table Query results to list all full-text indexes CatalogName table_name is_enabled change_tracking_state_desc AdventureWorksFTC Document AUTO AdventureWorksFTC JobCandidate AUTO AdventureWorksFTC ProductDescription AUTO AdventureWorksFTC ProductReview AUTO Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Useful system queries 193 We can extend the prior query one step further, adding in information about the unique index used for full-text searching and the associated stoplist for a given table, as shown in listing Listing Full information about tables and full-text searching SELECT c.name as CatalogName , t.name as TableName , idx.name as UniqueIndexName , case i.is_enabled when then 'Enabled' else 'Not Enabled' end as IsEnabled , i.change_tracking_state_desc , sl.name as StoplistName FROM sys.fulltext_indexes i JOIN sys.fulltext_catalogs c ON i.fulltext_catalog_id = c.fulltext_catalog_id JOIN sys.tables t ON i.object_id = t.object_id JOIN sys.indexes idx ON i.unique_index_id = idx.index_id AND i.object_id = idx.object_id LEFT JOIN sys.fulltext_stoplists sl ON sl.stoplist_id = i.stoplist_id; So far we’ve looked at catalogs and tables The next logical question most would ask is “Which columns are full-text indexed?” I’m glad you asked! The query shown in listing will tell us The results of the query are shown in table Listing List all columns that are full-text indexed SELECT t.[Name] as TableName , c.[Name] as ColumnName , (case ColumnProperty ( (t.[object_id], c.[Name], 'IsFulltextIndexed') when then 'True' when then 'False' else 'Invalid Input' end) as IsFullTextIndexed FROM sys.tables t JOIN sys.all_columns c ON t.[object_id] = c.[object_id] WHERE ColumnProperty(t.[object_id], c.[Name], 'IsFulltextIndexed') = ORDER BY t.[Name], c.column_id; Table Query results to list all columns that are full-text indexed TableName ColumnName IsFullTextIndexed Document DocumentSummary True Document Document True JobCandidate Resume True ProductDescription Description True ProductReview Comments True Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross 194 CHAPTER 13 Full-text searching Note that our ProductDescription.Description column appears in the list If you want to check the status of all columns, and not only the ones that are full-text indexed, omit the WHERE clause from the query Advanced queries It would be nice to return more information about our catalogs, such as their size, how many items are in them, and so forth SQL Server provides a useful function called FullTextCatalogProperty You provide it two parameters: the name of the catalog and the name of the property you wish to examine A handful of the properties you can check this way turn out to be useful, so let’s look at a query that returns these as a single result set Listing shows the query, and table shows the results Listing Using FullTextCatalogProperty to get information SELECT [name] as CatalogName , FullTextCatalogProperty('AdventureWorksFTC', 'IndexSize') AS IndexSizeMB , FullTextCatalogProperty('AdventureWorksFTC', 'ItemCount') AS ItemCount , FullTextCatalogProperty('AdventureWorksFTC', 'UniqueKeyCount') AS UniqueKeyCount , CASE FullTextCatalogProperty('AdventureWorksFTC', 'PopulateStatus') WHEN THEN 'Idle' WHEN THEN 'Full population in progress' WHEN THEN 'Paused' WHEN THEN 'Throttled' WHEN THEN 'Recovering' WHEN THEN 'Shutdown' WHEN THEN 'Incremental population in progress' WHEN THEN 'Building index' WHEN THEN 'Disk is full Paused.' WHEN THEN 'Change tracking' ELSE 'Error reading FullTextCatalogProperty PopulateStatus' END AS PopulateStatus , CASE is_default WHEN then 'Yes' ELSE 'No' END AS IsDefaultCatalog FROM sys.fulltext_catalogs ORDER BY [name]; Table Query results for catalog information query Catalog name Index size (MB) Item count Unique key count Populate status Is default catalog AdventureWorksFTC 762 3194 Idle Yes AW2008FullTextCatalog 762 3194 Idle No After the catalog name, you’ll see the size of the index, in megabytes Don’t be alarmed that what you see in the table is 0; this means the catalog is less than Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross Useful system queries 195 megabyte in size This is due to the fact that we’re running our query against the small sample database The next column is the Item Count This indicates the number of rows from the source system that were indexed Unique Key Count, on the other hand, is an indication of the number of entries in the catalog—in other words, how many unique word/primary key combinations are in the catalog Populate Status returns a number from to 9, indicating what activity the full-text catalog is performing Because we don’t want to have to remember what each number means, the query uses a simple case statement to let us know in language we can understand Finally, the Is Default Catalog column tells us whether the catalog is the default for this database Normally it returns (for default) or (for not the default), but because the case statement was useful for populate status, we might as well use it here as well At the beginning of the chapter, we talked about the varbinary(max) data type As you’ll recall, varbinary(max) is used to store a document whose type SQL Server understands; the full-text engine will open the document and index the contents It’s possible to retrieve a list of all document types understood by SQL Server, using the following query: SELECT document_type, path, [version], manufacturer FROM sys.fulltext_document_types; This returns four columns Because SQL Server 2008 understands 50 types right out of the box, I won’t list them all here The main column is the document_type; in it you’ll see extensions such as txt, doc, xls, aspx, cmd, cpp, and many more If the document saved in a varbinary(max) column has one of the extensions from the document_type column, SQL Server will understand it SQL Server uses the library located in the path column to the work of opening and reading the document The last two columns, version and manufacturer, are informational and let us know who made the library and what version it is If your full-text performance begins to suffer over time, you might want to check and see how many fragments exist Internally, SQL Server stores index data in special tables it calls full-text index fragments A table can have one or more fragments associated with it, but if that number gets too high, it can degrade performance The query shown in listing will tell you how many fragments exist for your fulltext index Listing Determining the number of fragments for your full-text indexes See how many fragments exist for each full text index If multiple closed fragments exist for a table a REORGANIZE to help ➥ performance SELECT t.name as TableName , f.data_size , f.row_count , case f.status Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark Licensed to Kerri Ross ... will be in C:\Program Files\Microsoft SQL Server\ [your instance name]\MSSQL\FTData\ For me, this was C:\Program Files\Microsoft SQL Server\ MSSQL10.MSSQLSERVER\MSSQL\FTData\ Now that we’ve located... /www.manning.com/SQLServerMVPDeep Dives have been created using two InfoPath forms based on the Album and the Band XML Schemas These forms can also be downloaded from http://www.manning.com/ SQLServerMVPDeepDives... and Band) EXTRACTING DATA FROM XML USING TRANSACT -SQL You can choose from three data retrieval methods implemented in SQL Server 2005 and SQL Server 2008, and all the details regarding them are