ptg 2024 CHAPTER 50 SQL Server Full-Text Search Here are some examples using FREETEXT and FREETEXTTABLE: Use AdventureWorks;SELECT * from Person.Contact where Freetext(*,’Barack Obama’) Corrected! HPC Use AdventureWorks; SELECT * FROM Sales.Individual as s JOIN (SELECT [key], rank FROM FREETEXTTABLE(Person.Contact, *, ‘jon’,100)) AS k ON k.[key]=s.Contactid order by rank desc Notice that the FREETEXTTABLE example does the functional equivalent of a CONTAINSTABLE query because the search is wrapped in double quotation marks. Stop Lists Stop lists are used when you want to hide words in searches or to prevent from being indexed those words that would otherwise bloat your full-text index and might cause perfor- mance problems. Stop lists (also known as noise word lists or stop word lists) are a legacy component from decades ago when disk prices were very expensive. Back then, using stop lists could save considerable disk space. However, with disk prices being relatively cheap, the use of stop lists is no longer as critical as it once was. You can create your own stop word list by expanding your database in SSMS and then right-clicking on the Full-Text Stoplists node and selecting New Full-Text Stoplist. You have an option of creating your own stop list, basing it on a system stop list, creating an empty one, or creating one based on another stop list in a different database. Each catalog can have its own stop list, which is a frequently demanded feature because some search consumers want to be able to prevent some words from being indexed in one table but want those words indexed in a different table. After you create a stop word list, you can maintain it by right-clicking on it in the Full-Text Stoplists node and selecting Properties. Figure 50.5 illustrates this option. The options are to add a stop word, delete a stop word, delete all stop words, and clear the stop list. After selecting the option you want, you can enter a stop word and the language in which you want that stop word to be applied. Keep in mind that the stop lists are applied at query time (while searching) and index time (while indexing). Changes made to a stop list are reflected real-time in searches but applied only to newly indexed words. The stop words remain in the catalog until you rebuild the catalog. It is a best practice to rebuild your catalog as soon as you have made changes to your stop word list. To rebuild your full-text catalog, right-click on the catalog in SSMS and select Rebuild. Full-Text Search Maintenance After you create full-text catalogs and indexes that you can query, you have to maintain them. The catalogs and indexes maintain themselves, but you need to focus on backing up and restoring them as well as tuning your search solution for optimal performance. In SQL Server 2008, the full-text catalogs get fragmented from time to time, especially if you are using the Automatic (Track Changes Automatically) setting. You can check the level of fragmentation by using the following command: SELECT table_id, status FROM sys.fulltext_index_fragments WHERE status=4 OR status=6; ptg 2025 Full-Text Search Performance 50 FIGURE 50.5 Maintaining a full-text stop list. If you notice that your tables are highly fragemented you will optimize your full-text indexes. Here is the command you would use to do this: ALTER FULLTEXT CATALOG AdventureWorks2008 REORGANIZE; Full-Text Search Performance SQL Server FTS performance is most sensitive to the number of rows in the result set and number of search terms in the query. You should limit your result set to a practical number; most searchers are conditioned to look only at the first page of results for what they are looking for, and if they don’t see what they need there, they refine the search and search again. A good practical limit for the number of rows to return is 200. You should try, if at all possible, to use simple queries because they perform better than more complex ones. As a rule, you should use CONTAINS rather than FREETEXT because it offers better performance, and you should use CONTAINSTABLE rather than FREETEXTTABLE for the same reason. Several factors are involved in delivering an optimal Full-Text Search solution. Consider the following: . Avoid indexing binary content. Convert it to text, if possible. Most IFilters do not perform as well as the text IFilter. . Use integer columns on the base table that comprises your unique index. ptg 2026 CHAPTER 50 SQL Server Full-Text Search . Partition large tables into smaller tables. There seems to be a sweet spot around 50 million rows, but your results may vary. Ensure that for large tables, each table has its own catalog. Place this catalog on a RAID 10 array, preferably on its own controller. . SQL Full-Text Search benefits from multiple processors, preferably four or more. A sweet spot exists on eight-way machines or better. You will find 64-bit hardware also offers substantial performance benefits over 32-bit. . Dedicate at least 512MB to 1GB of RAM to MSFTESQL by setting the maximum server memory to 1GB less than the installed memory. Set resource usage to run at 5 to give a performance boost to the indexing process (that is, sp_fulltext_service ‘resource_usage’,5 ), set ft crawl bandwidth (max) and ft notify bandwidth (max) to 0, and set max full-text crawl range to the number of CPUs on your sys- tem. Use sp_configure to make these changes. Full-Text Search Troubleshooting The first question you should ask yourself when you have a problem with SQL Full-Text Search is this: “Is the problem with searching or with indexing?” To help you make this determination, Microsoft has included three DMVs in SQL Server 2008: . sys.dm_fts_index_keywords . sys.dm_fts_index_keywords_by_document . sys.dm_fts_parser The first two DMVs displays the contents of your full-text index. The first DMV returns the following columns: . Keyword—Each keyword in varbinary form. . Display_term—The keyword as indexed; all the accents are removed from the word. . Column_ID—The column ID where the word exists. . Document_Count—The number of times the word exists in that column. The second DMV breaks down the keywords by document. Like the first DMV, it contains the Keyword, Display_term, and Column_ID columns, but in addition it contains the following two columns: . Document_ID—The row in which the keyword occurs. . Occurrence_count—The number of times the word occurs in the cell (a cell is also known as a tuple; it is a row-column combination—for example, the contents of the third column in the fifth row). The first DMV, sys.dm_fts_index_keywords, is used primarily to determine candidate noise wordsit can be used to diagnose indexing problems. The second DMV, sys.dm_fts_index_keywords_by_document, is used to determine what is stored in your index for a particular cell. ptg 2027 Full-Text Search Troubleshooting 50 Here are some examples of their usage: select * From sys.dm_fts_index_keywords(DB_ID(),Object_iD(‘MyTable’)) select * From sys.dm_fts_index_keywords_by_document(DB_ID(),Object_iD(‘MyTable’)) These two DMVs are used to determine what occurs at index time. The third DMV, sys.dm_fts_parser, is used primarily to determine what happens at search time—in other words, how SQL Server Full-Text Search interprets your search phrase. Here is an example of its usage. select * from sys.dm_fts_parser(@queryString, @LCID, @StopListID, @AccentSenstive) @QueryString is your search word or phrase, @LCID is the LoCale ID for your language (determinable by querying sys.fulltext_languages), @StopListID is your stoplist file (determinable by querying sys.fulltext_stoplists), @AccentSensitive allows you to set accent sensitivity (0 not sensitive, 1 sensitive to accents) . Here is an example of how this works: select * from sys.dm_fts_parser(‘café’, 1033, 0, 1) select * from sys.dm_fts_parser(‘café’, 1033, 0, 0) In the second example, you will notice that the Display_term is cafe and not café. These queries return the following columns: . Keyword—This is a varbinary representation of your keyword. . Group_id—The query parser builds a parse tree of the search phrase. If you have any Boolean searches, it assigns different group IDs to each part of the search term. For example in the search phrase ’”Hillary Clinton” OR “Barack Obama”’, Hillary and Clinton belong to Group ID 1 and Barack and Obama belong to Group ID 2. . Phrase_id—Some words are indexed in multiple forms; for example, data-base is indexed as data, base, and database. In this case, data and base have the same phrase ID, and database has another phrase ID. . Occurence_count—This is how frequently the word apprears in the search string. . Special_term—This column refers to any delimiters that the parser finds in the search phrase. Possible values are Exact Match, End of Sentence, End of Paragraph, and End of Chapter. . Display_term—This is how the term would be stored in the index. . Expansion_type—This is the type of expansion, whether it is a thesaurus expansion ( 4), an inflectional expansion (2), or not expanded (0). For example, the following query shows the stemmed variants of the word run. select * from sys.dm_fts_parser(‘FORMSOF( INFLECTIONAL, run)’, 1033, 0, 0) . Source_Term—This is the source term as it appears in your query. When troubleshooting indexing problems, you should consult the full-text error log, which can be found in C:\Program Files\Microsoft SQL ptg 2028 CHAPTER 50 SQL Server Full-Text Search Server\MSSQL10.MSSQLSERVER\MSSQL\LOG and starts with the prefix SQLFT followed by the database ID (padded with leading zeros), the catalog ID (query sys.fulltext_catalogs for this value), and then the extension .log. You may find many versions of the log each with a numerical extension, such as SQLFT0001800005.LOG.4; this is the fourth version of this log. These full-text indexing logs can be read by any text editor. You might find entries in this log that indicate documents were retried or documents failed indexing in addition to error messages returned from the iFilters. Summary SQL Server 2008 Full-Text Search offers extremely fast and powerful querying of textual content stored in tables. In SQL Server 2008, the full-text index creation statements are highly symmetrical, with the table index creation statements making SQL Server FTS much more intuitive to use than previous versions of SQL Server FTS. Also new is the tremendous increase in indexing and querying speeds. These features make SQL Server Full-Text Search a very attractive component of SQL Server 2008. ptg CHAPTER 51 SQL Server 2008 Analysis Services IN THIS CHAPTER . What’s New in SSAS . Understanding SSAS and OLAP . Understanding the SSAS Environment Wizards . An Analytics Design Methodology . An OLAP Requirements Example: CompSales International SQL Server 2008 Analysis Services (SSAS) continues to expand with numerous data warehousing, data mining, and online analytical processing (OLAP)–rich tools and tech- nologies. Microsoft continues to attack the data warehous- ing/business intelligence (BI) market by pouring millions and millions of dollars into this area. Microsoft knows that the world is hungry for analytics and is betting the farm on it. As a part of its internal project named “Madison,” Microsoft has been acquiring other complementary BI tech- nologies to accelerate its plans (such as acquiring the MPP data warehousing appliance company DATAllegro and rolling it under its BI offering). Other more traditional (and much more expensive) OLAP and BI platforms such as Cognos, Hyperion, Business Objects, and Micro Strategies are being challenged, if not completely replaced, by this new version of SSAS. A chief data architect from a prominent Silicon Valley company said recently, “I can now build [using SSAS] sound, extremely usable, highly scalable, OLAP cubes myself, faster and smarter than the entire data warehouse team could do only a few years ago.” This is what Microsoft has been trying to bring to the forefront for years—“BI for the masses.” What’s New in SSAS SQL Server 2005 was the big jump into completely rede- ploying Analysis Services—from the architecture, to the development environment, to the multidimensional languages supported, and even to the wizard-driven deploy- ments. SQL Server 2008 R2 raises this core work up a few ptg 2030 CHAPTER 51 SQL Server 2008 Analysis Services more notches with enhancements at almost every part of SSAS and with the addition of major scaleout capabilities. Following are some of the top new features and enhancements: . Microsoft has improved and streamlined the Cube Designer. . Several subtle enhancements have been made around the Dimension and Aggregation Designers. . You can now create attribute relationships with the new Attribute Relationship Designer. . You can use subspace computations to optimize performance for your Multidimensional Expressions (MDX) queries. . Multidimensional OLAP (MOLAP) enables write-back capabilities that support high- performance “what if” scenarios. . A shared read-only Analysis Services database between several Analysis Services servers enables you to “scale out” easily and efficiently. . You are able to use localized analytical data in native languages, including transla- tion capabilities and automatic currency conversions. . A highly compressed and optimized data cache is maintained automatically. . Backup performance is optimized. . SQL Server PowerPivot for Excel is a new feature. . The master data hub in SQL Server 2008 R2 helps manage your master data services more efficiently. And, last, but not least, . SQL Server 2008 R2 Parallel Data Warehouse is a highly scalable data warehouse appliance-based massively parallel processing (MPP) solution that knows no bounds. Understanding SSAS and OLAP Because OLAP is at the heart of SSAS, you need to understand what it is and how it solves the requirements of decision makers in a business. As you might already know, data ware- housing requirements typically include all the capability needed to report on a business’s transactional history, such as sales history. This transactional history is often organized into subject areas and tiers of aggregated information that can support some online query- ing and usually much more batch reporting. Data warehouses and data marts typically extract data from online transaction processing (OLTP) systems and serve data up to these business users and reporting systems. In general, these are all called decision support systems (DSS), or BI systems, and the latency of this data is determined by the business requirements it must support. Typically, this latency is daily or weekly, depending on the business needs, but more and more, we are seeing more real-time (or near-real-time) reporting requirements. ptg 2031 Understanding SSAS and OLAP 51 TIME GEOGRAHY All Product Product Type All Geo Country All Time Month Sales Units 450 333 1203 Years Product Region Customer TIME TIME OLAP Cube PRODUCT PRODUCT GEOGRAPHY Jan01 Feb01 Mar01 Apr01 996 (France) (2010) (Feb 01) (IBM Laptop Model 451D) FIGURE 51.1 Multidimensional representation of business facts. OLAP falls squarely into the realm of BI. The purpose of OLAP is to provide for a mostly online reporting environment that can support various end user reporting requirements. Typically, OLAP representations are of OLAP cubes. A cube is a multidimensional represen- tation of basic business facts that can be accessed easily and quickly to provide you with the specific information you need to make a critical decision. It is useful to note that a cube can be composed of from 1 to N dimensions. However, remember that the business facts represented in a cube must exist for all the dimensions being defined for the fact. In other words, all dimensional values (that is, intersections) have to be present for a fact value to be stored in the cube. Figure 51.1 illustrates the Sales_Units historical business fact, which is the intersection of time, product, and geography dimensional data. For a particular point in time (February 2010), for a particular product (IBM laptop model 451D), and in a particular country (France), the sales units were 996 units. With an OLAP cube, you can easily see how many of these laptop computers were sold in France in February 2010. Basically, cubes enable you to look at business facts via well-defined and organized dimen- sions (time, product, and geography dimensions, in this example). Note that each of these dimensions is further organized into hierarchical representations that correspond to the way data is looked at from the business point of view. This provides for the capability to drill down into the next level from a higher, broader level (like drilling down into a specific country’s data within a geographic region, such as France’s data within the European geographic region). ptg 2032 CHAPTER 51 SQL Server 2008 Analysis Services SSAS directly supports this and other data warehousing capabilities. In addition, SSAS allows a designer to implement OLAP cubes using a variety of physical storage techniques that are directly tied to data aggregation requirements and other performance considera- tions. You can easily access any OLAP cube built with SSAS via the Pivot Table Service, you can write custom client applications by using MDX with OLE DB for OLAP or ActiveX Data Objects Multidimensional (ADO MD), and you can use a number of third-party “OLE DB for OLAP” compliant tools. Microsoft utilizes something called the Unified Dimensional Model (UDM) to conceptual- ize all multidimensional representations in SSAS. It is also worth noting that many of the leading OLAP and statistical analysis software vendors have joined the Microsoft Data Warehousing Alliance and are building front-end analysis and presentation tools for SSAS. The data mining capabilities that are part of SSAS provide a new avenue for organized data discovery. This includes using SQL Server DMX. This chapter takes you through the major components of SSAS, discusses a mini-method- ology for OLAP cube design, and leads you through creating and managing robust OLAP cube that can easily be used to meet a company’s BI needs. Understanding the SSAS Environment Wizards Welcome to the “land of wizards.” This implementation of SSAS, as with older versions of SSAS, is heavily wizard oriented. SSAS has a Cube Wizard, a Dimension Wizard, a Partition Wizard, a Storage Design Wizard, a Usage Analysis Wizard, a Usage-Based Optimization Wizard, an Aggregation Wizard, a Calculated Cells Wizard, a Mining Model Wizard, and a few other wizards. All of them are useful, and many of their capabilities are also available through editors and designers. Using a wizard is helpful for those who need to have a little structure in the definition process and who want to rely on defaults for much of what they need. The wizards are also plug-and-play oriented and have been made avail- able in all SQL Server and .NET development environments. In other words, you can access these wizards from wherever you need to, when you need to. All the wizard-based capabilities can also be coded in MDX, DMX, and ASSL. Figure 51.2 shows how SSAS fits into the overall scheme of the SQL Server 2008 environ- ment. SSAS has become completely integrated into the SQL Server platform. Utilizing many different mechanisms, such as SSIS and direct data source access capabilities, a vast amount of data can be funneled into the SSAS environment. Most of the cubes you build will likely be read-only because they will be for BI. However, a write-enabled capability (WriteBack) is available in SSAS for situations that meet certain data updatability requirements. As you can also see in Figure 51.2, the basic components in SSAS are all focused on building and managing data cubes. SSAS consists of the analysis server, processing services, integra- tion services, and a number of data providers. SSAS has both server-based and client-/local- based SSAS capabilities. This essentially provides a complete platform for OLAP. You create cubes by preprocessing aggregations (that is, precalculated summary data) that reflect the desired levels within dimensions and support the type of querying that will be done. These aggregations provide the mechanism for rapid and uniform response times to ptg 2033 Understanding the SSAS Environment Wizards 51 Packages SSIS SQL Server Analysis Services msmdsrv.exe OLAP Cube OLAP Cube Mining Models Mining Models Server Based Local/Client Based Local Cube Engine msmdlocal.exe IIS COM Data Pump XMLA (SOAP over TCP/IP) XMLA (SOAP over HTTP) XMLA (SOAP over TCP/IP) ADO MD.NET OLE DB for OLAP ADO MD Win32/64 Applications COM-Based Applications .NET Applications Any Application for OLAP or DM OLTP Databases Multi-Dimensional Data Warehouse OLTP Databases Measures Dimensions Hierarchies Partitions Perspectives Unified Dimensional Model (UDM) Proactive Cache (MOLAP cache) SSAS Processing Engine FIGURE 51.2 SSAS as part of the overall SQL Server 2008 environment. queries. You create them before the user uses the cube. All queries utilize either these aggre- gations, the cube’s source data, a copy of this data in a client cube, data in cache, or a combination of these sources. A single Analysis Server can manage many cubes. You can have multiple SSAS instances on a single machine. By orienting around UDM, SSAS allows for the definition of a cube that contains data measures and dimensions. Each cube dimension can contain a hierarchy of levels to specify the natural categorical breakdown that users need to drill down into for more details. Look back at Figure 51.1, and you can see a product hierarchy, time hierarchy, and geography hierarchy representation. The data values within a cube are represented by measures (the facts). Each measure of data might utilize different aggregation options, depending on the type of data. Unit data might require the SUM (summarization) function, Date of Receipt data might require the MAX function, and so on. Members of a dimension are the actual level values, such as the particular product number, the particular month, and the particular country. Microsoft has solved most of the limitations within SSAS. SSAS addresses up to 2,147,483,647 of most anything within its environment (for example, dimensions in a database, attributes in a dimension, databases in an instance, levels in a hierarchy, cubes in a database, measures in a cube). In reality, you will probably not have more than a handful of dimen- sions. Remember that dimensions are the paths to the interesting facts. Dimension members should be textual and are used as criteria for queries and as row and column headers in query results. . can be found in C:Program Files Microsoft SQL ptg 2028 CHAPTER 50 SQL Server Full-Text Search Server MSSQL10.MSSQLSERVERMSSQLLOG and starts with the prefix SQLFT followed by the database ID. SQL Server PowerPivot for Excel is a new feature. . The master data hub in SQL Server 2008 R2 helps manage your master data services more efficiently. And, last, but not least, . SQL Server 2008. supported, and even to the wizard-driven deploy- ments. SQL Server 2008 R2 raises this core work up a few ptg 2030 CHAPTER 51 SQL Server 2008 Analysis Services more notches with enhancements at