1. Trang chủ
  2. » Công Nghệ Thông Tin

Hands-On Microsoft SQL Server 2008 Integration Services part 58 docx

10 216 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 284,74 KB

Nội dung

548 Hands-On Microsoft SQL Server 2008 Integration Services In this model, the dimensions are denormalized and a business view of a dimension is represented as a single table in the model. Also, the dimensions have simple primary keys and the fact table has a set of foreign keys pointing to dimensions; the combination of these keys forms a compound primary key for the fact table. This structure provides some key benefits to this model by making it simple for users to understand. Writing select queries for this model is easier due to simple joins between dimensions and the fact tables. Also, the query performance is generally better due to the reduced number of joins compared to other models. Finally, a star schema can easily be extended by simply adding new dimensions as long as the fact entity holds the related data. Snowflake Model Sometimes it is difficult to denormalize a dimension, or in other words, it makes more sense to keep a dimension in a normalized form, especially when multiple levels of relationships exist in dimension data or the child member has multiple parents in a dimension. In this model, the dimension suitable for snowflaking is split into its hierarchies and results into multiple tables linked to each other via relationships, generally, one-to-many. The many-to-many relationship is also handled using a bridge table between the dimensions, sometimes called a FactLessFact table. For example, the AdventureWorksDW2008 products dimension DimProduct is a snowflake dimension that is linked to DimProduct SubCategory, which is further linked to the DimProductCategory table (refer to Figure 12-2). This structure makes much sense to database developers or data modelers and helps users to write useful queries, especially when an OLAP tool such as SSAS supports such a structure and optimizes running of snowflaked queries. However, business users might find it a bit difficult to work with and would prefer a star schema, so you have to find a balance in choosing when to go for a snowflake schema. Though there is some space saving by breaking a dimension into a snowflake schema, it is not high on the preference list because first, the space is not very costly, and second, the dimension tables are not huge and space savings are not expected to be considerable in many cases. The snowflaking is done for the functional reasons rather than savings in disk space. Finally, the queries written against a snowflake schema tend to use more joins (because more dimension tables are involved) compared to a star schema, and this will affect the query performance. You need to test for user acceptance for the speed at which results are being returned. Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 549 Building a Star Schema The very first requirement in designing a data warehouse is your focus on the subject area for which a business has engaged you to build a star schema. It’s easy to sway in different directions while building a data warehouse, but whatever you do later in the process, you must always keep a focus on the business value you are delivering with the model. As a first step, capture all the business requirements and the purposes for which they want to do this activity. You might end up meeting several business managers and process owners to understand the requirements and match them against the data available in source systems. At this stage, you will be creating a high-level data source mappings document to meet the requirements. Once you have identified at a high Figure 12-2 AdventureWorksDW2008 simplified snowflake schema 550 Hands-On Microsoft SQL Server 2008 Integration Services level that the requirements can be met with the available data, the next step is to define the dimensions and the measures in the process. While defining measures or facts, it is important that you discuss with the business and understand clearly the level of detail they will be interested in. This will decide the grain of your fact data. Typically, you would want to keep lowest grain of data so that you have maximum flexibility for future changes. In fact, defining the grain is one of the first steps while building a data warehouse, as this is the cornerstone to collating the required information, for instance, defining the roll-up measures. At this stage, you are ready to create a low-level star schema data model and will be defining attributes and fields for the dimension and fact tables. As part of the process, you will also define primary keys for dimension tables that will also exist in fact tables as a foreign key. These primary keys will need to be the new set of keys known as the surrogate keys. The use of surrogate keys instead of source system keys or business keys provides many benefits in the design; for instance, they provide protection against changes in source systems keys, maintain history (by using SCD transformation), integrate data from multiple source, and handle late arriving members, including facts for which dimension members are missing. Generally, a surrogate key will be an auto- incrementing non-null integer value like an identity and will form a clustered index on the dimension. However, the Date dimension is a special dimension, commonly used in the data warehouses, with a primary key based on the date instead of a continuous number; for instance, 20091231 and 20100101 are date-based consecutive values, but not in a serial number order. While working on dimensions, you will need to identify some special dimensions. First, look for role-playing dimensions. Refer to Figure 12-2 and note that the DateKey of the DimDate dimension is connected multiple times with the fact table, once each for the OrderDateKey, DueDateKey, and ShipDateKey columns. In this case, DimDate is acting as a role-playing dimension. Another case is to figure out degenerate dimensions in your data and place them alongside facts in the fact table. Next, you need to identify any indicators or flags used in the facts that are of low cardinality and can be grouped together in one junk dimension table. These miscellaneous flags held together at one place make the user’s life much easier when they need to classify the analysis by some indicators or flags. Finally, you will complete the exercise by listing the attributes’ change types, that is, whether they are Type 1 or Type 2 candidates. These attributes, especially Type 2, help in maintaining history in the data warehouse. Keeping history using SCD transformations has been discussed earlier in the chapter as well as in Chapter 10. Though your journey to implement a data warehouse still has a long way to go, after implementing the preceding measures, you will have implemented a basic star schema structure. From here it will be easy to work with and extend this model further to meets specific business process requirements, such as real-time data delivery or snowflaking. Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 551 SQL Server 2008 R2 Features and Enhancements Several features have been provided in SQL Server 2008 R2 that can help you in your data warehousing project by either providing new functionality or improving the performance of commonly used features. It is not possible to cover them all here unless I want to stretch myself beyond the scope of this book, especially when those features do not reside in the realm of Integration Services. But I will still try to cover some features that most data warehousing projects will use, while other features such as data compression, sparse columns, new data types, Large UDTs, and minimal logging are not covered. SQL Server 2008 R2 Data Warehouse Editions Microsoft has just launched SQL Server 2008 R2, which is built upon the strong foundations and successes of SQL Server 2008 and is targeted at very large-scale data warehouses, and at higher mission-critical-scale and self-service business intelligence. SQL Server 2008 R2 has introduced two new premium editions to meet the demands of large-scale data warehouses. SQL Server 2008 R2 Datacenter This edition is built on the Enterprise Edition code base, but provides the highest levels of scalability and manageability. SQL Server 2008 R2 Datacenter is designed for the highest levels of scalability that SQL Server platform can provide, virtualization, and consolidation, and it delivers a high-performing data platform. Typical implementations include a large-scale data warehouse server that can scale up to support tens of terabytes of data, provide Master Data services, and implement very large-scale BI applications such as self-service or power pivot for SharePoint. Following are the key features: As the Enterprise Edition is restricted to up to 25 editions and 4 virtual machines c (VMs), the Datacenter Edition is the next level if you need more than 25 instances or more VMs. is also provides application and Multi-Server Management for enrolling and gaining insights. e Datacenter Edition has no limits on server maximum memory; rather, it is c restricted by the limits of the operating system only. For example, it can support up to 2TB of RAM if running on the Windows Server 2008 R2 Datacenter Edition. 552 Hands-On Microsoft SQL Server 2008 Integration Services It supports more than 8 processors and up to 256 logical processors for the highest c levels of scale. It has the highest virtualization support for maximum ROI on consolidation and c virtualization. It provides high-scale complex event processing with SQL Server StreamInsight. c Advanced features such as the Resource Governor, data compression, and backup c compression are included. SQL Server 2008 R2 Parallel Data Warehouse Since acquiring DATAllegro, a provider of large-volume, high-performance data warehouse appliances, Microsoft has been working on consolidating hardware and software solutions for high-end data warehousing under a project named Madison. The SQL Server 2008 R2 Parallel Data Warehouse is the result of Project Madison. The Parallel Data Warehouse is an appliance-based, highly scalable, highly reliable, and high-performance data warehouse solution. Using SQL Server 2008 on Windows Server 2008 in a massively parallel processing (MPP) configuration, Parallel Data Warehouse can scale from tens to hundreds of terabytes, providing better and more predictable performance, increased reliability, and lower cost per terabyte. It comes with preconfigured hardware and software that is carefully balanced in one appliance, thus making deployment quick and easy. Massively parallel processing enables it to perform ultra-fast loading and high-speed backups, thus addressing two of the major challenges facing modern data warehouses: data load and the backup times. You can integrate existing SQL Server 2008–based data marts or mini–data warehouses with parallel data warehouses via a hub-and-spoke architecture. This product was targeted to be shipped alongside the SQL Server 2008 R2 release; however, the release of this product has been slightly delayed awaiting customer feedback from the customer Technology Adoption Program (TAP). The parallel data warehouse principles and architecture are detailed later in this chapter. SQL Server 2008 R2 Data Warehouse Solutions Microsoft has recognized the need to develop data warehouse solutions to build upon the successes of SQL Server 2008. This has resulted in Microsoft partnering with several industry-leading hardware vendors to create best-of-breed balanced configurations combining hardware and software to achieve the highest levels of performance. Two such solutions are now available under the names of Fast Track Data Warehouse and Parallel Data Warehouse. Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 553 Fast Track Data Warehouse The Fast Track Data Warehouse solution implements a CPU core-balanced approach on the symmetric multiprocessor (SMP)–based SQL Server data warehouse using a core set of configurations and database best practice guidelines. The fast track reference architecture is a combination of hardware that is balanced for performance across all the components and software configurations, such as Windows OS configurations, SQL Server database layout, and indexing, along with a whole raft of other settings, best practices, and documents to implement all of these objectives. The Fast Track Data Warehouse servers can have two-, four-, or eight-processor configurations and can scale from 4 terabytes to 30-plus terabytes and even more if compression capabilities are used. The earlier available reference architectures are found to suffer from various performance issues due to the simple fact that they have not been specifically designed to suit the needs of one particular problem and hence, suffer from an unbalanced architecture. For example, you may have seen that a server is busier and processing too much I/O, but still the CPU utilization is not high enough to indicate the work load. This is a simple example of mismatch or unbalance existing between various components in currently available servers. The Fast Track Data Warehouse servers are built on the use cases of a particular scenario—i.e., they are built to capacity to match the required workload on the server rather than with a one-size-fits-all approach. This approach of designing a balanced server provides predictable performance and minimizes the risk of going over spec on the components such as by providing CPU or storage that will never be utilized. The predictable performance and scalability is achieved by adopting core principles, best practices, and methodologies, some of which have been listed next. It is built for data warehouse workloads. c First of all, understand that the data warehouse workload is quite different from that on the OLTP servers. While OLTP transactions are made up of small read and write operations, data warehouse queries usually perform large read and write operations. e data warehouse queries are generally fewer in number, but they are more complex, requiring high aggregations, and generally have date range restrictions. Furthermore, OLTP transactions generate more random I/O, which causes slow response. To overcome this, a large number of disks have been traditionally used, along with some other optimization techniques such as building heavy indexes. ese techniques have their own maintenance overheads and cause the data warehouse loading time to increase. e Fast Track Data Warehouse uses a new way of optimizing data warehouse workloads by laying data in a sequential architecture. Considering that a data warehouse workload requires ranges of data, reading sequential data off the disk drives is much efficient compared to random I/Os. All efforts are targeted to 554 Hands-On Microsoft SQL Server 2008 Integration Services preserving sequential storage. e data is preferred to be served from disk rather than from memory, as the performance achieved is much higher with sequential storage. is results in fewer indexes, yielding savings in maintenance, decreased loading time, lesser fragmentation of data, and reduced storage requirements. It offers a holistic approach to component architecture. c e server is built with a balance across all components, starting with disks, disk controllers, fiber channels HBAs, and the Windows operating system, and ranging up to the SQL Server and then to the CPU core. For example, on the basis of how much data can be consumed per CPU core (200 MBps), the number of CPU cores is calculated for a given workload and then the backward calculations are applied to all the components to support the same bandwidth or capacity. is balance or the synchronization in response by individual components provides the required throughput to match the capabilities of the data warehouse application. It is optimized for workload type. c e Fast Track Data Warehouse servers are designed and built considering the very nature of the database application. To capture these details, templates and tools are provided to design and build the fast track server. Several industry-leading vendors are participating in this program to provide you out-of-the-box performance reference architecture servers. Businesses can benefit from reduced hardware testing and tuning and rapid deployment. Parallel Data Warehouse When your data growth needs can no longer be satisfied with the scale-up approach of the Fast Track Data Warehouse, you can choose the scale-out approach of Parallel Data Warehouse, which has been built for very large data warehouse applications using Microsoft SQL Server 2008. The symmetric multiprocessing (SMP) architecture used in the Fast Track Data Warehouse is limited by the capacity of the components such as CPU, memory, and hard disk drives that form part of the single computer. This limitation is addressed by scaling out to a configuration consisting of multiple computing nodes that have dedicated resources—i.e., CPUs, memory, and hard disk space, along with an instance of SQL Server—connected in an MPP configuration. Architecture and Hardware The appliance hardware is built on industry-standard technology and is not proprietary to one manufacturer, so you can choose from well-known hardware vendors such as HP, Dell, IBM, EMC 2 , and Bull. This way, you can keep your hardware maintenance costs low, as it will integrate nicely with already-existing infrastructure. Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 555 As mentioned earlier, the Parallel Data Warehouse is built on Microsoft Windows 2008 Enterprise server nodes with their own dedicated hardware connected via a high-speed fiber channel link, with each node running an instance of the SQL Server 2008 database server. The MPP appliance basically has one control node and several compute nodes, depending on the data requirements. This configuration is extendable from single-rack to multirack configurations; in the latter case, one rack could act as a control node. The nodes are connected in a configuration called Ultra Shared Nothing (refer to Figure 12-3), in which the large database tables are partitioned across multiple nodes to improve the query performance. This architecture has no single point of failure, and redundancy has been built in at all component levels. Applications or users send requests to a control node that balances the requests intelligently across all the compute nodes. Each compute node processes the request it gets from the control node using its local resources and passes back the results to the control node, which then collates the results before returning to the requesting application or user. As the data is evenly distributed across multiple nodes and the nodes process the requests in parallel, queries run many times faster on an MPP appliance than on an SMP database server. Like a Fast Track Data Warehouse server, an MPP appliance is also built under tight specifications and carefully balanced configurations to eliminate performance bottlenecks. Reference configurations have been designed for different use case scenarios taking into account different types of workloads such as data loading, reporting, and ad hoc queries. A control node automatically distributing workload evenly, compute nodes being able to work on queries autonomously, system resources balanced against each other, and design of reference configurations on use case Figure 12-3 Parallel Data Warehouse architecture Compute nodes Node -1 Node -2 Node -N Control node High speed fiber channel network Node -0 556 Hands-On Microsoft SQL Server 2008 Integration Services scenarios enable an MPP appliance to achieve predictable performance. Scalability follows from here with the simple addition of capacity as the data volumes grow. Hub-and-Spoke Architecture Another important advantage with an MPP appliance is that it can be deployed in a hub-and-spoke architecture. In this way, you can use an MPP appliance as a hub while the spokes could be either MPP appliances or standard SQL Server 2008–based symmetric multiprocessing (SMP) servers (see Figure 12-4). Typically, department users will connect to spokes to access data in their required formats. In this configuration, the MPP appliance at the hub will host the enterprise data in the lowest granularity and the spokes will contain data for their relevant department in the schema and aggregations they require. This is possible because the spokes could host any database application such as the SQL Server 2008 data mart or SQL Server Analysis Services data mart, as best fits the user requirements. So, this architecture with an MPP appliance at the hub and SMP database servers or MPP appliances as spokes is a specialized configuration in which a grid of computers forms a very large-scale data warehouse in a federated model. This grid of computers can be connected via a high-speed network. Also, the nodes of an MPP appliance are connected via a high- speed link and the hub processes data differently in different nodes, enabling parallel Figure 12-4 A parallel data warehouse in hub-and-spoke architecture SQL Server 2008 Fast Track Data Warehouse SQL Server 2008 Analysis Services SQL Server 2008 Reporting Services SQL Server 2008 R2 Parallel Data Warehouse SQL Server 2008 Fast Track Data Warehouse SQL Server 2008 Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 557 high-speed data transfer from node to node between hub and spoke units. Data transfer speeds approaching 500GB per minute can be achieved, thus minimizing the overhead associated with export and load operations. The SQL Server 2008 R2 Parallel Data Warehouse MPP appliance integrates very well with BI applications such as Integration Services, Reporting Services, and Analysis Services. So, if you have an existing SQL Server 2008 data mart, it can be easily added as a node in the grid. As spokes can be any SQL Server 2008 database application, this architecture provides a best-fit approach to the problem. The enterprise data is managed at the center in the MPP appliance under the enforcement of IT policies and standards, while a business unit can still have a spoke that they can manage autonomously. This flexible model is a huge business benefit and provides quick deployments of data marts, bypassing the sensitive political issues. This way, you can easily expand an enterprise data warehouse by adding an additional node that can be configured according to the business unit requirements. The Parallel Data Warehouse hub-and-spoke architecture utilizes the available processing power in the best possible way by distributing work across multiple locations in the grid. While the basic data management such as cleansing, standardization, and metadata management is done in the hub according to the enterprise policies, the application of business rules relevant to the business units and the analytical processing is handled in the relevant spokes. The hub-and-spoke model offers benefits such as parallel high-speed data movement among different nodes in the grid, the distribution of workload, and the massively parallel architecture of the hub where all nodes work in parallel autonomously, thus providing outstanding performance. With all these listed benefits and many more that can be realized in individual deployment scenarios, the hub-and-spoke reference architecture provides the best of both worlds—i.e., the ease of data management of centralized data warehouses and the flexibility to build data marts on use-case scenarios as with federated data marts. SQL Server 2008 R2 Data Warehouse Enhancements In this section some of the SQL Server 2008 enhancements are covered, such as backup compression, MERGE T-SQL statements, change data capture, and partition-aligned Indexed views. Backup Compression Backup compression is a new feature provided in SQL Server 2008 Enterprise Edition and above and due to its popularity, has since been included in the Standard Edition and in the SQL Server 2008 R2 release as well. Backup compression helps to speed up . architecture SQL Server 2008 Fast Track Data Warehouse SQL Server 2008 Analysis Services SQL Server 2008 Reporting Services SQL Server 2008 R2 Parallel Data Warehouse SQL Server 2008 Fast Track. not covered. SQL Server 2008 R2 Data Warehouse Editions Microsoft has just launched SQL Server 2008 R2, which is built upon the strong foundations and successes of SQL Server 2008 and is targeted. support up to 2TB of RAM if running on the Windows Server 2008 R2 Datacenter Edition. 552 Hands-On Microsoft SQL Server 2008 Integration Services It supports more than 8 processors and up to

Ngày đăng: 04/07/2014, 15:21