This chapter introduces the basic concepts of data warehouses. A data warehouse is a particular database targeted toward decision support. It takes data from various operational databases and other data sources and transforms it into new structures that fit better for the task of performing business analysis. Data warehouses are based on a multidimensional model, where data are represented as hypercubes, with dimensions corresponding to the various business perspectives and cube cells containing the measures to be analyzed. In Sect. 3.1, we study the multidimensional model and present its main characteristics and components. Section 3.2 gives a detailed description of the most common operations for manipulating data cubes. In Sect. 3.3, we present the main characteristics of data warehouse systems and compare them against operational databases. The architecture of data warehouse systems is described in detail in Sect. 3.4. As we shall see, in addition to the data warehouse itself, data warehouse systems are composed of back-end tools, which extract data from the various sources to populate the warehouse, and front-end tools, which are used to extract the information from the warehouse and present it to users. We finish in Sect. 3.5, describing SQL Server, a representative business intelligence suite of tools.
An Overview of Data Warehousing
In the early 1990s, organizations recognized the need for advanced data analysis to enhance decision-making in a competitive and dynamic environment Traditional operational databases, designed for daily business operations, fell short as they lacked historical data and struggled with complex queries involving multiple tables Additionally, integrating data from various operational systems proved challenging due to inconsistencies in data definitions and content To address these issues, data warehouses emerged as a viable solution to meet the increasing demands of decision-making users.
According to Inmon's classic definition, a data warehouse is a collection of subject-oriented, integrated, nonvolatile, and time-varying data designed to support management decisions This definition highlights key features, particularly that a data warehouse focuses on specific subjects of analysis tailored to the analytical needs of managers at different decision-making levels For instance, a retail company's data warehouse may include data for analyzing inventory and product sales.
Data warehousing involves the integration of data from various operational and external systems, resulting in a nonvolatile repository that retains data over extended periods In a data warehouse, data modification and removal are restricted, allowing only the purging of obsolete information Additionally, data warehouses are time-variant, enabling the tracking of data evolution, such as sales trends over months or years The design of operational databases typically follows four phases: requirements specification, conceptual design, logical design, and physical design During the requirements specification phase, user needs across the organization are gathered to create a database schema that effectively addresses queries, utilizing conceptual models like entity-relationship diagrams.
The Entity-Relationship (ER) model outlines an application conceptually, independent of implementation details This design is subsequently transformed into a logical model, which serves as a framework for database applications Currently, the relational model is the most widely adopted logical model for databases The final step involves physical design, which tailors the logical model to a specific implementation platform, resulting in a physical model.
Relational databases require high normalization to maintain consistency during frequent updates and minimize redundancy, but this can lead to slower query performance due to the partitioning of data into multiple tables However, this approach is often unsuitable for data warehouses, which need to provide a comprehensive understanding of data and efficient performance for complex analytical queries As a result, a different design model, known as multidimensional modeling, has been adopted for data warehouse applications This model represents data as a collection of facts connected to various dimensions, where facts serve as the focus of analysis—such as sales in stores—and include numeric measures for quantitative evaluation Dimensions allow for the analysis of these measures from different perspectives, such as examining sales across various stores, over different time periods, or by geographical location Additionally, dimensions often include hierarchical attributes, enabling users to explore data at multiple levels of detail, like month–quarter–year for time and city–state–country for location.
Data warehouses should be designed similarly to operational databases, following a four-step process: requirements specification, conceptual design, logical design, and physical design However, the lack of a widely accepted conceptual model for data warehouse applications often leads to complex logical schemas that are not user-friendly To address this issue, we propose a conceptual model that operates above the logical level, enhancing the design of data warehouses In this book, we utilize the MultiDim model, which effectively captures the intricate characteristics of data warehouses at a higher level of abstraction A detailed exploration of conceptual modeling for data warehouses is presented in Chapter 4.
The multidimensional model is typically represented through relational tables structured as star schemas and snowflake schemas, which connect a central fact table to multiple dimension tables Star schemas utilize a single table for each dimension, resulting in denormalized dimension tables, while snowflake schemas employ normalized tables that reflect hierarchies within dimensions An OLAP server then constructs a data cube from this relational representation, offering a multidimensional perspective of the data warehouse For a deeper understanding of logical modeling, refer to Chapter 5.
After implementing a data warehouse, analytical queries can be executed using MDX (MultiDimensional eXpressions), the standard language for multidimensional databases Recently, Microsoft introduced Data Analysis Expressions (DAX) as an alternative querying language A comparison of MDX, DAX, and SQL is explored in Chapters 6 and 7.
The physical level focuses on implementation issues, making physical design essential for ensuring adequate response times to complex ad hoc queries To enhance system performance, three primary techniques are employed: materialized views, indexing, and data partitioning Notably, bitmap indexes are preferred in data warehousing contexts, while operational databases typically utilize B-tree indexes Extensive research on these techniques was conducted, especially in the late 1990s, and the findings have been integrated into both traditional and modern OLAP engines for big data Chapter 8 provides a review and analysis of these advancements.
A significant distinction between operational databases and data warehouses lies in the data integration process; data warehouses extract information from multiple source systems, requiring transformation to align with their specific model before loading This essential process, known as extraction, transformation, and loading (ETL), is critical for the success of data warehousing projects Despite extensive research, there remains a lack of consensus on a standardized ETL design methodology, leading to many challenges being addressed in an ad hoc manner, with various proposals available to tackle these issues.
Emerging Data Warehousing Technologies
regarding ETL conceptual design We study the design and implementation of ETL processes in Chap 9.
Data analysis involves leveraging the contents of a data warehouse to deliver crucial insights for decision-making This process utilizes three primary tools: querying, which employs the OLAP paradigm to extract relevant data and uncover valuable knowledge; key performance indicators (KPIs), which are measurable objectives that assess organizational performance; and dashboards, which are interactive reports that visually present warehouse data, including KPIs, to offer a comprehensive overview of an organization's performance for informed decision support Further exploration of data analysis can be found in Chapters 6 and 7.
Designing a data warehouse is a complex process that involves multiple phases, each addressing specific considerations These phases include requirements specification, conceptual design, logical design, and physical design Requirements can be gathered through three different approaches: collecting input from users, analyzing source systems, or using a combination of both methods The chosen approach significantly influences the subsequent conceptual design phase A comprehensive methodology for data warehouse design is presented in Chapter 10.
At the start of the 21st century, data warehouse systems had established foundational concepts, yet the field continues to evolve New data types and models, such as spatial data, have been integrated into both commercial and open-source systems Additionally, innovative architectures are being developed to manage the vast amounts of data required for modern decision-support systems.
In data warehousing, a common assumption is that dimensions remain static, with only facts and their measures linked to specific time frames However, this overlooks the reality that dimensions can evolve over time, such as changes in a product's price or category To address this issue within relational databases, the widely adopted solution is the concept of slowly changing dimensions.
An innovative solution to managing time-varying information is the use of temporal databases, which offer specific structures and mechanisms for this purpose By integrating temporal databases with data warehouses, we create what is known as temporal data warehouses, enhancing the capability to handle dynamic data effectively.
Current database and data warehouse systems offer limited capabilities for handling time-varying data, making SQL queries complex and often inefficient Additionally, MDX lacks temporal support, highlighting the need to enhance traditional OLAP operators for better exploration of time-varying data, a concept known as temporal OLAP (TOLAP) Further insights into temporal data warehouses are discussed in Chapter 11.
In real-world scenarios, data warehouse schemas evolve over time to meet new application requirements, often necessitating modifications to the data This process typically involves removing outdated data and incorporating new information However, when such modifications are impractical, maintaining multiple schema versions leads to multiversion data warehouses In these systems, current data is added according to the latest schema, while data from previous schemas is preserved for analysis This allows users and applications to work with older schema versions while new users can access the most current version, a concept further explored in Chapter 11.
Spatial data has seen a significant rise in usage across diverse fields, including public administration, transportation, environmental systems, and public health It encompasses both physical objects on the Earth's surface, like streets and cities, as well as geographic phenomena such as temperature and altitude The volume of spatial data is rapidly increasing, driven by technological advancements in remote sensing and global navigation satellite systems (GNSS), including the Global Positioning System (GPS) and the Galileo system.
Spatial databases are designed for efficient storage and manipulation of spatial data, but they often fall short in supporting decision-making processes To address this limitation, spatial data warehouses have been developed, integrating spatial database and data warehouse technologies These warehouses enhance data analysis, visualization, and manipulation capabilities The analysis performed within these systems is referred to as spatial OLAP (SOLAP), allowing users to explore spatial data similarly to traditional OLAP methods using tables and charts Further insights into spatial data warehouses are explored in Chapter 12.
The analysis of mobility data, which involves tracking the positions of moving objects over time, has gained significant importance with the advent of advanced positioning devices For instance, traffic data can now be collected through sequences of GPS signals emitted by vehicles during their journeys, enabling detailed insights into movement patterns This process is commonly referred to as mobility analysis.
Emerging data warehousing technologies are evolving to accommodate the complexities of mobility data analysis As moving objects generate extensive sequences of positional data, these sequences are typically segmented into manageable units known as trajectories This segmentation is crucial for effectively analyzing movement data Consequently, the development of mobility data warehouses is becoming increasingly important, allowing for a more efficient handling of this dynamic information.
The interconnectedness of various domains such as web, transportation, communication, biological, and economic data is effectively modeled using graphs, leading to the emergence of graph databases and analytics, specifically graph data warehousing and OLAP The property graph data model is pivotal for native graph databases, utilizing nodes and vertices for data storage, making it particularly efficient for path traversal computations, as explored in Chapter 13, which focuses on Neo4j, a leading graph database Additionally, the semantic web serves as a dynamic source of multidimensional information, represented in a machine-processable format through the Resource Description Framework (RDF) and the Web Ontology Language (OWL), enabling domain ontologies to establish a common terminology Semantic annotations enhance the description of unstructured and semi-structured data, with numerous applications, particularly in fields like medicine, creating extensive repositories of semantically annotated data that improve decision-support systems Consequently, data warehousing technologies must adapt to accommodate semantic web data, a topic examined in Chapter 14.
In the evolving landscape of big data, the rise of massive-scale data sources presents significant challenges for the data warehouse community To address these issues, new database architectures such as distributed storage, NoSQL systems, column-store databases, and in-memory databases are emerging Traditional ETL processes and data warehouse solutions struggle to handle the vast volume and variety of data, highlighting the need for integrated systems that can manage structured, unstructured, and real-time analytics Innovations like NewSQL, HTAP paradigms, data lakes, Delta Lake, polyglot architectures, and cloud data warehouses are being developed in response to these demands from both academia and industry Chapter 15 delves into these recent advancements in the field.
Review Questions
1.1 Why are traditional databases called operational or transactional? Why are these databases inappropriate for data analysis?
1.2 Discuss four main characteristics of data warehouses.
1.3 Describe the different components of a multidimensional model, that is, facts, measures, dimensions, and hierarchies.
1.4 What is the purpose of online analytical processing (OLAP) systems and how are they related to data warehouses?
1.5 Specify the different steps used for designing a database What are the specific concerns addressed in each of these phases?
1.6 Explain the advantages of using a conceptual model when designing a data warehouse.
1.7 What is the difference between the star and the snowflake schemas?
1.8 Specify several techniques that can be used for improving performance in data warehouse systems.
1.9 What is the extraction, transformation, and loading (ETL) process?
1.10 What languages can be used for querying data warehouses?
1.11 Describe what is meant by the term data analytics Give examples of techniques that are used for exploiting the content of data warehouses.
1.12 Why do we need a method for data warehouse design?
1.13 What is spatial data? What is mobility data? Give examples of ap- plications for which such kinds of data are important.
1.14 Explain the differences between spatial databases and spatial data warehouses.
1.15 What is big data and how is it related to data warehousing? Give examples of technologies that are used in this context.
1.16 Give examples of applications where graph data models can be used.
1.17 Describe why it is necessary to take into account web data in the context of data warehousing Motivate your answer by elaborating an example application scenario.
This chapter introduces fundamental database concepts, focusing on modeling, design, and implementation It outlines a four-step design process: requirements specification, conceptual design, logical design, and physical design, which helps separate concerns and tailor applications to user needs and specific database technologies The Northwind case study is introduced as a reference throughout the book Additionally, the chapter reviews the entity-relationship model and the relational model, which are essential for database design Physical design considerations are also discussed, providing foundational knowledge for understanding subsequent chapters while encouraging readers to explore further resources on the topic.
Database Design
Databases serve as the fundamental building blocks of modern information systems, consisting of a shared collection of logically related data and its description They are specifically designed to fulfill the information requirements and support the operational activities of organizations A database operates on a database management system (DBMS), ensuring efficient data management and accessibility.
(DBMS), which is a software system used to define, create, manipulate, and administer a database.
11 © Springer-Verlag GmbH Germany, part of Springer Nature 2022
A Vaisman, E Zimányi, Data Warehouse Systems, Data-Centric Systems and Applications, https://doi.org/10.1007/978-3-662-65167-4_2
Designing a database system is a complex undertaking typically divided into four phases, described next.
Requirements specification is essential for gathering user needs related to database systems Various techniques developed by both academics and practitioners assist in identifying crucial system properties, standardizing requirements, and prioritizing them effectively.
Conceptual design focuses on creating a user-centered representation of a database, free from implementation details This process utilizes a conceptual model to identify key concepts relevant to the application Among the various conceptual models, the entity-relationship model is one of the most commonly employed for database application design Additionally, object-oriented modeling techniques using Unified Modeling Notation (UML) can also be effectively applied.
Logical design focuses on converting the conceptual representation of a database into a logical model that is compatible with various database management systems (DBMSs) The relational model is the most widely used logical model today, although other models such as the object-relational, object-oriented, and semistructured models also exist This book primarily emphasizes the relational model.
The physical design phase focuses on adapting the logical database representation into a physical model tailored for specific Database Management Systems (DBMS) platforms Popular DBMS options include SQL Server, Oracle, DB2, MySQL, and PostgreSQL.
A key goal of the four-level data architecture process is to achieve data independence, ensuring that higher-level schemas remain unaffected by changes in lower-level schemas This concept is divided into two types: logical data independence and physical data independence Logical data independence means that modifications to the structure of relational tables do not impact the conceptual schema, as long as application requirements remain constant Conversely, physical data independence indicates that changes in the physical storage of data, such as sorting records on a disk, do not alter the conceptual or logical schemas, although users may notice differences in response times.
This article provides an overview of the entity-relationship model and relational models, focusing on the most commonly utilized conceptual and logical frameworks We will also discuss important physical design considerations The foundation of our discussion is the Northwind relational database, which serves as a practical example for explaining database design concepts In the subsequent chapter, we will explore a data warehouse derived from the Northwind database, where we will delve into data warehousing and OLAP concepts.
The Northwind Case Study
Northwind Company exports various goods and requires a relational database for effective data management and storage Key characteristics of the data to be stored will guide the design of this database.
• Customer data, which must include an identifier, the customer’s name, contact person’s name and title, full address, phone, and fax.
Employee data encompasses essential information such as identifiers, names, titles, courtesy titles, birth dates, hire dates, addresses, home phone numbers, phone extensions, and photographs It is important to store these photos in the file system, accompanied by their respective paths Additionally, employees are organized in a hierarchy, reporting to higher-level colleagues within the organization.
The company operates across various territories, which are grouped into specific regions Employees can be assigned to multiple territories, allowing for a flexible workforce; however, these territories are not exclusive to any one employee, as each territory can be associated with multiple employees.
• Shipper data, that is, information about the companies that Northwind hires to provide delivery services For each one of them, the company name and phone number must be kept.
• Supplier data, including the company name, contact name and title, full address, phone, fax, and home page.
Northwind trades a variety of products, each identified by a unique identifier, name, quantity per unit, and unit price, along with a status indicating if the product has been discontinued To manage inventory effectively, it tracks the number of units in stock, units ordered but not yet delivered, and the reorder level, which triggers production or acquisition when reached Products are categorized with a name, description, and image, and each product is linked to a specific supplier.
The article provides essential data regarding sales orders, encompassing the order identifier, submission date, required delivery date, actual delivery date, involved employee, customer details, assigned shipper, freight cost, and complete destination address.
An order can contain many products, and for each of them the unit price,the quantity, and the discount that may be given must be kept.
Conceptual Database Design
Northwind Company exports various goods and requires a well-structured relational database to effectively manage and store its data Key characteristics of the data to be stored include essential attributes that facilitate efficient organization and retrieval.
• Customer data, which must include an identifier, the customer’s name, contact person’s name and title, full address, phone, and fax.
Employee data encompasses essential information such as identifiers, names, job titles, courtesy titles, birth dates, hire dates, addresses, home phone numbers, phone extensions, and photos, which should be stored in the file system along with their respective paths Additionally, employees are organized in a hierarchical structure, reporting to higher-level colleagues within the organization.
The company operates across various territories, which are organized into distinct regions Employees may be assigned to multiple territories, and these territories are not exclusive; thus, each employee can be associated with several territories, while each territory can also encompass multiple employees.
• Shipper data, that is, information about the companies that Northwind hires to provide delivery services For each one of them, the company name and phone number must be kept.
• Supplier data, including the company name, contact name and title, full address, phone, fax, and home page.
Northwind trades a variety of products, each identified by a unique identifier, name, quantity per unit, and unit price, along with a status indicating whether the product has been discontinued The company maintains an inventory system that tracks the number of units in stock, units ordered but not yet delivered, and the reorder level, which signifies when new production or acquisition is necessary Additionally, products are categorized, with each category featuring a name, description, and image, while every product is associated with a distinct supplier.
The article provides essential data regarding sales orders, including a unique identifier, submission date, required and actual delivery dates, the employee responsible for the sale, customer details, the shipper handling delivery, freight costs, and the complete destination address.
An order can contain many products, and for each of them the unit price, the quantity, and the discount that may be given must be kept.
The entity-relationship (ER) model is a widely utilized conceptual framework for designing database applications While there is consensus on the meanings of its various concepts, multiple visual notations have been developed to represent these ideas For a detailed overview of the notations adopted in this book, refer to Appendix A.
Figure 2.1 shows the ER model for the Northwind database We next introduce the main ER concepts using this figure.
ProductID ProductName QuantityPerUnit UnitPrice UnitsInStock UnitsOnOrder ReorderLevel Discontinued /NumberOrders
OrderID OrderDate RequiredDate ShippedDate (0,1) Freight
ShipName ShipAddress ShipCity ShipRegion (0,1) ShipPostalCode (0,1) ShipCountry
EmployeeID Name FirstName LastName Title TitleOfCourtesy BirthDate HireDate Address City Region (0,1) PostalCode Country HomePhone Extension Photo (0,1) Notes (0,1) PhotoPath (0,1)
Fig 2.1 Conceptual schema of the Northwind database
Entity types represent groups of real-world objects relevant to an application, such as Employees, Orders, and Customers An individual object within an entity type is referred to as an entity or instance, and the collection of these instances is known as the entity type's population.
From the application point of view, all entities of an entity type have the same characteristics.
Real-world objects are interconnected and do not exist in isolation, with relationship types representing these associations For instance, relationship types like Supplies, ReportsTo, and HasCategory illustrate the connections between different objects Each association between objects within a relationship type is known as a relationship or an instance, while the complete set of these associations is referred to as the population of that relationship type.
In database relationships, the role of an entity type is defined by its participation in a relationship type, represented by a connecting line Each role is accompanied by cardinalities that indicate the minimum and maximum participation of an entity in that relationship For instance, the relationship between Products and Supplies has a cardinality of (1,1), indicating each product participates exactly once Conversely, Supplies and Suppliers have a cardinality of (0,n), allowing suppliers to participate an undetermined number of times Additionally, the (1,n) cardinality between Orders and OrderDetails signifies that each order can participate multiple times Roles are categorized as optional or mandatory based on whether their minimum cardinality is 0 or 1, and as monovalued or multivalued depending on whether their maximum cardinality is 1 or n.
A relationship type can connect two or more object types, being classified as binary when it involves two object types and n-ary when it involves more than two In the context of binary relationships, they can be further categorized based on maximum cardinality into one-to-one, one-to-many, and many-to-many types For instance, the Supplies relationship is one-to-many, where a single product is linked to at most one supplier, but a supplier can provide multiple products Conversely, the OrderDetails relationship exemplifies a many-to-many scenario, as an order can include several products, and a product can appear in multiple orders.
In certain relationship types, such as the ReportsTo relationship, the same entity type can appear multiple times, leading to what is known as a recursive relationship To clarify the distinct roles of the entity type in these scenarios, role names are essential For instance, in the ReportsTo relationship, the roles are designated as Subordinate and Supervisor.
Objects and the relationships between them possess distinct structural characteristics that define their nature Attributes serve to document these characteristics for both entity and relationship types.
Address andHomepageare attributes of Suppliers, whileUnitPrice,Quantity, andDiscountare attributes of OrderDetails.
Attributes in a database have cardinalities that specify the number of values they can hold for each instance Typically, when the cardinality is (1,1), it is not displayed in diagrams, indicating that each supplier has exactly one address and at most one homepage, making the homepage attribute optional with a cardinality of (0,1) Attributes can be classified as mandatory when their cardinality is (1,1) and can also be categorized as monovalued or multivalued based on whether they hold a single value or multiple values, respectively; in our example, all attributes are monovalued, except for the Phone attribute, which may be labeled (1,n) if a customer has multiple phone numbers Additionally, attributes can be complex, such as the Name attribute in the Employees entity, which consists of FirstName and LastName, while simple attributes do not have components Some attributes may also be derived, like the NumberOrders attribute of Products, which calculates the number of orders a product is involved in through a formula based on other schema elements.
In real-world applications, certain attributes, known as identifiers, uniquely identify specific objects For instance, in the context of the Employees entity type, EmployeeID serves as the unique identifier, ensuring that each employee possesses a distinct value for this attribute While the identifiers illustrated in the accompanying figure are simple and consist of a single attribute, it is also common for identifiers to be made up of two or more attributes.
(1,n) LineNo UnitPrice Quantity Discount SalesAmount
Fig 2.2 Relationship type OrderDetails modeled as a weak entity type
Weak entity types lack their own identifiers and are depicted with a double line in their name box, while strong entity types possess identifiers In the provided figures, there are no weak entity types; however, the relationship between Orders and Products, represented as OrderDetails, can be illustrated in a different model.
Logical Database Design
The Relational Model
Relational databases have been a reliable solution for information storage across various application domains for decades Despite the emergence of alternative database technologies, the relational model remains the preferred method for managing essential data that supports the daily operations of organizations.
The relational model, introduced by Codd in 1970, gained significant success due to its simplicity and intuitive design, grounded in a robust formal theory based on mathematical relations, set theory, and first-order predicate logic This foundation facilitated the development of declarative query languages and various optimization techniques, resulting in efficient implementations However, it wasn't until the early 1980s that the first commercial relational database management systems (RDBMS) emerged The model features a straightforward data structure, consisting of relations or tables made up of one or more attributes or columns, with a relational schema outlining the structure of these relations.
ProductID ProductName QuantityPerUnit UnitPrice UnitsInStock UnitsOnOrder ReorderLevel Discontinued SupplierID CategoryID
OrderID CustomerID EmployeeID OrderDate RequiredDate ShippedDate (0,1) ShipVia
Freight ShipName ShipAddress ShipCity ShipRegion (0,1) ShipPostalCode (0,1) ShipCountry
EmployeeID FirstName LastName Title TitleOfCourtesy BirthDate HireDate Address City Region (0,1) PostalCode Country HomePhone Extension Photo (0,1) Notes (0,1) PhotoPath (0,1) ReportsTo (0,1)
OrderID ProductID UnitPrice Quantity Discount
The relational schema of the Northwind database, illustrated in Fig 2.4, is derived from the conceptual schema shown in Fig 2.1 through a series of translation rules applied to the corresponding ER schema This schema includes various relations, such as Employees, Customers, and Products, each containing multiple attributes For instance, the Employees relation features attributes like EmployeeID, FirstName, and LastName.
R.Ato indicate the attribute Aof relationR.
In the relational model, each attribute is associated with a specific domain or data type, such as integer, float, date, or string, which defines a set of values and operations A key requirement of this model is that attributes must be atomic and monovalued, meaning complex attributes like the "Name" of the entity type "Employees" must be decomposed into simpler components, such as "FirstName" and "LastName." Consequently, a relation R is characterized by a schema R(A1:D1, A2:D2), ensuring clarity and consistency in data representation.
D 2 , , A n :D n ), whereRis the name of the relation, and each attributeA i is defined over the domainD i The relationRis associated to a set of tuples
A relation can be represented as a set of tuples (t1, t2, , tn), which is a subset of the Cartesian product D1 × D2 × × Dn This collection of tuples is often referred to as the instance or extension of a relation The degree, or arity, of a relation is defined by the number of attributes in its relation schema.
The relational model allows several types of integrity constraintsto be defined declaratively.
An attribute can be designated as non-null, indicating that null values or blanks are not permitted In the provided figure, only the attributes labeled with a cardinality of (0,1) are allowed to contain null values.
In a relational database, a key is defined by one or more attributes that ensure the uniqueness of tuples within a relation, meaning that no two tuples can have identical values in those key columns Keys are visually represented by underlining in diagrams, where a key made up of multiple attributes is referred to as a composite key, while a single attribute key is known as a simple key.
In the relational model, each table must have a primary key, such as the simple key EmployeeID in the Employees table, while other tables, like EmployeeTerritories, can have composite keys, which in this case consists of both EmployeeID and TerritoryID Additionally, the attributes that make up the primary key cannot contain null values, ensuring data integrity and uniqueness within the database.
Referential integrity establishes a connection between two tables, where a foreign key in one table references the primary key of another table This ensures that the values in the foreign key must match those in the primary key For instance, the EmployeeID attribute in the Orders table references the primary key in the Employees table, guaranteeing that every employee listed in an order is also present in the Employees table Additionally, referential integrity can involve foreign and primary keys that consist of multiple attributes.
A check constraint establishes a condition that must be satisfied when inserting or modifying a record in a database table For instance, in the Orders table, a check constraint can ensure that the OrderDate is less than or equal to the RequiredDate for each order It is important to note that many Database Management Systems (DBMS) limit check constraints to evaluate only a single record, and do not allow references to data in other tables.
2.4 Logical Database Design 21 in other tuples of the same table are not allowed Therefore, check con- straints can be used only to verify simple constraints.
Declarative integrity constraints alone are insufficient to capture the extensive range of constraints present in various application domains, necessitating the use of triggers for implementation A trigger is defined as a named event-condition-action rule that automatically activates upon modifications to a relation Additionally, triggers can facilitate the computation of derived attributes, exemplified by the attribute NumberOrders in the Products table Each time an insert, update, or delete occurs in the OrderDetails table, the trigger ensures the attribute's value is updated accordingly.
The translation of a conceptual schema (written in the ER or any other conceptual model) to an equivalent relational schema is called a mapping.
In most database design tools, a widely recognized process automatically converts conceptual schemas into logical schemas, primarily within the relational model This procedure involves defining tables across different relational database management systems (RDBMSs).
We now outline seven rules that are used to map an ER schema into a relational one.
Rule 1: A strong entity typeEis mapped to a tableT containing the simple monovalued attributes and the simple components of the monovalued complex attributes ofE The identifier of E defines the primary key of
T.T also defines non-null constraints for the mandatory attributes Note that additional attributes will be added to this table by subsequent rules. For example, the strong entity typeProductsin Fig.2.1is mapped to the tableProductsin Fig.2.4, with keyProductID.
Rule 2: Let us consider a weak entity type W, with owner (strong) entity typeO AssumeW id is the partial identifier ofW, andO id is the identifier ofO W is mapped in the same way as a strong entity type, that is, to a table T In this case, T must also includeO id as an attribute, with a referential integrity constraint to attributeO.O id Moreover, the identifier ofT is the union ofW id andO id
Normalization
When evaluating a relational schema, it is essential to identify potential redundancies within the relations, as these can lead to anomalies during insertions, updates, and deletions.
Employee Territories EmployeeID TerritoryID KindOfWork
(c) Fig 2.7 Examples of relations that are not normalized
In the context of OrderDetails, each product is linked to a discount percentage, leading to redundancy as this information is repeated across all orders containing that product To eliminate this redundancy, the Discount attribute should be removed from the OrderDetails table and incorporated into the Products table, ensuring that product discount information is stored only once.
The relation Products in Fig 2.7b incorporates category information, such as name, description, and picture, which is redundantly repeated for each product within the same category This redundancy can lead to inconsistencies when updating category descriptions, necessitating the removal of category attributes from the Products table and the creation of a separate Category table for better data management Similarly, the relation EmployeeTerritories in Fig 2.7c introduces an additional attribute, KindOfWork, which results in repetitive information for employees assigned to multiple territories To address this issue, the KindOfWork attribute should be eliminated from the EmployeeTerritories table, and a new table, EmpWork, should be established to effectively link employees with the various kinds of work they perform.
Dependencies and normal forms help identify redundancies in relational databases A functional dependency is a constraint that exists between two attribute sets within a relation For a relation R and attribute sets X and Y, a functional dependency X → Y is established when, across all tuples in the relation, each value of X corresponds to no more than one value of Y.
In this context, X is said to determine Y, highlighting that a key represents a specific instance of functional dependency In this scenario, the attributes that form the key functionally determine all attributes within the relation.
The redundancies illustrated in Fig 2.7a and 2.7b can be defined through functional dependencies In the OrderDetails relation shown in Fig 2.7a, the functional dependency ProductID → Discount is present Similarly, in the Products relation depicted in Fig 2.7b, the functional dependencies ProductID → CategoryName and CategoryName → Description are established.
The redundancy in the relation EmployeeTerritories, as illustrated in Fig 2.7c, is represented by a multivalued dependency This occurs when a set of attributes X determines multiple values for another set Y, independent of the remaining attributes in the relation In this scenario, the multivalued dependency EmployeeID→→KindOfWork is established, which also applies to TerritoryID→→KindOfWork It is important to note that functional dependencies are a specific instance of multivalued dependencies, meaning that every functional dependency is inherently a multivalued dependency as well.
A normal form serves as an integrity constraint that ensures a relational schema adheres to specific properties Since the inception of the relational model in the 1970s, various types of normal forms have been established to enhance data integrity and organization.
Normal forms are also established for various models, including the entity-relationship model For detailed definitions of these normal forms, readers are encouraged to consult database textbooks.
Relational Query Languages
Relational databases allow data to be queried through various formalisms, primarily categorized into two types of query languages Procedural languages require users to specify the operations needed to obtain the desired results, while declarative languages enable users to simply state what they wish to retrieve, allowing the Database Management System (DBMS) to determine the necessary procedural query for execution.
This section introduces relational algebra and SQL, two fundamental concepts that will be utilized throughout this book Relational algebra is a procedural query language, while SQL serves as a declarative query language.
Relational algebra consists of operations for manipulating relations, categorized into unary operations, which take one relation as input, and binary operations, which take two relations This algebra is closed, meaning all operations yield relations, allowing for the combination of operations to form complex queries Operations are further classified into basic operations, which cannot be derived from others, and derived operations, which simplify expressions of queries The first unary operation we explore is projection, denoted by π C1, ,Cn(R), which returns specific columns from a relation.
C 1 , , C n from the relationR Thus, it can be seen as a vertical partition of
Rinto two relations: one containing the columns mentioned in the expression, and the other containing the remaining columns.
For the database given in Fig.2.4, an example of a projection is: π FirstName,LastName,HireDate ( Employees )
This operation returns the three specified attributes from theEmployeestable.
The selection operation, represented as σ ϕ (R), retrieves tuples from the relation R that meet the Boolean condition ϕ This operation effectively partitions a table horizontally into two distinct sets: those tuples that satisfy the condition and those that do not, while maintaining the original structure of R in the output.
A selection operation over the database given in Fig.2.4is: σ HireDate≥‘01/01/2012’∧HireDate≤‘31/12/2014’ ( Employees )
This operation returns the employees hired between 2012 and 2014.
In relational algebra, the outcome of an operation is a relation that can serve as input for subsequent operations To enhance the readability of queries, utilizing temporary relations to hold intermediate results is beneficial We denote this process with the notation T ← Q, indicating that relation T contains the results of query Q.
Thus, combining the two previous examples, we can ask for the first name, last name, and hire date of all employees hired between 2012 and 2014 The query reads:
Result ← π FirstName,LastName,HireDate ( Temp1 )
The renameoperation, denoted by ρ A 1 →B 1 , ,A k →B k (R), returns a rela- tion where the attributesA 1 , , A k inRare renamed toB 1 , , B k , respec- tively Therefore, the resulting relation has the same tuples as the relation
R, although the schema of both relations is different.
This article discusses binary operations derived from classical set theory operations, emphasizing the necessity for relations to be union compatible Two relations, R1(A1, , An) and R2(B1, , Bn), are considered union compatible when they share the same degree n, and for each i from 1 to n, the domains of Ai and Bi must be identical.
The three operations union, intersection, and difference on two union- compatible relationsR 1and R 2 are defined as follows:
• Theunionoperation, denoted byR 1∪R 2, returns the tuples that are in
R 1, inR 2, or in both, removing duplicates.
• Theintersectionoperation, denoted byR 1 ∩R 2 , returns the tuples that are in bothR 1 and in R 2
• Thedifference operation, denoted by R 1 \R 2 , returns the tuples that are inR 1 but not in R 2
When relations are union compatible but have differing attribute names, the attribute names from the first relation are retained in the final result This union operation can be utilized to formulate queries such as, "Retrieve the identifiers of employees from the UK or those reported by a UK employee."
Result2 ← ρ ReportsTo→EmployeeID ( π ReportsTo ( UKEmps ))
The RelationUKEmps includes employees based in the UK Result1 presents a projection based on EmployeeID, while Result2 lists the EmployeeID of employees reported by UK staff By combining Result1 and Result2, we achieve the desired outcome.
The intersection allows for the formulation of queries such as "Identifiers of employees from the UK reported by another employee from the UK." This is achieved by replacing the previous expression with the following: Result ← Result1 ∩ Result2.
The distinction can be utilized to formulate queries such as “Identify employees from the UK who are not reported by any UK employee,” achieved by substituting the previous expression with the following: Result ← Result1 \ Result2.
The Cartesian product, denoted by R 1×R 2, takes two relations and returns a new one, whose schema is composed of all the attributes inR 1 and
The instance of R 2, which may be renamed if needed, is created by concatenating each pair of tuples from R 1 and R 2 Consequently, the total number of tuples in the resulting dataset equals the product of the cardinalities of both relations.
The Cartesian product, while often lacking standalone meaning, becomes highly valuable when paired with a selection process For instance, to obtain the names of products supplied by Brazilian suppliers, we utilize the Cartesian product to merge data from the Products and Suppliers tables We focus on essential attributes, retaining ProductID, ProductName, and SupplierID from the Products table, along with SupplierID and Country from the Suppliers table To avoid naming conflicts, we must rename the SupplierID in one of the tables.
Temp1 ← π ProductID,ProductName,SupplierID ( Products )
Temp2 ← ρ SupplierID→SupID ( π SupplierID,Country ( Suppliers ))
The Cartesian product connects each product with all available suppliers, but our focus is solely on the relationships between products and their specific suppliers To achieve this, we filter out irrelevant tuples, select only those related to suppliers from Brazil, and project the desired column, which is ProductName.
Result ← π ProductName ( σ Country=‘Brazil’ ( Temp4 ))
The join operation, represented by R1 on ϕ R2, combines two relations based on a specified condition ϕ applied to their attributes This operation produces a new relation that includes all attributes from both R1 and R2, which may be renamed if needed The resulting dataset is formed by concatenating each pair of tuples from R1 and R2 that meet the condition.
R 2 that satisfy condition ϕ The operation is basically a combination of a
Using the join operation, the query “Name of the products supplied by suppliers from Brazil” will read:
Result ← π ProductName ( σ Country=‘Brazil’ ( Product o n SupplierID=SupID Temp1 ))
The join operation efficiently merges the Cartesian product from Temp3 with the selection from Temp4, resulting in a more streamlined expression Various types of join operations exist, including the equijoin, which is a specific form of joining data.
Physical Database Design
The goal of physical database design is to define the storage, access, and relationships of database records to optimize the performance of database applications This design process encompasses various aspects such as query processing, physical data organization, indexing, transaction processing, and concurrency management In this section, we offer a concise overview of these critical issues, which will be explored in greater detail for data warehouses in Chapter 8.
Effective physical database design hinges on understanding the unique characteristics of the application, particularly the data properties and usage patterns This involves analyzing frequently executed transactions and queries that significantly affect performance, identifying critical operations for the organization, and recognizing peak load periods with high database demand Such insights are essential for pinpointing potential performance issues within the database.
Various factors can be used to measure the performance of database appli- cations Transaction throughput is the number of transactions that can
In physical database design, it is crucial to ensure high transaction throughput, particularly in systems like electronic payment platforms The response time, defined as the duration taken to complete a single transaction, is vital for user satisfaction and must be minimized Additionally, the disk storage required for database files is a key consideration Typically, a trade-off must be made among transaction throughput, response time, and disk storage requirements to achieve optimal performance.
The space-time trade-off concept highlights that improving operational speed can often require additional memory usage, and conversely, reducing memory consumption may lead to longer processing times For example, utilizing a compression algorithm can effectively minimize the storage space required for a large file; however, this comes with the trade-off of increased time needed for the decompression process.
The query-update trade-off highlights the balance between data accessibility and structural complexity Implementing a structured format can enhance data retrieval efficiency, but more intricate structures require additional time for initial setup and ongoing maintenance as data changes For instance, while sorting records by a key field facilitates quicker searches, it introduces extra overhead during insertions to maintain the sorted order.
After implementing the initial physical design, it's essential to monitor the system and adjust it based on performance observations and evolving requirements Many Database Management Systems (DBMSs) offer tools to facilitate system monitoring and tuning.
The diverse functionalities of modern Database Management Systems (DBMSs) necessitate a thorough understanding of the specific data storage and retrieval techniques employed by the chosen DBMS, making it essential for effective physical design.
A database is structured in secondary storage as one or more files, with each file containing multiple records, and each record comprising several fields Generally, each tuple in a relation aligns with a record in a file When a user seeks a specific tuple, the Database Management System (DBMS) translates this logical record into a physical disk address, retrieving it into main memory through the operating system's file access routines.
Data is stored on computer disks in units known as disk blocks, which are defined by the operating system during formatting Database Management Systems (DBMSs) utilize database blocks to store data, making it essential to align disk block sizes with database block sizes for efficient data retrieval The selection of database block size is influenced by various factors, including the management of concurrent access through locking mechanisms When a record is locked for modification by one transaction, other transactions are typically restricted from making changes, although they can still read the record In many DBMSs, locking occurs at the page level, meaning larger page sizes can increase the likelihood of multiple transactions accessing the same page Additionally, to ensure optimal disk efficiency, the database block size should match or be a multiple of the disk block size.
A Database Management System (DBMS) utilizes a buffer in the main memory to store multiple database pages, enabling quick access to data for query processing without the need to read from the disk When a query is issued, the query processor first checks the buffer for the required data records If found, it retrieves or modifies the data directly from the buffer, marking any altered pages for eventual writing back to the disk If the necessary pages are not in the buffer, the DBMS reads them from the disk, potentially replacing existing pages using algorithms like the least recently used (LRU) strategy This caching mechanism significantly enhances query performance by minimizing disk access.
File organization refers to the arrangement of data in a file on secondary storage, categorized into three main types: heap, sequential, and hash organization Heap files store records in the order of insertion, allowing for efficient data entry but slower retrieval as records must be read sequentially Sequential files, on the other hand, sort records based on specific fields, enabling quick access when searching by those attributes, though they complicate insertion and deletion due to the need to maintain order Hash files utilize a hash function to determine the storage address for records, offering rapid access for retrieval, but can face performance issues due to collisions when buckets reach capacity To enhance record retrieval, indexes are employed across various file organizations, allowing for efficient access based on chosen indexing fields, with the flexibility to create multiple indexes on different fields within the same file.
There are many different types of indexes We describe below some cate- gories of indexes according to various criteria.
Indexes can be categorized into clustered and non-clustered types, also known as primary and secondary indexes A clustered index physically organizes the records in the data file based on the indexed field(s), whereas a non-clustered index does not affect the physical order of the data Each file can have only one clustered index, but it can accommodate multiple non-clustered indexes.
Summary
Indexes can be classified as single-column or multiple-column based on the number of fields they index The arrangement of columns in a multiple-column index significantly affects data retrieval efficiency, so it's advisable to position the most restrictive value first to enhance performance.
• Indexes can beuniqueornonunique: the former do not allow duplicate values, while nonunique indexes do.
Indexes can be categorized as sparse or dense A dense index features an entry for each data record, necessitating that data files be organized based on the indexing key In contrast, a sparse index contains fewer entries than the total number of data records Consequently, nonclustered indexes are always dense, as they are not arranged according to the indexing key, whereas clustered indexes are considered sparse.
Indexes can be categorized as single-level or multilevel, with multilevel indexes designed to enhance search efficiency by dividing large index files into smaller segments and creating an index for those segments While this structure reduces the number of blocks accessed during record searches, it complicates the processes of insertion and deletion due to the physical ordering of all index levels To address these challenges, dynamic multilevel indexes are utilized, incorporating extra space within each block for new entries This indexing method is commonly implemented through data structures like B-trees and B+-trees, which are widely supported by various Database Management Systems (DBMSs).
Database Management Systems (DBMS) allow designers to create indexes on various fields, enhancing access speed while requiring additional storage for the indexes and incurring overhead during updates The sorted nature of indexed values facilitates efficient handling of partial matches and range searches, and in relational systems, it significantly accelerates join operations on those indexed fields.
We will see in Chap.8 that data warehouses require physical design solu- tions that are different from the ones required by DBMSs in order to support heavy transaction loads.
This chapter provides an overview of essential database concepts, including the steps for designing database systems: requirements specification, conceptual design, logical design, and physical design The Northwind case study is utilized to illustrate these concepts, featuring the entity-relationship model as a key conceptual framework We examine the relational model as a logical representation and outline the mapping rules for converting an entity-relationship schema into a relational schema Additionally, we touch on normalization to prevent redundancies and inconsistencies in relational databases Finally, we introduce two languages for database manipulation—relational algebra and SQL—while addressing various aspects of physical database design.
Bibliographic Notes
For a comprehensive understanding of the concepts discussed in this chapter, readers are encouraged to consult the textbooks [70,79] An overview of requirements engineering can be found in [59], while conceptual database design is explored in [171], which utilizes UML [37] rather than the traditional entity-relationship model Logical database design is addressed in [230] Detailed insights into the components of the SQL:1999 standard are provided in [151, 153], with subsequent versions of the standard elaborated upon in [133, 152, 157, 272] Additionally, physical database design is thoroughly examined in [140].
Review Questions
2.1 What is a database? What is a database management system?
2.2 Describe the four phases used in database design.
2.3 Define the following terms: entity type, entity, relationship type, re- lationship, role, cardinality, and population.
2.4 Illustrate with an example each of the following kinds of relationship types: binary,n-ary, one-to-one, one-to-many, many-to-many, and re- cursive.
2.5 Discuss different kinds of attributes according to their cardinality and their composition What are derived attributes?
2.6 What is an identifier? What is the difference between a strong and a weak entity type? Does a weak entity type always have an identifying relationship? What is an owner entity type?
2.7 Discuss the different characteristics of the generalization relationship.
2.8 Define the following terms: relation (or table), attribute (or column), tuple (or line), and domain.
2.9 Explain the various integrity constraints that can be described in the relational model.
Translating an Entity-Relationship (ER) schema into a relational schema involves several fundamental rules Firstly, each entity in the ER model corresponds to a table in the relational schema, with attributes becoming the table's columns Secondly, relationships between entities are represented through foreign keys in the relational tables Additionally, many-to-many relationships may require the creation of a junction table to effectively manage the associations For example, a "Student" entity and a "Course" entity can be translated into a relational model in different ways, such as using a direct foreign key in the "Student" table or creating a separate "Enrollment" table to link students and courses, illustrating the flexibility in the translation process.
Exercises
2.11 Illustrate with examples the different types of redundancy that may occur in a relation How can redundancy in a relation induce problems in the presence of insertions, updates, and deletions?
2.12 What is the purpose of functional and multivalued dependencies? What is the difference between them?
Relational algebra consists of various operations that manipulate and retrieve data from relational databases Key operations include selection, projection, union, set difference, and Cartesian product Among these, joins are crucial for combining data from multiple tables based on related attributes There are several types of joins, including inner join, outer join, left join, and right join, each serving distinct purposes in data retrieval An inner join returns only the rows with matching values in both tables, while outer joins include unmatched rows from one or both tables Furthermore, joins can be expressed using other relational algebra operations, such as selection and Cartesian product, highlighting their fundamental role in database querying.
2.14 What is SQL? What are the sublanguages of SQL?
2.15 What is the general structure of SQL queries? How can the semantics of an SQL query be expressed with the relational algebra?
2.16 Discuss the differences between the relational algebra and SQL Why is relational algebra an operational language, whereas SQL is a declar- ative language?
2.17 Explain what duplicates are in SQL and how they are handled.
2.18 Describe the general structure of SQL queries with aggregation and sorting State the basic aggregation operations provided by SQL.
2.19 What are subqueries in SQL? Give an example of a correlated sub- query.
2.20 What are common table expressions in SQL? What are they needed for?
The objective of physical database design is to optimize the performance, reliability, and efficiency of a database system Key factors for measuring the performance of database applications include response time, throughput, and resource utilization Additionally, trade-offs often arise between normalization and denormalization, which can impact data integrity and access speed, as well as between storage costs and performance efficiency, requiring careful consideration to achieve the best overall system performance.
2.22 Explain different types of file organization Discuss their respective advantages and disadvantages.
2.23 What is an index? Why are indexes needed? Explain the various types of indexes.
2.24 What is clustering? What is it used for?
Exercise 2.1.A French horse race fan wants to set up a database to analyze the performance of the horses as well as the betting payoffs.
A racetrack is described by a name (e.g., Hippodrome de Chantilly), a location (e.g., Chantilly, Oise, France), an owner, a manager, a date opened, and a description A racetrack hosts a series of horse races.
A horse race has a name (e.g., Prix Jean Prat), a category (i.e., Group 1,
2, or 3), a race type (e.g., thoroughbred flat racing), a distance (in meters), a track type (e.g., turf right-handed), qualification conditions (e.g., 3-year-old excluding geldings), and the first year it took place.
A meeting is held on a certain date and a racetrack and is composed of one or several races For a meeting, the following information is kept: weather
(e.g., sunny, stormy), temperature, wind speed (in km per hour), and wind direction (N, S, E, W, NE, etc.).
Each race in a meeting is assigned a unique number and scheduled departure time, featuring a specific number of participating horses The application is responsible for monitoring the distribution of prize money among the top finishers, such as e228,000 for first place and e88,000 for second place, as well as recording the time of the fastest horse.
Each race on a specific date provides various betting options, such as tiercé and quarté+, with each type allowing for multiple bet options like "in order," "in any order," and a bonus for the quarté+ The payoffs are determined for each bet type based on a base amount, for example, quarté+ for €2, detailing the win amounts and the number of winners for each option.
A horse is identified by several key attributes, including its name, breed—such as thoroughbred—sex, and important dates like foaling (birth), gelding (castration for males), and death Additionally, a horse's lineage is defined by its sire (father) and dam (mother), while its appearance is described by coat color, which can include variations like bay or chestnut Ownership, breeding, and training details also play a crucial role in a horse's profile.
In horse racing, each horse is assigned a unique number and carries a specific weight to balance the competition among participants The application tracks the finishing positions and victory margins of the horses To design an ER schema for this application, one would consider entities such as Horses, Jockeys, Races, and Results, with relationships that define how these entities interact The relational model would then translate this schema into tables, identifying primary keys for each table, such as HorseID for Horses and JockeyID for Jockeys, while also establishing foreign keys to maintain referential integrity Additionally, non-null constraints would ensure that essential fields, like race dates and horse numbers, are always populated.
Exercise 2.2.A Formula One fan club wants to set up a database to keep track of the results of all the seasons since the first Formula One World championship in 1950.
A racing season spans a year, defined by a start and end date, and includes multiple races governed by specific regulations Each race is identified by its order in the season, an official name (such as the 2013 Formula One Shell Belgian Grand Prix), and details including the race date, local and UTC times, weather conditions, pole position (driver name and time), and the fastest lap (driver name, time, and lap number) All races are part of a Grand Prix, which is documented with its active years, for instance, 1950–1956.
The Belgian Grand Prix has a rich history, having hosted a total of 58 races as of 2013 Each race takes place on a specific circuit, such as the renowned Circuit de Spa-Francorchamps located in Spa, Belgium The event type can vary, including race, road, or street formats, and features a set number of laps over a defined circuit length and race distance, both measured in kilometers Additionally, each race maintains a lap record, detailing the best time achieved, the driver who set it, and the corresponding year.
Notice that the course of the circuits may be modified over the years For example, the Spa-Francorchamps circuit was shortened from 14 to 7 km in
1979 Further, a Grand Prix may use several circuits over the years, as we the case for the Belgian Grand Prix.
A racing team, such as Scuderia Ferrari, is identified by its name, location (like Maranello, Italy), and key personnel (for instance, Stefano Domenicali) Each team records its history, including its debut Grand Prix, total races participated in, and championships won by both the constructor and driver Additionally, they track their highest race finishes, total victories, pole positions, and fastest laps In any given season, a team adopts a full name that often features its current sponsor (e.g., Scuderia Ferrari Marlboro from 1997 to 2011), along with a designated chassis (like the F138) and engine (such as Ferrari).
Each driver in the racing circuit is identified by key details, including their name, nationality, birth date and place, race entries, championships won, race victories, podium finishes, total career points, pole positions, fastest laps, highest race finishes, and best grid positions Teams typically employ two main drivers and may also have up to six test drivers, although the number of test drivers can vary While main drivers are generally associated with a team throughout the season, they may not compete in every race Each team is allocated two consecutive numbers for its main drivers, with the number 1 assigned to the previous season's constructor’s champion, while the number 13 is rarely used, having appeared only once during the 1963 Mexican Grand Prix.
In a Grand Prix, drivers must participate in a qualifying session that establishes their starting order for the race The qualifying results include the driver's position and times for three segments: Q1, Q2, and Q3 For the race, key results recorded for each driver encompass their finishing position (optional), total laps completed, overall race time, reasons for any retirement or disqualification (both optional), and points awarded (only for the top eight finishers) An ER schema should be designed to capture these elements, noting any unspecified requirements and integrity constraints, while making necessary assumptions for completeness Subsequently, this schema must be translated into a relational model, clearly indicating the keys for each relation, along with referential integrity and non-null constraints.
In this exercise, we explore various queries for the Northwind database, focusing on relational algebra and SQL First, we retrieve the names, addresses, cities, and regions of employees Next, we identify the names of employees and customers located in Brussels, specifically those connected through orders dispatched by Speedy Express Lastly, we seek the titles and names of employees who have sold at least one product.
The article discusses various aspects of employee performance and product sales, highlighting key details such as the names and titles of employees, their reporting structure, and the specific products sold or purchased in London It notes employees who reside in the same city as their customers, identifies products that have not been ordered, and lists customers who have purchased all available products Additionally, it categorizes products along with their average prices and identifies companies that offer more than three products The article also provides a summary of employee sales, organized by their identifiers, and mentions employees who sell products from over seven suppliers.
This chapter covers the fundamental concepts of data warehouses, which are specialized databases designed for decision support by integrating data from various operational sources and transforming it for business analysis Utilizing a multidimensional model, data warehouses represent information as hypercubes, with dimensions reflecting different business perspectives and cube cells containing key measures for analysis The chapter explores the multidimensional model's characteristics and components, details common operations for manipulating data cubes, and compares data warehouse systems with operational databases Additionally, it describes the architecture of data warehouse systems, highlighting the roles of back-end tools for data extraction and front-end tools for user presentation Finally, it concludes with an overview of SQL Server, a prominent business intelligence toolset.
Multidimensional Model
Hierarchies
The granularity of a data cube is defined by the combination of levels across its axes, which include the Product dimension (Category), the Time dimension (Quarter), and the Customer dimension (City).
To extract strategic knowledge from a data cube, it's essential to analyze data at various levels of detail, such as monthly sales figures or broader country-level insights Hierarchies facilitate this analysis by linking detailed concepts (children) to more general ones (parents), forming a structured dimension schema Each dimension instance includes all members across its levels For instance, in the Product dimension, items are categorized, while the Time dimension progresses from Day to Month, Quarter, Semester, and Year Similarly, the Customer dimension starts with individual Customers and aggregates up to City, State, Country, and Continent, often culminating in a top-level category labeled "All."
All Year Semester Quarter Month Day
All Continent Country State City Customer
Fig 3.2 Hierarchies of the Product , Time , and Customer dimensions
Figure 3.3 illustrates the Product dimension at the instance level, where each product in the hierarchy is linked to its corresponding category All categories are collectively represented by a member called "all," which serves as the sole member of the distinguished level "All." This member is essential for aggregating measures across the entire hierarchy, enabling the calculation of total sales for all products.
In various real-world applications, different types of hierarchies are present, such as the balanced hierarchy illustrated in Fig 3.3, where each product maintains an equal number of levels leading to the hierarchy's root Chapters 4 and 5 will provide an in-depth exploration of these hierarchies, focusing on their conceptual representation and implementation within modern data warehouse and OLAP systems.
1 Note that, as indicated by the ellipses, not all nodes of the hierarchy are shown.
Fig 3.3 Members of a hierarchy Product → Category
Measures
In a data cube, each measure is linked to an aggregation function that consolidates multiple measure values into a single value This aggregation occurs when adjusting the level of detail in data visualization by navigating through the hierarchies of dimensions For instance, changing the granularity from City to Country using the Customer hierarchy results in the aggregation of sales for all customers within the same country, typically using the SUM operation Consequently, total sales figures produce a cube with a single cell representing the overall sum of quantities for all products, effectively visualizing the data at the All level across all dimension hierarchies.
Summarizability is the accurate aggregation of cube measures across dimension hierarchies to achieve consistent results To guarantee summarizability, certain conditions must be met.
In a hierarchical structure, it is essential that instances grouped at one level remain disjoint from one another in relation to their parent category at the next level For instance, as illustrated in Fig 3.3, a product should not be classified under two different categories simultaneously; doing so would lead to the product's sales being inaccurately counted twice—once for each category.
To ensure accurate aggregation of results, it is essential that all instances in a hierarchy are included and each instance is linked to a parent at the next level For instance, in the Time hierarchy, every day within the specified period must be accounted for and assigned to a corresponding month Failure to meet this requirement could lead to discrepancies, resulting in some sales dates being omitted from the total count.
• Correctness: It refers to the correct use of the aggregation functions.
As explained next, measures can be of various types, and this determines the kind of aggregation function that can be applied to them.
According to the way in which they can be aggregated, measures can be classified as follows.
• Additive measures can be meaningfully summarized along all the di- mensions, using addition These are the most common type of measures.
For example, the measure Quantity in the cube of Fig 3.1 is additive: it can be summarized when the hierarchies in the Product, Time, and Customerdimensions are traversed.
Semiadditive measures allow for meaningful aggregation through addition along certain dimensions, but not universally A prime example of this is inventory quantities, which cannot be accurately summed over time, such as when attempting to combine inventory figures from two different quarters.
• Nonadditive measurescannot be meaningfully summarized using ad- dition across any dimension Typical examples are item price, cost per unit, and exchange rate.
Defining aggregation functions for each measure is crucial, especially for semi-additive and nonadditive measures For instance, a semi-additive measure like inventory can be averaged over the Time dimension while summed across other dimensions Similarly, nonadditive measures such as item price or exchange rate may also utilize averaging, although alternative functions like minimum, maximum, or count might be appropriate depending on the application's semantics To enhance user interaction with the data cube at varying granularities, optimization techniques that leverage aggregate precomputation are employed Incremental aggregation mechanisms help avoid recalculating the entire aggregation from scratch with each data warehouse query, although their feasibility is contingent on the aggregate function being utilized, leading to a further classification of measures.
Distributive measures are defined by aggregation functions that can be computed in a distributed manner When data is partitioned into subsets and an aggregate function is applied to each, the function is considered distributive if the overall result matches the outcome of applying a potentially different function to the aggregated values Common aggregation functions such as count, sum, minimum, and maximum are distributive, while the distinct count function is not For example, partitioning the measure values {3,3,4,5,8,4,7,3,8} into subsets {3,3,4}, {5,8,4}, and {7,3,8} results in a distinct count of 8 when summed from the subsets, but the distinct count of the original set is only 5.
Algebraic measures utilize an aggregation function that serves as a scalar representation of distributive functions A common example of this type of aggregation is the average, calculated by dividing the total sum by the number of elements, where both the sum and count functions are distributive in nature.
Holistic measures, such as the median, mode, and rank, are unique metrics that cannot be derived from other subaggregates These measures are often costly to calculate, particularly when data undergoes modifications, as they require recalculation from the beginning.
OLAP Operations
The multidimensional model is essential for analyzing data from various perspectives and levels of detail By utilizing OLAP operations, users can effectively materialize these perspectives, creating an interactive environment for data analysis These OLAP operations are comparable to the relational algebra operations discussed in Chapter 2.
Figure 3.4 illustrates various OLAP operations for analyzing a data cube The analysis begins with a cube that presents quarterly sales quantities (in thousands) categorized by product types and customer locations for the year 2012.
The roll-upoperation aggregates measures along a dimension hierarchy to obtain measures at a coarsergranularity The syntax for this operation is:
The ROLLUP function in data analysis is structured as ROLLUP(CubeName, (Dimension → Level)*, AggFunction(Measure)*), where Dimension → Level indicates the level within a dimension for the roll-up process, and AggFunction is the aggregation method used to summarize the measure Each measure included in the resulting cube must have a specified aggregation function; otherwise, those measures will be excluded It's important to note that using Dimension → All effectively removes a dimension from the cube.
In our example, the following roll-up operation computes the sales quan- tities by country
ROLLUP(Sales, Customer → Country, SUM(Quantity))
The results illustrated in Fig 3.4b show a transformation in the data cube, where the original Customer dimension with four city values is condensed to two country values Consequently, the data for Paris and Lyon in a specific quarter and category now aggregate to reflect the overall values for France, with a similar process applied for Germany When querying a data cube, a common operation involves rolling up several dimensions to specific levels while collapsing others to the All level, which can be achieved through n successive roll-up operations The shorthand notation for this sequence is represented by the ROLLUP* operation.
ROLLUP*(CubeName, [(Dimension → Level)*], AggFunction(Measure)*)
For example, the total quantity by quarter can be obtained as follows:ROLLUP*(Sales, Time → Quarter, SUM(Quantity))
Fig 3.4 OLAP operations ( a ) Original cube; ( b ) Roll-up to the Country level; ( c )
Drill-down to the Month level; ( d ) Sort product by name; ( e ) Pivot; ( f ) Slice on City = ‘Paris’
Fig 3.4 OLAP operations (continued) ( g ) Dice on City = ‘Paris’ or ‘Lyon’ and Quar- ter = ‘Q1’ or ‘Q2’ ; ( h ) Dice on Quantity > 15 ; ( i ) Cube for 2011; ( j ) Drill-across;( k ) Percentage change; ( l ) Total sales by quarter and city
Fig 3.4 OLAP operations (continued) ( m ) Maximum sales by quarter and city; ( n )
The article presents a detailed analysis of sales performance, highlighting the top two sales figures for each quarter and city It also examines the top 70% of sales categorized by city and product type, organized in ascending order by quarter Additionally, it explores the same top 70% sales data, but this time arranged in descending order based on quantity sold Finally, the article ranks each quarter by category and city, focusing on the highest quantities achieved.
The OLAP operations illustrated in Fig 3.4 include a three-month moving average, a year-to-date sum, and the union of the original cube with another cube containing data from Spain Additionally, there is a difference operation between the original cube and a modified cube that rolls up the Time dimension to the Quarter level while aggregating the Customer and Product dimensions to their All level.
ROLLUP*(Sales, SUM(Quantity)) all the dimensions of the cube will be rolled-up to the All level, yielding a single cell containing the overall sum of the Quantity measure.
When performing a roll-up operation, it is often necessary to count the members in a dimension that has been removed from the cube For instance, a query can be used to determine the number of distinct products sold each quarter, as demonstrated by the ROLLUP function with the Sales and Time dimensions, resulting in a count of products labeled as ProdCount.
In this case, a new measureProdCountwill be added to the cube We will see below other ways to add measures to a cube.
Recursive hierarchies, commonly found in real-world situations like employee supervision structures, feature levels that can roll up to themselves Unlike fixed hierarchies, the number of levels in these structures varies based on their members To aggregate measures within recursive hierarchies, the RECROLLUP operation is employed, which iteratively rolls up the hierarchy until the top level is achieved.
RECROLLUP(CubeName, Dimension → Level, Hierarchy, AggFunction(Measure)*)
The drill-down operation is the reverse of the roll-up operation, transitioning from a broader level to a more detailed level within a hierarchy.
DRILLDOWN(CubeName, (Dimension → Level)*) whereDimension →Levelis the level in a dimension to which the drill down is performed.
In the cube illustrated in Fig 3.4b, Seafood sales in France are notably elevated in the first quarter compared to subsequent quarters To investigate this trend further, we can perform a drill-down along the Time dimension to the Month level, allowing us to identify if this peak in sales is attributed to a specific month.
As shown in Fig.3.4c, we discover that, for some reason yet unknown, sales in January soared both in Paris and in Lyon.
Thesort operation returns a cube where the members of a dimension have been sorted The syntax of the operation is as follows:
The SORT function in a cube allows for the organization of Dimension members based on the value of an Expression, with options for sorting in ascending (ASC) or descending (DESC) order.
OLAP operations involve sorting members within their parent hierarchies, with ascending (ASC) as the default option In contrast, both BASC and BDESC perform sorting across all members, disregarding hierarchies.
The SORT function is used to arrange the members of the Product dimension in ascending order based on the ProductName attribute In cases where the cube consists of a single dimension, members can be sorted according to their measures For instance, when SalesByQuarter is derived from aggregating sales data by quarter across all cities and categories, the expression SORT(SalesByQuarter, Time, Quantity DESC) organizes the Time members in descending order based on the Quantity measure.
The pivot operation allows for the rotation of a cube's axes, offering a different perspective on the data The syntax for this operation is PIVOT(CubeName, (Dimension → Axis)*), where the axes can be designated as {X,Y,Z,X1,Y1,Z1, }.
In our example, to see the cube with the Time dimension on the xaxis, we can rotate the axes of the original cube as follows
PIVOT(Sales, Time → X, Customer → Y, Product → Z)
The result is shown in Fig 3.4e.
The slice operation removes a dimension in a cube (i.e., a cube of n−1 dimensions is obtained from a cube ofndimensions) by selecting one instance in a dimension level The syntax of this operation is:
SLICE(CubeName, Dimension, Level = Value) where the Dimension will be dropped by fixing a single Value in the Level. The other dimensions remain unchanged.
In our example, to visualize the data only for Paris, we apply a slice operation as follows:
SLICE(Sales, Customer, City = 'Paris')
The subcube illustrated in Fig 3.4f is a two-dimensional matrix that displays the sales quantity evolution categorized by quarter, forming a series of time series The slice operation is conducted at a specified dimension level, such as the city level in this example Consequently, adjustments in granularity through roll-up or drill-down operations are frequently required before performing the slice operation.
Thediceoperation keeps the cells in a cube that satisfy a Boolean condition ϕ The syntax for this operation is
Data Warehouses
The union operation is also used to display different granularities on the same dimension For example, if SalesCountry is the cube in Fig 3.4b, then the following operation
UNION(Sales, SalesCountry) results in a cube with sales measures summarized by city and by country.
Given two cubes with the same schema, the difference operation removes the cells in a cube that exist in another one The syntax of the operation is: DIFFERENCE(CubeName1, CubeName2).
In our example, we can eliminate the cells representing the top two sales by quarter and city from the original cube, referred to as TopTwoSales.
This will result in the cube in Fig 3.4u.
The drill-through operation enables users to transition from the lowest level of data within a cube to the corresponding operational systems that generated the cube This functionality is particularly useful for investigating the causes of outlier values in a data cube However, it is important to note that drill-through is not classified as an OLAP operator, as its outcome does not produce a multidimensional cube.
Table 3.1 outlines the OLAP operations discussed in this section, highlighting the fundamental functions available Beyond these basic operations, OLAP tools offer an extensive range of mathematical, statistical, and financial computations, including ratios, variances, interest calculations, depreciation, and currency conversions.
A data warehouse serves as a centralized repository for integrated data sourced from various origins, designed specifically for multidimensional data analysis It is characterized by being subject-oriented, integrated, nonvolatile, and time-varying, all of which support informed management decision-making.
• Subject-orientedmeans that data warehouses focus on the analytical needs of different areas of an organization These areas vary depending
Add measure Adds new measures to a cube computed from other measures or dimensions.
Aggregation op- erations Aggregate the cells of a cube, possibly after performing a grouping of cells.
Dice Keeps the cells of a cube that satisfy a Boolean condition over di- mension levels, attributes, and measures.
Difference Removes the cells of a cube that are in another cube Both cubes must have the same schema.
Drill-across Merges two cubes that have the same schema and instances using a join condition.
Drill-down Disaggregates measures along a hierarchy to obtain data at a finer granularity It is the opposite of the roll-up operation.
Drop measure Removes measures from a cube.
Pivot Rotates the axes of a cube to provide an alternative presentation of its data.
Recursive roll-up Performs an iteration of roll-ups over a recursive hierarchy until the top level is reached.
Rename Renames one or several schema elements of a cube.
Roll-up Aggregates measures along a hierarchy to obtain data at a coarser granularity It is the opposite of the drill-down operation.
Roll-up* Shorthand notation for a sequence of roll-up operations.
Slice Removes a dimension from a cube by selecting one instance in a dimension level.
Sort Orders the members of a dimension according to an expression. Union Combines the cells of two cubes that have the same schema but disjoint members.
Table 3.1 summarizes the OLAP operations related to various activities within an organization In a retail context, the analysis typically emphasizes product sales and inventory management Conversely, operational databases concentrate on specific application functions, such as recording product sales and managing inventory replenishment.
Integrated data involves merging information from both operational and external systems, addressing challenges like variations in data definitions, formats, and codification This includes managing synonyms—fields with different names representing the same data—and homonyms—fields sharing the same name but having different meanings Additionally, issues related to multiple occurrences of data must be resolved Typically, these challenges are addressed during the design phase of operational databases.
Nonvolatile data storage guarantees the durability of information by preventing modification and deletion, thereby extending the lifespan of data beyond what typical operational systems provide.
Data warehouses accumulate and store data over extended periods, often spanning 5 to 10 years or more, in contrast to operational databases, which retain information for shorter durations of 2 to 6 months to support daily operations This difference in data retention allows data warehouses to provide valuable insights over time, while operational databases prioritize immediate data needs and may overwrite older information as required.
Time-varying data refers to the ability to hold different values for the same information over time, including when these changes occur For instance, a bank's data warehouse can track clients' average monthly account balances over several years, reflecting historical changes In contrast, operational databases often lack explicit temporal support, as such details may be unnecessary for daily operations and challenging to implement.
A data warehouse is aimed at analyzing the data of an entire organization.
Departments within an organization frequently need only specific segments of the overall data warehouse tailored to their functions For instance, the sales department typically focuses on sales data, whereas the human resources department requires demographic information and employee-related data These specialized subsets of data warehouses are known as data marts.
However, these data marts are not necessarily private to a department, they may be shared with other interested parts of the organization.
A data warehouse can be understood as a collection of data marts, reflecting a bottom-up approach where smaller data marts are created first and then combined to form the data warehouse This method is advantageous for organizations seeking quicker results or hesitant to invest time in building a large data warehouse Conversely, in the traditional view, data marts are derived from the data warehouse in a top-down manner, often serving as logical representations of the larger data warehouse.
Table3.2shows several aspects that differentiate operational database (or OLTP) systems from data warehouse (or OLAP) systems We analyze next in detail some of these differences.
OLTP systems are primarily utilized by operations staff and employees executing predefined tasks through transactional applications, such as payroll and ticket reservation systems In contrast, data warehouse users typically occupy higher organizational levels and leverage interactive OLAP tools for data analysis Consequently, OLTP systems necessitate current and detailed data, whereas data analytics rely on historical and summarized data, highlighting a fundamental difference in data organization.
4) follows from the type of use of OLTP and OLAP systems.
OLTP data structures are designed for small, simple transactions that occur frequently These transactions involve reading and writing data files, such as in the Northwind database application, where users often insert, modify, or delete orders Consequently, OLTP transactions typically access a limited number of records, focusing on the specific entries relevant to each operation.
Table 3.2 Comparison between operational databases and data warehouses
Aspect Operational databases Data warehouses
1 User type Operators, office employees Managers, account executives
2 Usage Predictable, repetitive Ad hoc, nonstructured
3 Data content Current, detailed data Historical, summarized data
4 Data organization According to operational needs According to analysis needs
5 Data structures Optimized for small transactions Optimized for complex queries
6 Access frequency High From medium to low
7 Access type Read, insert, update, delete Read, append only
9 Response time Short Can be long
11 Lock utilization Needed Not needed
13 Data redundancy Low (normalized tables) High (denormalized tables)
Data modeling techniques such as UML, ER models, and multidimensional models are essential for structuring sales order data OLAP systems, which support complex aggregation queries, require access to extensive records across multiple tables, leading to lengthy and intricate SQL queries Unlike OLTP systems, which are frequently accessed for tasks like processing purchase orders, OLAP systems are accessed less often for order analysis Additionally, data warehouse records are typically retrieved in read mode While OLTP systems benefit from short query response times when properly indexed, OLAP queries often take longer to execute due to their complexity.
OLTP systems typically experience a high volume of concurrent accesses, necessitating the use of locking or other concurrency management techniques to ensure secure transaction processing In contrast, OLAP systems are read-only, allowing for concurrent query submissions and computations, although they generally have a lower number of concurrent users Additionally, OLTP systems are often modeled using UML or similar methodologies.
Data Warehouse Architecture
Back-End Tier
The back-end tier consists of extraction, transformation, and loading (ETL) tools that facilitate the integration of data from various operational databases and other internal or external sources into the data warehouse Additionally, a data staging area, often referred to as an operational data store, is utilized to modify the extracted data through successive transformations, preparing it for final loading into the data warehouse.
The extraction, transformation, and loading process, as the name indicates, is a three-step process as follows:
Data extraction involves collecting information from diverse and varied sources, which can include operational databases as well as files in different formats These sources may be either internal to the organization or external.
In order to solve interoperability problems, data are extracted whenever possible using application programming interfaces (APIs) such as ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity).
Transformation involves converting data from its original format in various sources to a standardized format suitable for a data warehouse This process includes cleaning, which eliminates errors and inconsistencies; integration, which harmonizes data from multiple sources at both the schema and data levels; and aggregation, which summarizes the data based on the desired level of detail or granularity for the warehouse.
Loading involves populating the data warehouse with transformed data and ensuring its refreshment by updating the warehouse with data from sources at designated intervals This process is crucial for maintaining current information that supports effective decision-making Refresh frequencies can differ based on organizational policies, ranging from monthly updates to multiple times a day or even real-time data integration.
Data Warehouse Tier
The data warehouse tierin Fig 3.5is composed of an enterprise data warehouseand/or several data marts, and ametadata repositorystor- ing information about the data warehouse and its contents.
An enterprise data warehouse serves as a centralized repository that encompasses the entire organization, while a data mart is a specialized subset designed for specific functional or departmental needs Essentially, a data mart functions as a smaller, localized version of a data warehouse The data within a data mart may be sourced from the broader enterprise data warehouse or gathered directly from various data sources.
Another component of the data warehouse tier is the metadata repository.
Metadata, often referred to as "data about data," is traditionally divided into two categories: technical and business metadata Business metadata focuses on the semantics of the data, including the organizational rules, policies, and constraints that govern its use In contrast, technical metadata provides insights into how data is structured and stored within a computer system, along with the applications and processes that interact with this data.
In the context of a data warehouse, technical metadata encompasses various aspects that describe the data warehouse system, the source systems, and the ETL (Extract, Transform, Load) process A metadata repository typically includes crucial information related to these components, ensuring efficient data management and integration.
Metadata plays a crucial role in defining the structure of data warehouses and data marts at conceptual, logical, and physical levels This essential information encompasses user authorization and access control for security, as well as monitoring details such as statistics, error reports, and audit trails.
Metadata encompasses descriptions of data sources and their schemas across conceptual, logical, and physical levels It includes essential details such as ownership, update frequencies, legal limitations, and methods of access, providing a comprehensive understanding of the data's context and usability.
Metadata plays a crucial role in the ETL process by detailing data lineage, which traces warehouse data back to its original source It encompasses essential aspects such as data extraction, cleaning procedures, transformation rules, and default settings Additionally, it outlines data refresh and purging protocols, as well as the algorithms used for data summarization, ensuring a comprehensive understanding of data management and integrity.
OLAP Tier
The OLAP tier in the architecture features an OLAP server that delivers multidimensional data to business users, independent of the underlying data storage methods While many database products offer OLAP extensions and tools for creating and querying data cubes, there is currently no standardized language for data cube manipulation, resulting in varying technologies across systems Notable languages include XMLA, which facilitates the exchange of multidimensional data between client applications and OLAP servers, as well as MDX and DAX, which are query languages for OLAP databases MDX has become a de facto standard supported by multiple OLAP vendors, whereas DAX, introduced by Microsoft, is designed to be more user-friendly for business end users Additionally, the SQL standard has been enhanced with analytical capabilities through SQL/OLAP Further exploration of MDX, DAX, and SQL/OLAP will be provided in Chapters 6 and 7.
Front-End Tier
The front-end tier in Fig 3.5 is used for data analysis and visualization.
It contains client toolsthat allow users to exploit the contents of the data warehouse Typical client tools include the following:
OLAP tools enable users to interactively explore and manipulate data within a data warehouse, making it easier to formulate complex ad hoc queries that can handle substantial data volumes without any prior system knowledge.
Reporting tools facilitate the creation, distribution, and management of various types of reports, including paper-based, interactive, and web-based formats These reports rely on predefined queries that request specific information in a structured format, executed regularly Contemporary reporting methods incorporate key performance indicators (KPIs) and dashboards to enhance data visualization and analysis.
• Statistical toolsare used to analyze and visualize the cube data using statistical methods.
Data mining tools enable users to analyze data effectively, uncovering valuable insights such as patterns and trends These tools also facilitate predictions based on existing data, enhancing decision-making processes.
In Chap.7, we show some of the tools used to exploit the data warehouse,like data analysis tools, key performance indicators, and dashboards.
Variations of the Architecture
In real-world scenarios, certain components of data architecture, such as enterprise data warehouses and data marts, may be absent Often, organizations may operate solely with an enterprise data warehouse, or in some cases, they may not have one at all Constructing an enterprise data warehouse is a complex and resource-intensive endeavor, while data marts are generally simpler to develop However, when multiple data marts are independently created, integrating them into a cohesive enterprise data warehouse can become a challenging task.
In some scenarios, client tools may directly access the data warehouse without the presence of an OLAP server, as illustrated by the connection between the data warehouse tier and the front-end tier This is exemplified in Chapter 7, where queries for the Northwind case study are presented in both MDX and DAX for OLAP servers, as well as in SQL In extreme cases, the absence of both a data warehouse and an OLAP server leads to the creation of a virtual data warehouse, which consists of views over operational databases designed for efficient access This situation is represented by the arrow linking the data sources to the front-end tier While a virtual data warehouse is relatively easy to construct, it lacks historical data.
Overview of Microsoft SQL Server BI Tools
A virtual data warehouse lacks centralized metadata and the capability to clean and transform data, which can negatively affect the performance of operational databases.
A data staging area may be unnecessary if the data from source systems closely aligns with the warehouse data, which typically occurs when there are few high-quality data sources However, this scenario is uncommon in real-world applications.
3.5 Overview of Microsoft SQL Server BI Tools
In today's market, a diverse range of business intelligence tools is available, with major database providers like Microsoft, Oracle, IBM, and Teradata offering their own comprehensive suites Notable tools in this category include SAP, MicroStrategy, Qlik, and Tableau, alongside popular open-source options such as Pentaho This article focuses on Microsoft SQL Server tools as a representative suite for illustrating key concepts in business intelligence Microsoft SQL Server serves as an integrated platform for developing analytical applications, comprising three primary components that will be briefly outlined, along with references to other prominent business intelligence tools in the bibliography.
Analysis Services enables the definition, querying, updating, and management of analytical databases in two modes: multidimensional and tabular The key distinction between these modes lies in their underlying paradigms—multidimensional and relational—each utilizing its own query language, MDX for multidimensional and DAX for tabular This book explores both modes and their respective languages in Chapters 5, 6, and 7, focusing on the analytical database for the Northwind case study.
Integration Services facilitate the ETL (Extract, Transform, Load) processes by enabling the extraction of data from diverse sources, followed by data combination, cleaning, and summarization Ultimately, this results in populating a data warehouse with the refined data A detailed exploration of Integration Services and their role in the ETL process is presented in Chapter 9 through the Northwind case study.
Reporting Services is a powerful tool for defining, generating, storing, and managing reports from diverse data sources like data warehouses and OLAP cubes It allows for personalized report creation and delivery in multiple formats, accessible through various clients, including web browsers and mobile applications Users connect to reports via the Reporting Services server component In Chapter 7, we will explore Reporting Services in the context of building dashboards for the Northwind case study, utilizing several tools for effective development and management of these components.
Visual Studio is a development platform that supports Analysis Services,
SQL Server Management Studio (SSMS) offers comprehensive management for all SQL Server components, while Power BI empowers business users to independently analyze data and generate reports and dashboards through self-service BI, minimizing the need for IT involvement Additionally, Power Pivot serves as an Excel add-in that facilitates the creation and analysis of data models, enhancing data management capabilities for users.
Summary
This chapter introduces the multidimensional model foundational to data warehouse systems, distinguishing online analytical processing (OLAP) from online transaction processing (OLTP) It explores the data cube concept, including dimensions, hierarchies, and measures, while classifying measures and defining aggregation and summarizability Key OLAP operations such as roll-up and drill-down are detailed for interactive data cube manipulation The chapter also contrasts data warehouse systems with traditional database systems, outlining their basic architecture and various configurations Finally, an overview of Microsoft SQL Server BI tools is provided.
Bibliographic Notes
Foundational concepts of data warehousing are extensively covered in Kimball and Inmon's classic literature, with Inmon providing a key definition of data warehouses The multidimensional model's hypercube concept was explored in earlier studies, particularly regarding SQL roll-up and cube operations OLAP hierarchies are also examined, while the summarizability of measures has been defined and analyzed in various research Additional classifications of measures are available, with more comprehensive details and references provided in Chapter 5.
Currently, there is no universally accepted definition of OLAP operations, akin to the established operations in relational algebra Various OLAP algebras have been introduced in academic literature, each specifying distinct sets of operations A comparative analysis of these OLAP algebras is provided in [202], highlighting the necessity for a reference algebra in the OLAP domain The operations defined in this chapter draw inspiration from the work presented in [50].
In SQL Server, comprehensive resources are available for Analysis Services, Integration Services, and Reporting Services, which are detailed in dedicated books The tabular model in Microsoft Analysis Services is explored in depth, while DAX is thoroughly examined in another resource.
Review Questions
3.1 What is the meaning of the acronyms OLAP and OLTP?
3.2 Using an example of an application domain that you are familiar with, describe the various components of the multidimensional model, that is, facts, measures, dimensions, and hierarchies.
3.3 Why are hierarchies important in data warehouses? Give examples of various hierarchies.
3.4 Discuss the role of measure aggregation in a data warehouse How can measures be characterized?
3.5 Give an example of a problem that may occur when summarizability is not verified in a data warehouse.
3.6 Describe the various OLAP operations using the example you defined in Question3.2.
3.7 What is an operational database system? What is a data warehouse system? Explain several aspects that differentiate these systems.
3.8 Give some essential characteristics of a data warehouse How do a data warehouse and a data mart differ? Describe two approaches for building a data warehouse and its associated data marts.
3.9 Describe the various components of a typical data warehouse archi- tecture Identify variants of this architecture and specify in what sit- uations they are used.
3.10 Briefly describe the components of Microsoft SQL Server.
Exercises
The books on SQL Server provide comprehensive coverage of its components, including Analysis Services, Integration Services, and Reporting Services The tabular model in Microsoft Analysis Services is explored in detail, while DAX is thoroughly examined in dedicated literature.
3.1 What is the meaning of the acronyms OLAP and OLTP?
3.2 Using an example of an application domain that you are familiar with, describe the various components of the multidimensional model, that is, facts, measures, dimensions, and hierarchies.
3.3 Why are hierarchies important in data warehouses? Give examples of various hierarchies.
3.4 Discuss the role of measure aggregation in a data warehouse How can measures be characterized?
3.5 Give an example of a problem that may occur when summarizability is not verified in a data warehouse.
3.6 Describe the various OLAP operations using the example you defined in Question3.2.
3.7 What is an operational database system? What is a data warehouse system? Explain several aspects that differentiate these systems.
3.8 Give some essential characteristics of a data warehouse How do a data warehouse and a data mart differ? Describe two approaches for building a data warehouse and its associated data marts.
3.9 Describe the various components of a typical data warehouse archi- tecture Identify variants of this architecture and specify in what sit- uations they are used.
3.10 Briefly describe the components of Microsoft SQL Server.
To analyze the data warehouse of a telephone provider, several OLAP operations can be defined based on the specified dimensions and measures For query (a), we will aggregate the total amount collected by each call program for the year 2012, utilizing the call program dimension In query (b), we will sum the total duration of calls made by customers from Brussels during 2012, focusing on the caller customer and date dimensions For query (c), we will count the total number of weekend calls made by customers from Brussels to customers in Antwerp in 2012, which involves filtering by date and utilizing both caller and callee customer dimensions In query (d), we will aggregate the total duration of international calls initiated by customers in Belgium in 2012, requiring the identification of call type and the caller customer dimension Lastly, for query (e), we will calculate the total amount collected from customers in Brussels who are enrolled in the corporate program for 2012, necessitating the use of the caller customer and call program dimensions.
To analyze the train company's data warehouse effectively, various OLAP operations can be employed For query (a), we need to calculate the total kilometers traveled by Alstom trains in 2012, focusing on departures from French or Belgian stations Query (b) requires determining the total duration of international trips in 2012, defined as journeys that start in one country and end in another For query (c), we will find the total number of trips that either departed from or arrived at Paris during July 2012 Query (d) involves computing the average duration of train segments in Belgium for the year 2012 Finally, for query (e), we will calculate the average number of passengers per segment for each trip by aggregating the passenger counts across all segments of each journey.
In a university's data warehouse, teaching and research activities are meticulously organized, focusing on key dimensions such as department, professor, course, and academic semester The teaching activities are quantified by essential measures, including the number of hours dedicated and the total credits earned.
To effectively analyze research activities, it is essential to consider various dimensions such as professors, funding agencies, projects, and time, with specific focus on start and end dates Each professor is associated with their respective department, and key metrics include the number of person months and funding amounts To address specific queries, OLAP operations should be defined along with necessary dimension hierarchies These queries include: total teaching hours by department for the academic year 2012–2013, total research project funding by department for the calendar year 2012, the total number of professors engaged in research projects by department for 2012, total courses delivered by each professor for the academic year 2012–2013, and the number of projects initiated in 2012 categorized by department and funding agency.
Utilizing conceptual models for database design offers significant advantages, primarily by enhancing communication between users and designers without needing technical knowledge of the implementation platform These models allow for easy mapping to various logical structures, including relational, object-oriented, and graph models, making it easier to adapt to technological changes Additionally, by prioritizing user requirements, conceptual models support effective database maintenance and evolution, ensuring that modifications to logical and physical schemas can be managed more efficiently.
This chapter focuses on conceptual modeling for data warehouses, specifically utilizing the MultiDim model to define data requirements for data warehouse and OLAP applications It begins with an overview of the model in Section 4.1 and emphasizes the importance of hierarchies in maximizing the functionality of data warehouse and OLAP systems in Section 4.2, where various types of hierarchies are classified and visually represented Advanced aspects of conceptual modeling are discussed in Section 4.3, and the chapter concludes in Section 4.4 by revisiting OLAP operations through a set of queries related to the Northwind data warehouse.
Conceptual Modeling of Data Warehouses
The conventional database design process involves developing database schemas at three levels: conceptual, logical, and physical At the conceptual level, a schema provides a clear overview of user data requirements, focusing on what data is needed rather than how it will be implemented Typically, this design is facilitated through the use of entity-relationship models, ensuring a structured approach to database creation.
75 © Springer-Verlag GmbH Germany, part of Springer Nature 2022
A Vaisman, E Zimányi, Data Warehouse Systems, Data-Centric Systems and Applications, https://doi.org/10.1007/978-3-662-65167-4_4
The Entity-Relationship (ER) model and the Unified Modeling Language (UML) are commonly used for conceptual schemas, which are then translated into the relational model through specific mapping rules However, a standardized conceptual model for multidimensional data is lacking, resulting in data warehouse designs that are often executed directly at the logical level This approach typically utilizes star and snowflake schemas, which can be complex and challenging for the average user to comprehend Consequently, there is a need for a conceptual data warehousing model that effectively abstracts and clarifies the logical level.
In this chapter, we introduce the MultiDim model to conceptualize the essential components of data warehouse and OLAP applications The graphical representation of this model can be found in Appendix A To provide a comprehensive understanding, we will utilize the Northwind data warehouse schema, depicted in Fig 4.1, which we will refer to as the Northwind data cube.
Aschemain the MultiDim model is composed of a set of dimensions and a set of facts.
A dimension can consist of a single level or multiple hierarchies, with each hierarchy made up of various levels Unlike other concepts, dimensions do not have a graphical representation; instead, they are illustrated through their individual components.
A level in the ER model represents an entity type that encapsulates a collection of real-world concepts sharing similar characteristics from an application standpoint The individual occurrences of a level are referred to as members, illustrating the concept further.
In Fig 4.1, the Product and Category levels are illustrated, each characterized by a set of attributes that define their members Each level includes one or more unique identifiers composed of these attributes, such as CategoryID for the Category level Attributes within a level are associated with specific value domains, including types like integer, real, and string; however, this type information is not represented in the graphical depiction of our conceptual schemas.
A fact can encompass multiple levels, as illustrated by the Sales fact in Fig 4.1, which connects the Employee, Customer, Supplier, Shipper, Order, Product, and Date levels Notably, the same level can appear multiple times within a fact, fulfilling various roles.
Each role in a data model is designated by a specific name and linked to a fact through distinct connections For instance, the Date level is associated with the Sales fact via roles such as OrderDate, DueDate, and ShippedDate Fact instances, known as fact members, are governed by cardinality, which defines the relationship limits between facts and levels In this case, the Sales fact exhibits a one-to-many relationship with the Product level, indicating that each sale corresponds to a single product while each product can be linked to multiple sales Additionally, the Sales fact maintains a one-to-one relationship with the Order level, signifying that each sale is tied to a unique order line, and conversely, each order line corresponds to a single sale.
4.1 Conceptual Modeling of Data Warehouses 77
ProductID ProductName QuantityPerUnit UnitPrice Discontinued
CountryName CountryCode CountryCapital Population Subdivision
StateName EnglishStateName StateType StateCode StateCapital
Fig 4.1 Conceptual schema of the Northwind data warehouse
A fact often includes attributes known as measures, which typically consist of numerical data analyzed through various dimensions For instance, in the Sales fact illustrated in Fig 4.1, measures such as Quantity, Unit Price, and Discount are utilized The identifier attributes associated with the levels within a fact determine the granularity of the measures, reflecting the detail level at which these measures are represented.
In roll-up operations, measures are aggregated along specific dimensions, with the SUM aggregation function being the default unless specified otherwise Measures are classified as additive, semiadditive, or nonadditive, with the assumption that they are typically additive Semiadditive and nonadditive measures are denoted with the symbols ‘+!’ and ‘+’, respectively For instance, in the provided example, Quantity is an additive measure, while UnitPrice is semiadditive Additionally, measures and level attributes can be derived from other measures or attributes within the schema, indicated by the symbol ‘/’; for example, NetAmount is a derived measure.
A hierarchy consists of interconnected levels, where the lower level is referred to as the child and the higher level as the parent These relationships are known as parent-child relationships, and their cardinalities define the minimum and maximum number of members at one level that can be associated with a member at another level For instance, in the hierarchy, the child level "Product" is linked to the parent level "Category" with a one-to-many cardinality, indicating that each product belongs to a single category, while a category can encompass multiple products.
A dimension can encompass multiple hierarchies, each representing a specific criterion for analysis, which is why hierarchy names are included for differentiation For instance, the Employee dimension features two hierarchies: Territories and Supervision When a user opts not to use a hierarchy for aggregation, all attributes are represented at a single level, as seen with the City, Region, and Country attributes in the Employee dimension.
Hierarchical levels are essential for analyzing data at different levels of detail, with the Product level providing specific product information and the Category level offering a broader view of product categories The most detailed data is found at the leaf level, which typically defines the dimension name, unless the same level appears multiple times in a fact; in such cases, the role name determines the dimension name, known as role-playing dimensions.
The root level in a hierarchy represents the most general data and is often denoted by a distinguished level called "All," which contains a single member referred to as "all." While including this level in multidimensional schemas is at the designer's discretion, we typically omit the "All" level in our hierarchies unless clarity requires its inclusion, as it is deemed meaningless in conceptual schemas.
Hierarchies
Balanced Hierarchies
A balanced hierarchy features a single mandatory path at the schema level, exemplified by the Product→Category relationship At the instance level, this hierarchy resembles a tree structure with uniform branch lengths, ensuring that each parent member has at least one child member, while each child member is linked to only one parent This is illustrated in the example where every category is associated with at least one product, and each product is exclusively assigned to one category.
Unbalanced Hierarchies
An unbalanced hierarchy at the schema level allows for optional levels, resulting in parent members that may lack associated child members For instance, in a bank hierarchy, branches may exist without agencies, and agencies may operate without ATMs, leading to an unbalanced tree structure where branches vary in length Conversely, balanced hierarchies ensure that each child member is linked to only one parent member, as demonstrated by the relationship between agencies and branches These hierarchies are beneficial for accommodating facts at varying granularities, allowing different facts to be associated with distinct levels, such as ATMs and agencies.
(a) bank X branch 1 branch 2 branch 3 agency 11 agency 12 agency 31 agency 32
(b) Fig 4.2 An unbalanced hierarchy ( a ) Schema; ( b ) Examples of instances
Recursive hierarchies, also known as parent-child hierarchies, represent a specific type of unbalanced hierarchy In these structures, elements at the same level are interconnected through the dual roles inherent in a parent-child relationship, highlighting the distinction between parent-child hierarchies and relationships.
The Employee dimension, illustrated in Fig 4.1, depicts an organizational chart that highlights the employee-supervisor relationship In this structure, the roles of Subordinate and Supervisor are connected through a parent-child relationship at the Employee level As shown in Fig 4.3, this hierarchy is unbalanced, as employees without subordinates do not have descendants in the instance tree.
Fig 4.3 Instances of the parent-child hierarchy in the Northwind data warehouse
Generalized Hierarchies
Generalized hierarchies occur when members within a level differ in type, such as customers being either companies or individuals In an Entity-Relationship (ER) model, this is illustrated through the generalization relationship Different aggregation methods are required for customer measures based on type: for companies, the path is Customer→Sector→Branch, whereas for individuals, it is Customer→Profession→Branch The MultiDim model effectively visualizes these hierarchies, highlighting both common and specific levels as well as the parent-child relationships among them.
At the schema level, a generalized hierarchy features multiple exclusive paths that share at least the leaf level, as illustrated in Fig 4.4a This figure highlights two distinct aggregation paths for different customer types within the same hierarchy At the instance level, each member of the hierarchy is associated with only one exclusive path, as shown in Fig 4.4b, where the symbol ‘⊗’ denotes exclusivity for each member The points at which the alternative paths diverge and converge are referred to as the splitting and joining levels, respectively.
Understanding the difference between splitting and joining levels is crucial for accurate measure aggregation during roll-up operations, a concept known as summarizability As discussed in Chapter 3, generalized hierarchies often lack summarizability, meaning that not all customers can be effectively mapped within these structures.
(a) branch 1 profession A profession B sector B person X person Y
(b) Fig 4.4 A generalized hierarchy ( a ) Schema; ( b ) Examples of instances to theProfessionlevel Thus, the aggregation mechanism should be modified when a splitting level is reached in a roll-up operation.
In generalized hierarchies, it is not necessary that splitting levels are joined.
The hierarchy depicted in Fig 4.5 serves as a framework for analyzing international publications, which encompass journals, books, and conference proceedings While conference proceedings can be aggregated at the conference level, a universal joining level for all paths is absent Generalized hierarchies often feature a specific instance known as ragged hierarchies, exemplified by the Geography hierarchy shown in Fig 4.1.
Some countries, like Belgium, are organized into regions, while others, such as Germany, maintain a more unified structure without regional divisions In contrast, smaller nations, exemplified by the Vatican, lack both regions and states A ragged hierarchy represents a generalized hierarchy where alternative routes can be formed by omitting one or more intermediate levels At the instance level, each child member is associated with a single parent member, although the distance from the leaves to the parent level can vary among different members.
Fig 4.5 A generalized hierarchy without a joining level
Fig 4.6 Examples of instances of the ragged hierarchy in Fig 4.1
Alternative Hierarchies
Alternative hierarchies occur when multiple nonexclusive hierarchies share the same leaf level within a schema For instance, in the Date dimension, two hierarchies represent different month groupings into calendar and fiscal years This creates a graph structure, as shown in the example, where a child member can connect to multiple parent members from different levels Utilizing alternative hierarchies allows for the analysis of measures from a singular viewpoint, such as time, by employing various aggregation methods.
Jan 2011 Feb 2011 Mar 2011 Apr 2011 Jun 2011
(b) Fig 4.7 An alternative hierarchy ( a ) Schema; ( b ) Examples of instances
Generalized and alternative hierarchies, while sharing certain levels, depict distinct scenarios In a generalized hierarchy, a child member is associated with a single path, whereas in an alternative hierarchy, a child member connects to all paths, requiring the user to select one for analysis.
Parallel Hierarchies
When a dimension has several hierarchies associated with it (even of different kinds), accounting for different analysis criteria, we are in the presence of
Parallel hierarchies can be classified as dependent or independent, depending on whether their component hierarchies share levels For instance, an example depicted in Figure 4.8 illustrates a dimension with two independent parallel hierarchies The hierarchy known as ProductGroups is utilized for categorizing products based on their respective categories.
Product G Distribu tor roups Locati on
Parallel independent hierarchies, as illustrated in Fig 4.8, categorize departments based on distributor divisions or regions In contrast, the parallel dependent hierarchies shown in Fig 4.9 are designed for sales analysis across stores in multiple countries The hierarchy StoreLocation organizes stores geographically, while SalesOrganization reflects the company's organizational structure Both hierarchies share the State level, which serves different purposes depending on the hierarchy used for analysis By sharing levels in a conceptual schema, the number of elements is reduced without compromising semantics, enhancing readability To clearly define the levels within these hierarchies, the hierarchy name must be included in the shared level for those that extend beyond it, as seen with StoreLocation and SalesOrganization at the State level.
Sales O rgani zation Store Location Sales O rgani zation
Fig 4.9 An example of parallel dependent hierarchies
Alternative and parallel hierarchies are distinct concepts that must be clearly defined Alternative hierarchies consist of a single hierarchy name, indicating that combining levels from different component hierarchies is not valid In contrast, parallel hierarchies allow for multiple hierarchy names, enabling the combination of levels across different components For instance, in the provided schema, a user can effectively query "Sales figures for stores in city A that belong to sales district B," illustrating the practical application of these hierarchical structures.
Fig 4.10 Parallel dependent hierarchies leading to different parent members of the shared level
In parallel dependent hierarchies, a leaf member can be associated with multiple members at the same level, unlike alternative hierarchies that share levels For example, in the context of sales employees' living places and territory assignments, traversing the Lives and Territory hierarchies from the Employee to the State level can yield different states for employees residing in one state but assigned to another This distinction allows for the reuse of aggregated measure values in alternative hierarchies, which is not applicable in parallel dependent hierarchies For instance, if employees E1, E2, and E3 generate sales of $50, $100, and $150 respectively, and all reside in state A, aggregating their sales through the Lives hierarchy results in a total of $300, while the Territories hierarchy yields $150 Both outcomes are valid as they reflect different analytical perspectives.
Nonstrict Hierarchies
In many hierarchical structures, parent-child relationships are often viewed as one-to-many, where each child has a single parent, while a parent can have multiple children However, real-life applications frequently exhibit many-to-many relationships, such as a diagnosis being associated with multiple diagnosis groups, a week spanning across two months, or a product being classified into various categories.
A hierarchy that includes at least one many-to-many relationship is classified as non-strict, while those without such relationships are termed strict The classification of a hierarchy as strict or non-strict is independent of its type Therefore, the hierarchies discussed earlier can be categorized as either strict or non-strict In the following sections, we will explore the challenges associated with non-strict hierarchies.
Fig 4.11 Examples of instances of the nonstrict hierarchy in Fig 4.1
In a nonstrict hierarchy, as illustrated in Figure 4.1, employees can be assigned to multiple cities across different states, exemplified by Janet Leverling's assignments in Figure 4.11 This structure allows for a child member to have multiple parent members, resulting in an acyclic graph rather than a traditional hierarchy The term "nonstrict hierarchy" is preferred for clarity, as it emphasizes the need for users to analyze measures at various levels of detail, whereas "acyclic classification graph" may lack this clarity.
The concept of "hierarchy" is crucial for practitioners and researchers in the field of data warehousing Nonstrict hierarchies can lead to the issue of double counting measures during roll-up operations, particularly in many-to-many relationships For instance, in a strict hierarchy scenario, as illustrated in Fig 4.12a, employee Janet Leverling's total sales of 100 can be accurately aggregated by territory and state when assigned to Atlanta Conversely, Fig 4.12b demonstrates a nonstrict hierarchy scenario where Janet is assigned to multiple territories, including Atlanta, Orlando, and Tampa, complicating the sales aggregation process.
Sales by employee Aggregation by city Aggregation by state 50
Sales by employee Aggregation by city Aggregation by state
Fig 4.12 Double-counting problem when aggregating a sales amount measure in
Fig 4.11 ( a ) Strict hierarchy; ( b ) Nonstrict hierarchy
This approach causes incorrect aggregated results, since the employee’s sales are counted three times instead of only once.
To address the double-counting issue in many-to-many relationships, one approach is to convert a nonstrict hierarchy into a strict one by creating a unique member for each set of parent members, such as a new member representing the cities of Atlanta, Orlando, and Tampa Additionally, a corresponding member must be established at the state level to account for the cities belonging to Florida and Georgia Alternatively, one might simplify the hierarchy by designating a primary member, like Atlanta, while disregarding other parent members However, both solutions fail to meet user analysis needs, as the first introduces artificial categories, and the second overlooks relevant analytical scenarios.
Fig 4.13 A nonstrict hierarchy with a distributing attribute
To address the double-counting issue, an alternative method involves illustrating the distribution of measures among multiple parent members in many-to-many relationships This approach is exemplified in Fig 4.13, which depicts a nonstrict hierarchy.
4.2 Hierarchies 89 archy where employees may work in several sections The schema includes a measure that represents an employee’s overall salary, that is, the sum of the salaries paid in each section Suppose that an attribute stores the per- centage of time for which an employee works in each section In this case, we annotate this attribute in the relationship with an additional symbol ‘÷’ indicating that it is a distributing attribute determining how measures are divided between several parent members in a many-to-many relationship. Choosing an appropriate distributing attribute is important in order to avoid approximate results when aggregating measures For example, suppose that in Fig.4.13the distributing attribute represents the percentage of time that an employee works in a specific section If the employee has a higher position in one section, although she works less time in that section she may earn a higher salary Thus, applying the percentage of time as a distributing attribute for measures representing an employee’s overall salary may not give an exact result Note also that in cases where the distributing attribute is unknown, it can be approximated by considering the total number of par- ent members with which the child member is associated In the example of Fig 4.12, since Janet Leverling is associated with three cities, one third of the value of the measure will be accounted for each city.
Org St ruct ure Employee
Fig 4.14 Transforming a nonstrict hierarchy into a strict one with an additional di- mension
Figure 4.14 presents an alternative solution to the issue depicted in Fig 4.13 by transforming a nonstrict hierarchy into independent dimensions, shifting the focus from overall employee salaries to salaries categorized by section This approach is applicable only when the precise distribution of salary measures is known, such as the specific amounts paid for different sections However, it cannot be implemented for nonstrict hierarchies lacking a distributing attribute, as illustrated in Fig 4.11 While the aggregation of salary measures through the roll-up operation from Section to Division levels is accurate, the challenge of double counting employees remains For instance, using the schema in Fig 4.14 to determine the number of employees by section or division involves counting employee instances in the fact table, as shown in Fig 4.15a, which includes five employees assigned to various sections.
Employee Section No employees by section 3 2 2
The double-counting problem arises in a nonstrict hierarchy when aggregating the number of employees across sections While the count of employees within each section is accurate, reusing these aggregated values for division calculations leads to inaccuracies This is evident in the example where employees E1 and E2 are counted twice, resulting in a total of 7 instead of the correct figure of 5.
In summary, nonstrict hierarchies can be handled in several ways:
• Transforming a nonstrict hierarchy into a strict one:
In a many-to-many relationship involving child members, it is essential to create a distinct parent member for each group associated with a single child Furthermore, selecting one parent member as the primary representative while disregarding the other parent members streamlines the management of these relationships.
– Replacing the nonstrict hierarchy by two independent dimensions.
• Calculating approximate values of a distributing attribute.
Designers must carefully choose the most suitable solution based on the specific circumstances and user needs, as each option has its own benefits and drawbacks, along with unique aggregation procedures.
Advanced Modeling Aspects
Facts with Multiple Granularities
Data can be captured at various granularities, as illustrated in Fig 4.16, where sales data for the USA is detailed by state, while European sales are specified by city Similarly, in a medical data warehouse, the diagnosis dimension includes multiple levels: diagnosis, diagnosis family, and diagnosis group A patient may be linked to a specific diagnosis at the most detailed level, but may also have broader classifications at the family and group levels, reflecting varying degrees of precision in the data.
Fig 4.16 Multiple granularities for a Sales fact
This scenario can be effectively modeled through exclusive relationships among different levels of granularity The primary challenge lies in ensuring accurate analysis outcomes when fact data is recorded across multiple granularities.
Many-to-Many Dimensions
In a many-to-many dimension, multiple members can be associated with the same fact member, as illustrated in the example For instance, when an account is jointly owned by several clients, the aggregation of the account balance will count each client's share multiple times, reflecting the total number of account holders This scenario is exemplified by accounts A1 and A2, which have respective balances of 100 at date D1.
In a scenario where two accounts, accountA1 and accountA2, are shared among multiple clients—accountA1 by clients C1, C2, and C3, and accountA2 by clients C1 and C2—the total balance of these accounts amounts to 600 However, when aggregating data based on dimensions such as Date or Client, the resulting value reaches 1,300.
Account AccountNo Type Description OpeningDate
Fig 4.17 Multidimensional schema for the analysis of bank accounts
The issue of double counting can be examined through multidimensional normal forms (MNFs), which establish the necessary conditions for accurate measure aggregation within complex hierarchies The first multidimensional normal form (1MNF) stipulates that each measure must be uniquely identified by its associated leaf levels, serving as the foundation for proper schema design In evaluating the schema depicted in Fig 4.17 against the 1MNF, we must identify the functional dependencies between the leaf levels and the measures Since the balance is contingent upon the specific account and the time considered, both the account and time influence the balance Consequently, the schema in Fig 4.17 fails to meet the 1MNF criteria, as the measure is not fully determined by all leaf levels, necessitating the decomposition of the fact.
Fig 4.18 An example of double-counting problem in a many-to-many dimension
Multivalued dependency, as discussed in Chapter 2, can be illustrated through the Balance fact depicted in Fig 4.17 This fact can be decomposed in two ways: firstly, when a joint account has different clients assigned at various times, leading to a scenario where both time and account multidetermine the clients This results in the decomposition shown in Fig 4.19a, separating the original fact into two distinct facts: AccountHolders and Balance Conversely, if the joint account holders remain constant over time, the clients are solely multidetermined by the accounts, without the influence of the date.
In this case, the link relating theDatelevel and theAccountHoldersfact can be eliminated Alternatively, this situation can be modeled with a nonstrict hierarchy as shown in Fig.4.19b.
Account AccountNo Type Description OpeningDate
Fig 4.19 Two possible decompositions of the fact in Fig 4.17 ( a ) Creating two facts;
The solutions depicted in Fig 4.19 effectively address the issue of double-counting; however, they still necessitate significant programming effort for queries related to individual clients Specifically, Fig 4.19a requires a drill-across operation between the two facts, while Fig 4.19b mandates the implementation of special procedures for aggregation in nonstrict hierarchies.
In Fig 4.19a, the differing granularities of the two facts complicate queries involving drill-across operations, necessitating conversions between finer and coarser granularities For example, one may need to group clients to identify those with a specific account balance or distribute a balance among various account holders Additionally, the schemas in Fig 4.19 can illustrate customer ownership percentages of accounts, which could be represented as a measure in the AccountHolders fact in Fig 4.19a or as a distributing attribute in the many-to-many relationship shown in Fig 4.19b.
Account AccountNo Type Description OpeningDate
Fig 4.20 Alternative decomposition of the fact in Fig 4.17
An effective solution to the issue presented involves creating an additional level that categorizes clients involved in joint accounts, as illustrated in Fig 4.20 For instance, in the example shown in Fig 4.18, two distinct groups should be formed: one comprising clients C1, C2, and C3, and another consisting of clients C1 and C2 It is important to note that the schema depicted in Fig 4.20 does not adhere to the 1MNF, as the measure Balance is only determined by Date and Account, rather than all leaf levels Consequently, the schema must be decomposed into structures similar to those in Fig 4.19, but with the Client level replaced by a nonstrict hierarchy of ClientGroup and Client levels To prevent many-to-many dimensions, one client can be designated as the primary account owner, allowing the use of the schema in Fig 4.17 without concerns of double counting measures However, this approach may not accurately reflect real-world scenarios and could exclude other clients from the analysis of joint accounts.
To avoid many-to-many dimensions in multidimensional schemas, consider the solutions illustrated in Fig 4.19 The selection of the appropriate alternative is influenced by the functional and multivalued dependencies present in the facts, the types of hierarchies within the schema, and the implementation complexity.
Links between Facts
In certain cases, it is essential to establish a connection between two related facts, even when they share dimensions For instance, the facts of Order and Delivery both include the dimensions of Customer and Date, but each fact also possesses unique dimensions—Employee for Order and Shipper for Delivery This relationship implies that multiple orders can be associated with a single delivery, and conversely, a single order containing various products may be fulfilled through multiple deliveries Although the shared dimensions of Customer and Date exist, a direct link between the facts is crucial for accurately tracking how orders are delivered, as these dimensions alone do not suffice for this purpose.
Fig 4.21 An excerpt of a conceptual schema for analyzing orders and deliveries ( a )
The relationships between facts can be categorized as one-to-one, one-to-many, or many-to-many In the discussed scenario, we focus on the many-to-many relationship If the link between Delivery and Order were one-to-many, it would imply that while each order is associated with a single delivery, a single delivery could encompass multiple orders.
Querying the Northwind Cube Using the OLAP Operations 96
In conclusion, this chapter demonstrates how the OLAP operations discussed in Chapter 3 can effectively express queries on a conceptual schema, independent of the underlying implementation This is illustrated through the Northwind cube presented in Figure 4.1.
Query 4.1 Total sales amount per customer, year, and product category.
ROLLUP*(Sales, Customer → Customer, OrderDate → Year,
The ROLLUP operation determines the levels for rolling up the dimensions of Customer, OrderDate, and Product, while performing a roll-up to All for the remaining dimensions The SUM operation aggregates the SalesAmount measure, excluding all other measures from the cube in the final result.
Query 4.2 Yearly sales amount for each pair of customer and supplier coun- tries.
ROLLUP*(Sales, OrderDate → Year, Customer → Country,
As in the previous query, a roll-up to the specified levels is performed, while performing aSUMoperation to aggregate the measureSalesAmount.
Query 4.3 Monthly sales by customer state compared to those of the previ- ous year.
Sales1 ← ROLLUP*(Sales, OrderDate → Month, Customer → State,
Sales2 ← RENAME(Sales1, SalesAmount → PrevYearSalesAmount)
Sales2.OrderDate.Month = Sales1.OrderDate.Month AND
Sales2.OrderDate.Year+1 = Sales1.OrderDate.Year AND
Sales2.Customer.State = Sales1.Customer.State)
In this process, we begin by performing a ROLLUP operation to aggregate the Sales Amount measure Next, we create a copy of the resulting data cube, renaming the measure to PrevYearSalesAmount, which is stored in the cube Sales2 We then utilize the DRILLACROSS operation to join the two cubes, ensuring that cells corresponding to the same month in consecutive years and the same client state are merged While we include the join condition for the Customer dimension, it is not mandatory, as equijoins are assumed for all unspecified dimensions in the join condition Moving forward, we omit the equijoins from the DRILLACROSS operation conditions.
4.4 Querying the Northwind Cube Using the OLAP Operations 97
Query 4.4 Monthly sales growth per product, that is, total sales per product compared to those of the previous month.
Sales1 ← ROLLUP*(Sales, OrderDate → Month, Product → Product,
Sales2 ← RENAME(Sales1, SalesAmount → PrevMonthSalesAmount)
Sales2.OrderDate.Month+1 = Sales1.OrderDate.Month AND
Sales2.OrderDate.Year = Sales1.OrderDate.Year ) OR
( Sales1.OrderDate.Month = 1 AND Sales2.OrderDate.Month = 12 AND Sales2.OrderDate.Year+1 = Sales1.OrderDate.Year ) )
To analyze sales growth, we first perform a ROLLUP operation and create a copy of the resulting cube Next, we use the DRILLACROSS operation to join the two cubes, considering two scenarios for the join condition For months starting from February (Month > 1), we merge consecutive cells within the same year In contrast, for January, we merge its data with December of the previous year Finally, we calculate a new measure, Sales Growth, by determining the difference between the sales amounts of the corresponding months.
Query 4.5 Three best-selling employees.
Sales1 ← ROLLUP*(Sales, Employee → Employee, SUM(SalesAmount))
In this process, we aggregate all dimensions in the cube to the All level, excluding Employee, while summing the measureSalesAmount Subsequently, the MAX operation is applied to retain only the top three values of the measure in the results.
Query 4.6 Best-selling employee per product and year.
Sales1 ← ROLLUP*(Sales, Employee → Employee,
Product → Product, OrderDate → Year, SUM(SalesAmount))
Result ← MAX(Sales1, SalesAmount) BY Product, OrderDate
In this query, we roll-up the dimensions of the cube as specified Then, the MAX operation is applied after grouping byProductandOrderDate.
Query 4.7 Countries that account for top 50% of the sales amount.
Sales1 ← ROLLUP*(Sales, Customer → Country, SUM(SalesAmount))
Result ← TOPPERCENT(Sales1, Customer, 50) ORDER BY SalesAmount DESC
In this analysis, we aggregate the Customer dimension to the country level while consolidating other dimensions to the All level Subsequently, the TOPPERCENT operation identifies the countries that collectively represent the top 50% of total sales.
Query 4.8 Total sales and average monthly sales by employee and year.
Sales1 ← ROLLUP*(Sales, Employee → Employee, OrderDate → Month,
Result ← ROLLUP*(Sales1, Employee → Employee, OrderDate → Year,
We begin by aggregating the data to the Employee and Month levels by summing the Sales Amount measure Next, we conduct a second aggregation at the Year level to calculate the total sales and the average monthly sales.
Query 4.9 Total sales amount and discount amount per product and month.
Sales1 ← ADDMEASURE(Sales, TotalDisc = Discount * Quantity * UnitPrice)
Result ← ROLLUP*(Sales1, Product → Product, OrderDate → Month,
Here, we first compute a new measure TotalDiscfrom three other measures. Then, we roll-up the cube to the ProductandMonthlevels.
Query 4.10 Monthly year-to-date sales for each product category.
Sales1 ← ROLLUP*(Sales, Product → Category, OrderDate → Month,
Result ← ADDMEASURE(Sales1, YTD = SUM(SalesAmount) OVER
OrderDate BY Year ALL CELLS PRECEDING)
We initiate the process by aggregating data at the category and month levels Subsequently, we create a new measure that utilizes the SUM function across a window of all preceding cells within the same year It is important to note that the members of the Date dimension are organized in chronological order.
Query 4.11 Moving average over the last 3 months of the sales amount by product category.
Sales1 ← ROLLUP*(Sales, Product → Category, OrderDate → Month,
Result ← ADDMEASURE(Sales1, MovAvg3M = AVG(SalesAmount) OVER
In the initial roll-up, we consolidate the SalesAmount measure by both category and month Subsequently, we calculate the moving average using a window that encompasses the current month along with the two prior months.
Query 4.12 Personal sales amount made by an employee compared with the total sales amount made by herself and her subordinates during 2017.
Sales1 ← SLICE(Sales, OrderDate.Year = 2017)
Sales2 ← ROLLUP*(Sales1, Employee → Employee, SUM(SalesAmount))
Sales3 ← RENAME(Sales2, SalesAmount → PersonalSales)
Sales4 ← RECROLLUP(Sales2, Employee → Employee, Supervision,
Summary
In 2017, we limited the data cube and aggregated sales amounts by employee, independent of the supervision hierarchy Next, we renamed the measure and applied a recursive roll-up, iterating through the hierarchy to aggregate data from children to parents until reaching the top level Finally, we generated the cube containing both measures.
Query 4.13 Total sales amount, number of products, and sum of the quan- tities sold for each order.
ROLLUP*(Sales, Order → Order, SUM(SalesAmount),
COUNT(Product) AS ProductCount, SUM(Quantity))
Here, we roll-up all the dimensions, exceptOrder, to theAlllevel, while adding theSalesAmountandQuantitymeasures and counting the number of products.
Query 4.14 For each month, total number of orders, total sales amount, and average sales amount by order.
Sales1 ← ROLLUP*(Sales, OrderDate → Month, Order → Order,
Result ← ROLLUP*(Sales1, OrderDate → Month, SUM(SalesAmount),
AVG(SalesAmount) AS AvgSales, COUNT(Order) AS OrderCount)
Here we first roll-up to theMonthandOrderlevels Then, we roll-up to remove theOrderdimension and obtain the requested measures.
Query 4.15 For each employee, total sales amount, number of cities, and number of states to which she is assigned.
ROLLUP*(Sales, Employee → State, SUM(SalesAmount), COUNT(DISTINCT City)
AS NoCities, COUNT(DISTINCT State) AS NoStates)
Recall that Territories is a nonstrict hierarchy in the Employee dimension.
In this analysis, we aggregate data at the state level by incorporating the SalesAmount measure and counting the distinct cities and states The ROLLUP operation effectively manages the nonstrict hierarchy, preventing any issues of double-counting as discussed in Section 4.2.6.
This chapter emphasizes the importance of conceptual modeling in data warehouses, similar to its role in databases, by representing user requirements without revealing implementation specifics To illustrate conceptual multidimensional modeling, we introduced the MultiDim model, which is grounded in the entity-relationship model and features an intuitive graphical notation Such graphical representations significantly enhance the comprehension of application requirements for both users and designers.
This article provides a detailed classification of hierarchies, focusing on their differences at both the schema and instance levels It outlines balanced, unbalanced, and generalized hierarchies, each pertaining to a single analysis criterion, and highlights recursive and ragged hierarchies as specific types of unbalanced and generalized hierarchies, respectively Additionally, it introduces alternative hierarchies, which consist of multiple hierarchies that establish various aggregation paths for the same criterion, and parallel hierarchies that address different analysis criteria The article differentiates between dependent and independent parallel hierarchies based on shared levels and discusses strict versus nonstrict hierarchies concerning many-to-many relationships Furthermore, it explores advanced modeling aspects such as facts with multiple granularities and many-to-many dimensions, which are often overlooked in data warehouse literature The implementation of these concepts at the logical level will be examined in Chapter 5, and the article concludes by demonstrating how OLAP operations from Chapter 3 can be applied to the conceptual model, illustrated with queries from the Northwind data cube.
Bibliographic Notes
Conceptual data warehouse design was first introduced by Golfarelli et al.
Numerous conceptual multidimensional models have been introduced in the literature, each with distinct characteristics Some models utilize graphical representations based on the Entity-Relationship (ER) model, such as the MultiDim model, while others adopt the Unified Modeling Language (UML) Additionally, certain models introduce unique notations, and some do not provide any graphical representation at all There is significant diversity in the types of hierarchies supported by these multidimensional models, with a comprehensive comparison of their approaches to hierarchies available in existing studies.
185] The inclusion of explicit links between cubes in multidimensional models was proposed in [206] Multidimensional normal forms were defined in [136,
137] A survey in multidimensional design is given in [203].
The Object Management Group (OMG) has proposed the Common Ware- house Model (CWM) 1 as a standard for representing data warehouse and OLAP systems This model provides a framework for representing metadata
1 https://www.omg.org/spec/CWM/1.1/PDF
Review Questions
The Common Warehouse Metamodel (CWM) is structured in layers, encompassing various submodels essential for data warehouse management Among these, the resource layer offers models for data representation, including the relational model Additionally, the analysis layer features a metamodel for Online Analytical Processing (OLAP), highlighting key concepts such as dimensions and hierarchies The CWM effectively represents all types of hierarchies discussed in this context, facilitating comprehensive data source, target, transformation, and analysis processes.
4.1 Discuss the following concepts: dimension, level, attribute, identifier, fact, role, measure, hierarchy, parent-child relationship, cardinalities, root level, and leaf level.
4.2 Explain the difference, at the schema and at the instance level, be- tween balanced and unbalanced hierarchies.
4.3 Give an example of a recursive hierarchy Explain how to represent an unbalanced hierarchy with a recursive one.
4.4 Explain the usefulness of generalized hierarchies To which concept of the entity-relationship model are these hierarchies related?
4.5 What is a splitting level? What is a joining level? Does a generalized hierarchy always have a joining level?
4.6 Explain why ragged hierarchies are a particular case of generalized hierarchies.
4.7 Explain in what situations alternative hierarchies are used.
4.8 Describe the difference between parallel dependent and parallel inde- pendent hierarchies.
4.9 Illustrate with examples the difference between generalized, alterna- tive, and parallel hierarchies.
4.10 What is the difference between strict and nonstrict hierarchies?
4.11 Illustrate with an example the problem of double counting of measures for nonstrict hierarchies Describe different solutions to this problem.
4.12 What is a distributing attribute? Explain the importance of choosing an appropriate distributing attribute.
4.13 What does it mean to have a fact with multiple granularities?
4.14 Relate the problem of double counting to the functional and multi- valued dependencies that hold in a fact.
4.15 Why must a fact be decomposed in the presence of dependencies? Show an example of a fact that can be decomposed differently ac- cording to the dependencies that hold on it.
4.16 Think of real-world examples where two fact tables must be related to each other.
Exercises
In this exercise, design a MultiDim schema tailored to a familiar application domain, ensuring it includes a central fact accompanied by relevant levels and measures Incorporate at least two hierarchies, one of which must exhibit an exclusive relationship, while also establishing a parent-child relationship that features a distributing attribute This structured approach will enhance data organization and analysis capabilities within the schema.
Exercise 4.2.Design a MultiDim schema for the telephone provider appli- cation in Ex 3.1.
Exercise 4.3.Design a MultiDim schema for the train application in Ex.3.2.
Exercise 4.4.Design a MultiDim schema for the university application given in Ex 3.3taking into account the different granularities of the time dimen- sion.
Design a MultiDim schema for a French horse racing application that showcases statistics related to prizes won by various stakeholders, including owners, trainers, jockeys, breeders, horses, sires, and damsires Additionally, the application should provide insights into betting statistics, detailing payoffs categorized by type, race, racetrack, and individual horses.
Exercise 4.6.In each of the dimensions of the multidimensional schema of
Ex 4.5, identify the hierarchies (if any) and determine its type.
Design a MultiDim schema for the Formula One application that showcases various statistics related to races This schema should effectively display data on prizes won by drivers, teams, circuits, Grand Prix events, and across different seasons.
In Exercise 4.8, we explore a time dimension with two hierarchical structures: the first hierarchy consists of day, month, quarter, and year, while the second features day, month, bimonth, and year The conceptual schema for these dimensions illustrates how time can be organized and analyzed in various ways For instance, in the first hierarchy, a specific date can be broken down into its corresponding month, quarter, and year, allowing for detailed temporal analysis In the second hierarchy, the inclusion of bimonthly periods provides an alternative perspective on time, enabling insights into trends and patterns over longer intervals Examples of instances could include a specific day like March 15, 2023, categorized under March (month), Q1 (quarter), and 2023 (year) in the first structure, and under March (month), Bimonth 2 (covering March-April), and 2023 (year) in the second structure.
In the Foodmart data warehouse, the OLAP operations can be utilized to retrieve comprehensive measures for all stores Additionally, it is possible to summarize the measures specifically for stores located in California and Washington at the state level, providing a focused analysis of these regions.
2 The queries of this exercise are based on a document written by Carl Nolan entitled
“Introduction to Multidimensional Expressions (MDX).”
Store Store ID Store Name Store Type Store Manager Store Sqft Grocery Sqft Frozen Sqft Meat Sqft Coffe Bar
Store Sales Store Cost Unit Sales /Sales Average /Profit
Store Sales Store Cost Unit Sales /Sales Average /Profit
Date Day Name Week Day Nb Month Week Nb Year
Customer ID First Name Middle Initial Last Name Birth Date Gender Education Marital Status Member Card Yearly Income Occupation Total Children
Nb Children Home House Owner
The Foodmart cube conceptual schema provides a comprehensive overview of sales and unit metrics across various dimensions, including state, city, and store type, specifically for California and Washington It highlights key performance indicators such as sales averages, profits, and unit sales for 2017, segmented by semester, quarter, and month The analysis includes unit sales by customer city, with comparisons to state and national averages, and examines the impact of promotions on sales performance Additionally, it tracks monthly sales profit growth and unit sales fluctuations, while offering insights into product categories and subcategories The data further identifies top-performing store cities based on sales count and evaluates customer demographics, including the number of female customers Key findings include maximum monthly unit sales for each product subcategory and overall unit sales by brand, showcasing trends and performance across the year.
Conceptual models play a crucial role in designing database applications by enhancing communication among project stakeholders To implement these models in a database management system, they must be translated into logical models This chapter explores the representation of the conceptual multidimensional model in the relational model, beginning with an overview of the three logical models for data warehouses: relational OLAP (ROLAP), multidimensional OLAP (MOLAP), and hybrid OLAP (HOLAP) Additionally, we examine the relational representation of data warehouses, focusing on four common implementations: star, snowflake, starflake, and constellation schemas.
In Section 5.3, we outline the rules for converting a conceptual multidimensional model, specifically the MultiDim model, into the relational model Section 5.4 focuses on the representation of the time dimension, while Sections 5.5 and 5.6 explore the implementation of hierarchies, facts with multiple granularities, and many-to-many dimensions within the relational framework Section 5.7 examines slowly changing dimensions that occur when updates are made to data warehouse dimensions In Section 5.8, we discuss the representation and querying of a data cube in the relational model using SQL Finally, Section 5.9 provides an illustration of these concepts by demonstrating the implementation of the Northwind data warehouse in Analysis Services, utilizing both the multidimensional and tabular models, referred to as Analysis Services Multidimensional and Analysis Services Tabular, respectively.
Logical Modeling of Data Warehouses
There are several approaches for implementing a multidimensional model, depending on how the data cube is stored These are described next.
Relational OLAP (ROLAP) systems store multidimensional data in relational databases including SQL features and special access methods to
105 © Springer-Verlag GmbH Germany, part of Springer Nature 2022
A Vaisman and E Zimányi highlight that ROLAP systems efficiently execute OLAP operations by precomputing aggregates in relational tables, which enhances performance However, the use of aggregates and indexing structures can consume significant database space While relational databases offer standardization and ample storage capacity, the necessity of performing OLAP operations on these tables often results in complex SQL queries.
Multidimensional OLAP (MOLAP) systems utilize specialized multidimensional data structures, such as arrays, along with hashing and indexing techniques to enhance the efficiency of OLAP operations While MOLAP systems facilitate straightforward and natural data manipulation, they typically offer lower storage capacity compared to Relational OLAP (ROLAP) systems Additionally, the proprietary nature of MOLAP systems limits their portability.
Hybrid OLAP (HOLAP) systems integrate the strengths of both ROLAP and MOLAP, leveraging the extensive storage capabilities of ROLAP alongside the efficient processing power of MOLAP In this architecture, a HOLAP server typically retains substantial amounts of detailed data within a relational database, while maintaining aggregations in a distinct MOLAP storage solution.
Modern OLAP tools integrate various models, primarily depending on a relational database management system for their underlying data warehouse Consequently, this article delves into a detailed examination of ROLAP implementation.
Relational Data Warehouse Design
The star schema is a relational representation of the multidimensional model, featuring a central fact table surrounded by dimension tables for each dimension In this schema, the fact table, shown in gray, includes foreign keys such as ProductKey, StoreKey, PromotionKey, and DateKey, along with measures like Amount and Quantity Referential integrity constraints are established between the fact table and its corresponding dimension tables, ensuring data consistency and accuracy.
In a star schema, dimension tables are typically not normalized, leading to potential redundancy in data, particularly when hierarchies are involved For instance, in the Product dimension, products within the same category share redundant information regarding their category and department attributes Similarly, the Store dimension exhibits redundancy in the attributes that describe the city and state.
On the other hand, fact tables are usually normalized: their key is the union of the foreign keys since this union functionally determines all the measures,
DateKey Date WeekdayFlag WeekendFlag Season
ProductKey StoreKey PromotionKey DateKey Amount Quantity
StoreKey StoreNumber StoreName StoreAddress ManagerName CityName CityPopulation CityArea StateName StatePopulation StateArea StateMajorActivity Promotion
Fig 5.1 An example of a star schema while there is no functional dependency between the foreign key attributes.
In Fig 5.1, the fact table Sales is normalized and its key is composed of ProductKey,StoreKey,PromotionKey, and DateKey.
The snowflake schema enhances data efficiency by normalizing dimension tables, which reduces redundancy found in star schemas In this structure, a single dimension is represented through multiple tables interconnected by referential integrity constraints Similar to star schemas, these constraints also establish relationships between the fact table and dimension tables, ensuring detailed data integrity.
In a snowflake schema, as illustrated in Fig 5.2, the fact table remains consistent with Fig 5.1, but the dimensions Product and Store are now represented by normalized tables In the Product dimension, category information has been separated into a dedicated Category table, with only the CategoryKey retained in the original table This structure allows the CategoryKey to be repeated for each product within the same category while storing category details only once in the Category table While normalized tables facilitate easier maintenance and optimize storage space, they can impact performance due to the necessity of additional joins for queries that navigate through hierarchies For instance, executing the query “Total sales by category” in this schema requires more complex SQL operations compared to a star schema.
GROUP BY CategoryName while in the snowflake schema in Fig.5.2we need an extra join, as follows:
DateKey Date WeekdayFlag WeekendFlag Season
ProductKey StoreKey PromotionKey DateKey Amount Quantity
CityKey CityName CityPopulation CityArea StateKey
StateKey StateName StatePopulation StateArea StateMajorActivity
Fig 5.2 An example of a snowflake schema
WHERE P.ProductKey = S.ProductKey AND P.CategoryKey = C.CategoryKey GROUP BY CategoryName
A starflake schema merges elements of both star and snowflake schemas, featuring a mix of normalized and non-normalized dimensions This schema is exemplified by replacing the Product, Category, and Department tables in a star schema with the Product dimension table from a snowflake schema, while keeping other dimension tables, such as Store, intact.
A constellation schema features multiple fact tables that utilize shared dimension tables For instance, as illustrated in Fig 5.3, the schema includes two fact tables: Sales and Purchases, which both draw from the Date and Product dimensions Additionally, constellation schemas can incorporate both normalized and unnormalized dimension tables.
We will discuss further star and snowflake schemas when we study logical representation of hierarchies later in this chapter.
Relational Representation of Data Warehouses
ProductKey SupplierKey OrderDateKey DueDateKey Amount Quantity FreightCost
DateKey Date WeekdayFlag WeekendFlag Season
ProductKey StoreKey PromotionKey DateKey Amount Quantity
StoreKey StoreNumber StoreName StoreAddress ManagerName CityName CityPopulation CityArea StateName StatePopulation StateArea StateMajorActivity
SupplierKey SupplierName ContactPerson SupplierAddress CityName StateName
Fig 5.3 An example of a constellation schema
5.3 Relational Representation of Data Warehouses
We outline a series of rules for converting a multidimensional conceptual schema into a relational schema, building upon the guidelines established in Section 2.4.1, which detail the translation of an ER schema into the relational model.
In Rule 1, a level L is associated with a table T L that encompasses all attributes of that level, unless it pertains to a one-to-one relationship A surrogate key can be included in the table; otherwise, the level's identifier serves as the table's key Additionally, when applying Rule 3 for mapping relationships, more attributes will be incorporated into this table.
According to Rule 2, a fact F is associated with a table TF that encompasses all relevant measures of that fact, and a surrogate key may be included in the table Additionally, when applying Rule 3 to map relationships, further attributes will be incorporated into this table.
Rule 3 outlines that the relationship between a fact and a dimension level, or between parent and child dimension levels, can be represented in three distinct ways based on their cardinalities.
In a one-to-one relationship, the table representing the fact (T F) or the child level (T C) is expanded to include all attributes from the dimension level or the parent level, respectively.
In a one-to-many relationship, the fact table (T F) or child table (T C) is enhanced by including the surrogate key from the corresponding dimension table (T L) or parent table (T P) This means that the fact or child table contains a foreign key that references the related table, establishing a clear link between the two.
In the case of a many-to-many relationship, a bridge table (Table T B) is established, which includes the surrogate keys from the corresponding fact table (T F) and dimension level (T L), or from the parent (T P) and child levels (T C) If the relationship involves a distributing attribute, an extra attribute is incorporated into the bridge table to capture this information.
Surrogate keys are created for each dimension level in a data warehouse to ensure independence from the evolving keys of source systems This approach enhances efficiency by using integers for surrogate keys, while source system keys may be less efficient, often represented as strings It's essential to retain the original keys from source systems within the dimensions to facilitate data matching between the source and the warehouse Alternatively, reusing the source system keys in the data warehouse is also a viable option.
A fact table created using the specified mapping rules will include surrogate keys for each level associated with the fact, reflecting a one-to-many relationship for each role played by the level The primary key of the table is formed by the surrogate keys of all participating levels If an additional surrogate key is included in the fact table, the combination of these surrogate keys from all levels serves as an alternate key.
As we will see in Sect.5.5, more specialized rules are needed for mapping the various kinds of hierarchies that we studied in Chap.4.
Applying the specified rules to the Northwind conceptual data cube results in a Sales table featuring eight foreign keys, each corresponding to a one-to-many relationship with the fact As discussed in Chapter 4, role-playing dimensions, such as the Date dimension, utilize multiple foreign keys to represent different roles within the relational model, including OrderDateKey, DueDateKey, and ShippedDateKey Additionally, the Order dimension is linked to the fact through a one-to-one relationship, allowing its attributes to be integrated directly into the fact table.
5.3 Relational Representation of Data Warehouses 111
CustomerKey EmployeeKey OrderDateKey DueDateKey ShippedDateKey ShipperKey ProductKey SupplierKey OrderNo OrderLineNo UnitPrice Quantity Discount SalesAmount Freight
SupplierKey CompanyName Address PostalCode CityKey
CustomerKey CustomerID CompanyName Address PostalCode CityKey
EmployeeKey FirstName LastName Title BirthDate HireDate Address City Region PostalCode Country SupervisorKey
StateKey StateName EnglishStateName StateType StateCode StateCapital RegionName (0,1) RegionCode (0,1) CountryKey
CountryKey CountryName CountryCode CountryCapital Population Subdivision ContinentKey
The Northwind data warehouse features a relational representation known as a fact dimension, as illustrated in Fig 4.1 The fact table includes five key attributes: Unit Price, Quantity, Discount, Sales Amount, and Freight Additionally, the many-to-many relationship between Employee and Territory is represented in the Territories table, which includes two foreign keys.
In the Northwind data warehouse, two methods for defining dimension level keys are illustrated: the use of surrogate keys and the retention of database keys as data warehouse keys For instance, the Customer entity features a surrogate key, CustomerKey, alongside its database key, CustomerID, while the Supplier entity utilizes SupplierKey as a database key The selection between these two approaches is determined during the ETL process, which will be discussed in Chapter 9.
Time Dimension
Time information is a crucial component of data warehouses, serving as foreign keys in fact tables to denote when specific events occurred Additionally, it is represented in a time dimension that outlines various aggregation levels, allowing for diverse methods of summarizing facts over time.
In OLTP database applications, temporal information is typically derived from DATE attributes using DBMS functions, without explicitly storing details like weekends or holidays Instead, these computations are performed on the fly Conversely, data warehouses store this temporal information as attributes in the time dimension to facilitate efficient OLAP queries, which require quick data retrieval for summarization For instance, a query to calculate "Total sales during weekends" can be easily executed using a straightforward SQL query in a data warehouse environment.
WHERE D.DateKey = S.DateKey AND D.WeekendFlag = 1
The granularity of the time dimension in a data warehouse is determined by application requirements For instance, if monthly data is needed, the time dimension would consist of 60 tuples for a 5-year span Conversely, if detailed data is required down to the second, the time dimension could contain 155,520,000 tuples over the same period To manage such large dimension tables, it is advisable to split the time dimension into two tables: one for date granularity, covering all dates over the data warehouse's time span, and another for second granularity, detailing every second within a single day.
In a time dimension, there are 1,825 tuples generated from 5 years, and a total of 86,400 tuples from 24 hours, each containing 3,600 seconds This necessitates the fact table to include two foreign keys linked to the time dimension tables, which can be automatically populated It's important to recognize that a time dimension can feature multiple hierarchies, as illustrated in the calendar and fiscal year example Additionally, when using a single hierarchy, care must be taken to meet summarizability conditions; for instance, while a day can be accurately aggregated into both month and year levels, a week may span two different months, preventing it from being aggregated at the month level within the hierarchy.
Logical Representation of Hierarchies
Balanced Hierarchies
In a conceptual multidimensional schema, dimension hierarchies are represented independently, connected through parent-child relationships By applying the mapping rules from Section 5.3 to balanced hierarchies, we generate snowflake schemas, where each level is depicted as a distinct table containing the level's key, attributes, and foreign keys for the relationships For instance, utilizing Rules 1 and 3b on the Categories hierarchy results in a snowflake structure with the Product and Category tables, as illustrated in Figure 5.5a.
To implement star schemas effectively, it is essential to represent hierarchies through flat tables that consolidate keys and attributes from all hierarchy levels into a single table This approach can be achieved by denormalizing the tables that encompass multiple levels of the hierarchy.
As an example, theDatedimension of Fig.4.1can be represented in a single table containing all attributes, as shown in Fig 5.5b.
MonthNumber MonthName Quarter Semester Year
(b) Fig 5.5 Relations for a balanced hierarchy ( a ) Snowflake structure; ( b ) Flat table
Snowflake schemas are more effective than star schemas in representing hierarchical structures, as they allow for clear differentiation between levels and enable the reuse of levels across various hierarchies.
In a snowflake schema, specific attributes can be assigned to various levels of a hierarchy, as illustrated by the Product and Category tables in Fig 5.5a However, this design can lead to reduced query performance due to the necessary joins required to integrate data from multiple tables within the hierarchy.
Star schemas enhance query formulation by reducing the number of joins required due to their denormalized structure, leading to improved system performance for processing star queries However, they have notable drawbacks, particularly in modeling hierarchies effectively For instance, in the Store dimension, it is unclear which attributes can represent hierarchies, and associating attributes with their corresponding levels is challenging, resulting in a complex hierarchy structure that is difficult to comprehend.
Unbalanced Hierarchies
Unbalanced hierarchies can violate summarizability conditions, leading to potential exclusions of non-leaf level members without associated children from analysis For example, in a scenario with two fact tables linked to a dimension—one at the ATM level and another at the Agency level—aggregation of measures is feasible for agencies with ATMs and branches with agencies However, disaggregating information from Agency to ATM requires knowledge of which agencies possess ATMs; otherwise, OLAP tools may struggle with this drill-down process Additionally, in star schema representations, defining ATM as a primary key becomes problematic due to the presence of NULL values for agencies lacking ATMs.
To resolve issues with unbalanced hierarchies, placeholders (PH1, PH2, , PHn) can be introduced at missing levels to create a balanced structure Logical mapping can then be applied to these balanced hierarchies In instances where a child member has two or more consecutive missing parent levels, measure values should be repeated for accurate aggregation, as illustrated by branch 2 in the diagram Additionally, a specialized interface is necessary to conceal these placeholders from users Furthermore, there are cases where factual data within the same fact table exists at varying granularities, which will be discussed in Section 5.6.
Recall from Sect.4.2.2thatparent-child hierarchiesare a special case of unbalanced hierarchies Mapping these hierarchies to the relational model
5.5 Logical Representation of Hierarchies 115 bank X branch 1 branch 3 agency 11 agency 12 agency 31 agency 32
Transforming an unbalanced hierarchy into a balanced one involves using placeholders to create tables that include all attributes of each level, along with a foreign key linking child members to their parent For instance, the Employee table illustrates the relational representation of a parent-child hierarchy This structure complicates operations, necessitating recursive queries to navigate the hierarchy While SQL and MDX support recursive queries, DAX does not, requiring a flattening of the hierarchical structure into a regular hierarchy with a separate column for each level.
Generalized Hierarchies
Generalized hierarchies manage scenarios where dimension members vary in type, each with a distinct aggregation path For instance, customers can be categorized as either companies or individuals, with companies following the aggregation route of Customer → Sector → Branch, while individuals are aggregated through Customer → Profession → Branch.
Balanced hierarchies can be represented at the logical level using two main approaches: creating separate tables for each level, resulting in snowflake schemas, or consolidating all levels into a single table that utilizes null values for non-applicable attributes A hybrid method is also feasible, where one table contains common levels and another holds specific details Additionally, separate fact and dimension tables can be established for each hierarchy path Regardless of the chosen approach, it is essential to maintain metadata that outlines the tables involved in various aggregation paths and to impose constraints that prevent incorrect queries, such as avoiding the grouping of unrelated categories like Sector and Profession.
Fig 5.7 Relations for the generalized hierarchy in Fig 4.4
Applying the mapping from Section 5.3 to the generalized hierarchy illustrated in Figure 4.4 produces the relationships depicted in Figure 5.7 While this schema effectively showcases the hierarchical structure, it does not facilitate navigation through only the shared levels of the hierarchy, such as moving from Customer to Branch.
To ensure this possibility, we must add the following mapping rule.
Rule 4: A table corresponding to a splitting level in a generalized hierarchy has an additional attribute which is a foreign key of the next joining level, provided it exists The table may also include a discriminating attribute that indicates the specific aggregation path of each member.
Fig 5.8 Improved relational representation of the generalized hierarchy in Fig 4.4
Figure 5.8 illustrates the hierarchical relationships outlined in Figure 4.4, showcasing the Customer table that incorporates two types of foreign keys The first set, SectorKey and ProfessionKey, signifies the next specialized hierarchy level, derived from applying Rules 1 and 3b from Section 5.3 The second type, BranchKey, represents the next joining level, obtained through Rule 4 Additionally, the CustomerType attribute, which can be either Person or Company, delineates the specific aggregation path for members, thereby enhancing aggregation processes It is crucial to implement check constraints to ensure that only one of the specialized foreign keys can hold a value, based on the CustomerType attribute.
ALTER TABLE Customer ADD CONSTRAINT CustomerTypeCK
CHECK ( CustomerType IN ('Person', 'Company') )
ALTER TABLE Customer ADD CONSTRAINT CustomerPersonFK
( ProfessionKey IS NOT NULL AND SectorKey IS NULL ) )
ALTER TABLE Customer ADD CONSTRAINT CustomerCompanyFK
( ProfessionKey IS NULL AND SectorKey IS NOT NULL ) )
The schema illustrated in Fig 5.8 provides the flexibility to choose various analytical paths, such as focusing on specific levels like Profession or Sector, or opting for common levels applicable to all members, such as analyzing all customers through the Customer and Branch hierarchy While this structure, similar to a snowflake schema, requires join operations across multiple tables, it significantly enhances the analytical possibilities available.
The mapping discussed can be effectively applied to ragged hierarchies, which are a specific type of generalized hierarchy, as demonstrated in Fig 5.4 In this example, the City level features two foreign keys linking to the State and Country levels Due to the unique path characteristic of ragged hierarchies, where certain levels can be omitted, one alternative is to incorporate the attributes of an optional level directly into the preceding level, as illustrated with the State level containing two optional attributes related to the Region level Additionally, another approach involves modifying the hierarchy at the instance level by inserting placeholders for any absent intermediate levels, similar to the method used for unbalanced hierarchies discussed in Sect 5.5.2, thereby transforming a ragged hierarchy into a balanced one.
Alternative Hierarchies
Alternative hierarchies can be effectively represented using traditional relational tables, as illustrated in Fig 5.9, which corresponds to the conceptual schema in Fig 4.7 While generalized and alternative hierarchies are clearly distinguishable at the conceptual level (refer to Figs 4.4a and 4.7), this distinction becomes less clear at the logical level, as shown in the comparison between Figs 5.7 and 5.9.
Fig 5.9 Relations for the alternative hierarchy in Fig 4.7
Parallel Hierarchies
Parallel hierarchies consist of multiple hierarchies, and their logical mapping involves integrating the mappings of the distinct hierarchy types For instance, the application of this mapping to the schema illustrated in Fig 4.9 is demonstrated in Fig 5.10.
Fig 5.10 Relations for the parallel dependent hierarchies in Fig 4.10
In parallel dependent hierarchies, shared levels are consolidated into a single table, such as the State table in this example Because these levels serve distinct functions within each hierarchy, we can develop views to enhance query efficiency and improve data visualization.
For example, in Fig 5.10table Statescontains all states where an employee lives, works, or both Therefore, aggregating along the pathEmployee→City
To exclude states without any employees from our results, we can create a view called StateLives, which will include only those states where at least one employee resides.
Both alternative and parallel dependent hierarchies can be conceptually distinguished, as illustrated in Figures 4.7 and 4.10 However, their logical representations, shown in Figures 5.9 and 5.10, appear similar despite several distinguishing characteristics discussed in Section 4.2.5.
Nonstrict Hierarchies
Applying mapping rules to nonstrict hierarchies establishes relationships between levels and introduces a bridge table to manage many-to-many relationships For instance, in the hierarchy illustrated in Fig 4.13, the bridge table, named EmplSection, exemplifies this many-to-many relationship When a parent-child relationship includes a distributing attribute, as shown in Fig 4.13, this attribute is incorporated into the bridge table to store necessary values for measure distribution This integration necessitates a specific aggregation procedure that utilizes the distributing attribute.
Fig 5.11 Relations for the nonstrict hierarchy in Fig 4.13
To convert a nonstrict hierarchy into a strict hierarchy, an additional dimension can be added to the fact, as illustrated in Fig 4.14 This transformation allows for the application of the mapping associated with a strict hierarchy The decision between using this method or another solution may be influenced by several factors, which will be discussed further.
Bridge tables are more space-efficient compared to creating additional dimensions, as they prevent the fact table from expanding excessively when child members are linked to multiple parent members This approach minimizes the need for extra foreign keys in the fact table, which can further increase space requirements Additionally, bridge tables necessitate the separate storage of parent-child relationship information and any distributing attributes, if applicable.
Bridge tables require join operations, calculations, and programming effort to aggregate measures effectively, making them suitable for applications with limited nonstrict hierarchies They work well when the distribution of measures remains constant over time In contrast, additional dimensions are more effective for representing temporal changes in measure distribution, allowing for easier aggregation along the hierarchy.
An alternative approach involves converting a many-to-many relationship into a one-to-many relationship by establishing a "primary" relationship This transformation shifts the nonstrict hierarchy to a strict hierarchy, allowing for the application of the corresponding mapping for simple hierarchies, as detailed in Section 4.3.2.
Advanced Modeling Aspects
Facts with Multiple Granularities
Two main approaches exist for logically representing facts with varying granularities The first approach utilizes multiple foreign keys to correspond with each alternative granularity, akin to the method described for generalized hierarchies Alternatively, the second approach eliminates granularity variation at the instance level by employing placeholders, similar to the strategy used for unbalanced hierarchies.
In the example illustrated by Fig 4.16, measurements are captured at various granularities Figure 5.12 presents the relational schema derived from the initial solution, demonstrating how the Sales fact table connects to both City and State levels via referential integrity constraints Notably, the attributes CityKey and StateKey are optional, necessitating the establishment of constraints to ensure that only one of the foreign keys can hold a value at any given time.
Figure 5.13shows an example of instances for the second solution above,where placeholders are used for facts that refer to nonleaf levels There are
DateKey ProductKey CityKey (0,1) StateKey (0,1) Quantity UnitPrice SalesAmount
Figure 5.12 illustrates the relationships involving facts with multiple granularities, as depicted in Figure 4.16 It presents two scenarios represented by placeholders In the first scenario, a fact member is associated with a nonleaf member that has children, where placeholder PH1 signifies all cities excluding the existing children In the second scenario, a fact member connects to a nonleaf member that does not have any children.
In this case, placeholder PH2represents all (unknown) cities of the state.
Fig 5.13 Using placeholders for the fact with multiple granularities in Fig 4.16
Both solutions aim to ensure accurate summarization of measures In the first approach, aggregation occurs at the state level through a union of two subqueries, each representing an alternative path In the second approach, aggregation takes place at the city level, where placeholders are included in the results.
Many-to-Many Dimensions
The mapping of many-to-many dimensions to the relational model creates relationships that represent facts and dimension levels, alongside a bridge table that captures the many-to-many relationship between the fact table and the dimension In this context, the bridge table, BalanceClient, connects the fact table, Balance, with the dimension table, Client Additionally, a surrogate key has been incorporated into the Balance fact table to facilitate the relationship between facts and clients in the bridge table.
Account AccountKey AccountNo Type Description OpeningDate
AgencyKey AgencyName Address Area NoEmployees
Fig 5.14 Relations for the many-to-many dimension in Fig 4.17
In Section 4.3.2, we explored various solutions for decomposing a many-to-many dimension based on the dependencies present in the fact table Following this decomposition, we can effectively apply the traditional mapping to the relational model on the resulting structure.
Links between Facts
The logical representation of relationships between facts is influenced by their cardinalities For one-to-one or one-to-many cardinalities, the surrogate key of the fact with a cardinality of one is incorporated as a foreign key in the other fact.
In the case of many-to-many cardinalities, a bridge table with foreign keys to the two facts is needed.
In a many-to-many relationship between Order and Delivery facts, as illustrated in Fig 4.21, the relational schema in Fig 5.15 employs surrogate keys for both facts and utilizes a bridge table, OrderDelivery, to connect them Instances of this relationship are demonstrated in Fig 5.16 Conversely, in a one-to-many scenario, each order is associated with a single delivery, highlighting a distinct relationship between the two facts.
DeliveryKey ShipperKey CustomerKey DateKey Freight
Fig 5.15 Relations for the schema with a link between facts in Fig 4.21
Order Date Customer Employee Amount
Delivery Date Customer Shipper Freight
In instances where multiple orders are related, as illustrated in Fig 5.16, a bridge table becomes unnecessary Instead, the DeliveryKey should be incorporated directly into the Order fact table to streamline data management.
Linking fact tables is essential for merging data from different cubes via a join operation, as illustrated in the example where order and delivery data are combined for comprehensive sales analysis In this setup, only one instance of the CustomerKey is maintained, assuming its consistency across both orders and deliveries, while the DateKeys are preserved and renamed as OrderDateKey and DeliveryDateKey However, it is important to note that the many-to-many relationship between these facts may lead to potential double-counting issues.
Order OrderDate Customer Employee Amount Delivery DeliveryDate Shipper Freight
Fig 5.17 Drill-across of the facts through their link in Fig 5.16
Slowly Changing Dimensions
In many real-world scenarios, dimensions in a data warehouse are not static and can change both structurally and at the instance level Structural changes occur when attributes are deleted from data sources, necessitating their removal from dimension tables Instance-level changes can arise from two situations: first, when corrections are needed to replace erroneous data in dimension tables, and second, when the contextual conditions of an analysis scenario evolve, requiring updates to the dimension table contents This section addresses these critical changes to ensure accurate and relevant data management.
This article presents a simplified version of the Northwind data warehouse, focusing on a Sales fact table that is linked to the dimensions of Date, Employee, Customer, and Product, along with a SalesAmount measure It utilizes a star schema with a denormalized Product table, which incorporates category data directly within it Examples of the Sales fact table and the Product dimension table are provided for illustration.
DateKey EmployeeKey CustomerKey ProductKey SalesAmount t1 e1 c1 p1 100 t2 e2 c2 p1 100 t3 e1 c3 p3 100 t4 e2 c4 p4 100
ProductKey ProductName UnitPrice CategoryName Description p1 prod1 10.00 cat1 desc1 p2 prod2 12.00 cat1 desc1 p3 prod3 13.50 cat2 desc2 p4 prod4 15.00 cat2 desc2
New tuples are added to the Sales fact table with each new sale, while updates may also be necessary For instance, when a company launches a new product, a corresponding tuple must be inserted into the Product table Additionally, it's important to correct any inaccuracies in product data to maintain data integrity.
Slowly Changing Dimensions (SCD) refer to the design of data dimensions that accommodate infrequent changes, such as adjustments in product categories due to new commercial or administrative policies To maintain data integrity, corresponding tuples must be corrected when such changes occur.
In the scenario above, consider a query asking for the total sales per em- ployee and product category, expressed as follows:
SELECT E.EmployeeKey, P.CategoryName, SUM(SalesAmount)
This query would return the following table:
EmployeeKey CategoryName SalesAmount e1 cat1 100 e2 cat1 100 e1 cat2 100 e2 cat2 100
If the product category of prod1 is reclassified from cat1 to cat2 after the last recorded sale, simply updating the category to cat2 fails to preserve the historical category information Consequently, when a user queries the sales data post-reclassification, they would anticipate consistent results However, since all prior sales occurred under the original category, the query would yield results that do not reflect the previous categorization, leading to potential discrepancies in the data interpretation.
EmployeeKey CategoryName SalesAmount e1 cat2 200 e2 cat2 200
The result is inaccurate because the products impacted by the category change were already linked to sales data Conversely, if the new category stems from correcting an error (for example, if the actual category of prod1 is cat2), then the result would be valid In the first scenario, it is essential to ensure that the results from when prod1 was categorized as cat1 are preserved, and that new aggregations are calculated using the updated category cat2.
There are three primary methods for managing slowly changing dimensions, with the simplest being Type 1 This method involves replacing the old attribute value with a new one, resulting in the loss of historical data Type 1 is suitable for corrections made due to errors in the dimension data.
In the second solution, called type 2, the tuples in the dimension table are versioned, and a new tuple is inserted each time a change takes place.
To manage product categorization effectively, we will add a new entry for product "prod1" in the Product table under the new category "cat2." This approach ensures that all sales before a specified time (twill) are aggregated to "cat1," while those after contribute to "cat2." To implement this, a surrogate key must be included alongside the business key, allowing different surrogate keys for each version of a dimension member while maintaining the same business key In our case, the business key is stored in the ProductID column, and the surrogate key is in the ProductKey column Additionally, we will enhance the Product table with two attributes, "From" and "To," to indicate the validity period of each entry.
Name Description From To k1 p1 prod1 10.00 cat1 desc1 2010-01-01 2011-12-31 k11 p1 prod1 10.00 cat2 desc2 2012-01-01 9999-12-31 k2 p2 prod2 12.00 cat1 desc1 2010-01-01 9999-12-31
The table illustrates two versions of product prod1, identified by ProductKey values k1 and k11, with the date 9999-12-31 in the To attribute signifying the tuple's ongoing validity, a common practice in temporal databases Each product is represented in the fact table by multiple surrogates corresponding to its attribute changes, allowing the business key to track all relevant tuples for the same product This business key is essential for counting the variety of products sold by the company over designated time frames However, the frequent insertion of new records with each attribute change can lead to significant dimension growth, which may hinder performance during join operations with the fact table To mitigate these issues, more advanced techniques have been proposed and will be discussed further.
In the type 2 approach, an additional attribute called RowStatus is often included to clearly indicate the current row The table below illustrates this by showing the current value for product prod1.
Name Description From To Row
Status k1 p1 prod1 10.00 cat1 desc1 2010-01-01 2011-12-31 Expired k11 p1 prod1 10.00 cat2 desc2 2012-01-01 9999-12-31 Current ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã ã
Let us consider a snowflake representation for theProductdimension where the categories are represented in a table Category, as given next.
Category Key p1 prod1 10.00 c1 p2 prod2 12.00 c1 p3 prod3 13.50 c2 p4 prod4 15.00 c2
Category Name Description c1 cat1 desc1 c2 cat2 desc2 c3 cat3 desc3 c4 cat4 desc4
In a type 2 snowflake representation, both the Product and Category tables are enhanced with a surrogate key and two temporal attributes, From and To For instance, if product prod1 transitions to category cat2, the Product table will reflect this change, capturing the historical data associated with the product's category.
The data outlines various product entries with associated keys, prices, locations, and effective date ranges, such as k1 for prod1 priced at 10.00 from 2010-01-01 to 2011-12-31 and k2 for prod2 at 12.00 from 2010-01-01 indefinitely It highlights the importance of maintaining the integrity of the Category table while ensuring that any modifications made at a higher level in the hierarchy, such as changes to category descriptions, are effectively propagated down to all relevant lower levels.
Category Name Description From To l1 c1 cat1 desc1 2010-01-01 2011-12-31 l11 c1 cat1 desc11 2012-01-01 9999-12-31 l2 c2 cat2 desc2 2010-01-01 9999-12-31 l3 c3 cat3 desc3 2010-01-01 9999-12-31 l4 c4 cat4 desc4 2010-01-01 9999-12-31
To ensure accurate sales tracking, it is essential to update the Product table so that all previous sales reference the old category version (cat1, key l1), while new sales should be linked to the updated category version (key l11).
Key From To k1 p1 prod1 10.00 l1 2010-01-01 2011-12-31 k11 p1 prod1 10.00 l11 2012-01-01 9999-12-31 k2 p2 prod2 12.00 l1 2010-01-01 2011-12-31 k21 p2 prod2 12.00 l11 2012-01-01 9999-12-31 k3 p3 prod3 13.50 l2 2010-01-01 9999-12-31 k4 p4 prod4 15.00 l2 2011-01-01 9999-12-31
Type 3 slowly changing dimensions involve adding an extra column for each attribute that may change, allowing for the storage of new attribute values For instance, when the product "prod1" shifts from category "cat1" to "cat2," the corresponding description also updates from "desc1" to "desc2." This approach effectively tracks changes in attributes like CategoryName and Description.
Description p1 prod1 10.00 cat1 cat2 desc1 desc2 p2 prod2 12.00 cat1 desc1 p3 prod3 13.50 cat2 desc2 p4 prod4 15.00 cat2 desc2
Note that only the two last versions of the attribute can be represented in this solution and that the validity interval of the tuples is not stored.
It is important to note that the three solutions mentioned can be applied individually or in combination within the same dimension For instance, correction (type 1), tuple versioning (type 2), and attribute addition (type 3) can be utilized for different attributes within a single dimension table.
Performing OLAP Queries with SQL
Multidimensional Model
A data warehouse gathers information from various data stores, utilizing a data source that provides essential connection details such as server location, login credentials, data retrieval methods, and security permissions Analysis Services accommodates multiple data source types, with SQL being the default query language for relational databases For instance, in our example, a single data source connects to the Northwind data warehouse.
A Data Source View (DSV) establishes the relational schema for an Analysis Services database, derived from various data sources To effectively load data into the warehouse, transformations are often necessary, such as selecting specific columns, adding derived columns, filtering rows based on criteria, and merging columns These tasks can be accomplished in the DSV by replacing source tables with SQL-based named queries or defining named calculations for derived columns Additionally, if source systems lack primary key specifications and foreign key relationships, these can be defined within the DSV.
Analysis Services enables users to assign user-friendly names to tables and columns, enhancing visibility and navigation within extensive data warehouses Additionally, it provides the option to create customizable views, known as diagrams, within a Data Source View (DSV) to display only selected tables.
The DSV, derived from the Northwind data warehouse, illustrates the Sales fact table alongside its corresponding dimension tables, as depicted in Fig 5.23 Notably, the ragged geography hierarchy has been transformed into a regular format The figure also highlights several named calculations, indicated by a special icon next to the attribute names, which play a crucial role in defining and browsing the dimensions.
• In the Employee dimension table, the named calculation EmployeeName combines the first and last name with the expression
• In theDatedimension table, the named calculationsFullMonth,FullQuar- ter, andFullSemester, are defined, respectively, by the expressions
'Q' + CONVERT(CHAR(1), Quarter) + ' ' + CONVERT(CHAR(4), Year)
'S' + CONVERT(CHAR(1), Semester) + ' ' + CONVERT(CHAR(4), Year)
These calculations combine the month, quarter, or semester with the year.
• In theSalesfact table, the named calculationOrderLineDesccombines the order number and the order line using the expression
CONVERT(CHAR(5),OrderNo) + ' - ' + CONVERT(CHAR(1),OrderLineNo)
5.9 Defining the Northwind Data Warehouse in Analysis Services 137
Fig 5.23 The data source view for the Northwind data warehouse
Analysis Services supports several types of dimensions as follows:
• Aregular dimensionhas a direct one-to-many link between a fact table and a dimension table An example is the dimensionProduct.
A reference dimension is connected to the fact table indirectly via another dimension, such as the Geography dimension, which links to the Sales fact table through the Customer and Supplier dimensions Additionally, reference dimensions can be interconnected; for instance, one can establish another reference dimension based on the Geography dimension.
In a role-playing dimension, a single fact table can connect to a dimension table multiple times For instance, dimensions such as OrderDate, DueDate, and ShippedDate all reference the same Date dimension This approach allows a role-playing dimension to be stored once while being utilized in various contexts throughout the data model.
• Afact dimension, also referred to asdegenerate dimension, is similar to a regular dimension but the dimension data are stored in the fact table.
An example is the dimensionOrder.
In a many-to-many dimensional model, a fact can be associated with multiple dimension members, and conversely, a dimension member can relate to multiple facts An example of this can be observed in the Northwind data warehouse, where there exists a many-to-many relationship between Employees and various facts.
Cities, which is represented in the bridge tableTerritories This table must be defined as a fact table in Analysis Services, as we will see later.
Dimensions can be defined either from a DSV, which provides data for the dimension, or from preexisting templates provided by Analysis Services.
An example of the latter is the time dimension, which does not need to be defined from a data source Dimensions can be built from one or more tables.
To understand dimensions in Analysis Services, it is essential to explore the management of hierarchies This article will delve deeper into the topic, highlighting the two distinct types of hierarchies utilized within Analysis Services.
Attribute hierarchies are linked to individual columns in a dimension table, such as the attribute ProductName in the Product dimension In contrast, multilevel hierarchies are formed from two or more attributes, with each attribute representing a different level, like Product and Category An attribute can be included in multiple multilevel hierarchies, such as a hierarchy combining Product and Brand Analysis Services recognizes three types of multilevel hierarchies based on the relationships between members: balanced, ragged, and parent-child hierarchies, which will be further detailed later in this section.
In this article, we demonstrate how to define various types of dimensions in Analysis Services using the Northwind data warehouse, focusing on the Product dimension The right pane illustrates the tables in the Data Source View (DSV) that contribute to the dimension, while the left pane lists the dimension's attributes The central pane displays the hierarchy called Categories, which includes the Category and Product levels, utilizing CategoryKey and ProductKey for their definition To enhance user experience during hierarchy browsing, the NameColumn property for these attributes is set to CategoryName and ProductName, respectively.
Fig 5.24 Definition of the Product dimension
Figure 5.25 illustrates various members of the Product dimension, highlighting the display of product names and categories within the dimension browser Notably, an "Unknown" member appears at the bottom, which is a standard feature in every dimension to accommodate key errors.
5.9 Defining the Northwind Data Warehouse in Analysis Services 139
When processing a fact table, if a corresponding key in the Product dimension cannot be found, the fact value can be assigned to the Unknown member of that dimension The visibility of the Unknown member can be controlled using the UnknownMember property; when set to visible, it will appear in the results of MDX queries.
Fig 5.26 Definition of the Date dimension
To define the Datedimension, a hierarchy named Calendar is established using the attributes Year, Semester, Quarter, Month, and Date The last two attributes have been renamed for clarity To utilize MDX time functions, the Type property of the dimension must be set to Time, and the attributes must correspond to standard time subdivisions, with DayNbMonth, MonthNumber, Quarter, Semester, and Year classified as DayOfMonth, MonthOfYear, QuarterOfYear, HalfYearOfYear, and Year, respectively It's crucial for attributes in hierarchies to maintain a one-to-many relationship with their parent attributes to facilitate accurate roll-up operations For instance, a quarter should roll up to its semester, which is enforced in Analysis Services by defining a unique key for each hierarchical attribute In the Northwind data warehouse, the MonthNumber attribute has non-unique values, necessitating a composite key of MonthNumber and Year to ensure uniqueness This is achieved by setting the KeyColumns property, while the NameColumn property should reflect the attribute displayed in the hierarchy, such as FullMonth, and similar adjustments are required for the Quarter and Semester attributes.
Fig 5.27 Definition of the key for attribute MonthNumber in the Calendar hierarchy
Fig 5.28 Definition of the relationships in the Calendar hierarchy
The relationships between attributes in the Date dimension, illustrated in Fig 5.28, exemplify functional dependencies In Analysis Services, relationships are categorized into two types: flexible relationships, which can change over time, such as a product being reassigned to a new category, and rigid relationships, which remain constant, like the association of a month with its corresponding year.
5.9 Defining the Northwind Data Warehouse in Analysis Services 141 relationships shown in Fig 5.28 are rigid, as indicated by the solid arrow head Figure5.29shows some members of theCalendar hierarchy As can be seen, the named calculations FullSemester (e.g., S2 1997), FullQuarter (e.g., Q2 1997), andFullMonth(e.g.,January 1997) are displayed.
Fig 5.29 Browsing the hierarchy in the Date dimension
Tabular Model
A tabular data model consolidates information from multiple sources, including relational and multidimensional databases, data feeds, and text files To establish a connection, it is essential to supply specific authentication details based on the chosen data source.
When importing data from a relational database, you can either select specific tables and views or write queries to define the data to be imported To streamline the data import process, it is advisable to filter out unnecessary columns or rows A best practice in defining a tabular model is to utilize database views instead of tables, as these views help to decouple the physical database structure from the tabular model and can consolidate information from multiple tables into a single entity However, creating these views necessitates the appropriate access rights.
When designing a tabular model, a key consideration is whether to utilize a snowflake dimension from the source data or to denormalize the source tables into a single model table Typically, the advantages of a single model table, such as enhanced performance and the inability to create cross-table hierarchies, outweigh those of multiple model tables However, denormalization can lead to increased storage size due to redundant data, especially with large dimension tables Ultimately, the best approach hinges on the data volume and usability requirements.
To integrate a snowflake dimension into a single model table, it can be accomplished at either the data source or within the tabular model Typically, denormalization of a dimension in a relational data source is achieved using a view For instance, in our example, we create a view called ProductStar in the relational data warehouse for the Product dimension.
SELECT P.ProductKey, P.ProductName, P.QuantityPerUnit, P.UnitPrice, P.Discontinued,
The Geography reference dimension encompasses the levels of City, State, Country, and Continent, which are utilized by the Customer, Supplier, and Employee dimensions In the data warehouse, we will establish views named CustomerStar and SupplierStar accordingly.
SELECT U.CustomerKey, U.CustomerID, U.CompanyName, U.Address, U.PostalCode,
C.CityName AS City, S.StateName AS State, Y.CountryName AS Country, N.ContinentName AS Continent
FROM Customer U, City C, State S, Country Y, Continent N
WHERE U.CityKey = C.CityKey AND C.StateKey = S.StateKey AND
S.CountryKey = Y.CountryKey AND Y.ContinentKey = N.ContinentKey;
SELECT U.SupplierKey, U.CompanyName, U.Address, U.PostalCode,
C.CityName AS City, S.StateName AS State, Y.CountryName AS Country, N.ContinentName AS Continent
FROM Supplier U, City C, State S, Country Y, Continent N
WHERE U.CityKey = C.CityKey AND C.StateKey = S.StateKey AND
S.CountryKey = Y.CountryKey AND Y.ContinentKey = N.ContinentKey;
Selecting the appropriate attributes for the snowflake dimension is crucial for effective data analysis In our example, we included CityName, StateName, CountryName, and ContinentName, which means that other attributes like StateCapital and Population are excluded from the tabular model and cannot be utilized for filtering.
The Employee dimension establishes a many-to-many relationship with the Geography reference dimension via the Territories bridge table, prompting the creation of an additional view to facilitate this connection.
SELECT C.CityKey, C.CityName AS City, S.StateName AS State,
Y.CountryName AS Country, N.ContinentName AS Continent
FROM City C, State S, Country Y, Continent N
WHERE C.StateKey = S.StateKey AND S.CountryKey = Y.CountryKey AND
In the tabular model, we will import the specified views, renaming the first three as Product, Customer, and Supplier Additionally, we will include the tables for Employee, Territories, Sales, Date, and Shipper.
5.9 Defining the Northwind Data Warehouse in Analysis Services 149
Fig 5.35 Definition of a relationship in Analysis Services Tabular
Analysis Services Tabular features two storage modes, with the default being an in-memory columnar database known as Vertipaq This mode stores a copy of the data retrieved from the data source each time the data model undergoes a refresh.
In-memory storage means that all data is held in RAM, while columnar storage organizes data by individual columns, allowing for independent storage and enhanced compression to minimize scan times and memory usage Alternatively, the DirectQuery storage mode creates a metadata layer over an external database, transforming DAX queries into SQL queries, which decreases latency between data updates and their availability in the analytical database The choice between these storage modes depends on specific application requirements and the hardware at hand.
When importing tables from a relational database, the import wizard identifies relationships based on foreign key constraints; however, this does not apply to the previously created views that are not connected to the Sales table To establish these necessary relationships, several characteristics must be defined, including cardinality, filter direction, and active state The tabular model of the Northwind data warehouse is depicted in Figure 5.36 after these relationships have been defined.
Relationships in a tabular model are based on a single column At least one of the two tables involved in a relationship should use a column that
The Northwind data warehouse in Analysis Services features a tabular model where each row is assigned a unique value, typically serving as the table's primary key Unlike foreign keys, this unique identifier can also represent any candidate key within the table.
In a relational database, when a relationship involves multiple columns supported by foreign keys, it is advisable to consolidate these into a single column to establish the necessary relationship This can be accomplished using a calculated column that concatenates the values of the contributing columns Typically, the cardinality of relationships in a tabular model is one-to-many, meaning that each row in the lookup table can correspond to zero, one, or multiple rows in the related table For instance, the relationship between the Sales and Customer tables exemplifies this concept Cardinality can vary, including many-to-one, one-to-many, and one-to-one, with the latter occurring when both sides of the relationship consist of candidate keys for their respective tables.
The uniqueness constraint in a relationship primarily impacts the lookup table, ensuring that any value in the related table lacking a corresponding entry in the lookup table is assigned to a designated blank row This blank row is automatically created in the lookup table to accommodate all unmatched values from the related table For instance, this special row consolidates all entries from the Sales table that do not have a corresponding entry in the Customer table, and it is only generated when there are unmatched values present.
5.9 Defining the Northwind Data Warehouse in Analysis Services 151 one row in the Sales fact that does not exist in the Customer table In a one-to-many relationship, the functionsRELATEDTABLEandRELATEDcan be used in a row-related calculation of the, respectively, lookup and related tables.
Summary
We then need to define three columns namedLevel 1,Level 2, andLevel 3 for the levels The Level 1column is defined as follows.
VAR LevelKey = PATHITEM ( Employee[EmployeePath], 1, INTEGER )
RETURN LOOKUPVALUE ( Employee[EmployeeName], Employee[EmployeeKey], LevelKey )
The PATHITEM function retrieves the employee key from a specified position in the path, while the LOOKUPVALUE function fetches the name of the employee corresponding to the LevelKey variable The other two columns are created similarly by adjusting the second argument in the PATHITEM function These calculated columns, as illustrated in Figure 5.40, should remain hidden Additionally, a hierarchy named Supervision should be established, with the Hide Members attribute set to Hide Blank Members for optimal visualization in client tools.
Fig 5.40 Flattening the Supervision hierarchy in Analysis Services Tabular
This chapter explores the logical design of relational data warehouses, examining various schema alternatives such as star, snowflake, starflake, and constellation schemas It emphasizes the translation of conceptual multidimensional schemas into logical schemas, with a focus on the representation of hierarchies and the management of slowly changing dimensions Additionally, the implementation of OLAP operations in the relational model using SQL is discussed, highlighting advanced SQL features that support OLAP queries The chapter concludes with a practical example of implementing the Northwind data warehouse in Microsoft Analysis Services, utilizing both multidimensional and tabular models.
Bibliographic Notes
For a comprehensive understanding of data warehouse modeling, refer to the book by Kimball and Ross, which specifically addresses slowly changing dimensions Hierarchies in data warehousing are explored by Jagadish et al., while complex hierarchies are analyzed in various studies The classic paper by Lenz and Shoshani delves into the issue of summarizability, with additional insights found in related works Efforts to establish normal forms for multidimensional databases have been inspired by Codd's relational model SQL, analytics, and OLAP are thoroughly discussed in several key texts, and numerous books on Analysis Services provide detailed insights into its functionalities for both multidimensional and tabular models Finally, studies on data aggregation at different temporal granularities are highlighted in the literature.
Review Questions
5.1 Describe the differences between relational OLAP (ROLAP), multi- dimensional OLAP (MOLAP), and hybrid OLAP (HOLAP).
5.2 Describe the differences between star schemas, snowflake schemas, starflake schemas, and constellation schemas.
5.3 Discuss the mapping rules for translating a MultiDim schema into a relational schema Are these rules similar to those used for translating an ER schema into a relational schema?
5.4 Explain how a balanced hierarchy can be mapped into either normal- ized or denormalized tables Discuss the advantages and disadvantages of these alternative mappings.
5.5 How do you transform at the logical level an unbalanced hierarchy into a balanced one?
5.6 Describe different approaches for representing generalized hierarchies at the logical level.
5.7 Is it possible to distinguish between generalized, alternative, and par- allel dependent hierarchies at the logical level?
5.8 Explain how a nonstrict hierarchy can be represented in the relational model.
5.9 Describe with an example the various types of slowly changing dimen- sions Analyze and discuss the pros and cons of each type.
5.10 Define the kinds of SQL/OLAP window functions: partitioning, win- dow ordering, and window framing Write, in English, queries of each class over the Northwind data warehouse.
5.11 Identify the kind of hierarchies that can be directly represented inAnalysis Services Multidimensional and in Analysis Services Tabular.
Exercises
5.12 Discuss how snowflake schemas are represented in Analysis Services Multidimensional and in Analysis Services Tabular.
Exercise 5.1.Consider the data warehouse of the telephone provider given in Ex.3.1 Draw a star schema diagram for the data warehouse.
Exercise 5.2.For the star schema obtained in the previous exercise, write in SQL the queries given in Ex 3.1.
Exercise 5.3.Consider the data warehouse of the train application given in Ex 3.2 Draw a snowflake schema diagram for the data warehouse with hierarchies for the train and station dimensions.
Exercise 5.4.For the snowflake schema obtained in the previous exercise, write in SQL the queries given in Ex 3.2.
Exercise 5.5.Consider the university data warehouse described in Ex 3.3. Draw a constellation schema for the data warehouse taking into account the different granularities of the time dimension.
Exercise 5.6.For the constellation schema obtained in the previous exercise, write in SQL the queries given in Ex 3.3.
Exercise 5.7.Translate the MultiDim schema obtained for the French horse race application in Ex 4.5into the relational model.
Exercise 5.8.Translate the MultiDim schema obtained for the Formula One application in Ex.4.7into the relational model.
Exercise 5.9.Implement in Analysis Services a multidimensional model for the Foodmart data warehouse given in Fig.5.41.
Exercise 5.10.Implement in Analysis Services a tabular model for the Food- mart data warehouse given in Fig 5.41.
The Research and Innovative Technology Administration (RITA) coordinates the US Department of Transportation's research programs, gathering extensive statistics on various transportation modes, including monthly flight segment data between airports Users can download annual ZIP files containing CSV data from 1990 to the present, which encompass details such as scheduled and actual flight departures, seat sales, freight transported, and travel distances The website provides comprehensive descriptions of all data fields available.
Construct an appropriate data warehouse schema for the above applica- tion Analyze the input data and motivate the choice of your schema.
1 http://www.transtats.bts.gov/
StoreID StoreType RegionID StoreName StoreNumber StoreStreetAddress StoreCity
StoreState StorePostalCode StoreCountry StoreManager StorePhone StoreFax FirstOpenedDate LastRemodelDate LeaseSqft StoreSqft GrocerySqft FrozenSqft MeatSqft CoffeeBar VideoStore SaladBar PreparedFood Florist
PromotionID PromotionDistrictID PromotionName MediaType Cost StartDate EndDate
ProductID ProductClassID BrandName ProductName SKU SRP GrossWeight NetWeight RecyclablePackage LowFat
UnitsPerCase CasesPerPallet ShelfWidth ShelfHeight ShelfDepth
ProductClassID ProductSubcategory ProductCategory ProductDepartment ProductFamily
RegionID SalesCity SalesStateProvince SalesDistrict SalesRegion SalesCountry SalesDistrictID
SalesID ProductID DateID CustomerID PromotionID StoreID StoreSales StoreCost UnitSales
Fig 5.41 Relational schema of the Foodmart data warehouse
Data Analysis in Data Warehouses
This chapter emphasizes the importance of data analysis within data warehousing, highlighting how the utilization of collected data can enhance decision-making processes.
This article introduces two key languages for defining and querying data warehouses: MDX (MultiDimensional eXpressions) and DAX (Data Analysis eXpressions) MDX, developed by Microsoft, is essential for multidimensional data analysis.
In 1997, the introduction of OLAP tools led to the widespread adoption of a de facto standard, particularly with Microsoft's Analysis Services and MDX However, users found multidimensional cubes challenging to comprehend and use effectively To address these concerns, Microsoft launched the tabular model and its associated DAX language in 2012, which have since gained significant popularity among users.
From the user's perspective, the tabular model offers a simpler conceptual framework compared to the multidimensional model, making it easier for design, analysis, and reporting However, both models, along with their respective languages MDX and DAX, will continue to coexist in the business intelligence landscape, as they cater to different application requirements This chapter provides an introduction to both languages, with MDX discussed in Section 6.1 and DAX in Section 6.2.
We continue this chapter describing two essential tools for data analysis.
In Section 6.3, we explore key performance indicators (KPIs), which are quantifiable metrics that assess the effectiveness of an organization in meeting its primary goals Section 6.4 delves into the use of dashboards for presenting KPIs and other critical organizational data, enabling managers to make timely and informed decisions.
Chapter 7 will apply the discussed topics to the Northwind case study, highlighting their relevance This chapter is not meant to provide an exhaustive overview, as there are numerous books dedicated to each subject At the conclusion, we will reference popular literature in these fields for further exploration.
159 © Springer-Verlag GmbH Germany, part of Springer Nature 2022
A Vaisman, E Zimányi, Data Warehouse Systems, Data-Centric Systems and Applications, https://doi.org/10.1007/978-3-662-65167-4_6
Introduction to MDX
Tuples and Sets
Two fundamental concepts in MDX are tuples and sets We illustrate them using the example cube given in Fig 6.1.
Fig 6.1 A simple three-dimensional cube with one measure
A tuple represents a specific cell within a multidimensional cube by specifying one member from one or more dimensions For instance, the cell located in the top left corner, which has a value of 21, reflects the sales of beverages in Paris during the first quarter To pinpoint this cell, it is sufficient to provide the coordinates for each dimension involved.
(Product.Category.Beverages, Date.Quarter.Q1, Customer.City.Paris)
In the expression provided, each dimension's coordinates are represented in the format Dimension.Level.Member This highlights that there are multiple methods to identify a member within a dimension in MDX.
In particular, the order of the members is not significant, and the previous tuple can also be stated as follows:
(Date.Quarter.Q1, Product.Category.Beverages, Customer.City.Paris)
Since a tuple points to a single cell, then it follows that each member in the tuple must belong to a different dimension.
On the other hand, a set is a collection of tuples defined using the same dimensions For example, the following set
{ (Product.Category.Beverages, Date.Quarter.Q1, Customer.City.Paris)
(Product.Category.Beverages, Date.Quarter.Q1, Customer.City.Lyon) } points to the previous cell with value 21 and the one behind it with value 12.
It is worth noting that a set may have one or even zero tuples.
A tuple does not need to specify a member from every dimension Thus, the tuple
The data for Customer.City.Paris highlights the segment of the cube that includes the sixteen front cells, specifically representing the sales figures for various product categories in Paris.
In the context of sales analysis, the combination of Customer.City.Paris and Product.Category.Beverages highlights the sales of beverages specifically in Paris When a specific member for a dimension is not indicated, the default member is assumed, usually represented by the All member that aggregates the total values for that dimension It is important to note that the default member may also refer to the current member within the scope of a query.
In a data cube, tuples interact with hierarchies, particularly in the Customer dimension, which comprises levels such as Customer, City, State, and Country Understanding this relationship is essential for effective data analysis and representation.
In the first quarter, total beverage sales in France are represented by the member France at the country level, highlighting the sales performance for this category within the region.
In MDX, measures function similarly to dimensions, allowing for the inclusion of multiple metrics within a cube For instance, if a cube contains the measures UnitPrice, Discount, and SalesAmount, the Measures dimension will encompass these three members Consequently, we can easily specify the desired measure using a tuple format.
(Customer.Country.France, Product.Category.Beverages, Date.Quarter.Q1,
If a measure is not specified, then adefault measure will be used.
Basic Queries
The syntax of a typical MDX query is as follows:
In an MDX query, the axis specification outlines the axes and selected members, allowing for up to 128 axes Each axis is numbered, with 0 representing the x-axis (COLUMNS), 1 for the y-axis (ROWS), 2 for the z-axis, and so forth The initial axes have predefined names, including COLUMNS, ROWS, PAGES, CHAPTERS, and SECTIONS Axes can also be referenced using the AXIS(number) format, where AXIS(0) corresponds to COLUMNS and AXIS(1) to ROWS Importantly, query axes cannot be skipped; a query must include lower-numbered axes if it includes higher-numbered ones, meaning a ROWS axis cannot exist without a COLUMNS axis.
The slicer specification in the WHERE clause is optional; if it is not included, the query will default to the cube's standard measure However, most queries include a slicer specification unless the goal is to display the Measures dimension.
The simplest form of an axis specification consists in taking the members of the required dimension, including those of theMeasuresdimension, as follows:
SELECT [Measures].MEMBERS ON COLUMNS,
[Customer].[Country].MEMBERS ON ROWS
This query presents a summary of all customer measures at the country level In MDX, square brackets are generally optional, except when dealing with names that contain spaces, numbers, or MDX keywords, in which case they are necessary For clarity, we will exclude any unnecessary square brackets in the following examples.
The above query will show a row with only null values for countries that do not have customers The next query uses the NONEMPTY function to remove such values.
SELECT Measures.MEMBERS ON COLUMNS,
NONEMPTY(Customer.Country.MEMBERS) ON ROWS
Alternatively, the NON EMPTYkeyword can be used as shown next.
SELECT Measures.MEMBERS ON COLUMNS,
NON EMPTY Customer.Country.MEMBERS ON ROWS
While both the NONEMPTY function and the NON EMPTY keyword produce identical results in this instance, there are subtle distinctions between them that extend beyond this introductory overview of MDX.
The query presented reveals all the measures stored within the cube; however, derived measures like Net Amount, as defined in Section 5.9.1, do not show up in the results To include these derived measures, it is necessary to utilize the ALLMEMBERS keyword.
SELECT Measures.ALLMEMBERS ON COLUMNS,
Customer.Country.MEMBERS ON ROWS
TheADDCALCULATEDMEMBERSfunction can also be used for this purpose.
Slicing
Consider now the query below, which shows all measures by year.
SELECT Measures.MEMBERS ON COLUMNS,
[Order Date].Year.MEMBERS ON ROWS
To restrict the result to Belgium, we can write the following query.
SELECT Measures.MEMBERS ON COLUMNS,
[Order Date].Year.MEMBERS ON ROWS
The query is designed to retrieve all measure values across all years specifically for customers residing in Belgium, highlighting the distinct behavior of the WHERE clause compared to traditional SQL.
The WHERE clause can include multiple members from various hierarchies, allowing for more precise query restrictions For instance, the previous query can be refined to focus specifically on customers residing in Belgium who have purchased products from the beverages category.
SELECT Measures.MEMBERS ON COLUMNS,
[Order Date].Year.MEMBERS ON ROWS
WHERE (Customer.Country.Belgium, Product.Categories.Beverages)
To retrieve data for multiple members within the same hierarchy, it is essential to incorporate a set in the WHERE clause For instance, the query below displays the measures for all years concerning customers who purchased products in the beverage category and reside in Belgium or France.
SELECT Measures.MEMBERS ON COLUMNS,
[Order Date].Year.MEMBERS ON ROWS
WHERE ( { Customer.Country.Belgium, Customer.Country.France },
Using a set in theWHEREclause implicitly aggregates values for all members in the set In this case, the query shows aggregated values for Belgium and France in each cell.
Consider now the following query, which requests the sales amount of customers by country and by year.
SELECT [Order Date].Year.MEMBERS ON COLUMNS,
Customer.Country.MEMBERS ON ROWS
Here, we specified in the WHEREclause the measure to be displayed If no measure is stated, then the default measure is used.
The WHERE clause allows for the combination of measures and dimensions in queries, enabling users to filter results effectively; for instance, a query can be constructed to display figures exclusively for the beverages category, yielding results similar to previous outputs.
SELECT [Order Date].Year.MEMBERS ON COLUMNS,
Customer.Country.MEMBERS ON ROWS
WHERE (Measures.[Sales Amount], Product.Category.[Beverages])
When a dimension is present in a slicer, it cannot be utilized in any axis within the SELECT clause However, the FILTER function can later be employed to filter members of dimensions that are displayed on an axis.
Navigation
The query results include aggregated values for all years, encompassing the All column To display only the individual year values without the All member, you can utilize the CHILDREN function, as demonstrated in the following SQL command: SELECT [Order Date].Year.CHILDREN ON COLUMNS,
The attentive reader may wonder why the memberAlldoes not appear in the rows of the above result The reason is that the expression
Customer.Country.MEMBERS we used in the query is a shorthand notation for
The Customer dimension includes a geographic hierarchy that allows for the selection of members at the Country level The All member, which represents the highest level of the hierarchy, is situated above the Continent level and does not belong to the Country level, hence it is excluded from the results Each attribute within a dimension establishes its own attribute hierarchy, ensuring that an All member is present in every hierarchy.
6.1 Introduction to MDX 165 dimension, for both the user-defined hierarchies and the attribute hierarchies. Since the dimension Customer has an attribute hierarchy Company Name, if in the above query we use the expression
Customer.[Company Name].MEMBERS the result will contain the All member, in addition to the names of all the customers Using CHILDRENinstead will not show theAllmember.
It is also possible to select a single member or an enumeration of members of a dimension An example is given in the following query.
SELECT [Order Date].Year.MEMBERS ON COLUMNS,
{ Customer.Country.France,Customer.Country.Italy } ON ROWS
The query analyzes customer sales amounts by year for France and Italy It can be expressed using various terms such as Customer.France or Customer.Geography.Country.France, which utilize fully qualified names that specify the dimension, hierarchy, and level of the member While unique member names can suffice without full qualification, employing them is advisable to eliminate any potential ambiguities.
The functionCHILDRENmay be used to retrieve the states of the countries above as follows:
SELECT [Order Date].Year.MEMBERS ON COLUMNS,
{ Customer.France.CHILDREN, Customer.Italy.CHILDREN } ON ROWS FROM Sales
The MEMBERS and CHILDREN functions do not allow for drilling down into a hierarchy; however, the DESCENDANTS function can be utilized for this purpose.
SELECT [Order Date].Year.MEMBERS ON COLUMNS,
DESCENDANTS(Customer.Germany, Customer.City) ON ROWS
WHERE Measures.[Sales Amount] shows the sales amount for German cities
The DESCENDANTS function, by default, shows only the members at the level indicated by its second argument An optional third argument allows users to choose whether to include or exclude descendants or children before and after the specified level.
• SELF, which is the default, displays values for theCitylevel as above.
• BEFOREdisplays values from the state to theCountrylevels.
• SELF_AND_BEFOREdisplays values from theCityto theCountrylevels.
• AFTERdisplays values from theCustomerlevel, since it is only level after City.
• SELF_AND_AFTERdisplays values from theCityandCustomer levels.
• BEFORE_AND_AFTERdisplays values from theCountryto theCustomer levels, excluding the former.
• SELF_BEFORE_AFTERdisplays values from theCountryto theCustomer levels.
• LEAVESdisplays values from theCitylevel as above, since this is the only leaf level betweenCountryandCity On the other hand, ifLEAVESis used without specifying the level, as in the following query:
DESCENDANTS(Customer.Geography.Germany, ,LEAVES) then the leaf level, that is,Customer will be displayed.
The ASCENDANTS function retrieves a set that encompasses a specified member along with all its ancestors within a hierarchy For instance, a query can be executed to obtain the sales amount for the customer "Du monde entier" and all its ancestors in the Geography hierarchy, which includes levels such as City, State, Country, Continent, and All.
SELECT Measures.[Sales Amount] ON COLUMNS,
ASCENDANTS(Customer.Geography.[Du monde entier]) ON ROWS
The functionANCESTORcan be used to obtain the result for an ancestor at a specified level, as shown next.
SELECT Measures.[Sales Amount] ON COLUMNS,
ANCESTOR(Customer.Geography.[Du monde entier], Customer.Geography.State)
Cross Join
While MDX queries can showcase up to 128 axes, most OLAP tools are limited to displaying two-dimensional tables To effectively combine multiple dimensions into a single axis, the CROSSJOIN function is utilized For instance, to present sales amounts for product categories by country and quarter in a matrix format, it is essential to merge the customer and time dimensions into one axis.
SELECT Product.Category.MEMBERS ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS) ON ROWS
Alternatively, we can use the cross join operator ‘*’ as shown next.
SELECT Product.Category.MEMBERS ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS ON ROWS
More than two cross joins can be applied, as shown in the following query.
SELECT Product.Category.MEMBERS ON COLUMNS,
Customer.Country.MEMBERS * [Order Date].Calendar.Quarter.MEMBERS * Shipper.[Company Name].MEMBERS ON ROWS
Subqueries
The WHERE clause is essential for slicing the cube, allowing users to select specific measures or dimensions for display For instance, a query can be utilized to retrieve the sales amount specifically for the beverages and condiments categories.
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS ON ROWS
WHERE { Product.Category.Beverages, Product.Category.Condiments }
Instead of using a slicer in theWHEREclause of above query, we can define a subquery in theFROM clause as follows:
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS ON ROWS
FROM ( SELECT { Product.Category.Beverages,
Product.Category.Condiments } ON COLUMNS FROM Sales )
This query retrieves the sales figures for each quarter, focusing exclusively on the beverages and condiments product categories Notably, unlike SQL, the outer query allows for the inclusion of attributes that are not selected in the subquery.
There is a key distinction between utilizing the WHERE clause and employing subqueries in SQL When the product category hierarchy is incorporated in the WHERE clause, it is restricted from appearing on any axis; however, this limitation does not apply when using a subquery, as demonstrated in the following example.
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS * Product.Category.MEMBERS
FROM ( SELECT { Product.Category.Beverages, Product.Category.Condiments }
The subquery may include more than one dimension, as shown next.
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS * Product.Category.MEMBERS
FROM ( SELECT ( { Product.Category.Beverages, Product.Category.Condiments },
{ [Order Date].Calendar.[Q1 2017], [Order Date].Calendar.[Q2 2017] } ) ON COLUMNS FROM Sales )
Subquery expressions can be nested to perform intricate multistep filtering operations For example, a query can be constructed to retrieve the quarterly sales figures for the top two countries in the beverages and condiments product categories.
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.[Quarter].Members ON ROWS
FROM ( SELECT TOPCOUNT(Customer.Country.MEMBERS, 2,
Measures.[Sales Amount]) ON COLUMNS FROM ( SELECT { Product.Category.Beverages,
Product.Category.Condiments } ON COLUMNS FROM Sales ) )
The TOPCOUNT function sorts a dataset in descending order based on a specified expression and retrieves a defined number of elements with the highest values While a single nesting could have been employed, the above expression offers greater clarity and ease of understanding.
Calculated Members and Named Sets
Calculated members in a dimension are newly defined members or measures created using the WITH clause preceding the SELECT statement For example, a calculated member can be defined as: WITH MEMBER Parent.MemberName AS h expression i, where Parent refers to the parent of the calculated member and MemberName is its designated name Likewise, named sets can be established using the format: WITH SET SetName AS h expression i to define new sets.
Calculated members and named sets defined with the WITH clause are limited to the scope of a single query To extend their visibility across an entire session or within a cube, a CREATE statement is required This article will primarily focus on examples of calculated members and named sets created within queries Since these elements are computed at runtime, they do not incur any processing penalties for the cube or affect the number of stored aggregations.
Calculated members are frequently utilized to create new measures that connect existing ones For instance, a measure for calculating the percentage profit from sales can be established effectively.
WITH MEMBER Measures.[Profit%] AS
(Measures.[Sales Amount] - Measures.[Freight]) / (Measures.[Sales Amount]), FORMAT_STRING = '#0.00%'
SELECT { [Sales Amount], Freight, [Profit%] } ON COLUMNS,
The FORMAT_STRING defines the display format for the new calculated member In the format expression, a ‘#’ represents a digit or nothing, whereas a ‘0’ indicates a digit or a zero Additionally, the percent symbol ‘%’ denotes that the calculation yields a percentage, which involves multiplying by a factor of 100.
We can also create a calculated member in a dimension, as shown next.
WITH MEMBER Product.Categories.[All].[Meat & Fish] AS
Product.Categories.[Meat/Poultry] + Product.Categories.[Seafood]
SELECT { Measures.[Unit Price], Measures.Quantity, Measures.Discount,
Measures.[Sales Amount] } ON COLUMNS,
The query generates a calculated member that represents the total of the Meat/Poultry and Seafood categories As a child of the All member within the hierarchy of the Product dimension, this calculated member is classified at the Category level.
In the following query, we define a named set Nordic Countriescomposed of the countries Denmark, Finland, Norway, and Sweden.
WITH SET [Nordic Countries] AS
{ Customer.Country.Denmark, Customer.Country.Finland,
Customer.Country.Norway, Customer.Country.Sweden }
SELECT Measures.MEMBERS ON COLUMNS,
A named set can be categorized as either static or dynamic A static named set is defined by explicitly listing its members, ensuring that its results remain unchanged despite updates to the cube or session scope In contrast, a dynamic named set is reevaluated whenever there are modifications to the scope For instance, a dynamic named set can be illustrated through a query that retrieves various measures for the top five selling products.
TOPCOUNT ( Product.Categories.Product.MEMBERS, 5,
SELECT { Measures.[Unit Price], Measures.Quantity, Measures.Discount,
Measures.[Sales Amount] } ON COLUMNS,
Relative Navigation
In hierarchical data analysis, it's essential to compare the value of one member against others within the hierarchy MDX offers various methods for navigating these hierarchies, with the most frequently used being PREVMEMBER, NEXTMEMBER, CURRENTMEMBER, PARENT, FIRSTCHILD, and LASTCHILD For instance, to calculate a member's sales within the Geography hierarchy as a percentage of its parent's sales, a specific query can be employed.
WITH MEMBER Measures.[Percentage Sales] AS
(Measures.[Sales Amount], Customer.Geography.CURRENTMEMBER) / (Measures.[Sales Amount], Customer.Geography.CURRENTMEMBER.PARENT), FORMAT_STRING = '#0.00%'
SELECT { Measures.[Sales Amount], Measures.[Percentage Sales] } ON COLUMNS,
NON EMPTY DESCENDANTS(Customer.Europe, Customer.Country,
SELF_AND_BEFORE) ON ROWS
The CURRENTMEMBER function within the WITH clause retrieves the current member along a dimension during an iteration, while the PARENT function identifies the parent of a member In the SELECT clause, measures specific to European countries are presented The expression used to define the calculated member can be simplified accordingly.
(Measures.[Sales Amount]) / (Measures.[Sales Amount],
The calculated measure for the Geography hierarchy functions effectively for all members at various levels, except for the All member, which lacks a parent To address this issue, a conditional expression must be incorporated into the measure's definition.
WITH MEMBER Measures.[Percentage Sales] AS
(Measures.[Sales Amount]) / (Measures.[Sales Amount],
The IIF function consists of three parameters: a Boolean condition, a value for when the condition is true, and a value for when it is false In scenarios where the All member lacks a parent, the sales amount for its parent defaults to 0, resulting in a percentage sales value of 1 The GENERATE function iterates through a set of members, utilizing a second set as a template for the output For instance, to show the sales amount by category for customers in Belgium and France, it is essential to structure the data appropriately.
6.1 Introduction to MDX 171 enumerating in the query all customers for each country, the GENERATE function can be used as follows:
SELECT Product.Category.MEMBERS ON COLUMNS,
GENERATE({Customer.Belgium, Customer.France},
DESCENDANTS(Customer.Geography.CURRENTMEMBER, Customer))
The PREVMEMBER function is utilized to illustrate growth over a specified time frame The subsequent query reveals the net amount along with the incremental change from the previous time member for each month in 2016.
WITH MEMBER Measures.[Net Amount Growth] AS
(Measures.[Net Amount], [Order Date].Calendar.PREVMEMBER),
SELECT { Measures.[Net Amount], Measures.[Net Amount Growth] } ON COLUMNS,
DESCENDANTS([Order Date].Calendar.[2016], [Order Date].Calendar.[Month])
The expression above specifies two formats: one for positive numbers and another for negative numbers By utilizing NEXTMEMBER, it displays the net amount for each month in relation to the subsequent month As the Northwind cube's sales data begins in July 2016, the growth for the initial month cannot be assessed, resulting in it being equivalent to the net amount Consequently, a value of zero is applied for the previous period that falls outside the cube's range.
Instead of using the PREVMEMBER function, the LAG(n) function can be utilized to retrieve a member located a specified number of positions before a specific member in the member dimension When a negative number is provided, it returns a subsequent member; if zero is given, it returns the current member Consequently, the functions PREV, NEXT, and CURRENT can be substituted with LAG(1), LAG(-1), and LAG(0), respectively Additionally, there is a similar function called LEAD, where LAG(n) is equivalent to LEAD(-n).
Time-Related Calculations
Time period analysis plays a crucial role in business intelligence applications, allowing businesses to compare sales data across different time frames, such as monthly or quarterly figures from the current year to those of the previous year MDX offers a robust array of time series functions specifically designed for this type of analysis Although these functions are primarily utilized with a time dimension, they can also be effectively applied to other dimensions, enhancing their versatility in data analysis.
The PARALLELPERIOD function allows users to compare values of a specified member with those of the same relative position in a prior period, such as comparing quarterly data to the same quarter from the previous year While the PREVMEMBER function calculates growth in relation to the previous month, the PARALLELPERIOD function is ideal for assessing growth compared to the same period in the prior year.
WITH MEMBER Measures.[Previous Year] AS
(Measures.[Net Amount], PARALLELPERIOD([Order Date].Calendar.Quarter, 4)), FORMAT_STRING = '$###,##0.00'
MEMBER Measures.[Net Amount Growth] AS
Measures.[Net Amount] - Measures.[Previous Year],
SELECT { [Net Amount], [Previous Year], [Net Amount Growth] } ON COLUMNS,
[Order Date].Calendar.Quarter ON ROWS
The PARALLELPERIOD function retrieves data from the member corresponding to the same quarter from the previous year Given that the Northwind cube's sales data begins in July 2016, the query will return null values for the Previous Year measure for the initial four quarters, while the measures for Net Amount and Net Amount Growth will display consistent values.
The OPENINGPERIOD and CLOSINGPERIOD functions identify the first and last siblings among a member's descendants at a specific level For instance, to calculate the difference in sales quantity between a month and the opening month of the quarter, these functions can be utilized effectively.
WITH MEMBER Measures.[Quantity Difference] AS
OPENINGPERIOD([Order Date].Calendar.Month,
[Order Date].Calendar.CURRENTMEMBER.PARENT))
SELECT { Measures.[Quantity], Measures.[Quantity Difference] } ON COLUMNS,
[Order Date].Calendar.[Month] ON ROWS
In deriving the calculated member Quantity Difference, the opening period at the month level is taken for the quarter to which the month corresponds.
If CLOSINGPERIODis used instead, the query will show sales based on the final month of the specified season,
The PERIODSTODATE function generates a series of periods from a chosen level, beginning with the initial period and concluding with a designated member For instance, the expression can be used to create a set that includes all months leading up to and including June 2017.
PERIODSTODATE([Order Date].Calendar.Year, [Order Date].Calendar.[June 2017])
To create a calculated member that shows year-to-date data, such as monthly year-to-date sales, we must utilize both the PERIODSTODATE function and the SUM function.
6.1 Introduction to MDX 173 which returns the sum of a numeric expression evaluated over a set For example, the sum of sales amount for Italy and Greece can be displayed with the following expression.
SUM({Customer.Country.Italy, Customer.Country.Greece}, Measures.[Sales Amount])
In the expression below, the measure to be displayed is the sum of the current time member over the year level.
SUM(PERIODSTODATE([Order Date].Calendar.Year,
[Order Date].Calendar.CURRENTMEMBER), Measures.[Sales Amount])
Similarly, by replacingYearbyQuarterin the above expression we can obtain quarter-to-date sales For example, the following query shows year-to-date and quarter-to-date sales.
WITH MEMBER Measures.YTDSales AS
SUM(PERIODSTODATE([Order Date].Calendar.Year,
[Order Date].Calendar.CURRENTMEMBER), Measures.[Sales Amount])
SUM(PERIODSTODATE([Order Date].Calendar.Quarter,
[Order Date].Calendar.CURRENTMEMBER), Measures.[Sales Amount])
SELECT { Measures.[Sales Amount], Measures.YTDSales, Measures.QTDSales }
ON COLUMNS, [Order Date].Calendar.Month.MEMBERS ON ROWS
The Northwind data warehouse includes sales data beginning in July 2016 For August 2016, both YTDSales and QTDSales are calculated by summing the Sales Amount for July and August In contrast, YTDSales for December 2016 is derived from the total Sales Amount from July through December, while QTDSales for December 2016 reflects the sum of Sales Amount from October to December.
The xTD functions, including YTD (Year-to-Date), QTD (Quarter-to-Date), MTD (Month-to-Date), and WTD (Week-to-Date), are specifically designed for time dimensions Unlike other functions, these are applicable only to date-related data Each xTD function corresponds to a specific period level, with YTD representing the yearly level, QTD for quarterly, and so forth For instance, the measure YTDSales can be expressed using the xTD functions to yield equivalent results.
SUM(YTD([Order Date].Calendar.CURRENTMEMBER), Measures.[Sales Amount])
Moving averages are essential tools for addressing common business challenges, particularly in analyzing time series data like financial indicators and stock market trends They help to smooth out rapid fluctuations in data, making it easier to identify overarching trends However, selecting the appropriate smoothing period is crucial; a period that is too long can result in a flat average that obscures trends, while a period that is too short may reveal excessive peaks and troughs, hindering the ability to discern significant patterns.
The LAG function, when used with the range operator ':', enables the calculation of moving averages in MDX This operator generates a set of members that includes two specified members and all members in between For instance, to calculate the 3-month moving average of order numbers, one can utilize a specific query.
WITH MEMBER Measures.MovAvg3M AS
AVG([Order Date].Calendar.CURRENTMEMBER.LAG(2):
Measures.[Order No]), FORMAT_STRING = '###,##0.00'
SELECT { Measures.[Order No], Measures.MovAvg3M } ON COLUMNS,
[Order Date].Calendar.Month.MEMBERS ON ROWS
The AVG function returns the average of an expression evaluated over a set The LAG(2) function obtains the month preceding the current one by
The range operator calculates the average number of orders over a three-month period For the Northwind cube, which contains sales data starting from July 2016, the average for July is based solely on the number of orders for that month due to the absence of prior data In contrast, the average for August 2016 is derived from the data of both July and August From September 2016 onward, the average will be determined using the current month along with the two preceding months.
Filtering
Filtering is a process used to limit the display of axis members, distinguishing it from slicing, which is defined in the WHERE clause While filtering reduces the number of visible axis members, slicing affects the values associated with those members without altering their selection.
In data analysis, the NON EMPTY clause is commonly used to remove members of an axis with no values Additionally, the FILTER function allows for filtering a dataset based on specific conditions For instance, if we want to display the sales amount for 2017 categorized by city and product, we can apply a filter to focus on top-performing cities, specifically those with sales exceeding $25,000.
SELECT Product.Category.MEMBERS ON COLUMNS,
FILTER(Customer.City.MEMBERS, (Measures.[Sales Amount],
[Order Date].Calendar.[2017]) > 25000) ON ROWS
WHERE (Measures.[Sales Amount], [Order Date].Calendar.[2017])
As another example, the following query shows customers that in 2017 had profit margin below the one of their city.
WITH MEMBER Measures.[Profit%] AS
(Measures.[Sales Amount] - Measures.[Freight]) / (Measures.[Sales Amount]), FORMAT_STRING = '#0.00%'
MEMBER Measures.[Profit%City] AS
(Measures.[Profit%], Customer.Geography.CURRENTMEMBER.PARENT), FORMAT_STRING = '#0.00%'
SELECT { Measures.[Sales Amount], Measures.[Freight], Measures.[Net Amount],
Measures.[Profit%], Measures.[Profit%City] } ON COLUMNS,
FILTER(NONEMPTY(Customer.Customer.MEMBERS),
(Measures.[Profit%]) < (Measures.[Profit%City])) ON ROWS
Profit% calculates the profit percentage for the current member, while Profit%City extends this calculation to the parent entity, reflecting the profit of the state associated with the city.
Sorting
In cube queries, all the members in a dimension have a hierarchical order. For example, consider the query below:
SELECT Measures.MEMBERS ON COLUMNS,
Customer.Geography.Country.MEMBERS ON ROWS
The countries are organized hierarchically based on their continent, starting with European countries followed by North American countries To sort the countries alphabetically by name, the ORDER function can be utilized, as outlined in the following syntax.
ORDER(Set, Expression [, ASC | DESC | BASC | BDESC])
The expression can be either numeric or a string, with the default sort order set to ASC The 'B' prefix signifies that the hierarchical order may be disregarded In hierarchical sorting, members are first organized based on their hierarchy position, followed by sorting within each level Conversely, non-hierarchical sorting arranges members independently of the hierarchy For instance, countries can be listed in a non-hierarchical manner, allowing for flexible ordering.
SELECT Measures.MEMBERS ON COLUMNS,
ORDER(Customer.Geography.Country.MEMBERS,
Customer.Geography.CURRENTMEMBER.NAME, BASC) ON ROWS
The property NAME retrieves the name of a level, dimension, member, or hierarchy, while the UNIQUENAME property provides the corresponding unique name This query will display countries in alphabetical order, including Argentina, Australia, and Austria.
It is often the case that the ordering is based on an actual measure To sort the query above based on the sales amount, we can proceed as follows:
SELECT Measures.MEMBERS ON COLUMNS,
ORDER(Customer.Geography.Country.MEMBERS,
Measures.[Sales Amount], BDESC) ON ROWS
Sorting data based on multiple criteria in MDX can be challenging, as it differs from SQL where the ORDER function permits a single sorting expression For example, if we want to analyze sales amounts categorized by continent and category, and we wish to sort the results first by continent name and then by category name, we must utilize the GENERATE function to achieve this.
SELECT Measures.[Sales Amount] ON COLUMNS,
ORDER( Customer.Geography.Continent.ALLMEMBERS,
Customer.Geography.CURRENTMEMBER.NAME, BASC ),
Product.Categories.CURRENTMEMBER.NAME, BASC ) ) ON ROWS
The GENERATE function first sorts continents alphabetically in ascending order Then, it performs a cross join with the categories, also sorted in ascending order by name, to combine the data effectively.
Top and Bottom Analysis
To identify the top three best-selling cities based on sales amount, the HEAD function can be utilized to retrieve the initial members of the dataset Conversely, the TAIL function allows for the selection of a subset from the end of the dataset Thus, the query for "Top three best-selling store cities" can be effectively executed using these functions.
SELECT Measures.MEMBERS ON COLUMNS,
HEAD(ORDER(Customer.Geography.City.MEMBERS,
Measures.[Sales Amount], BDESC), 3) ON ROWS
The functionTOPCOUNT can also be used to answer the previous query.
SELECT Measures.MEMBERS ON COLUMNS,
TOPCOUNT(Customer.Geography.City.MEMBERS, 3,
Measures.[Sales Amount]) ON ROWS
To showcase the top three cities by sales amount along with their total sales, we can also include the combined sales of all other cities.
WITH SET SetTop3Cities AS TOPCOUNT(
Customer.Geography.City.MEMBERS, 3, [Sales Amount])
MEMBER Customer.Geography.[Top 3 Cities] AS
MEMBER Customer.Geography.[Other Cities] AS
(Customer.[All]) - (Customer.[Top 3 Cities])
SELECT Measures.MEMBERS ON COLUMNS,
{ SetTop3Cities, [Top 3 Cities], [Other Cities], Customer.[All] } ON ROWS FROM Sales
The process begins by identifying the three best-selling cities, referred to as SetTop3Cities Two new members are then added to the Geography hierarchy: Top 3 Cities, which aggregates the measures from SetTop3Cities, and Other Cities, representing the difference between the overall customer measures and those of Top 3 Cities The AGGREGATE function is employed to calculate each measure, utilizing the average for Unit Price and Discount, while summing the other measures.
The TOPPERCENT and TOPSUM functions are essential for top filter processing, as they retrieve the highest elements based on a specified percentage or value For instance, a query can be used to display cities where their sales account for 30% of total sales.
SELECT Measures.[Sales Amount] ON COLUMNS,
{ TOPPERCENT(Customer.Geography.City.MEMBERS, 30,
Measures.[Sales Amount]), Customer.Geography.[All] } ON ROWS
The BOTTOM functions provide a way to retrieve the lowest items from a list, such as using the BOTTOMSUM function to identify cities with cumulative sales under $10,000.
Aggregation Functions
MDX offers a variety of aggregation functions, including SUM and AVG, as well as MEDIAN, MAX, MIN, VAR, and STDDEV, which calculate the median, maximum, minimum, variance, and standard deviation of numeric values in a dataset For instance, a query can be executed to analyze product categories, revealing total, maximum, minimum, and average sales amounts over a one-month period in 2017.
WITH MEMBER Measures.[Maximum Sales] AS
MAX(DESCENDANTS([Order Date].Calendar.Year.[2017],
[Order Date].Calendar.Month), Measures.[Sales Amount])
MEMBER Measures.[Minimum Sales] AS
MIN(DESCENDANTS([Order Date].Calendar.Year.[2017],
[Order Date].Calendar.Month), Measures.[Sales Amount])
MEMBER Measures.[Average Sales] AS
AVG(DESCENDANTS([Order Date].Calendar.Year.[2017],
[Order Date].Calendar.Month), Measures.[Sales Amount])
SELECT { [Sales Amount], [Maximum Sales], [Minimum Sales], [Average Sales] }
Product.Categories.Category.MEMBERS ON ROWS
Our next query computes the maximum sales by category as well as the month in which they occurred.
WITH MEMBER Measures.[Maximum Sales] AS
MAX(DESCENDANTS([Order Date].Calendar.Year.[2017],
[Order Date].Calendar.Month), Measures.[Sales Amount])
MEMBER Measures.[Maximum Period] AS
TOPCOUNT(DESCENDANTS([Order Date].Calendar.Year.[2017],
[Order Date].Calendar.Month), 1, Measures.[Sales Amount]).ITEM(0).NAME SELECT { [Maximum Sales], [Maximum Period] } ON COLUMNS,
Product.Categories.Category.MEMBERS ON ROWS
The TOPCOUNT function identifies the tuple with the highest sales amount, while the ITEM function extracts the first member from that tuple, and the NAME function retrieves the name of that member The COUNT function is utilized to tally the number of tuples in a dataset, with an optional parameter that allows users to include or exclude empty cells For instance, it can determine the number of customers who purchased a specific product category by counting tuples that combine sales amounts with customer names Excluding empty cells is crucial to ensure the count reflects only those customers who made purchases in the relevant product category.
WITH MEMBER Measures.[Customer Count] AS
[Customer].[Company Name].MEMBERS}, EXCLUDEEMPTY)
SELECT { Measures.[Sales Amount], Measures.[Customer Count] } ON COLUMNS,
Product.Category.MEMBERS ON ROWS
Introduction to DAX
Expressions
DAX is a strongly typed language that includes various data types such as integers, floating-point numbers, currency (a fixed decimal number with four digits of precision stored as an integer), datetimes, Boolean values, strings, and binary large objects (BLOBs) The type system in DAX determines the resulting type of an expression based on the components utilized within it.
Functions in DAX are essential for performing calculations within a data model, utilizing input arguments that can be either required or optional Upon execution, a function yields a value, which can be a single value or a table DAX encompasses a variety of functions for diverse calculations, including date and time, logical, statistical, mathematical, trigonometric, and financial functions Additionally, DAX features several types of operators such as arithmetic (+, -, *, /), comparison (=, , >), text concatenation (&), and logical operators (&& and ||) These operators are overloaded, meaning their behavior varies based on the provided arguments.
Expressions in data modeling are composed of various elements such as tables, columns, measures, functions, operators, and constants They play a crucial role in defining measures, calculated columns, calculated tables, and queries Specifically, an expression for a measure or calculated column must yield a scalar value, like a number or string, whereas an expression for a calculated table or query must produce a table This concept was previously illustrated through the use of DAX expressions in the context of the Northwind case study in Section 5.9.2.
In a data model, tables consist of both columns and measures, which can be referenced in expressions like 'Sales'[Quantity] The table name can be omitted if it doesn't start with a number, contain spaces, or use reserved words like Date or Sum Additionally, when the expression is in the same context as the table, the table name can also be skipped However, it is advisable to include the table name for column references, enhancing readability, while omitting it for measure references, such as [Sales Amount], due to their differing calculation semantics.
Measuresare used to aggregate values from multiple rows in a table An example is given next, which uses the SUMaggregate function.
Sales[Sales Amount] := SUM( Sales[SalesAmount] )
In DAX, a measure must be defined within a specific table, such as tableSales The provided expression serves as a simplified version of the iterator function SUMX, which takes two arguments: the table to be scanned and an expression that is evaluated for each row within that table.
Sales[Total Cost] := SUMX( Sales, Sales[Quantity] * Sales[UnitCost] )
A measure can reference other measures as shown next.
Sales[Margin] := [Sales Amount] - [Total Cost]
Sales[Margin %] := [Margin] / [Sales Amount]
Aggregation functions such as MIN, MAX, COUNT, AVERAGE, and STDDEV each have a corresponding iterator function with an "X" suffix Many of these functions take a table expression as their first argument, which can also include a table function For instance, the FILTER function processes rows from a table expression and returns a new table containing only the rows that meet the logical condition specified in its second argument.
SUMX ( FILTER ( Sales, Sales[UnitPrice] >= 10 ), Sales[Quantity] * Sales[UnitPrice] )
Calculated columns are derived by an expression and can be used like any other column An example is given next.
Employee[Age] = INT( YEARFRAC( Employee[BirthDate], TODAY() ) )
In DAX, functions like TODAY provide the current date, YEARFRAC calculates the year fraction between two dates, and INT rounds numbers down to the nearest integer Unlike SQL, where calculated columns are evaluated at query time and do not utilize memory, DAX computes these columns during database processing and stores them in the model This approach enhances user experience by allowing complex calculations to be performed at process time rather than at query time.
Variables can be used to avoid repeating the same subexpression The following example classifies customers according to their total sales.
VAR TotalSales = SUM( Sales[SalesAmount] )
Here, SWITCHreturns different results depending on the value of an expres- sion Variables can be of any scalar data type, including tables.
Calculated tables are created from expressions and are integrated into the model A common use case for a calculated table is to generate a Date table when it is absent from the data sources For instance, in the following example, we consider that the Sales table includes a column named OrderDate.
VAR MinYear = YEAR( MIN( Sales[OrderDate] ) )
VAR MaxYear = YEAR( MAX( Sales[OrderDate] ) )
YEAR( [Date] ) >= MinYear && YEAR( [Date] ) =
MINX( VALUES ( Customer[Country] ), [Sales Amount] ) ) )MEASURE Sales[Cumul Perc] = [Cumul Sales] / [Sales Amount]
Customer[Country], "Sales Amount", [Sales Amount],
"Perc Sales", [Perc Sales], "Cumul Sales", [Cumul Sales],
RANKX( Total, [Cumul Sales], , ASC ) 0 ) )
Employee[EmployeeName], 'Date'[Year], "Sales Amount", [Sales Amount],
"Avg Monthly Sales", [Avg Monthly Sales] )
ORDER BY [Full Name], [Year]
To calculate the Average Monthly Sales, we first utilize the SUMMARIZE function to determine monthly sales figures, applying the FILTER function to isolate months with sales activity We then count the eligible months using the COUNTROWS function Finally, the SUMMARIZECOLUMNS function aggregates these measures in the main query to present the results.
7.2 Querying the Tabular Model in DAX 215
Query 7.9 Total sales amount and discount amount per product and month.
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
SUMX( Sales, Sales[Quantity] * Sales[Discount] * Sales[UnitPrice] )
'Product'[ProductName], 'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber],
"Sales Amount", [Sales Amount], "Total Discount", [Total Discount] )
ORDER BY [ProductName], [Year], [MonthNumber]
We compute the Total Discount measure by multiplying the quantity, the discount, and the unit price This measure is then used in the SUMMA- RIZECOLUMNSfunction in the main query.
Query 7.10 Monthly year-to-date sales for each product category.
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[YTDSales] = TOTALYTD ( SUM( [SalesAmount] ), 'Date'[Date] ) EVALUATE
Product[CategoryName], 'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber],
"Sales Amount", [Sales Amount], "YTDSales", [YTDSales] )
ORDER BY [CategoryName], [Year], [MonthNumber]
In the measure YTDSales, we use theTOTALYTDfunction to aggregate the measureSales Amountfor all dates of the current year up to the current date.
Query 7.11 Moving average over the last 3 months of the sales amount by product category.
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
AVERAGEX( VALUES ( 'Date'[YearMonth] ), [Sales Amount] ),
DATESINPERIOD( 'Date'[Date], MAX( 'Date'[Date] ), -3, MONTH ) ) EVALUATE
Product[CategoryName], 'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber],
"Sales Amount", [Sales Amount], "MovAvg3M", [MovAvg3M] )
ORDER BY [CategoryName], [Year], [MonthNumber]
The DATESINPERIOD function is utilized to select all dates from the three months preceding the current date, providing essential context for the CALCULATE function Subsequently, the AVERAGEX function calculates the Sales Amount measure for each YearMonth value, averaging up to three values within the measure's context.
Query 7.12 Personal sales amount made by an employee compared with the total sales amount made by herself and her subordinates during 2017.
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
Employee[EmployeeName], FILTER( 'Date', 'Date'[Year] = 2017 ),
"Personal Amount", [Sales Amount], "Subordinates Amount", [Subordinates Sales] ) ORDER BY [EmployeeName]
In this analysis, we utilize the parent-child hierarchy of the Employee dimension, focusing on the derived column SupervisionPath, which contains a delimited text of all supervisors' keys for each employee The measure Subordinate Sales employs the SELECTEDVALUE function to retrieve the employee key based on the current filter context This key is then used with the PATHCONTAINS function to identify employees within the specified supervision hierarchy The main query aggregates these measures while filtering the results to the year 2017.
Query 7.13 Total sales amount, number of products, and sum of the quan- tities sold for each order.
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[NbProducts] = COUNTROWS( VALUES ( Sales[ProductKey] ) ) MEASURE Sales[Quantity] = SUM( Sales[Quantity] )
Sales[OrderNo], "Sales Amount", [Sales Amount], "NbProducts", [NbProducts],
This query focuses on the dimensionOrder, derived from the fact table Sales It utilizes the measure NbProducts, employing the COUNTROWS function to calculate the number of distinct products within each order Additionally, the measure Quantity aggregates the total quantity of products Ultimately, the main query presents the required measures for analysis.
Query 7.14 For each month, total number of orders, total sales amount, and average sales amount by order.
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Nb Orders] = COUNTROWS( SUMMARIZE( Sales, Sales[OrderNo] ) )MEASURE Sales[AvgSales] = DIVIDE ( [Sales Amount], [Nb Orders] )
Querying the Relational Data Warehouse in SQL
'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber], "Nb Orders", [Nb Orders],
"Sales Amount", [Sales Amount], "AvgSales", [AvgSales] )
The measure Nb Orderscomputes the number of orders while the measure AvgSales divides the measureSales Amount by the previous measure.
Query 7.15 For each employee, total sales amount, number of cities, and number of states to which she is assigned.
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
CROSSFILTER( Territories[EmployeeKey], Employee[EmployeeKey], BOTH ) ) MEASURE Sales[NoStates] =
CALCULATE( COUNTROWS( VALUES( EmployeeGeography[State] ),
CROSSFILTER( Territories[EmployeeKey], Employee[EmployeeKey], BOTH ) ) ), EVALUATE
Employee[FirstName], Employee[LastName], "Sales Amount", [Sales Amount],
To effectively navigate the many-to-many relationship between employees and cities in the Territories table, it's essential to understand that the relationship between Territories and Employees is unidirectional, while the link between Territories and Employee Geography is bidirectional This distinction allows filters from Employee Geography to affect the Employee table, but not vice versa To accurately calculate the measures NoCities and NoStates, we utilize the CROSSFILTER function to create a bidirectional relationship during the query evaluation The NoCities measure employs the COUNTROWS function to tally the rows in the Territories table, while the NoStates measure counts distinct values in the State column Finally, in the main query, we aggregate results using the SUMMARIZECOLUMNS function and filter out any rows that contain only blanks, retaining only the total number of states.
7.3 Querying the Relational Data Warehouse in SQL
The Northwind data warehouse schema, illustrated in Fig 5.4, prompts a reevaluation of previous SQL queries This is crucial as many OLAP tools automatically convert MDX or DAX queries into SQL, subsequently forwarding them to a relational database server.
To simplify month-related computations, we create the next two views.
SELECT DISTINCT Year, MonthNumber, MonthName
LAG(Year * 100 + MonthNumber) OVER (ORDER BY Year * 100 + MonthNumber) AS PrevMonth
FROM YearMonth ) SELECT Year, MonthNumber, MonthName, PrevMonth / 100 AS PM_Year,
PrevMonth % 100 AS PM_MonthNumber FROM YearMonthPrevMonth
The YearMonth view encompasses all years and months from the Date dimension, while the YearMonthPM view links each year and month to the corresponding month of the previous year This is achieved using the LAG window function in the YearMonthPrevMonth table, which creates a numerical expression in the format YYYYMM to represent the previous month For instance, January 2017 is associated with the expression 201612 The main query subsequently separates this expression into distinct year and month values.
Query 7.1 Total sales amount per customer, year, and product category.
FROM Sales S, Customer C, Date D, Product P, Category A
WHERE S.CustomerKey = C.CustomerKey AND S.OrderDateKey = D.DateKey AND
S.ProductKey = P.ProductKey AND P.CategoryKey = A.CategoryKey GROUP BY C.CompanyName, D.Year, A.CategoryName
ORDER BY C.CompanyName, D.Year, A.CategoryName
Here, we join the fact table with the involved dimension tables and aggregate the results by company, year, and category.
Query 7.2 Yearly sales amount for each pair of customer country and sup- plier countries.
SELECT CO.CountryName AS Country, SO.CountryName AS Country,
D.Year, SUM(SalesAmount) AS SalesAmount
FROM Sales F, Customer C, City CC, State CS, Country CO,
Supplier S, City SC, State SS, Country SO, Date D
WHERE F.CustomerKey = C.CustomerKey AND C.CityKey = CC.CityKey AND
CC.StateKey = CS.StateKey AND CS.CountryKey = CO.CountryKey AND F.SupplierKey = S.SupplierKey AND S.CityKey = SC.CityKey AND
SC.StateKey = SS.StateKey AND SS.CountryKey = SO.CountryKey AND F.OrderDateKey = D.DateKey
GROUP BY CO.CountryName, SO.CountryName, D.Year
ORDER BY CO.CountryName, SO.CountryName, D.Year
7.3 Querying the Relational Data Warehouse in SQL 219
Here, the tables of the geography dimension are joined twice with the fact table to obtain the countries of the customer and the supplier.
Query 7.3 Monthly sales by customer state compared to those of the previ- ous year.
SELECT DISTINCT StateName, Year, MonthNumber, MonthName
FROM Customer C, City Y, State S, YearMonth M
WHERE C.CityKey = Y.CityKey AND Y.StateKey = S.StateKey ),
SUM(SalesAmount) AS SalesAmount FROM Sales F, Customer C, City Y, State S, Date D
C.CityKey = Y.CityKey AND Y.StateKey = S.StateKey AND F.OrderDateKey = D.DateKey
GROUP BY S.StateName, D.Year, D.MonthNumber )
SELECT S.StateName, S.MonthName, S.Year, M1.SalesAmount,
FROM StateMonth S LEFT OUTER JOIN SalesStateMonth M1 ON
LEFT OUTER JOIN SalesStateMonth M2 ON
WHERE M1.SalesAmount IS NOT NULL OR M2.SalesAmount IS NOT NULL
ORDER BY S.StateName, S.Year, S.MonthNumber
The query begins by creating a table called tableStateMonth, which generates the Cartesian product of all customer states and months from the viewYearMonth Next, the tableSalesStateMonth calculates the monthly sales figures for each state The main query then executes two left outer joins between tableStateMonth and tableSalesStateMonth to derive the final results.
Query 7.4 Monthly sales growth per product, that is, total sales per product compared to those of the previous month.
SELECT DISTINCT ProductName, Year, MonthNumber, MonthName,
PM_Year, PM_MonthNumber FROM Product, YearMonthPM ),
SUM(SalesAmount) AS SalesAmount FROM Sales S, Product P, Date D
WHERE S.ProductKey = P.ProductKey AND S.OrderDateKey = D.DateKey GROUP BY ProductName, Year, MonthNumber )
SELECT P.ProductName, P.MonthName, P.Year, S1.SalesAmount,
S2.SalesAmount AS SalesPrevMonth, COALESCE(S1.SalesAmount,0) -
FROM ProdYearMonthPM P LEFT OUTER JOIN SalesProdMonth S1 ON
P.MonthNumber = S1.MonthNumber LEFT OUTER JOIN SalesProdMonth S2
ORDER BY ProductName, P.Year, P.MonthNumber
The ProdYearMonthPM table generates a Cartesian product of the Product table and the YearMonthPM view Meanwhile, the SalesProdMonth table calculates monthly sales by product The primary query then executes two left outer joins between the ProdYearMonthPM and SalesProdMonth tables to derive the final results.
Query 7.5 Three best-selling employees.
SELECT TOP(3) E.FirstName + ' ' + E.LastName AS EmployeeName,
We categorize sales data by employee and utilize the SUM function to aggregate the sales for each group The results are then sorted in descending order based on the total sales, and the TOP function is applied to retrieve the top three entries.
Query 7.6 Best-selling employee per product and year.
SELECT P.ProductName, D.Year, SUM(S.SalesAmount) AS SalesAmount,
E.FirstName + ' ' + E.LastName AS EmployeeName FROM Sales S, Employee E, Date D, Product P
S.OrderDateKey = D.DateKey AND S.ProductKey = P.ProductKey GROUP BY P.ProductName, D.Year, E.FirstName, E.LastName )
SELECT ProductName, Year, EmployeeName AS TopEmployee,
WHERE S1.ProductName = S2.ProductName AND S1.Year = S2.Year ) ORDER BY ProductName, Year
The table SalesProdYearEmp calculates annual sales by product and employee The query retrieves entries where the total sales match the highest total sales for each product within a specific year.
7.3 Querying the Relational Data Warehouse in SQL 221
Query 7.7 Countries that account for top 50% of the sales amount.
SELECT SUM(SalesAmount) AS SalesAmount
SELECT CountryName, SUM(SalesAmount) AS SalesAmount
FROM Sales S, Customer C, City Y, State T, Country O
C.CityKey = Y.CityKey AND Y.StateKey = T.StateKey AND T.CountryKey = O.CountryKey
SELECT S.*, SUM(SalesAmount) OVER (ORDER BY SalesAmount DESC
ROWS UNBOUNDED PRECEDING) AS CumulSales FROM SalesCountry S )
SELECT C.CountryName, C.SalesAmount, C.SalesAmount / T.SalesAmount AS
PercSales, C.CumulSales, C.CumulSales / T.SalesAmount AS CumulPerc FROM CumulSalesCountry C, TotalSales T
SELECT MIN(CumulSales) FROM CumulSalesCountry
SELECT 0.5 * SUM(SalesAmount) FROM SalesCountry ) ) ORDER BY SalesAmount DESC
The SalesCountry table consolidates sales data by country, while the CumulSalesCountry table calculates cumulative sales by defining a window that includes all rows sorted by descending sales amounts For each entry, it sums the current row's sales with all preceding rows in the window The main query then retrieves countries from CumulSalesCountry where the cumulative sales amount is less than or equal to the minimum value that meets or exceeds 50% of the total sales.
Query 7.8 Total sales and average monthly sales by employee and year.
SELECT E.FirstName + ' ' + E.LastName AS EmployeeName,
D.Year, D.MonthNumber, SUM(SalesAmount) AS SalesAmount FROM Sales S, Employee E, Date D
S.OrderDateKey = D.DateKey GROUP BY E.FirstName, E.LastName, D.Year, D.MonthNumber )
SELECT EmployeeName, Year, SUM(SalesAmount) AS SalesAmount,
The MonthlySalesEmp table calculates sales figures on a monthly basis for each employee By grouping the data by employee and year in the query, we utilize the SUM and AVG functions to determine the total sales for the year and the average monthly sales per employee.
Query 7.9 Total sales amount and discount amount per product and month.
SUM(S.UnitPrice * S.Quantity * S.Discount) AS TotalDisc
WHERE S.OrderDateKey = D.DateKey AND S.ProductKey = P.ProductKey
GROUP BY P.ProductName, D.Year, D.MonthNumber
ORDER BY P.ProductName, D.Year, D.MonthNumber
Here, we group the sales by product and month Then, theSUMaggregation function is used to obtain the total sales and the total discount amount.
Query 7.10 Monthly year-to-date sales for each product category.
SUM(SalesAmount) AS SalesAmount FROM Sales S, Product P, Category C, Date D
S.ProductKey = P.ProductKey AND P.CategoryKey = C.CategoryKey GROUP BY CategoryName, Year, MonthNumber ),
SELECT C.CategoryName, C.MonthName, C.Year, SalesAmount, SUM(SalesAmount)
OVER (PARTITION BY C.CategoryName, C.Year ORDER BY
C.MonthNumber ROWS UNBOUNDED PRECEDING) AS YTDSalesAmount FROM CategoryMonth C LEFT OUTER JOIN SalesCategoryMonth S ON
C.CategoryName = S.CategoryName AND C.Year = S.Year AND
ORDER BY CategoryName, Year, C.MonthNumber
The query aggregates year-to-date sales data by category and month, including months with no sales It utilizes the SalesCategoryMonth table to summarize sales amounts and the TableCategorySales to gather all categories The Cartesian product is computed with the TableCategoryMonth and the YearMonth view, which encompasses all months in the date dimension In the main query, a left outer join is performed with SalesCategoryMonth, and for each result row, a window is defined to include all rows with the same category and year The rows are sorted by month, and the cumulative sum is calculated for the current row and all preceding rows within the window.
7.3 Querying the Relational Data Warehouse in SQL 223
Query 7.11 Moving average over the last 3 months of the sales amount by product category.
SELECT C.CategoryName, C.MonthName, C.Year, SalesAmount,
AVG(SalesAmount) OVER (PARTITION BY C.CategoryName ORDER BY C.Year, C.MonthNumber ROWS 2 PRECEDING) AS MovAvg3M
FROM CategoryMonth C LEFT OUTER JOIN SalesCategoryMonth S ON
C.CategoryName = S.CategoryName AND C.Year = S.Year AND
ORDER BY CategoryName, Year, C.MonthNumber
This query builds upon a previous one by defining the same temporary tables It executes a left outer join between the CategoryName and SalesCategoryMonth tables For each resulting row, a window is created that includes all tuples sharing the same category, ordered by year and month The query then calculates the average of the current row along with the two preceding rows.
Query 7.12 Personal sales amount made by an employee compared with the total sales amount made by herself and her subordinates during 2017.
WHERE SupervisorKey IS NOT NULL
SELECT EmployeeKey, SUM(S.SalesAmount) AS PersonalSales
WHERE S.OrderDateKey = D.DateKey AND D.Year = 2017
SUM(S.SalesAmount) AS SubordinateSales FROM Sales S, Supervision U, Date D
S.OrderDateKey = D.DateKey AND D.Year = 2017 GROUP BY SupervisorKey )
SELECT FirstName + ' ' + LastName AS EmployeeName, S1.PersonalSales,
FROM Employee E JOIN SalesEmp2017 S1 ON E.EmployeeKey = S1.EmployeeKey
LEFT OUTER JOIN SalesSubord2017 S2 ON
The Table Supervision utilizes a recursive query to determine the transitive closure of the supervision relationship Table SalesEmp2017 calculates total sales for each employee, while Table SalesSubord2017 assesses the total sales generated by an employee's subordinates The primary query performs a left outer join between these two tables, aggregating both personal and subordinate sales in the SELECT clause Additionally, the COALESCE function is employed to handle scenarios where an employee has no subordinates, ensuring accurate results.
Query 7.13 Total sales amount, number of products, and sum of the quan- tities sold for each order.
SELECT OrderNo, SUM(SalesAmount) AS SalesAmount,
MAX(OrderLineNo) AS NbProducts, SUM(Quantity) AS Quantity
The Salesfact table includes order numbers and order line numbers, forming a fact dimension In our query, we group sales by order number and utilize the SUM and MAX aggregation functions to derive the desired values.
Query 7.14 For each month, total number of orders, total sales amount, and average sales amount by order.
SELECT OrderNo, OrderDateKey, SUM(SalesAmount) AS SalesAmount FROM Sales
SELECT Year, MonthNumber, COUNT(OrderNo) AS NoOrders,
SUM(SalesAmount) AS SalesAmount, AVG(SalesAmount) AS AvgAmount FROM OrderAgg O, Date D
TableOrderAgg calculates the total sales for each order while retaining the time dimension key for joining with the main fact table By grouping the data by year and month, we can compute the necessary aggregated values effectively.
Query 7.15 For each employee, total sales amount, number of cities, and number of states to which she is assigned.
SELECT FirstName + ' ' + LastName AS EmployeeName,
SUM(SalesAmount) / COUNT(DISTINCT CityName) AS TotalSales,
COUNT(DISTINCT CityName) AS NoCities,
COUNT(DISTINCT StateName) AS NoStates
FROM Sales F, Employee E, Territories T, City C, State S
WHERE F.EmployeeKey = E.EmployeeKey AND E.EmployeeKey = T.EmployeeKey AND
T.CityKey = C.CityKey AND C.StateKey = S.StateKey
Comparison of MDX, DAX, and SQL
The Territories table illustrates the many-to-many relationship between employees and cities The query joins five tables and aggregates the results by employee In the SELECT clause, the total SalesAmount is calculated and divided by the number of distinct city names assigned to each employee in the Territories table, effectively addressing the double-counting issue discussed in Section 4.2.6.
If an attributePercentagein tableTerritoriesstates the percentage of time an employee is assigned to each city, the query above will read:
SELECT FirstName + ' ' + LastName AS EmployeeName,
SUM(SalesAmount) * T.Percentage AS TotalSales,
COUNT(DISTINCT CityName) AS NoCities,
COUNT(DISTINCT StateName) AS NoStates
FROM Sales F, Employee E, Territories T, City C, State S
We can see that the sum of the SalesAmount measure is multiplied by the percentage to account for the double-counting problem.
7.4 Comparison of MDX, DAX, and SQL
In the preceding sections, we used MDX, DAX, and SQL for querying the Northwind data warehouse In this section, we compare these languages.
At first glance, SQL, MDX, and DAX appear to have similar syntax and functionality, as they can express the same set of queries However, there are key differences among these three languages that warrant further discussion.
The primary distinction between SQL, MDX, and DAX lies in MDX's capability to reference multiple dimensions, making it particularly suited for querying multidimensional data While SQL can query cubes, it typically handles only two dimensions—columns and rows DAX, based on a tabular model, faces challenges with multiple dimensions and hierarchies Despite these differences, most OLAP tools struggle to effectively display results with more than two dimensions In MDX, the cross join operator allows for the analysis of measures across multiple dimensions, whereas SQL utilizes the SELECT clause to define column layouts In contrast, DAX does not employ a SELECT clause, highlighting its unique approach to data retrieval.
In SQL, the WHERE clause is utilized to filter data returned by a query, while in MDX, it serves to provide a slice of the data Although both concepts aim to refine query results, they differ significantly SQL's WHERE clause can include an arbitrary list of items to narrow down data retrieval, whereas MDX's slicing reduces the number of dimensions, requiring each member in the WHERE clause to represent a distinct data portion from a different dimension Additionally, MDX's WHERE clause does not filter axis results; instead, functions like FILTER, NONEMPTY, and TOPCOUNT are used for that purpose Similarly, in DAX, filtering is achieved through the FILTER clause, with no WHERE clause or slicing concept involved.
In comparing the queries over the Northwind data warehouse using MDX, DAX, and SQL, we observe key differences In SQL, joins between tables must be explicitly stated, while in MDX and DAX, they are implicit Additionally, SQL's inner join eliminates empty combinations unless specified, contrasting with MDX and DAX, which require the use of NON EMPTY and ISBLANK, respectively, to achieve similar results Conversely, SQL necessitates outer joins to display empty combinations.
In SQL, roll-up operations require explicit aggregation through the GROUP BY clause and aggregation functions in the SELECT statement Conversely, MDX automates this process by defining aggregation functions within the cube, while DAX computes them during measure evaluation using the MEASURE statement Additionally, SQL mandates that the display format be specified in the query, whereas in MDX and DAX, it is defined within the cube and model definitions, respectively.
When comparing current period measures to previous periods, such as the previous month or the same month last year, different approaches are utilized in MDX and SQL In MDX, calculated members are created using the WITH MEMBER clause, while SQL requires a temporary table defined in the WITH clause for necessary aggregations and an outer join to combine current and previous period measures However, querying the previous month in SQL can be complex due to the need to differentiate between months in the same year and those in the previous year Conversely, DAX offers a more straightforward method akin to MDX, using the MEASURE statement to compute sales for the previous period Both DAX and MDX provide simpler and more flexible computations, though they may appear cryptic to non-expert users due to their more complex semantics.
7.4 Comparison of MDX, DAX, and SQL 227
Top and bottom performance analysis can be effectively performed using different query languages: MDX employs the TOPCOUNT function, DAX utilizes TOPN, and SQL uses the TOP function While these queries are straightforward, they differ significantly in calculating cumulative sums In MDX, the TOPPERCENT function sorts data in descending order and returns tuples with the highest values that meet a specified cumulative percentage Conversely, SQL's TOP(n) PERCENT computes a percentage of the total tuples, requiring the use of window functions for running sums, which complicates the query DAX faces additional challenges as it lacks an equivalent to TOPPERCENT and does not support window functions, necessitating a more complex approach involving multiple measure computations and filtering.
Query 7.8 illustrates aggregate manipulation across various granularities, with MDX utilizing the ASCENDANTS function to transition from finer to coarser granularity In SQL, this process involves creating a temporary table for finer granularity and aggregating it in the main query for coarser granularity DAX employs a similar method through measure computations and the SUMMARIZECOLUMNS function for aggregation For period-to-date calculations and moving averages, MDX uses the PERIODSTODATE and LAG functions, while DAX employs TOTALYTD and AVERAGEX These functions streamline query creation for users familiar with the languages, whereas SQL relies on window functions, resulting in more ad hoc query structures.
Query 7.12 illustrates aggregation within parent-child hierarchies, showcasing the simplicity of expression in MDX compared to the complexity of recursive queries in SQL While DAX presents a straightforward query format, it necessitates prior computation of a Supervision-Path measure to compensate for the tabular model's limitations in handling hierarchies Therefore, the dimensional capabilities of MDX are fully leveraged when tackling recursive queries.
Queries 7.13 and 7.14 illustrate the manipulation of fact dimensions, specifically the Order dimension, highlighting that while MDX can succinctly express these queries, their execution may not be immediately clear In contrast, SQL queries are more straightforward to write, and DAX also effectively manages fact dimensions Additionally, Query 7.15 demonstrates the manipulation of many-to-many dimensions, where the SQL version must address the double-counting issue during aggregation, yet remains simple The MDX query naturally resolves this problem through the use of a bridge table established during cube creation, whereas DAX requires additional handling due to relationship directionality constraints between employees and their assigned territories.
To conclude, Table7.1summarizes some of the advantages and disadvan- tages of the three languages.
Table 7.1 Comparison of MDX, DAX, and SQL
• Expressive multidimensional model comprising facts, dimensions, hier- archies, measure groups
• Simple navigation within time di- mension and hierarchies
• Easy to express business-related re- quests
• Fast, due to the existence of aggre- gations
• Extra effort for designing a cube and setting up aggregations
• Steep learning curve: manipulating an n -dimensional space
• Hard-to-grasp concepts: current context, execution phases, etc.
• Some operations are difficult to ex- press, such as ordering on multiple criteria
• Simple model with tables joined implicitly through relationships
• Similar syntax as Excel functions aiming at self-service BI
• Built-in time-related calculations
• Fast, due to in-memory, com- pressed columnar storage
• Hard-to-grasp concepts, e.g., eval- uation contexts
• Single functionality can be achieved in multiple ways
• Limited model, e.g., hierachies in a single table, at most one active re- lationship between two entities
• Limited functionality for handling hierarchies
• Simple model using two- dimensional tables
• Standardized language supported by multiple systems
• Easy-to-understand semantics of queries
• Various ways to relate tables: joins, derived tables, correlated queries, common table expressions, etc.
• Tables must be joined explicitly in queries
• No concept of row ordering and hierarchies: navigating dimensions may be complex
• Analytical queries and time-related computations may be difficult to express
• Analytical queries may be costly
KPIs for the Northwind Case Study
KPIs in Analysis Services Multidimensional
In a multidimensional model, a cube can contain a collection of Key Performance Indicators (KPIs), with only their metadata stored within the cube To retrieve KPI values, a set of MDX functions utilizes this metadata Each KPI is defined by five key properties: value, goal, status, trend, and weight, which are represented by MDX expressions yielding numeric values The status and trend expressions return values between −1 and 1, facilitating the display of graphical indicators for these properties Additionally, the weight determines the KPI's contribution to its parent KPI, if applicable Analysis Services automatically generates hidden calculated members on the Measures dimension for each KPI property, enabling their use in MDX expressions.
To effectively implement the sales performance KPI aimed at achieving a 15% year-over-year growth, it is essential to monitor monthly sales figures against this target Sales performance is deemed satisfactory when actual sales reach at least 95% of the goal, while an alert is triggered if sales fall between 85% and 95% Immediate action is required if sales drop below 85% of the goal Additionally, tracking sales trends is crucial; a sales amount exceeding expectations by 20% is a positive indicator that should be highlighted, whereas a 20% shortfall necessitates prompt intervention.
The MDX query that computes the goal of the KPI is given next:
WITH MEMBER Measures.SalesPerformanceGoal AS
WHEN ISEMPTY(PARALLELPERIOD([Order Date].Calendar.Month, 12, [Order Date].Calendar.CurrentMember))
PARALLELPERIOD([Order Date].Calendar.Month, 12, [Order Date].Calendar.CurrentMember) )
SELECT { [Sales Amount], SalesPerformanceGoal } ON COLUMNS,
[Order Date].Calendar.Month.MEMBERS ON ROWS
The CASE statement establishes the sales goal for each month based on actual monthly sales figures, provided that the corresponding month from the previous year falls outside the cube's time frame In the context of the Northwind data warehouse, which began recording sales in July 2016, this results in the sales goal being aligned with actual sales data up until June 2017.
In Visual Studio, we can establish a Key Performance Indicator (KPI) called Sales Performance by supplying MDX expressions for each specified property.
• Value: The measure defining the KPI is [Measures].[Sales Amount].
• Goal: The goal to increase 15% over last year’s sales amount is given by FORMAT(CASE END, '$###,##0.00') where theCASEexpression is as in the query above.
• Status: We select the traffic light indicator for displaying the status of theKPI Therefore, the corresponding MDX expression must return a value
7.5 KPIs for the Northwind Case Study 231 between−1 and 1 The KPI browser displays a red traffic light when the status is−1, a yellow traffic light when the status is 0, and a green traffic light when the status is 1 The MDX expression is given next:
WHEN KpiValue("Sales Performance") / KpiGoal("Sales Performance") >= 0.95 THEN 1
WHEN KpiValue("Sales Performance") / KpiGoal("Sales Performance") < 0.85 THEN -1
The functions TheKpiValue and KpiGoal are designed to retrieve the actual and target values of a specific Key Performance Indicator (KPI) The status is calculated by dividing the actual value by the goal value, resulting in a status of 1 if the ratio is 95% or higher, -1 if it falls below 85%, and 0 for values in between.
• Trend: Among the available graphical indicators, we select the status arrow The associated MDX expression is given next:
WHEN ( KpiValue("Sales Performance") - KpiGoal("Sales Performance") ) / KpiGoal("Sales Performance") > 0.2
WHEN ( KpiValue("Sales Performance") - KpiGoal("Sales Performance") ) / KpiGoal("Sales Performance") = 0.2, 1,
( [Sales Amount] - [Sales Target] ) / [Sales Target] < -0.2, -1,
7.5 KPIs for the Northwind Case Study 233
"Sales Amount", FORMAT( [Sales Amount], "$###,##0.00" ),
"Sales Target", FORMAT( [Sales Target], "$###,##0.00" ),
"Sales Status", [Sales Status], "Sales Trend", [Sales Trend] )
The Sales Target measure utilizes the variable SalesAmountPY to calculate the previous year's sales amount, establishing the target as the current sales amount if no previous value exists, or a 15% increase if there is one The Sales Status measure evaluates performance, returning a value of 1 when the actual-to-target KPI ratio is 95% or higher, -1 if it's below 85%, and 0 in other cases Meanwhile, the Sales Trend measure assesses the difference between actual and target values, yielding 1 for a 20% or greater difference, -1 for less than 20%, and 0 otherwise To visualize these KPIs in Power BI Desktop, it is essential to install the Tabular Editor, where the Sales Amount measure can be transformed into a KPI, allowing for effective data representation.
Fig 7.2 Definition of the Sales Performance KPI in Power BI Desktop
Fig 7.3 Display of the Sales Performance KPI in Power BI Desktop
Dashboards for the Northwind Case Study
Dashboards in Reporting Services
Reporting Services is a server-based platform designed for generating reports from various data sources, comprising three key components: the client, the report server, and report databases Typically, Visual Studio serves as the client for creating dashboards The report server is responsible for essential functions such as authentication, data processing, report rendering, scheduling, and delivery Reporting Services also includes three databases: the data source, which provides the data for reports and can connect to multiple providers like SQL Server and Oracle; and two additional databases, ReportServer and ReportServerTempDB, which store metadata related to the reports.
Reporting Services offers a variety of elements for dashboard creation, including different chart types, report objects like gauges for KPIs, images, maps, data bars, sparklines, and indicators These components can be effectively combined with tabular data to enhance visualization and insights.
Fig 7.4 Defining the Northwind dashboard in Reporting Services using Visual Studio
The Northwind dashboard is defined using Visual Studio, as illustrated in Figure 7.4, which highlights the Northwind data warehouse as the primary data source The dashboard comprises five datasets, each linked to its respective SQL query and a collection of fields that represent the columns returned by the query The SQL query displayed in the dialog box corresponds to the top left chart of the report Figure 7.5 presents the final dashboard, and we will detail its various components below.
Fig 7.5 Dashboard of the Northwind case study in Reporting Services
The top left chart shows the monthly sales together with those of the same month of the previous year The SQL query is given next.
7.6 Dashboards for the Northwind Case Study 237
SELECT Year(D.Date) AS Year, Month(D.Date) AS Month,
SUBSTRING(DATENAME(month, D.Date), 1, 3) AS MonthName, SUM(S.SalesAmount) AS MonthlySales
GROUP BY Year(D.Date), Month(D.Date), DATENAME(month, D.Date) ) SELECT MS.Year, MS.Month, MS.MonthName, MS.MonthlySales,
FROM MonthlySales MS, MonthlySales PYMS
WHERE PYMS.Year = MS.Year - 1 AND MS.Month = PYMS.Month
The tableMonthlySales calculates the monthly sales figures, with the MonthName column generating the first three letters of each month for labeling the x-axis of the chart Additionally, the main query joins the tableMonthlySales twice to produce the final results.
The top right gauge displays the percentage of April 2018 sales compared to April 2017 sales, with the last order recorded on May 5, 2018 This gauge features a gradient scale from white to light blue, indicating a range from 0% to 115%, reflecting a KPI goal of achieving a 15% increase in monthly sales over the same month from the previous year.
SELECT Year(D.Date) AS Year, Month(D.Date) AS Month,
SUM(S.SalesAmount) AS MonthlySales FROM Sales S, Date D
GROUP BY Year(D.Date), Month(D.Date) ),
SELECT Year(MAX(D.Date)) AS Year, Month(MAX(D.Date)) AS MonthNumber FROM Sales S, Date D
SELECT MS.Year, MS.Month,
(MS.MonthlySales - PYMS.MonthlySales) / PYMS.MonthlySales AS Percentage FROM LastMonth L, YearMonthPM Y, MonthlySales MS, MonthlySales PYMS
WHERE L.Year = Y.Year AND L.MonthNumber = Y.MonthNumber AND
Y.PM_Year = MS.Year AND Y.PM_MonthNumber = MS.Month AND
PYMS.Year = MS.Year - 1 AND MS.Month = PYMS.Month
The MonthlySales table calculates the total sales for each month, while the LastMonth table determines the year and month of the most recent order The main query then joins the LastMonth table with the YearMonthPM view to retrieve the previous month of the last order, and additional joins are made to gather the sales data for that month as well as the corresponding sales figures from the same month in the previous year.
The query for the middle left chart, which shows the shipping costs with respect to the total sales by month is given next.
SELECT Year(D.Date) AS Year, Month(D.Date) AS Month,
SUBSTRING(DATENAME(mm, D.Date), 1, 3) AS MonthName,
SUM(S.SalesAmount) AS TotalSales, SUM(S.Freight) AS TotalFreight, SUM(S.Freight) / SUM(S.SalesAmount) AS Percentage
GROUP BY Year(D.Date), Month(D.Date), DATENAME(mm, D.Date)
ORDER BY Year, Month, DATENAME(mm, D.Date)
Here we compute the total sales and the total freight cost by month, and the percentage of the former comprised by the latter.
The gauge displayed in the middle right of Fig 7.5 illustrates the year-to-date shipping costs to sales ratio, with a target KPI aimed at keeping shipping costs below 5% of total sales.
SELECT Year(MAX(D.Date)) AS Year, Month(MAX(D.Date)) AS Month FROM Sales S, Date D
SELECT SUM(S.SalesAmount) AS TotalSales, SUM(S.Freight) AS TotalFreight,
SUM(S.Freight) / SUM(S.SalesAmount) AS Percentage
FROM LastMonth L, YearMonthPM Y, Sales S, Date D
WHERE L.Year = Y.Year AND L.Month = Y.MonthNumber AND
S.OrderDateKey = D.DateKey AND Year(D.Date) = Y.PM_Year AND
As for the previous gauge, we useLastMonthand YearMonthPMto compute the requested value for the last complete month, that is April 2018.
Finally, the query for the bottom table showing the three lowest-performing employees is given next:
SELECT Year(MAX(Date)) AS Year, DATEPART(dy, MAX(Date)) AS DayNbYear FROM Sales S, Date D
SELECT S.EmployeeKey, SUM(S.SalesAmount) * 1.05 AS TargetSales
WHERE S.OrderDateKey = D.DateKey AND Year(D.Date) = L.Year - 1
SELECT S.EmployeeKey, SUM(S.SalesAmount) AS YTDSales,
SUM(S.SalesAmount) * 365 / L.DayNbYear AS ExpectedSales FROM Sales S, Date D, LastDay L
WHERE S.OrderDateKey = D.DateKey AND Year(D.Date) = L.Year
SELECT TOP(3) FirstName + ' ' + LastName AS Name, YTDSales, ExpectedSales,
TargetSales, ExpectedSales/TargetSales AS Percentage,
RANK() OVER (ORDER BY ExpectedSales/TargetSales) AS Rank
7.6 Dashboards for the Northwind Case Study 239
WHERE T.EmployeeKey = S.EmployeeKey AND S.EmployeeKey = E.EmployeeKey ORDER BY Percentage
The TableLastDay calculates the year and day-of-the-year number for the last order, while TableTgtSales determines the target sales for employees, reflecting a 5% increase from the previous year's sales TableExpSales evaluates year-to-date sales and expected sales by factoring in the remaining days of the year In the main query, we join the last two tables with the Employee table to compute the expected sales percentage and identify the three lowest-performing employees.
Dashboards in Power BI
This section demonstrates the implementation of the previously discussed dashboard in Power BI, as illustrated in Figure 7.6 We will also detail the DAX queries used for the various components of the dashboard.
Fig 7.6 Dashboard of the Northwind case study in Power BI
The top left chart shows the monthly sales compared with those of the same month in the previous year The required measures are given next. [Sales Amount] = SUM ( [SalesAmount] )
[PY Sales] = CALCULATE( [Sales Amount], SAMEPERIODLASTYEAR( 'Date'[Date] ) )
The top right gauge shows the percentage change in sales between the last month and the same month in the previous year The required measures are given next.
[LastOrderDate] = CALCULATE( MAX( 'Date'[Date] ), FILTER( ALL( 'Sales' ), TRUE ) ) [LM Sales] =
VAR LastOrderEOMPM = CALCULATE ( MAX ( 'Date'[YearMonth] ),
FILTER ( 'Date', 'Date'[Date] = EOMONTH ( [LastOrderDate], -1 ) ) ) RETURN
FILTER( ALL( 'Date' ), 'Date'[YearMonth] = LastOrderEOMPM ) )
VAR LastOrderEOMPY = CALCULATE ( MAX ( 'Date'[YearMonth] ),
FILTER ( 'Date', 'Date'[Date] = EOMONTH ( [LastOrderDate], -13 ) ) ) RETURN
FILTER( ALL( 'Date' ), 'Date'[YearMonth] = LastOrderEOMPY ) )
[Perc Change Sales] = DIVIDE([LM Sales] - [LMPY Sales], [LMPY Sales], 0)
The LastOrderDate function determines the date of the most recent order, while LM Sales calculates the sales figures from the month prior to this date In contrast, LMPY Sales assesses the sales from the same month in the previous year To achieve this, we utilize the EOMONTH function, which identifies the last day of a month a specified number of months before or after a given date Additionally, the Perc Change Sales function leverages the previous two metrics to calculate the percentage change in sales We also establish minimum, maximum, and target values for the gauge to provide a comprehensive analysis.
Next, we show the measures for the middle left chart, which displays the shipping costs with respect to the total sales by month.
[Total Freight to Sales Ratio] = DIVIDE( [Total Freight], [Sales Amount], 0 )
The middle right gauge shows the year-to-date shipping costs to sales ratio.
In addition to computing this measure, we need set the minimum, maximum, and target values for the gauge as given next.
[YTD Freight] = CALCULATE( [Total Freight],
FILTER( 'Date', 'Date'[Year] = MAX( 'Date'[Year] ) ) )
[YTD Sales] = CALCULATE( [Sales Amount],
FILTER( 'Date', 'Date'[Year] = MAX( 'Date'[Year] ) ) )
[YTD Freight to Sales Ratio] = DIVIDE( [YTD Freight], [YTD Sales], 0 )
[Freight to Sales Ratio Min Value] = 0.0
[Freight to Sales Ratio Max Value] = 0.2
[Freight to Sales Ratio Target Value] = 0.05
Finally, the measures for the bottom table showing the forecast of the three lowest-performing employees at the end of the year is given next:
[LastOrderDayNbYear] = CALCULATE( MAX ( 'Date'[DayNbYear] ),
FILTER ( 'Date', 'Date'[Year] = MAX ( 'Date'[Year] ) ) )
[Expected Sales] = [YTD Sales] * DIVIDE( 365, [LastOrderDayNbYear], 1 )
[Target Sales] = CALCULATE( [Sales Amount] * 1.05,
FILTER(ALL('Date'), 'Date'[Year] = MAX ('Date'[Year] ) - 1))
[Expected Quota] = DIVIDE( [Expected Sales], [Target Sales], 0 )
[Quota Rank] = RANKX( FILTER( ALLSELECTED( Employee ),
The first step involves calculating the day-of-the-year number for the last order date, which is 124 for May 4, 2018 Expected sales are determined by multiplying year-to-date sales by a factor that considers the remaining days in the year Target sales are set at a 5% increase over the previous year's sales, and the expected quota is derived from the ratio of expected sales to target sales Lastly, employee rankings are based on their expected quota, with the Power BI interface filtering the display to show only three employees.
Summary
This chapter explores the use of MDX, DAX, and SQL for querying data warehouses, specifically through a series of queries to the Northwind data warehouse We compared the strengths and weaknesses of these languages, emphasizing their expressiveness Additionally, we defined KPIs for the Northwind case study in both Analysis Services Multidimensional and Tabular models The chapter concludes with a demonstration of how to create dashboards for the Northwind case study using Microsoft Reporting Services and Power BI.
Review Questions
[LastOrderDayNbYear] = CALCULATE( MAX ( 'Date'[DayNbYear] ),
FILTER ( 'Date', 'Date'[Year] = MAX ( 'Date'[Year] ) ) )
[Expected Sales] = [YTD Sales] * DIVIDE( 365, [LastOrderDayNbYear], 1 )
[Target Sales] = CALCULATE( [Sales Amount] * 1.05,
FILTER(ALL('Date'), 'Date'[Year] = MAX ('Date'[Year] ) - 1))
[Expected Quota] = DIVIDE( [Expected Sales], [Target Sales], 0 )
[Quota Rank] = RANKX( FILTER( ALLSELECTED( Employee ),
The first step involves calculating the day-of-the-year number for the last order date, which is 124 for May 4, 2018 Expected sales are derived by multiplying year-to-date sales by a factor that reflects the remaining days in the year Target sales are set at a 5% increase over the previous year's sales, and the expected quota is determined by the ratio of expected sales to target sales Lastly, employee rankings are established based on their expected quota, with the Power BI interface filtering the display to show only three employees.
This chapter begins by demonstrating the use of MDX, DAX, and SQL for querying data warehouses, specifically through a series of queries to the Northwind data warehouse It concludes the first part with a comparison of the expressiveness of these languages, outlining their respective advantages and disadvantages The chapter then progresses to define Key Performance Indicators (KPIs) for the Northwind case study in both Analysis Services Multidimensional and Tabular models Finally, it wraps up by showcasing the creation of dashboards for the Northwind case study using Microsoft Reporting Services and Power BI.
7.1 What are key performance indicators or KPIs? What are they used for? Detail the conditions a good KPI must satisfy.
7.2 Define a collection of KPIs using an example of an application domain that you are familiar with.
7.3 Explain the notion of dashboard Compare the different definitions for dashboards.
7.4 What types of dashboards do you know? How would you use them?
7.5 Comment on the dashboard design guidelines.
7.6 Define a dashboard using an example of an application domain that you are familiar with.
Implementation and Deployment
Physical Modeling of Data Warehouses
This section provides an overview of three fundamental techniques to enhance data warehouse performance: materialized views, indexing, and partitioning We will explore these techniques in greater detail later in the chapter.
In the relational model, a view is essentially a stored query that can be utilized like a standard table, involving either base tables or other views A materialized view, on the other hand, is a view that is physically stored in the database, significantly improving query performance by precalculating complex operations such as joins and aggregations, and saving the results This allows for faster execution of queries that access only materialized views, although it does require additional storage space.
Materialized views face a common challenge in updating, as any changes to the underlying base tables must be reflected in the view To enhance efficiency, updates are ideally executed incrementally, which helps avoid the need to recalculate the entire view This process involves capturing modifications to the base tables and assessing their impact on the view's content Significant research has been dedicated to the maintenance of views, with this chapter focusing on the foundational approaches in this field.
In data warehousing, the exponential growth of aggregates with increasing dimensions and hierarchies makes it impractical to precalculate and materialize all possible aggregations A critical challenge in data warehouse design is the selection of materialized views to minimize total query response time and maintenance costs while operating within limited resources like storage and materialization time Various algorithms have been developed for this selection process, and some commercial DBMSs offer tools that optimize materialized view selection based on historical query patterns After defining the materialized views, it is essential to rewrite queries to leverage these views effectively, a process known as query rewriting This aims to utilize materialized views as much as possible, even if they only partially meet query conditions Selecting the optimal query rewriting, particularly for aggregation queries, is complex and involves algorithms that impose restrictions on the original query and potential materialized views to facilitate the rewriting process.
One significant drawback of using materialized views in data warehouses is the necessity to predict which queries will be materialized, as many data warehouse queries are often ad hoc and unpredictable When queries are not pre-calculated, they must be computed at runtime, necessitating effective indexing methods for optimal query processing Traditional indexing techniques used in Online Transaction Processing (OLTP) systems are inadequate for handling multidimensional data, as OLTP transactions typically access only a limited number of tuples, while data warehouse queries often require access to a larger dataset To address this, two common indexing methods employed in data warehouses are bitmap indexes and join indexes Bitmap indexes are particularly effective for columns with low cardinality, and various compression techniques can enhance their utility Conversely, join indexes materialize the relationships between two tables by storing pairs of row identifiers that participate in the join, effectively linking dimension values to rows in the fact table For instance, a join index for a sales fact table and a client dimension maintains a list of row identifiers for sales related to each client, and these join indexes can be effectively combined with bitmap indexes for improved query performance.
Partitioning is a technique utilized in relational databases to enhance query efficiency by dividing the contents of a relation into multiple files for more effective processing For instance, a table can be organized so that frequently accessed attributes are stored in one partition, while less frequently used attributes are allocated to another Additionally, in data warehouses, partitioning can be time-based, with each partition containing data relevant to a specific time period, such as a year or a range of months.
In the following sections we study these techniques in detail.
Materialized Views
A view in SQL is a derived relation created using the CREATE VIEW statement, which is recomputed each time it is accessed In contrast, a materialized view is physically stored in the database, enhancing query performance by serving as a cache that allows direct access without querying the base relations However, when base relations are updated, materialized views must also be updated, a process known as view maintenance Incremental view maintenance allows for updates to the view based on changes in the underlying relations without requiring a complete recomputation, making it a crucial aspect of data warehousing.
The view maintenance problem can be analyzed through four dimensions:
• Information: Refers to the information available for view maintenance,like integrity constraints, keys, access to base relations, and so on.
Modification encompasses the various types of changes managed by the maintenance algorithm, specifically insertions, deletions, and updates, with updates typically being processed as a deletion followed by an insertion.
• Language: Refers to the language used to define the view, most often SQL Aggregation and recursion are also issues in this dimension.
• Instance: Refers to whether or not the algorithm works for every instance of the database, or for a subset of all instances.
In a sales database, the relation Sales(ProductKey, CustomerKey, Quantity) tracks product orders, while the materialized view TopProducts identifies products that have received orders exceeding 150 units from at least one customer.
Inserting the tuple (p2,c3,110) into the tableSales will not impact the view, as it does not meet the view's criteria Conversely, adding the tuple (p2,c3,160) could potentially alter the view An algorithm can efficiently update the view without needing to access the base relation, essentially incorporating the product if it is not already present in the view.
To delete the tuple (p2, c3, 160) from the Sales view, it is essential to first verify that p2 has not been ordered by any other customer in a quantity exceeding 150 This verification necessitates a scan of the Sales relation.
In summary, while insertion can sometimes be achieved directly through a materialized view, deletion necessitates additional information For instance, the view FoodCustomers showcases customers who have ordered at least one product from the food category, utilizing the simplified and denormalized Product dimension The FoodCustomers view is defined as the projection of CustomerKey from the selection of products in the 'Food' category joined with sales data.
Inserting the tuple (p3, c4, 170) into the tableSales does not guarantee that c4 will appear in the FoodCustomers view, unless we verify if p3 belongs to the food category within the base relations.
The examples highlight the importance of categorizing view maintenance issues based on the type of updates and the operations involved in the view definition In the realm of database literature, two primary classes of algorithms for view maintenance have been extensively explored.
• Algorithms using full information, which means the views and the base relations.
• Algorithms using partial information, namely, the materialized views and the key constraints.
This article explores three types of views addressed by algorithms: nonrecursive views, outer-join views, and recursive views However, the focus will be on the first two types, as the discussion of recursive views falls outside the scope of this book.
The counting algorithm serves as the foundational method for nonrecursive views, encompassing operations like join, union, negation, and aggregation This algorithm tallies the alternative derivations associated with each tuple in the view, allowing for the assessment of whether a tuple in the view should be deleted when a corresponding tuple in a base relation is removed To illustrate this concept, we can examine the relation FoodCustomers previously discussed.
WHERE S.ProductKey = P.ProductKey AND P.CategoryName = 'Food' )
Figure 8.1a illustrates an instance of relationSales, while Figure 8.1b presents the corresponding viewFoodCustomers In this view, a new column, Count, has been introduced to indicate the number of possible purchases for each customer tuple For instance, the entry (c2,2) signifies that customer c2 purchased two items from the food category It is assumed that the only products in this category are p1 and p2, and the displayed tuples exclusively pertain to these products.
Fig 8.1 An example of the counting algorithm: ( a ) Instance of the Sales relation; ( b )
View FoodCustomers , including the number of possible derivations of each tu- ple; ( c ) View FoodCustomers after the deletion of ( p1, c2, 100 )
Deleting the tuple (p1,c2,100) from the Sales table does not affect the presence of c2 in the FoodCustomers view, as c2 can still be derived from the alternative tuple (p2,c2,50) Therefore, the counting algorithm can successfully compute the relation without any issues.
∆ − (FoodCustomers) which contains the tuples that can be derived from (p1,c2,100), and therefore affected by the deletion of such tuple, and adds a
−1 to each tuple In this example,∆ − (FoodCustomers) will contain the tuples
In managing insertions, the operation ∆ + (FoodCustomers) adds a value of 1 to the tuples The updated view, illustrated in Fig 8.1c, is created by joining ∆ − (FoodCustomers) with the materialized view FoodCustomers based on the CustomerKey attribute, followed by subtracting ∆ − (FoodCustomers).Count from FoodCustomers.Count Notably, since c2 has two derivations (as shown in Fig 8.1b), it will remain in the view, with only one derivation being removed If the tuple (p2, c2, 50) is later deleted, c2 will also be removed from the view, whereas c1 will be deleted along with (p1, c1, 20).
In this analysis, we explore views created using an outer join, focusing on two relations: Product (ProdID, ProdName, ShipID) and Shipper (ShipID, ShipName), as illustrated in Figures 8.2a and 8.2b An example of an outer join view will be presented to demonstrate how these relations can be effectively combined.
SELECT P.ProdID, P.ProdName, S.ShipID, S.ShipName
FROM Product P FULL OUTER JOIN Shipper S ON
Figure 8.2d illustrates the process of modifying a Product relation through insertions (∆ + (Product)) and deletions (∆ − (Product)) Updates are treated as a sequence of deletions followed by insertions To maintain the view, the full outer join is rewritten as either a left or right outer join, based on whether the updates pertain to the left or right table of the join Subsequently, the results are merged with the view that requires updating.
SELECT P.ProdID, P.ProdName, S.ShipID, S.ShipName
FROM ∆ ( Product ) P LEFT OUTER JOIN Shipper S ON P.ShipID = S.ShipID
SELECT P.ProdID, P.ProdName, S.ShipID, S.ShipName
FROM Product P RIGHT OUTER JOIN ∆ ( Shipper ) S ON P.ShipID = S.ShipID
Data Cube Maintenance
In data warehouses, summary tables, which incorporate aggregate functions, are essential for efficient data management To maintain these tables with minimal access to the base data while ensuring maximum data availability, it's crucial to address the challenge of updating summary tables as source data is modified This can be approached in two ways: either by completely recomputing the summary tables from the ground up or by utilizing incremental view maintenance techniques to streamline the update process.
ProdID ProdName ShipID p3 MP3 s2 p4 PC NULL
ProdID ProdName ShipID ShipName p3 MP3 s2 DHL p4 PC NULL NULL
ProdID ProdName ShipID ShipName p1 TV s1 Fedex
ProdID ProdName ShipID ShipName p2 Tablet NULL NULL p3 MP3 s2 DHL p4 PC NULL NULL
Fig 8.3 An example of self-maintenance of a full outer join view ( a ) Proj _ Shipper ;
( b ) ∆ + ( Product ); ( c ) ∆ − ( Product ); ( d ) ∆ + ( Product ) o n Proj_Shipper ; ( e )
The summary-delta algorithm is a key method for maintaining summary tables in the Proj_Shipper product, focusing on reducing the time required for updates while these tables are unavailable to data warehouse users While this algorithm is representative, there are various other techniques documented in the literature, including those that manage multiple versions of summary tables.
An aggregate function is considered self-maintainable if its new value can be calculated solely from previous values and changes in the base data For an aggregate function to be self-maintainable, it must be distributive While the five classic SQL aggregate functions are self-maintainable concerning insertions, they do not maintain this property for deletions Specifically, the functions MAX and MIN cannot be made self-maintainable when it comes to handling deletions.
The summary-delta algorithm consists of two key phases: propagate and refresh Its primary advantage lies in the ability to execute the propagate phase concurrently with data warehouse operations, while the refresh phase necessitates taking the warehouse offline During the propagate phase, a summary-delta table is created to capture net changes to the summary table resulting from modifications in the source data Subsequently, these changes are applied to the summary table in the refresh phase.
We will explain the algorithm with a simplified version of the Sales fact table in the Northwind case study, whose schema we show next.
254 8 Physical Data Warehouse Design Sales(ProductKey, CustomerKey, DateKey, Quantity)
Consider a view DailySalesSumdefined as follows:
SELECT ProductKey, DateKey, SUM(Quantity) AS SumQuantity,
COUNT(*) AS Count FROM Sales
The Count attribute is introduced to preserve the view amidst deletions During the propagate phase, we establish two tables, ∆ + (Sales) and ∆ − (Sales), to capture insertions and deletions in the fact table, respectively Additionally, a summary-delta table is created to record the net changes to the summary tables.
CREATE VIEW SD_DailySalesSum(ProductKey, DateKey,
SD_SumQuantity, SD_Count) AS WITH Temp AS (
Quantity AS _Quantity, 1 AS _Count FROM ∆ + ( Sales ) )
UNION ALL ( SELECT ProductKey, DateKey,
-1 * Quantity AS _Quantity, -1 AS _Count FROM ∆ − ( Sales ) ) )
SELECT ProductKey, DateKey, SUM(_Quantity), SUM(_Count)
In the temporary tableTempof the view definition, we can see that for each tuple in∆ + (Sales) we store a 1 in the_Countattribute, while for each tuple in
In this article, we discuss the calculation of net sales quantities by utilizing the attributes from both ∆ + (Sales) and ∆ − (Sales) The Quantity attribute values are adjusted by multiplying them by either 1 or -1, depending on the source of the data The main SELECT clause then aggregates these values, with SD_SumQuantity representing the net sum of quantities for each unique combination of Product Key and Date Key, while SD_Count reflects the total number of tuples associated with that combination.
In the refresh phase, the net changes from the summary-delta table are applied to the summary table The following outlines a general algorithm for refreshing data when using the SUM aggregate function.
INPUT: Summary-delta table SD_DailySalesSum
OUTPUT: Updated summary table DailySalesSum
For each tuple T in SD_DailySalesSum DO
T.DateKey = D.DateKey) INSERT T INTO DailySalesSum
SELECT * FROM DailySalesSum D WHERE T.ProductKey = D.ProductKey AND
T.DateKey = D.DateKey AND T.SD_Count + D.Count = 0) DELETE T FROM DailySalesSum
SET SumQuantity = SumQuantity + T.SD_SumQuantity,
Count = Count + T.SD_Count WHERE ProductKey = T.ProductKey AND
The algorithm evaluates each tuple T in the summary-delta table to determine its presence in the view If T is absent, it is added; if T is present but all occurrences of the (ProductKey, DateKey) combination have been removed (T.SD_Count + D.Count = 0), T is deleted from the view Otherwise, the existing tuple in the view is updated with the new sum and count For example, in the case of the DailySalesSum table, the insertion of the tuple (p4, c2, t4, 100) is illustrated in the changes, while the summary-delta table reflects the net result of the combination (p4, t4) with an update yielding the final tuple (p4, t4, 200, 16).
Figure 8.5 illustrates the use of the MAX aggregate function, highlighting the need for an additional column in the viewDailySalesMax that counts the number of tuples with the maximum value, rather than simply counting tuples based on the combination of ProductKey and DateKey as done with the SUM function.
CREATE VIEW DailySalesMax(ProductKey, DateKey, MaxQuantity, Count) AS (
SELECT ProductKey, DateKey, MIN(Quantity), COUNT(*)
SELECT MAX(Quantity) FROM Sales S2 WHERE S1.ProductKey = S2.ProductKey AND
S1.DateKey = S2.DateKey ) GROUP BY ProductKey, DateKey )
The summary-delta table, as illustrated in Figure 8.5b, requires a column to store the maximum value of the inserted or deleted tuples, along with an additional column to track the count of these insertions or deletions.
SD_Sum Quantity SD_Count p1 t1 150 1 p2 t2 290 1 p4 t4 100 1 p5 t5 -10 -1 p6 t6 100 0
Fig 8.4 An example of the propagate and refresh algorithm with aggregate function
The article discusses the creation of a summary-delta table, SD_DailySalesSum, based on the original view DailySalesSum It highlights the calculation of changes in sales, with positive values indicating insertions and negative values reflecting deletions The summary-delta table's first four entries represent sales additions, while the last three entries denote sales removals The updated view of DailySalesSum is then presented, showcasing the maximum value after these adjustments.
CREATE VIEW SD_DailySalesMax(ProductKey, DateKey,
SD_MaxQuantity, SD_Count) AS ( SELECT ProductKey, DateKey, Quantity, COUNT(*)
SELECT MAX(Quantity) FROM ∆ + ( Sales ) S2 WHERE S1.ProductKey = S2.ProductKey AND
S1.DateKey = S2.DateKey ) GROUP BY ProductKey, DateKey
SD_Max Quantity SD_Count p1 t1 100 2 p2 t2 100 2 p3 t3 100 2 p4 t4 100 2 p5 t5 100 -2 p6 t6 100 -2 p7 t7 100 -2
Fig 8.5 An example of the propagate and refresh algorithm with aggregate func- tion MAX ( a ) Original view DailySalesMax ; ( b ) Summary-delta table SD_DailySalesMax ; ( c ) Updated view DailySalesMax
SELECT ProductKey, DateKey, Quantity, -1 * COUNT(*)
SELECT MAX(Quantity) FROM ∆ − ( Sales ) S2 WHERE S1.ProductKey = S2.ProductKey AND
S1.DateKey = S2.DateKey ) GROUP BY ProductKey, DateKey )
After the update depicted in Fig 8.5c, the view is modified based on the insertions from the summary-delta table The tuple for p1 is inserted into the view as it lacks a corresponding entry, while the tuple for p2 is not included since its maximum value is lower than that in the view The tuple for p3 maintains the maximum value in the view, resulting in an increment of the counter to 7 Conversely, the tuple for p4 necessitates an update to the view, as its maximum value exceeds the current maximum, prompting both a new maximum and a revised counter.
In the summary-delta table, the tuple for p5 has a quantity value that is less than the maximum in the view, indicating that the view remains unchanged Conversely, the tuple for p6 in the summary-delta table possesses a quantity value that requires further analysis.
In physical data warehouse design, the maximum value in a view may not always reflect the current state due to deletions, as illustrated by the example of tuple p7 When the maximum value and its corresponding counter in the summary-delta table align with those in the view, two scenarios arise If other tuples exist in the base table with the same combination (p7,t7), we must recalculate the new maximum value and count from the base tables Conversely, if no such tuples exist, the appropriate action is to delete the tuple from the view.
The algorithm for refreshing the view DailySalesMax from the summary- delta tableSD_DailySalesMaxis left as an exercise.
Computation of a Data Cube
In Chapter 5, we discussed the computation of a data cube using SQL queries, where the allvalue is represented by a null value This process involves executing the UNION of 2n GROUP BY queries for each aggregation type However, generating the entire data cube directly from base fact and dimension tables is impractical for real-world applications without effective strategies Consequently, various optimization techniques have been proposed, which we will explore next to illustrate the core concepts.
Optimization methods begin with the concept of a data cube lattice, where each node signifies a potential aggregation of fact data An edge exists between nodes if one can be derived from the other, with the grouping attributes of the initial node being one more than those of the subsequent node For example, from an aggregate view by CustomerKey and ProductKey of the Sales table, the total sales amount by customer can be directly computed from this view without needing to reference the base table.
To optimize the computation of a four-dimensional data cube, represented by the lattice in Fig 8.6 with dimensions A, B, C, and D, we focus on direct edges, such as the connection from ABC to AB, which indicates that the summary table AB can be derived from ABC Transitive edges, such as ABCD leading to AB, are excluded from this lattice to prevent unnecessary complexity.
The smallest-parent method calculates each view by referencing the smallest previously computed view For instance, in the lattice depicted in Fig 8.6, the view AB can be derived from the views ABC, ABD, or ABCD, with the method opting for the smallest among these options.
• Cache-results: Caches in memory an aggregation from which other ones can be computed.
• Amortize-scans: Computes in memory as many aggregations as possible,reducing the amount of table scans.
• Share-sorts: Applies only to methods based on sorting and aims at sharing costs between several aggregations.
Share-partitions are essential for hashing-based algorithms, particularly when the hash table exceeds main memory capacity In such cases, data is divided into manageable partitions, allowing aggregation for each partition that can be accommodated in memory This approach effectively distributes the partitioning costs across multiple aggregations, enhancing efficiency.
These optimization methods can be contradictory; for example, share-sorts may favor deriving AB from ABC, while ABD could represent its smallest parent Advanced cube computation techniques aim to integrate various simple optimization strategies for efficient query evaluation plans This article will discuss a sorting-based method, alongside similar hashing-based approaches It's important to note that most algorithms necessitate estimating the sizes of each aggregate view within the lattice.
The PipeSort algorithm offers a comprehensive approach to data cube computation, incorporating four key optimization techniques By utilizing cache-results and amortize-scans strategies, it enables the evaluation of computing nodes with shared prefixes in a single scan, a method known as pipelined evaluation in database query optimization This allows for the simultaneous computation of various combinations such as ABCD, ABC, AB, and A, leveraging the attribute order in the view that aligns with the file's sorting order.
260 8 Physical Data Warehouse Design with a single scan of the first five tuples we can compute the aggregations (a1,b1,c1,200), (a1,b1,c2,500), (a1,b1,700), (a1,b2,400), and (a1,1100).
The input of the algorithm is a data cube lattice in which each edge e ij , where node i is the parent of node j, is labeled with two costs, S(e ij ) and
A(e ij ) S(e ij ) is the cost of computing j from i ifi is not sorted A(e ij ) is the cost of computing j fromi ifi is already sorted Thus,A(e ij )≤S(e ij ).
In addition, we consider the lattice organized into levels, where each level k contains views with exactlykattributes, starting fromAll, wherek= 0 This data structure is called asearch lattice.
The algorithm generates a subgraph from the search lattice, where each node has a single parent for computation, either sorted or unsorted If a node's attribute order is a prefix of its parent's, it can be computed without sorting, incurring a cost of A(e_ij) Conversely, if sorting is required, the cost is S(e_ij) In the output graph, each node can have only one outgoing edge labeled A, while multiple edges can be labeled S The algorithm aims to create an execution plan that minimizes the total edge costs.
To create the minimum cost output graph, the algorithm operates level by level, progressing from level 0 to level N−1, where N represents the total levels in the search lattice It identifies the optimal method for computing nodes at each level k based on the nodes at level k+1, effectively transforming the problem into a weighted bipartite matching scenario For each pair of levels (k, k+1), the algorithm duplicates each node in level k+1, resulting in k+1 outgoing edges for every node The original edges carry a cost A(eij), while the replicated edges have a cost S(eij) This transformation yields a bipartite graph, characterized by edges connecting nodes from different levels but not within the same level Ultimately, the algorithm calculates the minimum cost matching, ensuring that each node j in level k is paired with a corresponding node i in level k+1.
When node j is linked to node i via an A() edge, it dictates the attribute order for sorting i during computation Conversely, if j is connected to i through an S() edge, the sorting of i is performed to facilitate the computation of j.
Fig 8.7 Minimum bipartite matching between two levels in the cube lattice
In Figure 8.7, the graph illustrates the levels 1 and 2 of the lattice depicted in Figure 8.6, with solid lines representing edges of type A (e_ij) and dashed lines indicating edges of type S (e_ij) Additionally, Figure 8.7a includes a duplicate of each node at level 2 Figure 8.7b demonstrates that the computation of all views incurs a cost of A(e_ij), where, for instance, A is derived from AC, B from BA, and so forth.
The matching process is executed N times, corresponding to the number of grouping attributes, resulting in an evaluation plan The underlying heuristic suggests that if the cost is minimized for every pair of levels, the same principle applies to the entire plan The output lattice establishes a sorting order for computing each node, leading to the PipeSort algorithm's evaluation strategy This strategy allows for all aggregations to be computed in a pipeline within any chain where a node at level k serves as a prefix for a node at level k+1 in the output graph.
The general scheme of the PipeSort algorithm is given next.
INPUT: A search lattice with the A () and S () edges costs
OUTPUT: An evaluation plan to compute all nodes in the search lattice
For each node i in level k + 1
Fix the sort order of i as the order of the level k node connected to i by an A () edge;
Create k additional copies of each level k + 1 node;
To ensure consistency, connect each copy node to the identical set of level k nodes as the original node Additionally, assign a cost A (e_ij) to the edges e_ij originating from the original nodes, while assigning a cost S (e_ij) to the edges linked to the copy nodes.
Find the minimum cost matching on the transformed level k + 1 with level k;
Figure 8.8 shows an evaluation plan for computing the cube lattice of Fig 8.6using the PipeSort algorithm The minimum cost sort plan will first
AB BC BC BD BD CD CD
AB BC BC BD BD CD CD
In physical data warehouse design, the base fact table is sorted in CBAD order to compute the CBA, CB, C, and all aggregations using a pipelined approach Subsequently, the base fact table is sorted in BADC order to calculate the aggregates BAD, BA, and A.
The computation of views at level 1 (A, B, C, and D) for ACDBandDBCA is derived from the views at level 2, as illustrated by the bipartite graph matching in Figure 8.7.
BA AC DB AD CD
Fig 8.8 Evaluation plan for computing the cube lattice in Fig 8.6
Indexes for Data Warehouses
In database management systems (DBMSs), ensuring rapid data access is a primary concern To achieve this, relational DBMSs select the most efficient access path for each query A widely used method to enhance data retrieval speed is indexing, which allows for quick location of relevant data Nearly all queries that seek data meeting specific criteria rely on indexing for efficient answers.
As an example, consider the following SQL query:
Without an index on attribute EmployeeKey, we should perform a complete scan of table Employee (unless it is ordered), whereas with the help of an
268 8 Physical Data Warehouse Design index over such attribute, a single disk block access will do the job since this attribute is a key for the relation.
While indexing enhances quick data retrieval, it also necessitates updates to the index with every change to an indexed attribute, which can hinder updating performance Therefore, database administrators must exercise caution when defining indexes, as excessive indexing can negatively impact performance The B+-tree is the most widely used indexing method in relational databases, and all major database vendors offer support for various forms of B+-tree indexes.
A B + -tree index is a hierarchical structure consisting of a root node and pointers to lower levels, with the leaves containing record identifiers for the associated data Typically, each node's size matches that of a block, allowing for a substantial number of keys per node, which results in a tree with fewer levels and rapid record retrieval This indexing method is particularly effective when the indexed attribute is a file key or when there are few duplicate values.
Queries in OLAP systems differ significantly from those in OLTP systems, necessitating the development of new indexing strategies Key indexing requirements for data warehouse systems include efficient data retrieval and optimized performance to handle complex analytical queries.
Symmetric partial match queries are essential in OLAP, as they often involve searching for data within specified ranges, such as "Total sales from January 2006 to December 2010." To efficiently handle these queries across various dimensions of a data cube, it is crucial to symmetrically index all dimensions, enabling simultaneous searches for optimal performance.
To optimize performance and enhance accessibility, summary tables should be indexed at multiple levels of aggregation This approach is essential because summary tables can be extensive, and queries often require specific values from the aggregated data By indexing summary tables similarly to base non-aggregated tables, we ensure efficient data retrieval and improved query response times.
Efficient batch updates in OLAP systems allow for more columns to be indexed due to the less critical nature of updates However, it's essential to consider the data warehouse's refresh time when designing the indexing schema, as the time required to rebuild indexes post-refresh can prolong the warehouse's downtime.
• Sparse data: Typically, only 20% of the cells in a data cube are nonempty. The indexing schema must thus be able to deal efficiently with sparse and nonsparse data.
Bitmap indexes and join indexes are commonly used in data warehouse systems to cope with these requirements We study these indexes next.
In the example provided in Fig 8.10a, we analyze a simplified Product table containing six products To optimize data retrieval, we demonstrate the construction of a bitmap index based on the attributes QuantityPerUnit and UnitPrice, which have four and five possible values, respectively For each attribute value, we create a bit vector with a length equal to the number of rows in the Product table, resulting in a vector of length six Each bit vector indicates the presence of an attribute value for each product, where a '1' signifies that the product in row i possesses the attribute value corresponding to column j, while a '0' indicates its absence.
In the first row of the table in Fig 8.10b, the presence of a '1' in the vector labeled 25 signifies that the product (p1) has an attribute value of 25 for QuantityPerUnit To enhance readability, the product key is included in the first column of the bitmap index, although this column is not officially part of the index itself.
ProductKey ProductName QuantityPerUnit UnitPrice Discontinued CategoryKey p1 prod1 25 60 No c1 p2 prod2 45 60 Yes c1 p3 prod3 50 75 No c2 p4 prod4 50 100 Yes c2 p5 prod5 50 120 No c3 p6 prod6 70 110 Yes c4
Fig 8.10 An example of bitmap indexes for a Product dimension table ( a ) Product dimension table; ( b ) Bitmap index for attribute QuantityPerUnit ; ( c ) Bitmap index for attribute UnitPrice
To find products with a unit price of 75, the query processor utilizes a bitmap index on the UnitPrice field in the Product table By identifying the bit vector corresponding to the value of 75, it can pinpoint the record positions where a '1' appears, indicating the matching entries, such as the third row in the table.
When handling queries that specify a search range, such as “Products with between 45 and 55 pieces per unit and a unit price between 100 and 200,” the process requires additional steps First, we consult the index for overQuantityPerUnit and identify the relevant bit vectors labeled 45 and 50 The products that meet the criteria of having between 45 and 55 pieces per unit correspond to these identified vectors.
In the analysis depicted in Fig 8.11, we identify products with a quantity per unit ranging from 45 to 55 and a unit price between 100 and 200 This process involves using an OR operation for the QuantityPerUnit and another OR operation for the UnitPrice, which leads to an AND operation between the resulting vectors We specifically search for indices associated with UnitPrice values of 100, 110, and 120, resulting in two vectors labeled OR1 and OR2 The final AND operation between these vectors reveals that products p4 and p5 meet both criteria, thus satisfying the query.
Bitmapped indexing significantly enhances query performance by utilizing efficient bitwise operations, such as AND, OR, and NOT This method relies on simple bit comparisons, allowing the system to generate the resulting bit vector with minimal CPU overhead.
Bitmap indexes are most effective when indexing attributes with low cardinality, as they occupy significantly less space compared to traditional B-tree indexes For instance, in a Product table with 100,000 rows, a bitmapped index on the UnitPrice attribute requires only 0.075 MB, while a B-tree index would need approximately 0.4 MB This efficiency in space usage makes bitmap indexes preferable in OLAP systems However, they are not suitable for OLTP environments due to frequent updates that bitmap indexes cannot handle efficiently, and the page-level locking in these systems can lead to concurrency issues, as a locked page may restrict access to numerous index entries.
Bitmap indexes are generally sparse, featuring a limited number of '1's amidst numerous '0's, which makes them suitable for compression Even without compression, bitmap indexes outperform B+-trees in terms of space efficiency for low cardinality attributes Furthermore, bitmap compression enables the handling of high-cardinality attributes However, a drawback of this approach is the decompression overhead during query evaluation One well-known compression technique is run-length encoding (RLE), which serves as the foundation for many advanced methods discussed in the bibliographic notes section of this chapter.
Evaluation of Star Queries
Star queries are designed to leverage the star schema structure by joining the fact table with dimension tables An example of a star query can be seen in the Northwind dataset, where one might request the total sales of discontinued products, categorized by customer name and product name.
SELECT C.CustomerName, P.ProductName, SUM(S.SalesAmount)
S.ProductKey = P.ProductKey AND P.Discontinued = 'Yes'
We will study now how this query is evaluated by an engine using the indexing strategies studied above.
To effectively evaluate our example query, it is essential to establish a B+-tree based on the dimension keys CustomerKey and ProductKey, alongside implementing bitmap indexes on the Discontinued column in the Product dimension table and on the foreign key columns within the Sales fact table The structure of the Product and Customer dimension tables, as well as the Sales fact table, is illustrated in Figure 8.13a, c, and d, while the corresponding bitmap indexes are shown in Figures 8.13b, e, and f.
The OLAP engine evaluates the query by first identifying the record numbers that meet the condition Discontinued='Yes', which correspond to ProductKey values p2, p4, and p6, as indicated in the bitmap index Next, it accesses the bitmap vectors for these labels to perform a join between the Product and Sales tables Only the vectors for p2 and p4 yield matches, as there is no fact record for product p6 The relevant rows in the fact table, specifically the third, fourth, and sixth rows, are identified based on the presence of a ‘1’ in the corresponding vectors Finally, the key values for CustomerKey, c2 and c3, are obtained as the results.
Name Discontinued p1 prod1 No p2 prod2 Yes p3 prod3 No p4 prod4 Yes p5 prod5 No p6 prod6 Yes
Code c1 cust1 35 Main St 7373 c2 cust2 Av Roosevelt 50 1050 c3 cust3 Av Louise 233 1080 c4 cust4 Rue Gabrielle 1180
In the evaluation of star queries using bitmap indexes, the process involves multiple tables, including the Product table, Customer table, and Sales fact table, alongside various bitmap indexes for attributes like Discontinued, CustomerKey, and ProductKey By utilizing the B+-tree index, we can efficiently search for product and customer names that meet the query criteria, effectively joining the dimension tables with the fact table The results reveal that the corresponding records include cust2, cust3, prod2, and prod4, ultimately yielding the query answers of (cust2, prod2, 200) and (cust3, prod4, 100).
Note that the last join with Customer would not be needed if the query would have been of the following form:
SELECT S.CustomerKey, P.ProductKey, SUM(SalesAmount)
WHERE S.ProductKey = P.ProductKey AND P.Discontinued = 'Yes'
The query above only mentions attributes in the fact table Sales Thus, the only join that needs to be performed is the one between ProductandSales.
We illustrate now the evaluation of star queries using bitmap join indexes.
The primary objective is to establish a bitmap index on a fact table using an attribute from a dimension table, which involves precomputing the join between both tables and creating a bitmap index for the dimension table For instance, the bitmap join index between Sales and Product based on the Discontinued attribute allows for efficient retrieval of sales data for discontinued products To find relevant sales facts, one simply locates the vector labeled 'Yes' and examines the bits set to '1' This method streamlines query evaluation by eliminating the initial steps required when using bitmap indexes, though it requires an upfront cost for offline precomputation.
Implementing this strategy can significantly lower evaluation costs, especially when the SELECT clause excludes dimension attributes, eliminating the need for B+-tree joins with dimensions Consequently, the alternative query can be efficiently answered by performing a straightforward scan of the Sales table, even in the worst-case scenario.
Partitioning
Partitioning is the process of dividing a large table into smaller, more manageable partitions to enhance data management and processing efficiency From an application standpoint, partitioned tables function like non-partitioned tables, making SQL statements unaffected by the partitioning Both tables and indexes can be partitioned, allowing for flexible configurations, such as having a partitioned index on an unpartitioned table There are two primary methods of partitioning: horizontal and vertical Horizontal partitioning divides a table into smaller tables with the same structure but fewer records, which is particularly useful for queries that focus on recent data, as it streamlines data warehouse refreshes by only requiring access to the most recent partition Conversely, vertical partitioning organizes a table's attributes into distinct groups stored independently, necessitating the retention of a key in all partitions to reconstruct the original tuples, allowing frequently accessed attributes to be stored together for improved performance.
Physical data warehouse design can enhance performance by utilizing partitioning strategies, allowing more records to be loaded into memory efficiently and reducing processing time Among these strategies, round-robin partitioning distributes tuples evenly across partitions, facilitating parallel access but requiring full relation scans for individual tuple retrievals Hash partitioning, on the other hand, uses a hash function to uniformly distribute rows, enabling exact-match queries to be handled by specific partitions Additionally, range partitioning organizes tuples based on attribute value ranges, making it particularly effective for temporal data, such as partitioning by date.
Partitioning database tables enhances query performance by organizing data into smaller, more manageable segments The 2021 partition, for instance, includes only rows from that month, allowing for efficient handling of non-uniform data distributions and supporting both exact-match and range queries List partitioning assigns tuples based on specified attribute values, while composite partitioning combines methods like range and hash partitioning for more complex data organization Utilizing partition pruning allows queries to access only relevant data from specific partitions, significantly reducing response times, as seen in a Sales fact table partitioned by month Additionally, partitioning improves the performance of joins, especially when both tables involved are partitioned on their join attributes or when the reference table is partitioned by its primary key.
In scenarios where large joins are involved, breaking them into smaller joins between partitions can lead to substantial performance improvements, especially when leveraging parallel execution.
Partitioning enhances database management by breaking tables and indexes into smaller, manageable sections, allowing for targeted maintenance operations For instance, a database administrator can back up a single partition rather than the entire table, streamlining the process Additionally, partitioning is beneficial for index maintenance, enabling updates to specific data segments without impacting the whole index This approach also improves data availability; if some partitions become unavailable, others can remain online, ensuring that applications continue to function without disruption Furthermore, because each partition can reside in a separate tablespace, backup and recovery tasks can be executed independently, allowing for quicker access to active data compared to unpartitioned tables.
Parallel Processing
Parallel query processing involves converting queries into execution plans that can be effectively executed in parallel on multiprocessor systems This efficient execution is essential for optimizing performance metrics such as query response time and throughput Achieving this involves utilizing advanced parallel execution techniques and query optimization methods to identify the most efficient execution plan from a set of equivalent options.
Parallelism in data processing can be categorized into two primary forms: interquery parallelism and intraquery parallelism Interquery parallelism involves the simultaneous processing of multiple queries, whereas intraquery parallelism focuses on executing distinct steps within a single query concurrently These two forms can be effectively combined to enhance overall performance.
Parallel algorithms for efficient database query processing leverage data partitioning, balancing the trade-off between parallelism and communication costs These algorithms aim to maximize parallelism, recognizing that not all operations can be parallelized; for instance, highly sequential algorithms like Quicksort are less suitable for parallel execution In contrast, the sort-merge algorithm is well-suited for parallel processing and is widely utilized in shared-nothing architectures.
There are three main parallel join algorithms for partitioned relations, derived from their non-parallel counterparts: the parallel sort-merge join, the parallel nested-loop join, and the parallel hash-join The parallel sort-merge join sorts both relations by the join attribute using a parallel merge sort and then performs a merge-like operation for the join on a single node In contrast, the parallel nested-loop join calculates the Cartesian product of two relations in parallel, allowing for various join predicates beyond the traditional equijoin Lastly, the parallel hash-join focuses solely on equijoin by partitioning the two relations R and S.
S into the same number pof mutually exclusive fragments R i and S i such
278 8 Physical Data Warehouse Design that the join between R and S is equal to the union of the joins between
R i andS i Each join betweenR i andS i is performed in parallel Of course, the algorithms above are the basis for actual implementations which exploit main memory and multicore processors.
PostgreSQL implements parallel processing through interquery parallelism, utilizing multiple connections in a multiuser environment and processing the same query across different partitions, which is crucial for OLAP queries Additionally, it supports intraquery parallelism by enabling a single connection to utilize multiple threads, optimizing performance on multicore processors.
Fig 8.14 A parallel query plan in PostgreSQL
In PostgreSQL, a parallel query plan consists of three key components: Process, Gather, and Workers The Process manages overall query execution, while the Gather node is introduced when a query can be parallelized, serving as the root for the parallelizable part of the plan Workers, which are threads that operate in parallel, are allocated alongside the Gather node The Process executes all serial parts of the query, dividing relation blocks among Workers to maintain sequential access These Workers communicate through shared memory, returning their results to the Process for aggregation PostgreSQL supports three types of parallel computation: sequential scans, where tables are scanned block by block; parallel index scans, where Workers coordinate to scan B-tree index pages; and parallel aggregation, utilizing a divide-and-conquer approach with PartialAggregate nodes that send their outputs to a FinalizeAggregate node for final calculations Additionally, parallel merge joins are facilitated, allowing each Worker to execute the inner loop of a join while sending results to the Gather node for final output.
PostgreSQL enables users to declaratively define partitions, enhancing query performance and bulk loading by operating on smaller tables This partitioning approach not only boosts efficiency but also facilitates increased parallel processing Tables are partitioned based on a specified list of columns or expressions that serve as the partition key.
We explain next the various kinds of partitioning supported, using a table Employee(EmpNbr, Position, Salary)as example.
A list partition contains specific values for the partition key, while a default partition accommodates values that do not belong to any defined partition Each position is assigned its own partition, with only one partition illustrated in this context.
CREATE TABLE EmployeeManager PARTITION OF Employee
A range partition organizes partition key values within a specified range, requiring both a minimum and a maximum value, where the minimum is inclusive and the maximum is exclusive For example, we can create three distinct partitions based on different salary ranges.
CREATE TABLE EmployeeSalaryLow PARTITION OF Employee
FOR VALUES FROM (MINVALUE) TO (50000);
CREATE TABLE EmployeeSalaryMiddle PARTITION OF Employee
CREATE TABLE EmployeeSalaryHigh PARTITION OF Employee
FOR VALUES FROM (100000) TO (MAXVALUE);
A hash partition is established by applying a modulus and a remainder to each partition, where each partition contains rows corresponding to the hash value of the partition key This value, when divided by the specified modulus, yields the designated remainder, ensuring organized data distribution across partitions.
An example is as follows (only one partition is shown):
CREATE TABLE EmployeeHash1 PARTITION OF Employee
Finally, amultilevel partition is created by partitioning an already parti- tioned table For example, we can combine the range and list partitioning above as shown next:
CREATE TABLE EmployeeSalaryLow PARTITION OF Employee
FOR VALUES FROM (MINVALUE) TO (50000)
CREATE TABLE EmployeeSalaryLow_Manager PARTITION OF EmployeeSalaryLow
Table partitions are used by PostgreSQL for query processing Consider, for example, the following query:
The query plan devised by the query optimizer is as follows, according to the architecture shown in Fig.8.14.
Parallel Seq Scan on EmployeeSalaryHigh Parallel Seq Scan on EmployeeSalaryMiddle Parallel Seq Scan on EmployeeSalaryLow
The optimizer determines the number of Workers required for a query, utilizing available resources efficiently; for instance, with eight cores, only four are necessary to avoid performance degradation The Parallel Append feature allows Workers to distribute across partitions, where one partition may be scanned by two Workers while others are handled by one each Once a Worker completes its task on a partition, it reallocates to ensure an even distribution of Workers across the remaining partitions.
Physical Design in SQL Server and Analysis Services
This section explores the practical application of theoretical concepts in Microsoft SQL Server, beginning with the support for materialized views It then introduces the innovative column-store index, a unique indexing method offered by SQL Server Following this, we examine partitioning techniques and describe the implementation of the three types of multidimensional data representation—ROLAP, MOLAP, and HOLAP—within Analysis Services.
In SQL Server, indexed views serve as materialized views, where a unique clustered index is created on a view to precompute and materialize it This technique is essential for optimizing performance, particularly in data warehouse environments.
Creating an indexed view requires ensuring that both the view and the base tables meet specific conditions The view's definition must be deterministic, meaning all expressions in the SELECT, WHERE, and GROUP BY clauses must yield consistent results For instance, the DATEADD function is deterministic, while GETDATE is not, as it produces varying results with each execution Additionally, indexed views can be created with the SCHEMABINDING option, which prevents modifications to the base tables that could alter the view's definition An example of this is an indexed view that calculates total sales by employee from the Sales fact table in the Northwind data warehouse.
CREATE VIEW EmployeeSales WITH SCHEMABINDING AS (
SELECT EmployeeKey, SUM(UnitPrice * OrderQty * Discount)
AS TotalAmount, COUNT(*) AS SalesCount FROM Sales
CREATE UNIQUE CLUSTERED INDEX CI_EmployeeSales ON
An indexed view serves a dual purpose in query optimization: it can be directly referenced in a query, or it may be utilized by the query optimizer even when not explicitly mentioned, leading to a more cost-effective query plan.
When a query involves a view, the view's definition is expanded to include only base tables, a process known as view expansion To prevent this expansion, the NOEXPAND hint can be used, instructing the query optimizer to treat the view as a standard table with a clustered index.
FROM Employee, EmployeeSales WITH (NOEXPAND)
When an indexed view is not explicitly referenced in a query, the query optimizer evaluates its potential usage during execution This allows existing applications to leverage newly created indexed views without requiring modifications To determine if an indexed view can fully or partially cover a query, several criteria are assessed: the tables in the query's FROM clause must include all tables from the indexed view's FROM clause; the join conditions in the query should encompass those in the view; and the aggregate columns in the query must be derivable from a subset of the view's aggregate columns.
In SQL Server, when a partitioned table is created and indexed views are built on it, the system automatically partitions the indexed view using the same partition scheme as the table, resulting in what is known as a partition-aligned indexed view A key advantage of this approach is that the database query processor automatically maintains the indexed view when new partitions are added to the table, eliminating the need to drop and recreate the view This feature significantly enhances the manageability of indexed views.
In this article, we demonstrate the process of creating a partition-aligned indexed view on the Salesfact table within the Northwind data warehouse To enhance maintenance and improve efficiency, we opt to partition the fact table by year.
To create a partition scheme, we need first to define the partition function.
We aim to establish a partitioning scheme for our table, dividing it by year from 2016 to 2018 This partitioning will utilize a function named PartByYear, which accepts an integer attribute representing the surrogate key values for the Date dimension.
CREATE PARTITION FUNCTION [PartByYear] (INT)
AS RANGE LEFT FOR VALUES (184, 549, 730);
The surrogate keys 184, 549, and 730 correspond to the dates 31/12/2016, 31/12/2017, and 31/12/2018, which define the boundaries for partition intervals Using RANGE LEFT, records with values less than or equal to 184 are assigned to the first partition, those greater than 184 but less than or equal to 549 fall into the second partition, and records with values exceeding 730 are categorized into the third partition.
Once the partition function has been defined, the partition scheme is cre- ated as follows:
AS PARTITION [PartByYear] ALL to ( [PRIMARY] );
In this context, "PRIMARY" refers to the storage of partitions within the primary filegroup, which holds the essential startup database information Additionally, filegroup names may be utilized, allowing for the inclusion of multiple filegroups The term "ALL" signifies that all partitions will be allocated to the primary filegroup.
TheSalesfact table is created as a partitioned table as follows:
CREATE TABLE Sales (CustomerKey INT, EmployeeKey INT,
OrderDateKey INT, ) ON SalesPartScheme(OrderDateKey)
The statement ON SalesPartScheme(OrderDateKey) tells that the table will be partitioned following theSalesPartSchemeand the partition function will have OrderDateKeyas argument.
Now we create an indexed view over the Sales table, as explained inSect.8.9.1 We first create the view.
CREATE VIEW SalesByDateProdEmp WITH SCHEMABINDING AS (
SELECT OrderDateKey, ProductKey, EmployeeKey, COUNT(*) AS Cnt,
SUM(SalesAmount) AS SalesAmount FROM Sales
GROUP BY OrderDateKey, ProductKey, EmployeeKey )
Finally, we materialize the view.
CREATE UNIQUE CLUSTERED INDEX UCI_SalesByDateProdEmp
ON SalesByDateProdEmp (OrderDateKey, ProductKey, EmployeeKey)
Since the clustered index was created using the same partition scheme, this is a partition-aligned indexed view.
SQL Server offers column-store indexes that organize data by columns, similar to vertical partitioning, significantly boosting performance for specific query types The principles applicable to bitmap indexes and their role in star-join evaluation are also relevant to column-store indexes A comprehensive analysis of these indexes will be presented in Chapter 15.
A column-store index can be defined to enhance access speed for a materialized view, such as Sales2012, which retrieves data from the Sales fact table specific to the year 2012 By focusing on frequently requested attributes like DueDateKey, EmployeeKey, and SalesAmount, the implementation of a column-store index optimizes query performance for the Sales2012 view.
CREATE NONCLUSTERED COLUMNSTORE INDEX CSI_Sales2012
ON Sales2012 (DueDateKey, EmployeeKey, SalesAmount)
Column-store indexes come with significant limitations, one of which is that they cannot be applied to tables that are frequently updated Consequently, it is not feasible to create a column-store index on the originalSalesfact table due to its update requirements; instead, the index must be defined on a view.
SQL Server does not support bitmap indexes; instead, it utilizes bitmap filters that are generated at execution time by the query processor to efficiently filter table values These filters can be introduced into the query plan post-optimization or dynamically by the query optimizer during query plan generation, with the latter known as optimized bitmap filters This approach can significantly enhance the performance of data warehouse queries utilizing star schemas by eliminating nonqualifying rows from the fact table early in the query process It is important to note that this mechanism differs from traditional bitmap indexes found in other database systems like Oracle and Informix.
In Analysis Services, a partition is a container for a portion of the data of a measure group Defining a partition requires to specify:
• Basic information, like name of the partition, the storage mode, and the processing mode.
• Slicing definition, which is an MDX expression specifying a tuple or a set.
• Aggregation design, which is a collection of aggregation definitions that can be shared across multiple partitions.
Query Performance in Analysis Services
We now briefly describe how query performance can be enhanced in Analysis Services through several techniques.
The first step must be to optimize cube and measure group design For this,many of the issues studied in this book apply For example, it is suggested to
Fig 8.16 Template query that defines a partition
The final partitions for the Sales measure group utilize cascading attribute relationships such as Day→Month→Quarter→Year, establishing natural hierarchies that define user hierarchies of related attributes within each dimension Natural hierarchies are advantageous because they are materialized on disk and automatically recognized as aggregation candidates To enhance query performance, redundant relationships between attributes should be eliminated, and the cube space should be minimized to include only essential measure groups Measures that are frequently queried together should be grouped, as retrieving measures from multiple groups necessitates additional storage engine operations It is also important to separate large sets of measures that are not commonly queried together into distinct measure groups Additionally, large parent-child hierarchies should be avoided due to their inefficiency in generating aggregations, which are limited to the key and top attributes, resulting in slower query times for intermediate levels Finally, optimizing many-to-many dimension performance is crucial, as it requires real-time joins between the data measure group and other dimensions.
To optimize query performance in Analysis Services, it's crucial to select efficient aggregations that minimize the number of records the storage engine must scan While aggregations can enhance query speed, the time required to create and refresh them must be weighed against their benefits Excessive aggregations can actually degrade performance, particularly when a summary table is used infrequently, potentially displacing more relevant data from the cache Therefore, it's important to limit the number of aggregations to maintain optimal query efficiency.
The Analysis Services aggregation design algorithm does not automatically account for all attributes in its aggregation process, necessitating a review of which attributes are included and whether additional candidates should be suggested This is particularly important when user queries that cannot be resolved from cache are often addressed through partition reads instead of aggregation reads To guide this process, Analysis Services employs the Aggregation Usage property, which can be set to one of four values: full, none, unrestricted, or default This property allows administrators to influence the inclusion of attributes in aggregations, ensuring optimal performance and query resolution.
Partitions in Analysis Services optimize query performance by allowing access to smaller data sets when the data cache or aggregations cannot provide answers To enhance efficiency, data should be partitioned based on common query patterns It's crucial to avoid configurations that necessitate accessing multiple partitions for most queries The vendor recommends that each partition should contain between 2 million and 20 million records, and that each measure group should not exceed 2,000 partitions Additionally, a distinct ROLAP partition is required for real-time data, which must be associated with its own measure group.
To enhance performance in MDX queries, it's crucial to avoid run-time checks in calculations, as they can significantly slow down execution Instead of using frequently evaluated CASE and IF functions, the SCOPE function should be utilized for more efficient query rewriting Implementing Non_Empty_Behavior allows the query engine to leverage bulk evaluation mode, and using EXISTS for filtering, rather than relying on member properties, can further optimize performance Minimizing the use of subqueries and filtering sets prior to cross joins can also help reduce cube space and improve overall query efficiency.
To optimize the query engine's cache, it's essential to ensure the server has sufficient memory for storing query results for future use Defining calculations in MDX scripts is crucial, as they possess a global scope that allows cache sharing across sessions with identical security permissions Additionally, warming the cache by executing predefined queries with any tool is necessary Techniques for tuning the cache are akin to those used for relational databases, focusing on optimizing memory and processor usage For comprehensive guidance, please consult the Analysis Services documentation.
Summary
In this chapter, we explored the challenges of designing a physical data warehouse, emphasizing three key techniques: view materialization, indexing, and partitioning We specifically examined the issue of incremental view maintenance in the context of view materialization.
290 8 Physical Data Warehouse Design is, how and when a view can be updated without recomputing it from scratch.
We introduced efficient algorithms for computing data cubes with all views materialized and provided methods for estimating optimal materialization sets under constraints when full materialization isn't feasible Our analysis included two common indexing schemes in data warehousing—bitmap and join indexes—and their roles in query evaluation Additionally, we explored partitioning techniques and strategies to improve data warehouse performance and management The final sections focused on the application of theoretical concepts in physical design and query performance within Analysis Services, demonstrating their relevance in real-world scenarios.
Bibliographic Notes
A comprehensive resource on physical database design can be found in [140], with specific insights into SQL Server covered in [56, 227] Key topics discussed include incremental view maintenance, explored in [95, 96], and the summary table algorithm developed by Mumick et al [159] The PipeSort algorithm and various data cube computation techniques are detailed in [3] The foundational view selection algorithm was introduced by Harinarayan et al [100] Bitmap indexes were first presented in [173], while bitmap join indexes are discussed in [172] A study examining the combined use of indexing, partitioning, and view materialization in data warehouses is documented in [25] For further reading, [123] offers a book on indexing structures for data warehouses, and index selection studies can be found in [85] Additionally, [223] provides a survey on bitmap indices for data warehouses, highlighting the popular WAH (Word Align Hybrid) bitmap compression technique [266] and its more efficient variation, PLWAH (Position List Word Align Hybrid) [223] Rizzi and Saltarelli [198] compare view materialization and indexing in data warehouse design, while a comprehensive survey of view selection methods is presented in [146].
Review Questions
8.1 What is the objective of physical data warehouse design? Specify dif- ferent techniques that are used to achieve such objective.
8.2 Discuss advantages and disadvantages of using materialized views.
8.3 What is view maintenance? What is incremental view maintenance?
8.4 Discuss the kinds of algorithms for incremental view maintenance, that is, using full and partial information.
8.5 What are self-maintainable aggregate functions and views.
8.6 Explain briefly the main idea of the summary-delta algorithm for data cube maintenance.
8.7 How is data cube computation optimized? What are the kinds of optimizations that algorithms are based on?
8.8 Explain the idea of the PipeSort algorithm.
8.9 How can we estimate the size of a data cube?
8.10 Explain the algorithm for selecting a set of views to materialize Dis- cuss its limitations How can they be overridden?
8.11 Compare B + -tree indexes, hash indexes, bitmap indexes, and join indexes with respect to their use in databases and data warehouses.
8.12 How do we use bitmap indexes for range queries?
8.14 Describe a typical indexing scheme in a star and snowflake schemas.
8.15 How are bitmap indexes used during query processing?
8.16 How do join indexes work in query processing? Explain for which kinds of queries they are efficient.
8.17 Explain and discuss horizontal and vertical data partitioning.
8.18 Discuss different horizontal partitioning strategies When would you use each of them?
8.19 Explain two techniques for increasing query performance taking ad- vantage of partitioning.
8.20 Explain how parallel query processing is implemented in PostgreSQL.
8.21 Discuss the characteristics of storage modes in Analysis Services.
8.22 How do indexed views compare with materialized views?
Exercises
Exercise 8.1.In the Northwind database, consider the relations
Employee(EmplID, FirstName, LastName, Title, )
Orders(OrderID, CustID, EmpID, OrderDate, ).
EmpOrders(EmpID, Name, OrderID, OrderDate) computed from the full outer join of tablesEmployeeandOrders, whereName is obtained by concatenatingFirstNameandLastName.
The view EmpOrders in SQL represents a virtual table that combines data from the Employee and Orders tables, allowing for efficient querying of employee-related order information For example, if we have instances where Employee IDs 1 and 2 have placed orders, the view will display these relationships When an employee is added or removed from the Employee table, the EmpOrders view must be updated accordingly to reflect these changes, ensuring data consistency To compute the delta relation of the view, the SQL command can be executed to identify the differences between the current state of the view and its previous state, highlighting any modifications due to insertions or deletions in the Employee table.
292 8 Physical Data Warehouse Design from the delta relations of tableEmployee Write an algorithm to update the view from the delta relation.
Exercise 8.2.Consider a relation Connected(CityFrom, CityTo, Distance), which indicates pairs of cities that are directly connected and the distance between them, and a view OneStop(CityFrom, CityTo), which computes the pairs of cities (c1,c2) such that c2can be reached from c1 passing through exactly one intermediate stop Answer the same questions as those of the previous exercise.
Exercise 8.3.Consider the following tables
Store(StoreID, City, State, Manager)
OrderLine(OrderID, LineNo, ProductID, Quantity, Price)
Product(ProductID, ProductName, Category, Supplier)
Part(PartID, PartName, ProductID, Quantity) and the following views
•ParisManagers(Manager)that contains managers of stores located in Paris.
•OrderProducts(OrderID, ProductCount)that contains the number of prod- ucts for each order.
•OrderSuppliers(OrderID, SupplierCount)that contains the number of sup- pliers for each order.
•OrderAmount(OrderID, StoreID, Date, Amount) which adds to the table Orderan additional column that contains the total amount of each order.
•StoreOrders(StoreID, OrderCount)that contains the number of orders for each store.
•ProductPart(ProductID, ProductName, PartID, PartName)that is obtained from the full outer join of tablesProductandPart.
In SQL, views can be categorized based on their self-maintainability concerning insertions and deletions A self-maintainable view allows for direct updates to the underlying tables without requiring additional logic to maintain data integrity For instance, a simple view based on a single table with no aggregations or joins is typically self-maintainable However, more complex views, such as those involving multiple tables or aggregate functions, may not be self-maintainable An example of a non-self-maintainable view is one that aggregates data from several tables; attempting to insert or delete records in such a view can lead to inconsistencies, as the underlying data structure cannot accommodate changes without additional processing.
Exercise 8.4.Consider the following tables
PhDStudent(StudNo, StudName, Laboratory) and a view ProfPhdStud(ProfNo, ProfName, StudNo, StudName) computed from the outer joins of these three relations.
To assess the self-maintainability of the view, we need to evaluate its ability to update automatically with changes in the underlying tables The SQL command for creating the view should be structured to reflect the relevant data from the Supervision table For instance, if we have a Supervision table with columns such as SupervisorID, EmployeeID, and Department, the view can be created using a query that selects these columns As an example, consider a scenario where the Supervision table contains records of supervisors and their respective employees A delta table can illustrate the effects of insertions and deletions in the Supervision table, highlighting how the view dynamically updates to reflect these changes By tracking these delta operations, we can demonstrate the view's computation and ensure it remains accurate and reliable.
Exercise 8.5.By means of examples, explain the propagate and refresh algo- rithm for the aggregate functionsAVG,MIN, andCOUNT For each aggregate function, write the SQL command that creates the summary-delta table from the tables containing the inserted and deleted tuples in the fact table, and write the algorithm that refreshes the view from the summary-delta table.
Exercise 8.6.Suppose that a cubeSales(A,B,C,D,Amount) has to be fully materialized The cube contains 64 tuples Sorting takes the typicalnlog(n) time Every GROUP BYwithk attributes has 4×2 k tuples. a Compute the cube using the PipeSort algorithm. b Compute the gain of applying the PipeSort compared to the cost of com- puting all the views from scratch.
Exercise 8.7.Consider the graph in Fig.8.19, where each node represents a view and the numbers are the costs of materializing the view Assuming that the bottom of the lattice is materialized, determine using the View Selection Algorithm the five views to be materialized first.
Exercise 8.8.Consider the data cube lattice of a three-dimensional cube with dimensions A, B, and C Extend the lattice to take into account the hierarchies A → A 1 →All and B → B 1 → B 2 →All Since the lattice is complex to draw, represent it by giving the list of nodes and the list of edges.
Exercise 8.9.Consider ann-dimensional cube with dimensionsD 1 , D 2 , ,
D n Suppose that each dimensionD i has a hierarchy withn i levels Compute the number of nodes of the corresponding data cube lattice.
Exercise 8.10.Modify the algorithm for selecting views to materialize in order to consider the probability that each view has to completely match a given query In other words, consider that you know the distribution of the queries, so that view Ahas probability P(A) to match a query, viewB has probability P(B),etc. a How would you change the algorithm to take into account this knowledge? b Suppose that in the lattice of Fig 8.9, the view ABC is already mate- rialized Apply the modified algorithm to select four views to be mate- rialized given the following probabilities for the views: P(ABC) = 0.1,
P(C) = 0.1, andP(All) = 0.1. c Answer the same question as in (b) but now with the probabilities as follows:P(ABC) = 0.1,P(AB) = 0.05,P(AC) = 0.1,P(BC) = 0, P(A) 0.2,P(B) = 0.1,P(C) = 0.05, andP(All) = 0.05 Compare the results.
Exercise 8.11.Given theEmployeetable below, show how a bitmap index on attributeTitlewould look like Compress the bitmap values using run-length encoding.
Employee Name Title Address City Department
The article lists key individuals along with their titles and locations: Peter Brown, Dr., based in Brussels; James Martin, Mr., from Wavre; Ronald Ritchie, Mr., residing in Paris; Marco Benetti, Mr., located in Versailles; Alexis Manoulis, Mr., from London; Maria Mortsel, Mrs., in Reading; Laura Spinotti, Mrs., situated in Brussels; John River, Mr., in Waterloo; Bert Jasper, Mr., also from Paris; and Claudia Brugman, Mrs., residing in Saint-Denis.
Exercise 8.12.Given the Sales table below and the Employee table from
Ex 8.11, show how a join index on attributeEmployeeKeywould look like.
Exercise 8.13.Given the Department table below and the Employee table from Ex 8.11, show how a bitmap join index on attribute DeptKey would look like.
Department Name Location d1 Management Brussels d2 Production Paris d3 Marketing London d4 Human Resources Brussels d5 Research Paris
Exercise 8.14.Consider the tablesSales in Ex.8.12,Employeein Ex.8.11, andDepartmentin Ex.8.13. a Propose an indexing scheme for the tables, including any kind of index you consider it necessary Discuss possible alternatives according to sev- eral query scenarios Discuss the advantages and disadvantages of creating the indexes. b Consider the query:
( D.Location = 'Brussels' OR D.Location = 'Paris' )
Explain a possible query plan that uses the indexes defined in (a).
Extraction, transformation, and loading (ETL) processes are essential for managing data from various sources within an organization, but their complexity and cost necessitate a focus on reducing development and maintenance expenses Conceptual modeling of ETL processes can help achieve this, yet current ETL tools lack a standardized language for defining these processes This chapter explores the design of ETL processes using the Business Process Modeling Notation (BPMN), which serves as a widely accepted standard for business process specification BPMN enables a conceptual, implementation-independent representation of ETL processes, allowing designers to concentrate on key characteristics while obscuring technical complexities The chapter also discusses how BPMN models can be converted into executable specifications for ETL tools and presents an alternative implementation using extended relational algebra, which can be translated into SQL for execution in any RDBMS It begins with an introduction to BPMN, followed by a detailed explanation of its application in ETL conceptual modeling, illustrated through a case study involving the Northwind data warehouse The chapter concludes with an overview of Microsoft Integration Services and demonstrates the relational algebra implementation of ETL processes, including its translation into SQL.
297 © Springer-Verlag GmbH Germany, part of Springer Nature 2022
A Vaisman, E Zimányi, Data Warehouse Systems, Data-Centric Systems and Applications, https://doi.org/10.1007/978-3-662-65167-4_9
A business process consists of interconnected activities or tasks within an organization aimed at delivering a specific product or service These tasks can be executed by software systems, humans, or a combination of both Business process modeling involves depicting an organization’s business processes to facilitate their analysis and enhancement.
Various techniques have been proposed for modeling business processes, including Gantt charts, flowcharts, PERT diagrams, and data flow diagrams However, these traditional methods often lack formal semantics, making them less effective In contrast, formal techniques like Petri Nets offer well-defined semantics but can be challenging for business users to comprehend and may not adequately represent typical real-world scenarios This gap in existing methodologies led to the creation of BPMN (Business Process Model and Notation), which has become the de facto standard for modeling business processes, with BPMN 2.0 being the current version.
BPMN (Business Process Model and Notation) offers a graphical notation designed to define, understand, and communicate an organization's business processes effectively It aims to create a user-friendly language for the business community while accommodating the complexities of business process modeling Defined using the Unified Modeling Language (UML), BPMN includes precise semantics and execution semantics to enhance clarity To balance simplicity and complexity, BPMN organizes its graphical elements into categories, allowing readers to easily identify fundamental components and comprehend the diagrams This structured approach enables the addition of variations and detailed information without significantly altering the diagram's overall layout.
BPMN provides four basic categories of elements, namely, flow objects, connecting objects, swimlanes, and artifacts These are described next.