Data warehouse systems design and implementation (data centric systems and applications)

This chapter introduces the basic concepts of data warehouses. A data warehouse is a particular database targeted toward decision support. It takes data from various operational databases and other data sources and transforms it into new structures that fit better for the task of performing business analysis. Data warehouses are based on a multidimensional model, where data are represented as hypercubes, with dimensions corresponding to the various business perspectives and cube cells containing the measures to be analyzed. In Sect. 3.1, we study the multidimensional model and present its main characteristics and components. Section 3.2 gives a detailed description of the most common operations for manipulating data cubes. In Sect. 3.3, we present the main characteristics of data warehouse systems and compare them against operational databases. The architecture of data warehouse systems is described in detail in Sect. 3.4. As we shall see, in addition to the data warehouse itself, data warehouse systems are composed of back-end tools, which extract data from the various sources to populate the warehouse, and front-end tools, which are used to extract the information from the warehouse and present it to users. We finish in Sect. 3.5, describing SQL Server, a representative business intelligence suite of tools.

An Overview of Data Warehousing

In the early 1990s, organizations recognized the need for advanced data analysis to enhance decision-making in a competitive and dynamic environment Traditional operational databases, designed for daily business operations, fell short as they lacked historical data and struggled with complex queries involving multiple tables Additionally, integrating data from various operational systems proved challenging due to inconsistencies in data definitions and content To address these issues, data warehouses emerged as a viable solution to meet the increasing demands of decision-making users.

According to Inmon's classic definition, a data warehouse is a collection of subject-oriented, integrated, nonvolatile, and time-varying data designed to support management decisions This definition highlights key features, particularly that a data warehouse focuses on specific subjects of analysis tailored to the analytical needs of managers at different decision-making levels For instance, a retail company's data warehouse may include data for analyzing inventory and product sales.

Data warehousing involves the integration of data from various operational and external systems, resulting in a nonvolatile repository that retains data over extended periods In a data warehouse, data modification and removal are restricted, allowing only the purging of obsolete information Additionally, data warehouses are time-variant, enabling the tracking of data evolution, such as sales trends over months or years The design of operational databases typically follows four phases: requirements specification, conceptual design, logical design, and physical design During the requirements specification phase, user needs across the organization are gathered to create a database schema that effectively addresses queries, utilizing conceptual models like entity-relationship diagrams.

The Entity-Relationship (ER) model outlines an application conceptually, independent of implementation details This design is subsequently transformed into a logical model, which serves as a framework for database applications Currently, the relational model is the most widely adopted logical model for databases The final step involves physical design, which tailors the logical model to a specific implementation platform, resulting in a physical model.

Relational databases require high normalization to maintain consistency during frequent updates and minimize redundancy, but this can lead to slower query performance due to the partitioning of data into multiple tables However, this approach is often unsuitable for data warehouses, which need to provide a comprehensive understanding of data and efficient performance for complex analytical queries As a result, a different design model, known as multidimensional modeling, has been adopted for data warehouse applications This model represents data as a collection of facts connected to various dimensions, where facts serve as the focus of analysis—such as sales in stores—and include numeric measures for quantitative evaluation Dimensions allow for the analysis of these measures from different perspectives, such as examining sales across various stores, over different time periods, or by geographical location Additionally, dimensions often include hierarchical attributes, enabling users to explore data at multiple levels of detail, like month–quarter–year for time and city–state–country for location.

Data warehouses should be designed similarly to operational databases, following a four-step process: requirements specification, conceptual design, logical design, and physical design However, the lack of a widely accepted conceptual model for data warehouse applications often leads to complex logical schemas that are not user-friendly To address this issue, we propose a conceptual model that operates above the logical level, enhancing the design of data warehouses In this book, we utilize the MultiDim model, which effectively captures the intricate characteristics of data warehouses at a higher level of abstraction A detailed exploration of conceptual modeling for data warehouses is presented in Chapter 4.

The multidimensional model is typically represented through relational tables structured as star schemas and snowflake schemas, which connect a central fact table to multiple dimension tables Star schemas utilize a single table for each dimension, resulting in denormalized dimension tables, while snowflake schemas employ normalized tables that reflect hierarchies within dimensions An OLAP server then constructs a data cube from this relational representation, offering a multidimensional perspective of the data warehouse For a deeper understanding of logical modeling, refer to Chapter 5.

After implementing a data warehouse, analytical queries can be executed using MDX (MultiDimensional eXpressions), the standard language for multidimensional databases Recently, Microsoft introduced Data Analysis Expressions (DAX) as an alternative querying language A comparison of MDX, DAX, and SQL is explored in Chapters 6 and 7.

The physical level focuses on implementation issues, making physical design essential for ensuring adequate response times to complex ad hoc queries To enhance system performance, three primary techniques are employed: materialized views, indexing, and data partitioning Notably, bitmap indexes are preferred in data warehousing contexts, while operational databases typically utilize B-tree indexes Extensive research on these techniques was conducted, especially in the late 1990s, and the findings have been integrated into both traditional and modern OLAP engines for big data Chapter 8 provides a review and analysis of these advancements.

A significant distinction between operational databases and data warehouses lies in the data integration process; data warehouses extract information from multiple source systems, requiring transformation to align with their specific model before loading This essential process, known as extraction, transformation, and loading (ETL), is critical for the success of data warehousing projects Despite extensive research, there remains a lack of consensus on a standardized ETL design methodology, leading to many challenges being addressed in an ad hoc manner, with various proposals available to tackle these issues.

Emerging Data Warehousing Technologies

regarding ETL conceptual design We study the design and implementation of ETL processes in Chap 9.

Data analysis involves leveraging the contents of a data warehouse to deliver crucial insights for decision-making This process utilizes three primary tools: querying, which employs the OLAP paradigm to extract relevant data and uncover valuable knowledge; key performance indicators (KPIs), which are measurable objectives that assess organizational performance; and dashboards, which are interactive reports that visually present warehouse data, including KPIs, to offer a comprehensive overview of an organization's performance for informed decision support Further exploration of data analysis can be found in Chapters 6 and 7.

Designing a data warehouse is a complex process that involves multiple phases, each addressing specific considerations These phases include requirements specification, conceptual design, logical design, and physical design Requirements can be gathered through three different approaches: collecting input from users, analyzing source systems, or using a combination of both methods The chosen approach significantly influences the subsequent conceptual design phase A comprehensive methodology for data warehouse design is presented in Chapter 10.

At the start of the 21st century, data warehouse systems had established foundational concepts, yet the field continues to evolve New data types and models, such as spatial data, have been integrated into both commercial and open-source systems Additionally, innovative architectures are being developed to manage the vast amounts of data required for modern decision-support systems.

In data warehousing, a common assumption is that dimensions remain static, with only facts and their measures linked to specific time frames However, this overlooks the reality that dimensions can evolve over time, such as changes in a product's price or category To address this issue within relational databases, the widely adopted solution is the concept of slowly changing dimensions.

An innovative solution to managing time-varying information is the use of temporal databases, which offer specific structures and mechanisms for this purpose By integrating temporal databases with data warehouses, we create what is known as temporal data warehouses, enhancing the capability to handle dynamic data effectively.

Current database and data warehouse systems offer limited capabilities for handling time-varying data, making SQL queries complex and often inefficient Additionally, MDX lacks temporal support, highlighting the need to enhance traditional OLAP operators for better exploration of time-varying data, a concept known as temporal OLAP (TOLAP) Further insights into temporal data warehouses are discussed in Chapter 11.

In real-world scenarios, data warehouse schemas evolve over time to meet new application requirements, often necessitating modifications to the data This process typically involves removing outdated data and incorporating new information However, when such modifications are impractical, maintaining multiple schema versions leads to multiversion data warehouses In these systems, current data is added according to the latest schema, while data from previous schemas is preserved for analysis This allows users and applications to work with older schema versions while new users can access the most current version, a concept further explored in Chapter 11.

Spatial data has seen a significant rise in usage across diverse fields, including public administration, transportation, environmental systems, and public health It encompasses both physical objects on the Earth's surface, like streets and cities, as well as geographic phenomena such as temperature and altitude The volume of spatial data is rapidly increasing, driven by technological advancements in remote sensing and global navigation satellite systems (GNSS), including the Global Positioning System (GPS) and the Galileo system.

Spatial databases are designed for efficient storage and manipulation of spatial data, but they often fall short in supporting decision-making processes To address this limitation, spatial data warehouses have been developed, integrating spatial database and data warehouse technologies These warehouses enhance data analysis, visualization, and manipulation capabilities The analysis performed within these systems is referred to as spatial OLAP (SOLAP), allowing users to explore spatial data similarly to traditional OLAP methods using tables and charts Further insights into spatial data warehouses are explored in Chapter 12.

The analysis of mobility data, which involves tracking the positions of moving objects over time, has gained significant importance with the advent of advanced positioning devices For instance, traffic data can now be collected through sequences of GPS signals emitted by vehicles during their journeys, enabling detailed insights into movement patterns This process is commonly referred to as mobility analysis.

Emerging data warehousing technologies are evolving to accommodate the complexities of mobility data analysis As moving objects generate extensive sequences of positional data, these sequences are typically segmented into manageable units known as trajectories This segmentation is crucial for effectively analyzing movement data Consequently, the development of mobility data warehouses is becoming increasingly important, allowing for a more efficient handling of this dynamic information.

The interconnectedness of various domains such as web, transportation, communication, biological, and economic data is effectively modeled using graphs, leading to the emergence of graph databases and analytics, specifically graph data warehousing and OLAP The property graph data model is pivotal for native graph databases, utilizing nodes and vertices for data storage, making it particularly efficient for path traversal computations, as explored in Chapter 13, which focuses on Neo4j, a leading graph database Additionally, the semantic web serves as a dynamic source of multidimensional information, represented in a machine-processable format through the Resource Description Framework (RDF) and the Web Ontology Language (OWL), enabling domain ontologies to establish a common terminology Semantic annotations enhance the description of unstructured and semi-structured data, with numerous applications, particularly in fields like medicine, creating extensive repositories of semantically annotated data that improve decision-support systems Consequently, data warehousing technologies must adapt to accommodate semantic web data, a topic examined in Chapter 14.

In the evolving landscape of big data, the rise of massive-scale data sources presents significant challenges for the data warehouse community To address these issues, new database architectures such as distributed storage, NoSQL systems, column-store databases, and in-memory databases are emerging Traditional ETL processes and data warehouse solutions struggle to handle the vast volume and variety of data, highlighting the need for integrated systems that can manage structured, unstructured, and real-time analytics Innovations like NewSQL, HTAP paradigms, data lakes, Delta Lake, polyglot architectures, and cloud data warehouses are being developed in response to these demands from both academia and industry Chapter 15 delves into these recent advancements in the field.

Review Questions

1.1 Why are traditional databases called operational or transactional? Why are these databases inappropriate for data analysis?

1.2 Discuss four main characteristics of data warehouses.

1.3 Describe the different components of a multidimensional model, that is, facts, measures, dimensions, and hierarchies.

1.4 What is the purpose of online analytical processing (OLAP) systems and how are they related to data warehouses?

1.5 Specify the different steps used for designing a database What are the specific concerns addressed in each of these phases?

1.6 Explain the advantages of using a conceptual model when designing a data warehouse.

1.7 What is the difference between the star and the snowflake schemas?

1.8 Specify several techniques that can be used for improving performance in data warehouse systems.

1.9 What is the extraction, transformation, and loading (ETL) process?

1.10 What languages can be used for querying data warehouses?

1.11 Describe what is meant by the term data analytics Give examples of techniques that are used for exploiting the content of data warehouses.

1.12 Why do we need a method for data warehouse design?

1.13 What is spatial data? What is mobility data? Give examples of applications for which such kinds of data are important.

1.14 Explain the differences between spatial databases and spatial data warehouses.

1.15 What is big data and how is it related to data warehousing? Give examples of technologies that are used in this context.

1.16 Give examples of applications where graph data models can be used.

1.17 Describe why it is necessary to take into account web data in the context of data warehousing Motivate your answer by elaborating an example application scenario.

This chapter introduces fundamental database concepts, focusing on modeling, design, and implementation It outlines a four-step design process: requirements specification, conceptual design, logical design, and physical design, which helps separate concerns and tailor applications to user needs and specific database technologies The Northwind case study is introduced as a reference throughout the book Additionally, the chapter reviews the entity-relationship model and the relational model, which are essential for database design Physical design considerations are also discussed, providing foundational knowledge for understanding subsequent chapters while encouraging readers to explore further resources on the topic.

Database Design

Databases serve as the fundamental building blocks of modern information systems, consisting of a shared collection of logically related data and its description They are specifically designed to fulfill the information requirements and support the operational activities of organizations A database operates on a database management system (DBMS), ensuring efficient data management and accessibility.

(DBMS), which is a software system used to define, create, manipulate, and administer a database.

11 © Springer-Verlag GmbH Germany, part of Springer Nature 2022

A Vaisman, E Zimányi, Data Warehouse Systems, Data-Centric Systems and Applications, https://doi.org/10.1007/978-3-662-65167-4_2

Designing a database system is a complex undertaking typically divided into four phases, described next.

Requirements specification is essential for gathering user needs related to database systems Various techniques developed by both academics and practitioners assist in identifying crucial system properties, standardizing requirements, and prioritizing them effectively.

Conceptual design focuses on creating a user-centered representation of a database, free from implementation details This process utilizes a conceptual model to identify key concepts relevant to the application Among the various conceptual models, the entity-relationship model is one of the most commonly employed for database application design Additionally, object-oriented modeling techniques using Unified Modeling Notation (UML) can also be effectively applied.

Logical design focuses on converting the conceptual representation of a database into a logical model that is compatible with various database management systems (DBMSs) The relational model is the most widely used logical model today, although other models such as the object-relational, object-oriented, and semistructured models also exist This book primarily emphasizes the relational model.

The physical design phase focuses on adapting the logical database representation into a physical model tailored for specific Database Management Systems (DBMS) platforms Popular DBMS options include SQL Server, Oracle, DB2, MySQL, and PostgreSQL.

A key goal of the four-level data architecture process is to achieve data independence, ensuring that higher-level schemas remain unaffected by changes in lower-level schemas This concept is divided into two types: logical data independence and physical data independence Logical data independence means that modifications to the structure of relational tables do not impact the conceptual schema, as long as application requirements remain constant Conversely, physical data independence indicates that changes in the physical storage of data, such as sorting records on a disk, do not alter the conceptual or logical schemas, although users may notice differences in response times.

This article provides an overview of the entity-relationship model and relational models, focusing on the most commonly utilized conceptual and logical frameworks We will also discuss important physical design considerations The foundation of our discussion is the Northwind relational database, which serves as a practical example for explaining database design concepts In the subsequent chapter, we will explore a data warehouse derived from the Northwind database, where we will delve into data warehousing and OLAP concepts.

The Northwind Case Study

Northwind Company exports various goods and requires a relational database for effective data management and storage Key characteristics of the data to be stored will guide the design of this database.

• Customer data, which must include an identifier, the customer’s name, contact person’s name and title, full address, phone, and fax.

Employee data encompasses essential information such as identifiers, names, titles, courtesy titles, birth dates, hire dates, addresses, home phone numbers, phone extensions, and photographs It is important to store these photos in the file system, accompanied by their respective paths Additionally, employees are organized in a hierarchy, reporting to higher-level colleagues within the organization.

The company operates across various territories, which are grouped into specific regions Employees can be assigned to multiple territories, allowing for a flexible workforce; however, these territories are not exclusive to any one employee, as each territory can be associated with multiple employees.

• Shipper data, that is, information about the companies that Northwind hires to provide delivery services For each one of them, the company name and phone number must be kept.

• Supplier data, including the company name, contact name and title, full address, phone, fax, and home page.

Northwind trades a variety of products, each identified by a unique identifier, name, quantity per unit, and unit price, along with a status indicating if the product has been discontinued To manage inventory effectively, it tracks the number of units in stock, units ordered but not yet delivered, and the reorder level, which triggers production or acquisition when reached Products are categorized with a name, description, and image, and each product is linked to a specific supplier.

The article provides essential data regarding sales orders, encompassing the order identifier, submission date, required delivery date, actual delivery date, involved employee, customer details, assigned shipper, freight cost, and complete destination address.

An order can contain many products, and for each of them the unit price,the quantity, and the discount that may be given must be kept.

Conceptual Database Design

Northwind Company exports various goods and requires a well-structured relational database to effectively manage and store its data Key characteristics of the data to be stored include essential attributes that facilitate efficient organization and retrieval.

• Customer data, which must include an identifier, the customer’s name, contact person’s name and title, full address, phone, and fax.

Employee data encompasses essential information such as identifiers, names, job titles, courtesy titles, birth dates, hire dates, addresses, home phone numbers, phone extensions, and photos, which should be stored in the file system along with their respective paths Additionally, employees are organized in a hierarchical structure, reporting to higher-level colleagues within the organization.

The company operates across various territories, which are organized into distinct regions Employees may be assigned to multiple territories, and these territories are not exclusive; thus, each employee can be associated with several territories, while each territory can also encompass multiple employees.

• Shipper data, that is, information about the companies that Northwind hires to provide delivery services For each one of them, the company name and phone number must be kept.

• Supplier data, including the company name, contact name and title, full address, phone, fax, and home page.

Northwind trades a variety of products, each identified by a unique identifier, name, quantity per unit, and unit price, along with a status indicating whether the product has been discontinued The company maintains an inventory system that tracks the number of units in stock, units ordered but not yet delivered, and the reorder level, which signifies when new production or acquisition is necessary Additionally, products are categorized, with each category featuring a name, description, and image, while every product is associated with a distinct supplier.

The article provides essential data regarding sales orders, including a unique identifier, submission date, required and actual delivery dates, the employee responsible for the sale, customer details, the shipper handling delivery, freight costs, and the complete destination address.

An order can contain many products, and for each of them the unit price, the quantity, and the discount that may be given must be kept.

The entity-relationship (ER) model is a widely utilized conceptual framework for designing database applications While there is consensus on the meanings of its various concepts, multiple visual notations have been developed to represent these ideas For a detailed overview of the notations adopted in this book, refer to Appendix A.

Figure 2.1 shows the ER model for the Northwind database We next introduce the main ER concepts using this figure.

ProductID ProductName QuantityPerUnit UnitPrice UnitsInStock UnitsOnOrder ReorderLevel Discontinued /NumberOrders

OrderID OrderDate RequiredDate ShippedDate (0,1) Freight

ShipName ShipAddress ShipCity ShipRegion (0,1) ShipPostalCode (0,1) ShipCountry

EmployeeID Name FirstName LastName Title TitleOfCourtesy BirthDate HireDate Address City Region (0,1) PostalCode Country HomePhone Extension Photo (0,1) Notes (0,1) PhotoPath (0,1)

Fig 2.1 Conceptual schema of the Northwind database

Entity types represent groups of real-world objects relevant to an application, such as Employees, Orders, and Customers An individual object within an entity type is referred to as an entity or instance, and the collection of these instances is known as the entity type's population.

From the application point of view, all entities of an entity type have the same characteristics.

Real-world objects are interconnected and do not exist in isolation, with relationship types representing these associations For instance, relationship types like Supplies, ReportsTo, and HasCategory illustrate the connections between different objects Each association between objects within a relationship type is known as a relationship or an instance, while the complete set of these associations is referred to as the population of that relationship type.

In database relationships, the role of an entity type is defined by its participation in a relationship type, represented by a connecting line Each role is accompanied by cardinalities that indicate the minimum and maximum participation of an entity in that relationship For instance, the relationship between Products and Supplies has a cardinality of (1,1), indicating each product participates exactly once Conversely, Supplies and Suppliers have a cardinality of (0,n), allowing suppliers to participate an undetermined number of times Additionally, the (1,n) cardinality between Orders and OrderDetails signifies that each order can participate multiple times Roles are categorized as optional or mandatory based on whether their minimum cardinality is 0 or 1, and as monovalued or multivalued depending on whether their maximum cardinality is 1 or n.

A relationship type can connect two or more object types, being classified as binary when it involves two object types and n-ary when it involves more than two In the context of binary relationships, they can be further categorized based on maximum cardinality into one-to-one, one-to-many, and many-to-many types For instance, the Supplies relationship is one-to-many, where a single product is linked to at most one supplier, but a supplier can provide multiple products Conversely, the OrderDetails relationship exemplifies a many-to-many scenario, as an order can include several products, and a product can appear in multiple orders.

In certain relationship types, such as the ReportsTo relationship, the same entity type can appear multiple times, leading to what is known as a recursive relationship To clarify the distinct roles of the entity type in these scenarios, role names are essential For instance, in the ReportsTo relationship, the roles are designated as Subordinate and Supervisor.

Objects and the relationships between them possess distinct structural characteristics that define their nature Attributes serve to document these characteristics for both entity and relationship types.

Address andHomepageare attributes of Suppliers, whileUnitPrice,Quantity, andDiscountare attributes of OrderDetails.

Attributes in a database have cardinalities that specify the number of values they can hold for each instance Typically, when the cardinality is (1,1), it is not displayed in diagrams, indicating that each supplier has exactly one address and at most one homepage, making the homepage attribute optional with a cardinality of (0,1) Attributes can be classified as mandatory when their cardinality is (1,1) and can also be categorized as monovalued or multivalued based on whether they hold a single value or multiple values, respectively; in our example, all attributes are monovalued, except for the Phone attribute, which may be labeled (1,n) if a customer has multiple phone numbers Additionally, attributes can be complex, such as the Name attribute in the Employees entity, which consists of FirstName and LastName, while simple attributes do not have components Some attributes may also be derived, like the NumberOrders attribute of Products, which calculates the number of orders a product is involved in through a formula based on other schema elements.

In real-world applications, certain attributes, known as identifiers, uniquely identify specific objects For instance, in the context of the Employees entity type, EmployeeID serves as the unique identifier, ensuring that each employee possesses a distinct value for this attribute While the identifiers illustrated in the accompanying figure are simple and consist of a single attribute, it is also common for identifiers to be made up of two or more attributes.

(1,n) LineNo UnitPrice Quantity Discount SalesAmount

Fig 2.2 Relationship type OrderDetails modeled as a weak entity type

Weak entity types lack their own identifiers and are depicted with a double line in their name box, while strong entity types possess identifiers In the provided figures, there are no weak entity types; however, the relationship between Orders and Products, represented as OrderDetails, can be illustrated in a different model.

Logical Database Design

The Relational Model

Relational databases have been a reliable solution for information storage across various application domains for decades Despite the emergence of alternative database technologies, the relational model remains the preferred method for managing essential data that supports the daily operations of organizations.

The relational model, introduced by Codd in 1970, gained significant success due to its simplicity and intuitive design, grounded in a robust formal theory based on mathematical relations, set theory, and first-order predicate logic This foundation facilitated the development of declarative query languages and various optimization techniques, resulting in efficient implementations However, it wasn't until the early 1980s that the first commercial relational database management systems (RDBMS) emerged The model features a straightforward data structure, consisting of relations or tables made up of one or more attributes or columns, with a relational schema outlining the structure of these relations.

ProductID ProductName QuantityPerUnit UnitPrice UnitsInStock UnitsOnOrder ReorderLevel Discontinued SupplierID CategoryID

OrderID CustomerID EmployeeID OrderDate RequiredDate ShippedDate (0,1) ShipVia

Freight ShipName ShipAddress ShipCity ShipRegion (0,1) ShipPostalCode (0,1) ShipCountry

EmployeeID FirstName LastName Title TitleOfCourtesy BirthDate HireDate Address City Region (0,1) PostalCode Country HomePhone Extension Photo (0,1) Notes (0,1) PhotoPath (0,1) ReportsTo (0,1)

OrderID ProductID UnitPrice Quantity Discount

The relational schema of the Northwind database, illustrated in Fig 2.4, is derived from the conceptual schema shown in Fig 2.1 through a series of translation rules applied to the corresponding ER schema This schema includes various relations, such as Employees, Customers, and Products, each containing multiple attributes For instance, the Employees relation features attributes like EmployeeID, FirstName, and LastName.

R.Ato indicate the attribute Aof relationR.

In the relational model, each attribute is associated with a specific domain or data type, such as integer, float, date, or string, which defines a set of values and operations A key requirement of this model is that attributes must be atomic and monovalued, meaning complex attributes like the "Name" of the entity type "Employees" must be decomposed into simpler components, such as "FirstName" and "LastName." Consequently, a relation R is characterized by a schema R(A1:D1, A2:D2), ensuring clarity and consistency in data representation.

D 2 , , A n :D n ), whereRis the name of the relation, and each attributeA i is defined over the domainD i The relationRis associated to a set of tuples

A relation can be represented as a set of tuples (t1, t2, , tn), which is a subset of the Cartesian product D1 × D2 × × Dn This collection of tuples is often referred to as the instance or extension of a relation The degree, or arity, of a relation is defined by the number of attributes in its relation schema.

The relational model allows several types of integrity constraintsto be defined declaratively.

An attribute can be designated as non-null, indicating that null values or blanks are not permitted In the provided figure, only the attributes labeled with a cardinality of (0,1) are allowed to contain null values.

In a relational database, a key is defined by one or more attributes that ensure the uniqueness of tuples within a relation, meaning that no two tuples can have identical values in those key columns Keys are visually represented by underlining in diagrams, where a key made up of multiple attributes is referred to as a composite key, while a single attribute key is known as a simple key.

In the relational model, each table must have a primary key, such as the simple key EmployeeID in the Employees table, while other tables, like EmployeeTerritories, can have composite keys, which in this case consists of both EmployeeID and TerritoryID Additionally, the attributes that make up the primary key cannot contain null values, ensuring data integrity and uniqueness within the database.

Referential integrity establishes a connection between two tables, where a foreign key in one table references the primary key of another table This ensures that the values in the foreign key must match those in the primary key For instance, the EmployeeID attribute in the Orders table references the primary key in the Employees table, guaranteeing that every employee listed in an order is also present in the Employees table Additionally, referential integrity can involve foreign and primary keys that consist of multiple attributes.

A check constraint establishes a condition that must be satisfied when inserting or modifying a record in a database table For instance, in the Orders table, a check constraint can ensure that the OrderDate is less than or equal to the RequiredDate for each order It is important to note that many Database Management Systems (DBMS) limit check constraints to evaluate only a single record, and do not allow references to data in other tables.

2.4 Logical Database Design 21 in other tuples of the same table are not allowed Therefore, check constraints can be used only to verify simple constraints.

Declarative integrity constraints alone are insufficient to capture the extensive range of constraints present in various application domains, necessitating the use of triggers for implementation A trigger is defined as a named event-condition-action rule that automatically activates upon modifications to a relation Additionally, triggers can facilitate the computation of derived attributes, exemplified by the attribute NumberOrders in the Products table Each time an insert, update, or delete occurs in the OrderDetails table, the trigger ensures the attribute's value is updated accordingly.

The translation of a conceptual schema (written in the ER or any other conceptual model) to an equivalent relational schema is called a mapping.

In most database design tools, a widely recognized process automatically converts conceptual schemas into logical schemas, primarily within the relational model This procedure involves defining tables across different relational database management systems (RDBMSs).

We now outline seven rules that are used to map an ER schema into a relational one.

Rule 1: A strong entity typeEis mapped to a tableT containing the simple monovalued attributes and the simple components of the monovalued complex attributes ofE The identifier of E defines the primary key of

T.T also defines non-null constraints for the mandatory attributes Note that additional attributes will be added to this table by subsequent rules. For example, the strong entity typeProductsin Fig.2.1is mapped to the tableProductsin Fig.2.4, with keyProductID.

Rule 2: Let us consider a weak entity type W, with owner (strong) entity typeO AssumeW id is the partial identifier ofW, andO id is the identifier ofO W is mapped in the same way as a strong entity type, that is, to a table T In this case, T must also includeO id as an attribute, with a referential integrity constraint to attributeO.O id Moreover, the identifier ofT is the union ofW id andO id

Normalization

When evaluating a relational schema, it is essential to identify potential redundancies within the relations, as these can lead to anomalies during insertions, updates, and deletions.

Employee Territories EmployeeID TerritoryID KindOfWork

(c) Fig 2.7 Examples of relations that are not normalized

In the context of OrderDetails, each product is linked to a discount percentage, leading to redundancy as this information is repeated across all orders containing that product To eliminate this redundancy, the Discount attribute should be removed from the OrderDetails table and incorporated into the Products table, ensuring that product discount information is stored only once.

The relation Products in Fig 2.7b incorporates category information, such as name, description, and picture, which is redundantly repeated for each product within the same category This redundancy can lead to inconsistencies when updating category descriptions, necessitating the removal of category attributes from the Products table and the creation of a separate Category table for better data management Similarly, the relation EmployeeTerritories in Fig 2.7c introduces an additional attribute, KindOfWork, which results in repetitive information for employees assigned to multiple territories To address this issue, the KindOfWork attribute should be eliminated from the EmployeeTerritories table, and a new table, EmpWork, should be established to effectively link employees with the various kinds of work they perform.

Dependencies and normal forms help identify redundancies in relational databases A functional dependency is a constraint that exists between two attribute sets within a relation For a relation R and attribute sets X and Y, a functional dependency X → Y is established when, across all tuples in the relation, each value of X corresponds to no more than one value of Y.

In this context, X is said to determine Y, highlighting that a key represents a specific instance of functional dependency In this scenario, the attributes that form the key functionally determine all attributes within the relation.

The redundancies illustrated in Fig 2.7a and 2.7b can be defined through functional dependencies In the OrderDetails relation shown in Fig 2.7a, the functional dependency ProductID → Discount is present Similarly, in the Products relation depicted in Fig 2.7b, the functional dependencies ProductID → CategoryName and CategoryName → Description are established.

The redundancy in the relation EmployeeTerritories, as illustrated in Fig 2.7c, is represented by a multivalued dependency This occurs when a set of attributes X determines multiple values for another set Y, independent of the remaining attributes in the relation In this scenario, the multivalued dependency EmployeeID→→KindOfWork is established, which also applies to TerritoryID→→KindOfWork It is important to note that functional dependencies are a specific instance of multivalued dependencies, meaning that every functional dependency is inherently a multivalued dependency as well.

A normal form serves as an integrity constraint that ensures a relational schema adheres to specific properties Since the inception of the relational model in the 1970s, various types of normal forms have been established to enhance data integrity and organization.

Normal forms are also established for various models, including the entity-relationship model For detailed definitions of these normal forms, readers are encouraged to consult database textbooks.

Relational Query Languages

Relational databases allow data to be queried through various formalisms, primarily categorized into two types of query languages Procedural languages require users to specify the operations needed to obtain the desired results, while declarative languages enable users to simply state what they wish to retrieve, allowing the Database Management System (DBMS) to determine the necessary procedural query for execution.

This section introduces relational algebra and SQL, two fundamental concepts that will be utilized throughout this book Relational algebra is a procedural query language, while SQL serves as a declarative query language.

Relational algebra consists of operations for manipulating relations, categorized into unary operations, which take one relation as input, and binary operations, which take two relations This algebra is closed, meaning all operations yield relations, allowing for the combination of operations to form complex queries Operations are further classified into basic operations, which cannot be derived from others, and derived operations, which simplify expressions of queries The first unary operation we explore is projection, denoted by π C1, ,Cn(R), which returns specific columns from a relation.

C 1 , , C n from the relationR Thus, it can be seen as a vertical partition of

Rinto two relations: one containing the columns mentioned in the expression, and the other containing the remaining columns.

For the database given in Fig.2.4, an example of a projection is: π FirstName,LastName,HireDate ( Employees )

This operation returns the three specified attributes from theEmployeestable.

The selection operation, represented as σ ϕ (R), retrieves tuples from the relation R that meet the Boolean condition ϕ This operation effectively partitions a table horizontally into two distinct sets: those tuples that satisfy the condition and those that do not, while maintaining the original structure of R in the output.

A selection operation over the database given in Fig.2.4is: σ HireDate≥‘01/01/2012’∧HireDate≤‘31/12/2014’ ( Employees )

This operation returns the employees hired between 2012 and 2014.

In relational algebra, the outcome of an operation is a relation that can serve as input for subsequent operations To enhance the readability of queries, utilizing temporary relations to hold intermediate results is beneficial We denote this process with the notation T ← Q, indicating that relation T contains the results of query Q.

Thus, combining the two previous examples, we can ask for the first name, last name, and hire date of all employees hired between 2012 and 2014 The query reads:

Result ← π FirstName,LastName,HireDate ( Temp1 )

The renameoperation, denoted by ρ A 1 →B 1 , ,A k →B k (R), returns a relation where the attributesA 1 , , A k inRare renamed toB 1 , , B k , respectively Therefore, the resulting relation has the same tuples as the relation

R, although the schema of both relations is different.

This article discusses binary operations derived from classical set theory operations, emphasizing the necessity for relations to be union compatible Two relations, R1(A1, , An) and R2(B1, , Bn), are considered union compatible when they share the same degree n, and for each i from 1 to n, the domains of Ai and Bi must be identical.

The three operations union, intersection, and difference on two union- compatible relationsR 1and R 2 are defined as follows:

• Theunionoperation, denoted byR 1∪R 2, returns the tuples that are in

R 1, inR 2, or in both, removing duplicates.

• Theintersectionoperation, denoted byR 1 ∩R 2 , returns the tuples that are in bothR 1 and in R 2

• Thedifference operation, denoted by R 1 \R 2 , returns the tuples that are inR 1 but not in R 2

When relations are union compatible but have differing attribute names, the attribute names from the first relation are retained in the final result This union operation can be utilized to formulate queries such as, "Retrieve the identifiers of employees from the UK or those reported by a UK employee."

Result2 ← ρ ReportsTo→EmployeeID ( π ReportsTo ( UKEmps ))

The RelationUKEmps includes employees based in the UK Result1 presents a projection based on EmployeeID, while Result2 lists the EmployeeID of employees reported by UK staff By combining Result1 and Result2, we achieve the desired outcome.

The intersection allows for the formulation of queries such as "Identifiers of employees from the UK reported by another employee from the UK." This is achieved by replacing the previous expression with the following: Result ← Result1 ∩ Result2.

The distinction can be utilized to formulate queries such as “Identify employees from the UK who are not reported by any UK employee,” achieved by substituting the previous expression with the following: Result ← Result1 \ Result2.

The Cartesian product, denoted by R 1×R 2, takes two relations and returns a new one, whose schema is composed of all the attributes inR 1 and

The instance of R 2, which may be renamed if needed, is created by concatenating each pair of tuples from R 1 and R 2 Consequently, the total number of tuples in the resulting dataset equals the product of the cardinalities of both relations.

The Cartesian product, while often lacking standalone meaning, becomes highly valuable when paired with a selection process For instance, to obtain the names of products supplied by Brazilian suppliers, we utilize the Cartesian product to merge data from the Products and Suppliers tables We focus on essential attributes, retaining ProductID, ProductName, and SupplierID from the Products table, along with SupplierID and Country from the Suppliers table To avoid naming conflicts, we must rename the SupplierID in one of the tables.

Temp1 ← π ProductID,ProductName,SupplierID ( Products )

Temp2 ← ρ SupplierID→SupID ( π SupplierID,Country ( Suppliers ))

The Cartesian product connects each product with all available suppliers, but our focus is solely on the relationships between products and their specific suppliers To achieve this, we filter out irrelevant tuples, select only those related to suppliers from Brazil, and project the desired column, which is ProductName.

Result ← π ProductName ( σ Country=‘Brazil’ ( Temp4 ))

The join operation, represented by R1 on ϕ R2, combines two relations based on a specified condition ϕ applied to their attributes This operation produces a new relation that includes all attributes from both R1 and R2, which may be renamed if needed The resulting dataset is formed by concatenating each pair of tuples from R1 and R2 that meet the condition.

R 2 that satisfy condition ϕ The operation is basically a combination of a

Using the join operation, the query “Name of the products supplied by suppliers from Brazil” will read:

Result ← π ProductName ( σ Country=‘Brazil’ ( Product o n SupplierID=SupID Temp1 ))

The join operation efficiently merges the Cartesian product from Temp3 with the selection from Temp4, resulting in a more streamlined expression Various types of join operations exist, including the equijoin, which is a specific form of joining data.

Physical Database Design

The goal of physical database design is to define the storage, access, and relationships of database records to optimize the performance of database applications This design process encompasses various aspects such as query processing, physical data organization, indexing, transaction processing, and concurrency management In this section, we offer a concise overview of these critical issues, which will be explored in greater detail for data warehouses in Chapter 8.

Effective physical database design hinges on understanding the unique characteristics of the application, particularly the data properties and usage patterns This involves analyzing frequently executed transactions and queries that significantly affect performance, identifying critical operations for the organization, and recognizing peak load periods with high database demand Such insights are essential for pinpointing potential performance issues within the database.

Various factors can be used to measure the performance of database applications Transaction throughput is the number of transactions that can

In physical database design, it is crucial to ensure high transaction throughput, particularly in systems like electronic payment platforms The response time, defined as the duration taken to complete a single transaction, is vital for user satisfaction and must be minimized Additionally, the disk storage required for database files is a key consideration Typically, a trade-off must be made among transaction throughput, response time, and disk storage requirements to achieve optimal performance.

The space-time trade-off concept highlights that improving operational speed can often require additional memory usage, and conversely, reducing memory consumption may lead to longer processing times For example, utilizing a compression algorithm can effectively minimize the storage space required for a large file; however, this comes with the trade-off of increased time needed for the decompression process.

The query-update trade-off highlights the balance between data accessibility and structural complexity Implementing a structured format can enhance data retrieval efficiency, but more intricate structures require additional time for initial setup and ongoing maintenance as data changes For instance, while sorting records by a key field facilitates quicker searches, it introduces extra overhead during insertions to maintain the sorted order.

After implementing the initial physical design, it's essential to monitor the system and adjust it based on performance observations and evolving requirements Many Database Management Systems (DBMSs) offer tools to facilitate system monitoring and tuning.

The diverse functionalities of modern Database Management Systems (DBMSs) necessitate a thorough understanding of the specific data storage and retrieval techniques employed by the chosen DBMS, making it essential for effective physical design.

A database is structured in secondary storage as one or more files, with each file containing multiple records, and each record comprising several fields Generally, each tuple in a relation aligns with a record in a file When a user seeks a specific tuple, the Database Management System (DBMS) translates this logical record into a physical disk address, retrieving it into main memory through the operating system's file access routines.

Data is stored on computer disks in units known as disk blocks, which are defined by the operating system during formatting Database Management Systems (DBMSs) utilize database blocks to store data, making it essential to align disk block sizes with database block sizes for efficient data retrieval The selection of database block size is influenced by various factors, including the management of concurrent access through locking mechanisms When a record is locked for modification by one transaction, other transactions are typically restricted from making changes, although they can still read the record In many DBMSs, locking occurs at the page level, meaning larger page sizes can increase the likelihood of multiple transactions accessing the same page Additionally, to ensure optimal disk efficiency, the database block size should match or be a multiple of the disk block size.

A Database Management System (DBMS) utilizes a buffer in the main memory to store multiple database pages, enabling quick access to data for query processing without the need to read from the disk When a query is issued, the query processor first checks the buffer for the required data records If found, it retrieves or modifies the data directly from the buffer, marking any altered pages for eventual writing back to the disk If the necessary pages are not in the buffer, the DBMS reads them from the disk, potentially replacing existing pages using algorithms like the least recently used (LRU) strategy This caching mechanism significantly enhances query performance by minimizing disk access.

File organization refers to the arrangement of data in a file on secondary storage, categorized into three main types: heap, sequential, and hash organization Heap files store records in the order of insertion, allowing for efficient data entry but slower retrieval as records must be read sequentially Sequential files, on the other hand, sort records based on specific fields, enabling quick access when searching by those attributes, though they complicate insertion and deletion due to the need to maintain order Hash files utilize a hash function to determine the storage address for records, offering rapid access for retrieval, but can face performance issues due to collisions when buckets reach capacity To enhance record retrieval, indexes are employed across various file organizations, allowing for efficient access based on chosen indexing fields, with the flexibility to create multiple indexes on different fields within the same file.

There are many different types of indexes We describe below some categories of indexes according to various criteria.

Indexes can be categorized into clustered and non-clustered types, also known as primary and secondary indexes A clustered index physically organizes the records in the data file based on the indexed field(s), whereas a non-clustered index does not affect the physical order of the data Each file can have only one clustered index, but it can accommodate multiple non-clustered indexes.

Summary

Indexes can be classified as single-column or multiple-column based on the number of fields they index The arrangement of columns in a multiple-column index significantly affects data retrieval efficiency, so it's advisable to position the most restrictive value first to enhance performance.

• Indexes can beuniqueornonunique: the former do not allow duplicate values, while nonunique indexes do.

Indexes can be categorized as sparse or dense A dense index features an entry for each data record, necessitating that data files be organized based on the indexing key In contrast, a sparse index contains fewer entries than the total number of data records Consequently, nonclustered indexes are always dense, as they are not arranged according to the indexing key, whereas clustered indexes are considered sparse.

Indexes can be categorized as single-level or multilevel, with multilevel indexes designed to enhance search efficiency by dividing large index files into smaller segments and creating an index for those segments While this structure reduces the number of blocks accessed during record searches, it complicates the processes of insertion and deletion due to the physical ordering of all index levels To address these challenges, dynamic multilevel indexes are utilized, incorporating extra space within each block for new entries This indexing method is commonly implemented through data structures like B-trees and B+-trees, which are widely supported by various Database Management Systems (DBMSs).

Database Management Systems (DBMS) allow designers to create indexes on various fields, enhancing access speed while requiring additional storage for the indexes and incurring overhead during updates The sorted nature of indexed values facilitates efficient handling of partial matches and range searches, and in relational systems, it significantly accelerates join operations on those indexed fields.

We will see in Chap.8 that data warehouses require physical design solutions that are different from the ones required by DBMSs in order to support heavy transaction loads.

This chapter provides an overview of essential database concepts, including the steps for designing database systems: requirements specification, conceptual design, logical design, and physical design The Northwind case study is utilized to illustrate these concepts, featuring the entity-relationship model as a key conceptual framework We examine the relational model as a logical representation and outline the mapping rules for converting an entity-relationship schema into a relational schema Additionally, we touch on normalization to prevent redundancies and inconsistencies in relational databases Finally, we introduce two languages for database manipulation—relational algebra and SQL—while addressing various aspects of physical database design.

Bibliographic Notes

For a comprehensive understanding of the concepts discussed in this chapter, readers are encouraged to consult the textbooks [70,79] An overview of requirements engineering can be found in [59], while conceptual database design is explored in [171], which utilizes UML [37] rather than the traditional entity-relationship model Logical database design is addressed in [230] Detailed insights into the components of the SQL:1999 standard are provided in [151, 153], with subsequent versions of the standard elaborated upon in [133, 152, 157, 272] Additionally, physical database design is thoroughly examined in [140].

Review Questions

2.1 What is a database? What is a database management system?

2.2 Describe the four phases used in database design.

2.3 Define the following terms: entity type, entity, relationship type, relationship, role, cardinality, and population.

2.4 Illustrate with an example each of the following kinds of relationship types: binary,n-ary, one-to-one, one-to-many, many-to-many, and recursive.

2.5 Discuss different kinds of attributes according to their cardinality and their composition What are derived attributes?

2.6 What is an identifier? What is the difference between a strong and a weak entity type? Does a weak entity type always have an identifying relationship? What is an owner entity type?

2.7 Discuss the different characteristics of the generalization relationship.

2.8 Define the following terms: relation (or table), attribute (or column), tuple (or line), and domain.

2.9 Explain the various integrity constraints that can be described in the relational model.

Translating an Entity-Relationship (ER) schema into a relational schema involves several fundamental rules Firstly, each entity in the ER model corresponds to a table in the relational schema, with attributes becoming the table's columns Secondly, relationships between entities are represented through foreign keys in the relational tables Additionally, many-to-many relationships may require the creation of a junction table to effectively manage the associations For example, a "Student" entity and a "Course" entity can be translated into a relational model in different ways, such as using a direct foreign key in the "Student" table or creating a separate "Enrollment" table to link students and courses, illustrating the flexibility in the translation process.

Exercises

2.11 Illustrate with examples the different types of redundancy that may occur in a relation How can redundancy in a relation induce problems in the presence of insertions, updates, and deletions?

2.12 What is the purpose of functional and multivalued dependencies? What is the difference between them?

Relational algebra consists of various operations that manipulate and retrieve data from relational databases Key operations include selection, projection, union, set difference, and Cartesian product Among these, joins are crucial for combining data from multiple tables based on related attributes There are several types of joins, including inner join, outer join, left join, and right join, each serving distinct purposes in data retrieval An inner join returns only the rows with matching values in both tables, while outer joins include unmatched rows from one or both tables Furthermore, joins can be expressed using other relational algebra operations, such as selection and Cartesian product, highlighting their fundamental role in database querying.

2.14 What is SQL? What are the sublanguages of SQL?

2.15 What is the general structure of SQL queries? How can the semantics of an SQL query be expressed with the relational algebra?

2.16 Discuss the differences between the relational algebra and SQL Why is relational algebra an operational language, whereas SQL is a declarative language?

2.17 Explain what duplicates are in SQL and how they are handled.

2.18 Describe the general structure of SQL queries with aggregation and sorting State the basic aggregation operations provided by SQL.

2.19 What are subqueries in SQL? Give an example of a correlated sub- query.

2.20 What are common table expressions in SQL? What are they needed for?

The objective of physical database design is to optimize the performance, reliability, and efficiency of a database system Key factors for measuring the performance of database applications include response time, throughput, and resource utilization Additionally, trade-offs often arise between normalization and denormalization, which can impact data integrity and access speed, as well as between storage costs and performance efficiency, requiring careful consideration to achieve the best overall system performance.

2.22 Explain different types of file organization Discuss their respective advantages and disadvantages.

2.23 What is an index? Why are indexes needed? Explain the various types of indexes.

2.24 What is clustering? What is it used for?

Exercise 2.1.A French horse race fan wants to set up a database to analyze the performance of the horses as well as the betting payoffs.

A racetrack is described by a name (e.g., Hippodrome de Chantilly), a location (e.g., Chantilly, Oise, France), an owner, a manager, a date opened, and a description A racetrack hosts a series of horse races.

A horse race has a name (e.g., Prix Jean Prat), a category (i.e., Group 1,

2, or 3), a race type (e.g., thoroughbred flat racing), a distance (in meters), a track type (e.g., turf right-handed), qualification conditions (e.g., 3-year-old excluding geldings), and the first year it took place.

A meeting is held on a certain date and a racetrack and is composed of one or several races For a meeting, the following information is kept: weather

(e.g., sunny, stormy), temperature, wind speed (in km per hour), and wind direction (N, S, E, W, NE, etc.).

Each race in a meeting is assigned a unique number and scheduled departure time, featuring a specific number of participating horses The application is responsible for monitoring the distribution of prize money among the top finishers, such as e228,000 for first place and e88,000 for second place, as well as recording the time of the fastest horse.

Each race on a specific date provides various betting options, such as tiercé and quarté+, with each type allowing for multiple bet options like "in order," "in any order," and a bonus for the quarté+ The payoffs are determined for each bet type based on a base amount, for example, quarté+ for €2, detailing the win amounts and the number of winners for each option.

A horse is identified by several key attributes, including its name, breed—such as thoroughbred—sex, and important dates like foaling (birth), gelding (castration for males), and death Additionally, a horse's lineage is defined by its sire (father) and dam (mother), while its appearance is described by coat color, which can include variations like bay or chestnut Ownership, breeding, and training details also play a crucial role in a horse's profile.

In horse racing, each horse is assigned a unique number and carries a specific weight to balance the competition among participants The application tracks the finishing positions and victory margins of the horses To design an ER schema for this application, one would consider entities such as Horses, Jockeys, Races, and Results, with relationships that define how these entities interact The relational model would then translate this schema into tables, identifying primary keys for each table, such as HorseID for Horses and JockeyID for Jockeys, while also establishing foreign keys to maintain referential integrity Additionally, non-null constraints would ensure that essential fields, like race dates and horse numbers, are always populated.

Exercise 2.2.A Formula One fan club wants to set up a database to keep track of the results of all the seasons since the first Formula One World championship in 1950.

A racing season spans a year, defined by a start and end date, and includes multiple races governed by specific regulations Each race is identified by its order in the season, an official name (such as the 2013 Formula One Shell Belgian Grand Prix), and details including the race date, local and UTC times, weather conditions, pole position (driver name and time), and the fastest lap (driver name, time, and lap number) All races are part of a Grand Prix, which is documented with its active years, for instance, 1950–1956.

The Belgian Grand Prix has a rich history, having hosted a total of 58 races as of 2013 Each race takes place on a specific circuit, such as the renowned Circuit de Spa-Francorchamps located in Spa, Belgium The event type can vary, including race, road, or street formats, and features a set number of laps over a defined circuit length and race distance, both measured in kilometers Additionally, each race maintains a lap record, detailing the best time achieved, the driver who set it, and the corresponding year.

Notice that the course of the circuits may be modified over the years For example, the Spa-Francorchamps circuit was shortened from 14 to 7 km in

1979 Further, a Grand Prix may use several circuits over the years, as we the case for the Belgian Grand Prix.

A racing team, such as Scuderia Ferrari, is identified by its name, location (like Maranello, Italy), and key personnel (for instance, Stefano Domenicali) Each team records its history, including its debut Grand Prix, total races participated in, and championships won by both the constructor and driver Additionally, they track their highest race finishes, total victories, pole positions, and fastest laps In any given season, a team adopts a full name that often features its current sponsor (e.g., Scuderia Ferrari Marlboro from 1997 to 2011), along with a designated chassis (like the F138) and engine (such as Ferrari).

Each driver in the racing circuit is identified by key details, including their name, nationality, birth date and place, race entries, championships won, race victories, podium finishes, total career points, pole positions, fastest laps, highest race finishes, and best grid positions Teams typically employ two main drivers and may also have up to six test drivers, although the number of test drivers can vary While main drivers are generally associated with a team throughout the season, they may not compete in every race Each team is allocated two consecutive numbers for its main drivers, with the number 1 assigned to the previous season's constructor’s champion, while the number 13 is rarely used, having appeared only once during the 1963 Mexican Grand Prix.

In a Grand Prix, drivers must participate in a qualifying session that establishes their starting order for the race The qualifying results include the driver's position and times for three segments: Q1, Q2, and Q3 For the race, key results recorded for each driver encompass their finishing position (optional), total laps completed, overall race time, reasons for any retirement or disqualification (both optional), and points awarded (only for the top eight finishers) An ER schema should be designed to capture these elements, noting any unspecified requirements and integrity constraints, while making necessary assumptions for completeness Subsequently, this schema must be translated into a relational model, clearly indicating the keys for each relation, along with referential integrity and non-null constraints.

In this exercise, we explore various queries for the Northwind database, focusing on relational algebra and SQL First, we retrieve the names, addresses, cities, and regions of employees Next, we identify the names of employees and customers located in Brussels, specifically those connected through orders dispatched by Speedy Express Lastly, we seek the titles and names of employees who have sold at least one product.

The article discusses various aspects of employee performance and product sales, highlighting key details such as the names and titles of employees, their reporting structure, and the specific products sold or purchased in London It notes employees who reside in the same city as their customers, identifies products that have not been ordered, and lists customers who have purchased all available products Additionally, it categorizes products along with their average prices and identifies companies that offer more than three products The article also provides a summary of employee sales, organized by their identifiers, and mentions employees who sell products from over seven suppliers.

This chapter covers the fundamental concepts of data warehouses, which are specialized databases designed for decision support by integrating data from various operational sources and transforming it for business analysis Utilizing a multidimensional model, data warehouses represent information as hypercubes, with dimensions reflecting different business perspectives and cube cells containing key measures for analysis The chapter explores the multidimensional model's characteristics and components, details common operations for manipulating data cubes, and compares data warehouse systems with operational databases Additionally, it describes the architecture of data warehouse systems, highlighting the roles of back-end tools for data extraction and front-end tools for user presentation Finally, it concludes with an overview of SQL Server, a prominent business intelligence toolset.

Multidimensional Model

Hierarchies

The granularity of a data cube is defined by the combination of levels across its axes, which include the Product dimension (Category), the Time dimension (Quarter), and the Customer dimension (City).

To extract strategic knowledge from a data cube, it's essential to analyze data at various levels of detail, such as monthly sales figures or broader country-level insights Hierarchies facilitate this analysis by linking detailed concepts (children) to more general ones (parents), forming a structured dimension schema Each dimension instance includes all members across its levels For instance, in the Product dimension, items are categorized, while the Time dimension progresses from Day to Month, Quarter, Semester, and Year Similarly, the Customer dimension starts with individual Customers and aggregates up to City, State, Country, and Continent, often culminating in a top-level category labeled "All."

All Year Semester Quarter Month Day

All Continent Country State City Customer

Fig 3.2 Hierarchies of the Product , Time , and Customer dimensions

Figure 3.3 illustrates the Product dimension at the instance level, where each product in the hierarchy is linked to its corresponding category All categories are collectively represented by a member called "all," which serves as the sole member of the distinguished level "All." This member is essential for aggregating measures across the entire hierarchy, enabling the calculation of total sales for all products.

In various real-world applications, different types of hierarchies are present, such as the balanced hierarchy illustrated in Fig 3.3, where each product maintains an equal number of levels leading to the hierarchy's root Chapters 4 and 5 will provide an in-depth exploration of these hierarchies, focusing on their conceptual representation and implementation within modern data warehouse and OLAP systems.

1 Note that, as indicated by the ellipses, not all nodes of the hierarchy are shown.

Fig 3.3 Members of a hierarchy Product → Category

Measures

In a data cube, each measure is linked to an aggregation function that consolidates multiple measure values into a single value This aggregation occurs when adjusting the level of detail in data visualization by navigating through the hierarchies of dimensions For instance, changing the granularity from City to Country using the Customer hierarchy results in the aggregation of sales for all customers within the same country, typically using the SUM operation Consequently, total sales figures produce a cube with a single cell representing the overall sum of quantities for all products, effectively visualizing the data at the All level across all dimension hierarchies.

Summarizability is the accurate aggregation of cube measures across dimension hierarchies to achieve consistent results To guarantee summarizability, certain conditions must be met.

In a hierarchical structure, it is essential that instances grouped at one level remain disjoint from one another in relation to their parent category at the next level For instance, as illustrated in Fig 3.3, a product should not be classified under two different categories simultaneously; doing so would lead to the product's sales being inaccurately counted twice—once for each category.

To ensure accurate aggregation of results, it is essential that all instances in a hierarchy are included and each instance is linked to a parent at the next level For instance, in the Time hierarchy, every day within the specified period must be accounted for and assigned to a corresponding month Failure to meet this requirement could lead to discrepancies, resulting in some sales dates being omitted from the total count.

• Correctness: It refers to the correct use of the aggregation functions.

As explained next, measures can be of various types, and this determines the kind of aggregation function that can be applied to them.

According to the way in which they can be aggregated, measures can be classified as follows.

• Additive measures can be meaningfully summarized along all the dimensions, using addition These are the most common type of measures.

For example, the measure Quantity in the cube of Fig 3.1 is additive: it can be summarized when the hierarchies in the Product, Time, and Customerdimensions are traversed.

Semiadditive measures allow for meaningful aggregation through addition along certain dimensions, but not universally A prime example of this is inventory quantities, which cannot be accurately summed over time, such as when attempting to combine inventory figures from two different quarters.

• Nonadditive measurescannot be meaningfully summarized using addition across any dimension Typical examples are item price, cost per unit, and exchange rate.

Defining aggregation functions for each measure is crucial, especially for semi-additive and nonadditive measures For instance, a semi-additive measure like inventory can be averaged over the Time dimension while summed across other dimensions Similarly, nonadditive measures such as item price or exchange rate may also utilize averaging, although alternative functions like minimum, maximum, or count might be appropriate depending on the application's semantics To enhance user interaction with the data cube at varying granularities, optimization techniques that leverage aggregate precomputation are employed Incremental aggregation mechanisms help avoid recalculating the entire aggregation from scratch with each data warehouse query, although their feasibility is contingent on the aggregate function being utilized, leading to a further classification of measures.

Distributive measures are defined by aggregation functions that can be computed in a distributed manner When data is partitioned into subsets and an aggregate function is applied to each, the function is considered distributive if the overall result matches the outcome of applying a potentially different function to the aggregated values Common aggregation functions such as count, sum, minimum, and maximum are distributive, while the distinct count function is not For example, partitioning the measure values {3,3,4,5,8,4,7,3,8} into subsets {3,3,4}, {5,8,4}, and {7,3,8} results in a distinct count of 8 when summed from the subsets, but the distinct count of the original set is only 5.

Algebraic measures utilize an aggregation function that serves as a scalar representation of distributive functions A common example of this type of aggregation is the average, calculated by dividing the total sum by the number of elements, where both the sum and count functions are distributive in nature.

Holistic measures, such as the median, mode, and rank, are unique metrics that cannot be derived from other subaggregates These measures are often costly to calculate, particularly when data undergoes modifications, as they require recalculation from the beginning.

OLAP Operations

The multidimensional model is essential for analyzing data from various perspectives and levels of detail By utilizing OLAP operations, users can effectively materialize these perspectives, creating an interactive environment for data analysis These OLAP operations are comparable to the relational algebra operations discussed in Chapter 2.

Figure 3.4 illustrates various OLAP operations for analyzing a data cube The analysis begins with a cube that presents quarterly sales quantities (in thousands) categorized by product types and customer locations for the year 2012.

The roll-upoperation aggregates measures along a dimension hierarchy to obtain measures at a coarsergranularity The syntax for this operation is:

The ROLLUP function in data analysis is structured as ROLLUP(CubeName, (Dimension → Level)*, AggFunction(Measure)*), where Dimension → Level indicates the level within a dimension for the roll-up process, and AggFunction is the aggregation method used to summarize the measure Each measure included in the resulting cube must have a specified aggregation function; otherwise, those measures will be excluded It's important to note that using Dimension → All effectively removes a dimension from the cube.

In our example, the following roll-up operation computes the sales quantities by country

ROLLUP(Sales, Customer → Country, SUM(Quantity))

The results illustrated in Fig 3.4b show a transformation in the data cube, where the original Customer dimension with four city values is condensed to two country values Consequently, the data for Paris and Lyon in a specific quarter and category now aggregate to reflect the overall values for France, with a similar process applied for Germany When querying a data cube, a common operation involves rolling up several dimensions to specific levels while collapsing others to the All level, which can be achieved through n successive roll-up operations The shorthand notation for this sequence is represented by the ROLLUP* operation.

ROLLUP*(CubeName, [(Dimension → Level)*], AggFunction(Measure)*)

For example, the total quantity by quarter can be obtained as follows:ROLLUP*(Sales, Time → Quarter, SUM(Quantity))

Fig 3.4 OLAP operations ( a ) Original cube; ( b ) Roll-up to the Country level; ( c )

Drill-down to the Month level; ( d ) Sort product by name; ( e ) Pivot; ( f ) Slice on City = ‘Paris’

Fig 3.4 OLAP operations (continued) ( g ) Dice on City = ‘Paris’ or ‘Lyon’ and Quar- ter = ‘Q1’ or ‘Q2’ ; ( h ) Dice on Quantity > 15 ; ( i ) Cube for 2011; ( j ) Drill-across;( k ) Percentage change; ( l ) Total sales by quarter and city

Fig 3.4 OLAP operations (continued) ( m ) Maximum sales by quarter and city; ( n )

The article presents a detailed analysis of sales performance, highlighting the top two sales figures for each quarter and city It also examines the top 70% of sales categorized by city and product type, organized in ascending order by quarter Additionally, it explores the same top 70% sales data, but this time arranged in descending order based on quantity sold Finally, the article ranks each quarter by category and city, focusing on the highest quantities achieved.

The OLAP operations illustrated in Fig 3.4 include a three-month moving average, a year-to-date sum, and the union of the original cube with another cube containing data from Spain Additionally, there is a difference operation between the original cube and a modified cube that rolls up the Time dimension to the Quarter level while aggregating the Customer and Product dimensions to their All level.

ROLLUP*(Sales, SUM(Quantity)) all the dimensions of the cube will be rolled-up to the All level, yielding a single cell containing the overall sum of the Quantity measure.

When performing a roll-up operation, it is often necessary to count the members in a dimension that has been removed from the cube For instance, a query can be used to determine the number of distinct products sold each quarter, as demonstrated by the ROLLUP function with the Sales and Time dimensions, resulting in a count of products labeled as ProdCount.

In this case, a new measureProdCountwill be added to the cube We will see below other ways to add measures to a cube.

Recursive hierarchies, commonly found in real-world situations like employee supervision structures, feature levels that can roll up to themselves Unlike fixed hierarchies, the number of levels in these structures varies based on their members To aggregate measures within recursive hierarchies, the RECROLLUP operation is employed, which iteratively rolls up the hierarchy until the top level is achieved.

RECROLLUP(CubeName, Dimension → Level, Hierarchy, AggFunction(Measure)*)

The drill-down operation is the reverse of the roll-up operation, transitioning from a broader level to a more detailed level within a hierarchy.

DRILLDOWN(CubeName, (Dimension → Level)*) whereDimension →Levelis the level in a dimension to which the drill down is performed.

In the cube illustrated in Fig 3.4b, Seafood sales in France are notably elevated in the first quarter compared to subsequent quarters To investigate this trend further, we can perform a drill-down along the Time dimension to the Month level, allowing us to identify if this peak in sales is attributed to a specific month.

As shown in Fig.3.4c, we discover that, for some reason yet unknown, sales in January soared both in Paris and in Lyon.

Thesort operation returns a cube where the members of a dimension have been sorted The syntax of the operation is as follows:

The SORT function in a cube allows for the organization of Dimension members based on the value of an Expression, with options for sorting in ascending (ASC) or descending (DESC) order.

OLAP operations involve sorting members within their parent hierarchies, with ascending (ASC) as the default option In contrast, both BASC and BDESC perform sorting across all members, disregarding hierarchies.

The SORT function is used to arrange the members of the Product dimension in ascending order based on the ProductName attribute In cases where the cube consists of a single dimension, members can be sorted according to their measures For instance, when SalesByQuarter is derived from aggregating sales data by quarter across all cities and categories, the expression SORT(SalesByQuarter, Time, Quantity DESC) organizes the Time members in descending order based on the Quantity measure.

The pivot operation allows for the rotation of a cube's axes, offering a different perspective on the data The syntax for this operation is PIVOT(CubeName, (Dimension → Axis)*), where the axes can be designated as {X,Y,Z,X1,Y1,Z1, }.

In our example, to see the cube with the Time dimension on the xaxis, we can rotate the axes of the original cube as follows

PIVOT(Sales, Time → X, Customer → Y, Product → Z)

The result is shown in Fig 3.4e.

The slice operation removes a dimension in a cube (i.e., a cube of n−1 dimensions is obtained from a cube ofndimensions) by selecting one instance in a dimension level The syntax of this operation is:

SLICE(CubeName, Dimension, Level = Value) where the Dimension will be dropped by fixing a single Value in the Level. The other dimensions remain unchanged.

In our example, to visualize the data only for Paris, we apply a slice operation as follows:

SLICE(Sales, Customer, City = 'Paris')

The subcube illustrated in Fig 3.4f is a two-dimensional matrix that displays the sales quantity evolution categorized by quarter, forming a series of time series The slice operation is conducted at a specified dimension level, such as the city level in this example Consequently, adjustments in granularity through roll-up or drill-down operations are frequently required before performing the slice operation.

Thediceoperation keeps the cells in a cube that satisfy a Boolean condition ϕ The syntax for this operation is

Data Warehouses

The union operation is also used to display different granularities on the same dimension For example, if SalesCountry is the cube in Fig 3.4b, then the following operation

UNION(Sales, SalesCountry) results in a cube with sales measures summarized by city and by country.

Given two cubes with the same schema, the difference operation removes the cells in a cube that exist in another one The syntax of the operation is: DIFFERENCE(CubeName1, CubeName2).

In our example, we can eliminate the cells representing the top two sales by quarter and city from the original cube, referred to as TopTwoSales.

This will result in the cube in Fig 3.4u.

The drill-through operation enables users to transition from the lowest level of data within a cube to the corresponding operational systems that generated the cube This functionality is particularly useful for investigating the causes of outlier values in a data cube However, it is important to note that drill-through is not classified as an OLAP operator, as its outcome does not produce a multidimensional cube.

Table 3.1 outlines the OLAP operations discussed in this section, highlighting the fundamental functions available Beyond these basic operations, OLAP tools offer an extensive range of mathematical, statistical, and financial computations, including ratios, variances, interest calculations, depreciation, and currency conversions.

A data warehouse serves as a centralized repository for integrated data sourced from various origins, designed specifically for multidimensional data analysis It is characterized by being subject-oriented, integrated, nonvolatile, and time-varying, all of which support informed management decision-making.

• Subject-orientedmeans that data warehouses focus on the analytical needs of different areas of an organization These areas vary depending

Add measure Adds new measures to a cube computed from other measures or dimensions.

Aggregation operations Aggregate the cells of a cube, possibly after performing a grouping of cells.

Dice Keeps the cells of a cube that satisfy a Boolean condition over dimension levels, attributes, and measures.

Difference Removes the cells of a cube that are in another cube Both cubes must have the same schema.

Drill-across Merges two cubes that have the same schema and instances using a join condition.

Drill-down Disaggregates measures along a hierarchy to obtain data at a finer granularity It is the opposite of the roll-up operation.

Drop measure Removes measures from a cube.

Pivot Rotates the axes of a cube to provide an alternative presentation of its data.

Recursive roll-up Performs an iteration of roll-ups over a recursive hierarchy until the top level is reached.

Rename Renames one or several schema elements of a cube.

Roll-up Aggregates measures along a hierarchy to obtain data at a coarser granularity It is the opposite of the drill-down operation.

Roll-up* Shorthand notation for a sequence of roll-up operations.

Slice Removes a dimension from a cube by selecting one instance in a dimension level.

Sort Orders the members of a dimension according to an expression. Union Combines the cells of two cubes that have the same schema but disjoint members.

Table 3.1 summarizes the OLAP operations related to various activities within an organization In a retail context, the analysis typically emphasizes product sales and inventory management Conversely, operational databases concentrate on specific application functions, such as recording product sales and managing inventory replenishment.

Integrated data involves merging information from both operational and external systems, addressing challenges like variations in data definitions, formats, and codification This includes managing synonyms—fields with different names representing the same data—and homonyms—fields sharing the same name but having different meanings Additionally, issues related to multiple occurrences of data must be resolved Typically, these challenges are addressed during the design phase of operational databases.

Nonvolatile data storage guarantees the durability of information by preventing modification and deletion, thereby extending the lifespan of data beyond what typical operational systems provide.

Data warehouses accumulate and store data over extended periods, often spanning 5 to 10 years or more, in contrast to operational databases, which retain information for shorter durations of 2 to 6 months to support daily operations This difference in data retention allows data warehouses to provide valuable insights over time, while operational databases prioritize immediate data needs and may overwrite older information as required.

Time-varying data refers to the ability to hold different values for the same information over time, including when these changes occur For instance, a bank's data warehouse can track clients' average monthly account balances over several years, reflecting historical changes In contrast, operational databases often lack explicit temporal support, as such details may be unnecessary for daily operations and challenging to implement.

A data warehouse is aimed at analyzing the data of an entire organization.

Departments within an organization frequently need only specific segments of the overall data warehouse tailored to their functions For instance, the sales department typically focuses on sales data, whereas the human resources department requires demographic information and employee-related data These specialized subsets of data warehouses are known as data marts.

However, these data marts are not necessarily private to a department, they may be shared with other interested parts of the organization.

A data warehouse can be understood as a collection of data marts, reflecting a bottom-up approach where smaller data marts are created first and then combined to form the data warehouse This method is advantageous for organizations seeking quicker results or hesitant to invest time in building a large data warehouse Conversely, in the traditional view, data marts are derived from the data warehouse in a top-down manner, often serving as logical representations of the larger data warehouse.

Table3.2shows several aspects that differentiate operational database (or OLTP) systems from data warehouse (or OLAP) systems We analyze next in detail some of these differences.

OLTP systems are primarily utilized by operations staff and employees executing predefined tasks through transactional applications, such as payroll and ticket reservation systems In contrast, data warehouse users typically occupy higher organizational levels and leverage interactive OLAP tools for data analysis Consequently, OLTP systems necessitate current and detailed data, whereas data analytics rely on historical and summarized data, highlighting a fundamental difference in data organization.

4) follows from the type of use of OLTP and OLAP systems.

OLTP data structures are designed for small, simple transactions that occur frequently These transactions involve reading and writing data files, such as in the Northwind database application, where users often insert, modify, or delete orders Consequently, OLTP transactions typically access a limited number of records, focusing on the specific entries relevant to each operation.

Table 3.2 Comparison between operational databases and data warehouses

Aspect Operational databases Data warehouses

1 User type Operators, office employees Managers, account executives

2 Usage Predictable, repetitive Ad hoc, nonstructured

3 Data content Current, detailed data Historical, summarized data

4 Data organization According to operational needs According to analysis needs

5 Data structures Optimized for small transactions Optimized for complex queries

6 Access frequency High From medium to low

7 Access type Read, insert, update, delete Read, append only

9 Response time Short Can be long

11 Lock utilization Needed Not needed

13 Data redundancy Low (normalized tables) High (denormalized tables)

Data modeling techniques such as UML, ER models, and multidimensional models are essential for structuring sales order data OLAP systems, which support complex aggregation queries, require access to extensive records across multiple tables, leading to lengthy and intricate SQL queries Unlike OLTP systems, which are frequently accessed for tasks like processing purchase orders, OLAP systems are accessed less often for order analysis Additionally, data warehouse records are typically retrieved in read mode While OLTP systems benefit from short query response times when properly indexed, OLAP queries often take longer to execute due to their complexity.

OLTP systems typically experience a high volume of concurrent accesses, necessitating the use of locking or other concurrency management techniques to ensure secure transaction processing In contrast, OLAP systems are read-only, allowing for concurrent query submissions and computations, although they generally have a lower number of concurrent users Additionally, OLTP systems are often modeled using UML or similar methodologies.

Data Warehouse Architecture

Back-End Tier

The back-end tier consists of extraction, transformation, and loading (ETL) tools that facilitate the integration of data from various operational databases and other internal or external sources into the data warehouse Additionally, a data staging area, often referred to as an operational data store, is utilized to modify the extracted data through successive transformations, preparing it for final loading into the data warehouse.

The extraction, transformation, and loading process, as the name indicates, is a three-step process as follows:

Data extraction involves collecting information from diverse and varied sources, which can include operational databases as well as files in different formats These sources may be either internal to the organization or external.

In order to solve interoperability problems, data are extracted whenever possible using application programming interfaces (APIs) such as ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity).

Transformation involves converting data from its original format in various sources to a standardized format suitable for a data warehouse This process includes cleaning, which eliminates errors and inconsistencies; integration, which harmonizes data from multiple sources at both the schema and data levels; and aggregation, which summarizes the data based on the desired level of detail or granularity for the warehouse.

Loading involves populating the data warehouse with transformed data and ensuring its refreshment by updating the warehouse with data from sources at designated intervals This process is crucial for maintaining current information that supports effective decision-making Refresh frequencies can differ based on organizational policies, ranging from monthly updates to multiple times a day or even real-time data integration.

Data Warehouse Tier

The data warehouse tierin Fig 3.5is composed of an enterprise data warehouseand/or several data marts, and ametadata repositorystor- ing information about the data warehouse and its contents.

An enterprise data warehouse serves as a centralized repository that encompasses the entire organization, while a data mart is a specialized subset designed for specific functional or departmental needs Essentially, a data mart functions as a smaller, localized version of a data warehouse The data within a data mart may be sourced from the broader enterprise data warehouse or gathered directly from various data sources.

Another component of the data warehouse tier is the metadata repository.

Metadata, often referred to as "data about data," is traditionally divided into two categories: technical and business metadata Business metadata focuses on the semantics of the data, including the organizational rules, policies, and constraints that govern its use In contrast, technical metadata provides insights into how data is structured and stored within a computer system, along with the applications and processes that interact with this data.

In the context of a data warehouse, technical metadata encompasses various aspects that describe the data warehouse system, the source systems, and the ETL (Extract, Transform, Load) process A metadata repository typically includes crucial information related to these components, ensuring efficient data management and integration.

Metadata plays a crucial role in defining the structure of data warehouses and data marts at conceptual, logical, and physical levels This essential information encompasses user authorization and access control for security, as well as monitoring details such as statistics, error reports, and audit trails.

Metadata encompasses descriptions of data sources and their schemas across conceptual, logical, and physical levels It includes essential details such as ownership, update frequencies, legal limitations, and methods of access, providing a comprehensive understanding of the data's context and usability.

Metadata plays a crucial role in the ETL process by detailing data lineage, which traces warehouse data back to its original source It encompasses essential aspects such as data extraction, cleaning procedures, transformation rules, and default settings Additionally, it outlines data refresh and purging protocols, as well as the algorithms used for data summarization, ensuring a comprehensive understanding of data management and integrity.

OLAP Tier

The OLAP tier in the architecture features an OLAP server that delivers multidimensional data to business users, independent of the underlying data storage methods While many database products offer OLAP extensions and tools for creating and querying data cubes, there is currently no standardized language for data cube manipulation, resulting in varying technologies across systems Notable languages include XMLA, which facilitates the exchange of multidimensional data between client applications and OLAP servers, as well as MDX and DAX, which are query languages for OLAP databases MDX has become a de facto standard supported by multiple OLAP vendors, whereas DAX, introduced by Microsoft, is designed to be more user-friendly for business end users Additionally, the SQL standard has been enhanced with analytical capabilities through SQL/OLAP Further exploration of MDX, DAX, and SQL/OLAP will be provided in Chapters 6 and 7.

Front-End Tier

The front-end tier in Fig 3.5 is used for data analysis and visualization.

It contains client toolsthat allow users to exploit the contents of the data warehouse Typical client tools include the following:

OLAP tools enable users to interactively explore and manipulate data within a data warehouse, making it easier to formulate complex ad hoc queries that can handle substantial data volumes without any prior system knowledge.

Reporting tools facilitate the creation, distribution, and management of various types of reports, including paper-based, interactive, and web-based formats These reports rely on predefined queries that request specific information in a structured format, executed regularly Contemporary reporting methods incorporate key performance indicators (KPIs) and dashboards to enhance data visualization and analysis.

• Statistical toolsare used to analyze and visualize the cube data using statistical methods.

Data mining tools enable users to analyze data effectively, uncovering valuable insights such as patterns and trends These tools also facilitate predictions based on existing data, enhancing decision-making processes.

In Chap.7, we show some of the tools used to exploit the data warehouse,like data analysis tools, key performance indicators, and dashboards.

Variations of the Architecture

In real-world scenarios, certain components of data architecture, such as enterprise data warehouses and data marts, may be absent Often, organizations may operate solely with an enterprise data warehouse, or in some cases, they may not have one at all Constructing an enterprise data warehouse is a complex and resource-intensive endeavor, while data marts are generally simpler to develop However, when multiple data marts are independently created, integrating them into a cohesive enterprise data warehouse can become a challenging task.

In some scenarios, client tools may directly access the data warehouse without the presence of an OLAP server, as illustrated by the connection between the data warehouse tier and the front-end tier This is exemplified in Chapter 7, where queries for the Northwind case study are presented in both MDX and DAX for OLAP servers, as well as in SQL In extreme cases, the absence of both a data warehouse and an OLAP server leads to the creation of a virtual data warehouse, which consists of views over operational databases designed for efficient access This situation is represented by the arrow linking the data sources to the front-end tier While a virtual data warehouse is relatively easy to construct, it lacks historical data.

Overview of Microsoft SQL Server BI Tools

A virtual data warehouse lacks centralized metadata and the capability to clean and transform data, which can negatively affect the performance of operational databases.

A data staging area may be unnecessary if the data from source systems closely aligns with the warehouse data, which typically occurs when there are few high-quality data sources However, this scenario is uncommon in real-world applications.

3.5 Overview of Microsoft SQL Server BI Tools

In today's market, a diverse range of business intelligence tools is available, with major database providers like Microsoft, Oracle, IBM, and Teradata offering their own comprehensive suites Notable tools in this category include SAP, MicroStrategy, Qlik, and Tableau, alongside popular open-source options such as Pentaho This article focuses on Microsoft SQL Server tools as a representative suite for illustrating key concepts in business intelligence Microsoft SQL Server serves as an integrated platform for developing analytical applications, comprising three primary components that will be briefly outlined, along with references to other prominent business intelligence tools in the bibliography.

Analysis Services enables the definition, querying, updating, and management of analytical databases in two modes: multidimensional and tabular The key distinction between these modes lies in their underlying paradigms—multidimensional and relational—each utilizing its own query language, MDX for multidimensional and DAX for tabular This book explores both modes and their respective languages in Chapters 5, 6, and 7, focusing on the analytical database for the Northwind case study.

Integration Services facilitate the ETL (Extract, Transform, Load) processes by enabling the extraction of data from diverse sources, followed by data combination, cleaning, and summarization Ultimately, this results in populating a data warehouse with the refined data A detailed exploration of Integration Services and their role in the ETL process is presented in Chapter 9 through the Northwind case study.

Reporting Services is a powerful tool for defining, generating, storing, and managing reports from diverse data sources like data warehouses and OLAP cubes It allows for personalized report creation and delivery in multiple formats, accessible through various clients, including web browsers and mobile applications Users connect to reports via the Reporting Services server component In Chapter 7, we will explore Reporting Services in the context of building dashboards for the Northwind case study, utilizing several tools for effective development and management of these components.

Visual Studio is a development platform that supports Analysis Services,

SQL Server Management Studio (SSMS) offers comprehensive management for all SQL Server components, while Power BI empowers business users to independently analyze data and generate reports and dashboards through self-service BI, minimizing the need for IT involvement Additionally, Power Pivot serves as an Excel add-in that facilitates the creation and analysis of data models, enhancing data management capabilities for users.

Summary

This chapter introduces the multidimensional model foundational to data warehouse systems, distinguishing online analytical processing (OLAP) from online transaction processing (OLTP) It explores the data cube concept, including dimensions, hierarchies, and measures, while classifying measures and defining aggregation and summarizability Key OLAP operations such as roll-up and drill-down are detailed for interactive data cube manipulation The chapter also contrasts data warehouse systems with traditional database systems, outlining their basic architecture and various configurations Finally, an overview of Microsoft SQL Server BI tools is provided.

Bibliographic Notes

Foundational concepts of data warehousing are extensively covered in Kimball and Inmon's classic literature, with Inmon providing a key definition of data warehouses The multidimensional model's hypercube concept was explored in earlier studies, particularly regarding SQL roll-up and cube operations OLAP hierarchies are also examined, while the summarizability of measures has been defined and analyzed in various research Additional classifications of measures are available, with more comprehensive details and references provided in Chapter 5.

Currently, there is no universally accepted definition of OLAP operations, akin to the established operations in relational algebra Various OLAP algebras have been introduced in academic literature, each specifying distinct sets of operations A comparative analysis of these OLAP algebras is provided in [202], highlighting the necessity for a reference algebra in the OLAP domain The operations defined in this chapter draw inspiration from the work presented in [50].

In SQL Server, comprehensive resources are available for Analysis Services, Integration Services, and Reporting Services, which are detailed in dedicated books The tabular model in Microsoft Analysis Services is explored in depth, while DAX is thoroughly examined in another resource.

Review Questions

3.1 What is the meaning of the acronyms OLAP and OLTP?

3.2 Using an example of an application domain that you are familiar with, describe the various components of the multidimensional model, that is, facts, measures, dimensions, and hierarchies.

3.3 Why are hierarchies important in data warehouses? Give examples of various hierarchies.

3.4 Discuss the role of measure aggregation in a data warehouse How can measures be characterized?

3.5 Give an example of a problem that may occur when summarizability is not verified in a data warehouse.

3.6 Describe the various OLAP operations using the example you defined in Question3.2.

3.7 What is an operational database system? What is a data warehouse system? Explain several aspects that differentiate these systems.

3.8 Give some essential characteristics of a data warehouse How do a data warehouse and a data mart differ? Describe two approaches for building a data warehouse and its associated data marts.

3.9 Describe the various components of a typical data warehouse architecture Identify variants of this architecture and specify in what situations they are used.

3.10 Briefly describe the components of Microsoft SQL Server.

Exercises

The books on SQL Server provide comprehensive coverage of its components, including Analysis Services, Integration Services, and Reporting Services The tabular model in Microsoft Analysis Services is explored in detail, while DAX is thoroughly examined in dedicated literature.