This section presents the concept of entity clustering, which abstracts the ER schema to such a degree that the entire schema can appear on a single sheet of paper or a single computer screen. This has happy consequences for the end user and database designer in terms of developing a mutual understanding of the data- base contents and formally documenting the conceptual model.
An entity cluster is the result of a grouping operation on a collection of entities and relationships. Entity clustering is potentially useful for designing large data- bases. When the scale of a database or information structure is large and includes a large number of interconnections among its different components, it may be very diffi cult to understand the semantics of such a structure and to manage it, especially for the end users or managers. In an ER diagram with 1000 entities, the overall structure will probably not be very clear, even to a well-trained data- base analyst. Clustering is therefore important because it provides a method to organize a conceptual database schema into layers of abstraction, and it supports the different views of a variety of end users.
4.5.1 Clustering Concepts
One should think of grouping as an operation that combines entities and their relationships to form a higher-level construct. The result of a grouping operation on simple entities is called an entity cluster . A grouping operation on entity clus- ters, or on combinations of elementary entities and entity clusters, results in a higher-level entity cluster. The highest-level entity cluster, representing the entire database conceptual schema, is called the root entity cluster .
Figure 4.8(a) illustrates the concept of entity clustering in a simple case where (elementary) entities R-sec (report section), R-abbr (report abbreviation), and
Author are naturally bound to (dominated by) the entity Report ; and entities
Department , Contractor , and Project are not dominated. (Note that to avoid unnecessary detail, we do not include the attributes of entities in the diagrams.) In Figure 4.8(b) , the dark-bordered box around the entity Report and the entities it dominates defi nes the entity cluster Report . The dark-bordered box is called the EC box to represent the idea of an entity cluster. In general, the name of the entity cluster need not be the same as the name of any internal entity; however, when there is a single dominant entity, the names are often the same. The EC box number in the lower-right corner is a clustering-level number used to keep track of the sequence in which clustering is done. The number 2.1 signifi es that the entity cluster Report is the fi rst entity cluster at level 2. Note that all the original entities are considered to be at level 1.
The higher-level abstraction, the entity cluster, must maintain the same rela- tionships between entities inside and outside the entity cluster as those that occur between the same entities in the lower-level diagram. Thus, the entity names inside the entity cluster should appear just outside the EC box along the path of their direct relationship to the appropriately related entities outside the box, maintaining consistent interfaces (relationships) as shown in Figure 4.8(b) . For simplicity, we modify this rule slightly: If the relationship is between an external entity and the dominant internal entity (for which the entity cluster is named), the entity cluster name need not be repeated outside the EC box. Thus, in FIGURE 4.8
Entity clustering concepts: (a) ER model before clustering, and (b) ER model after clustering.
N N
N 1
N N
1
N 1
N
has
(a) Report
Author Project
Department Contractor
has does
does has in
1 1
R-abbr R-sec
(b)
N N Report
Report
N N
Author
Project Department Contractor
has does
does
1 1
Report
(entity cluster) 2.1
Figure 4.8(b) , we could drop the name Report both places it occurs outside the
Report box, but we must retain the name Author , which is not the name of the entity cluster.
4.5.2 Grouping Operations
Grouping operations are the fundamental components of the entity clustering technique. They defi ne what collections of entities and relationships comprise higher-level objects, the entity clusters. The operations are heuristic in nature and (see Figure 4.9 ) include the following .
FIGURE 4.9
Grouping operations: (a) dominance grouping, (b) abstraction grouping, (c) constraint grouping, and (d) relationship grouping.
(a)
(c)
(b)
(d)
■ Dominance grouping.
■ Abstraction grouping.
■ Constraint grouping.
■ Relationship grouping.
These grouping operations can be applied recursively or used in a variety of combinations to produce higher-level entity clusters — that is, clusters at any level of abstraction. An entity or entity cluster may be an object that is subject to com- binations with other objects to form the next higher level. That is, entity clusters have the properties of entities and can have relationships with any other objects at any equal or lower level. The original relationships among entities are preserved after all grouping operations, as illustrated in Figure 4.8 .
Dominant objects or entities normally become obvious from the ER diagram or the relationship defi nitions. Each dominant object is grouped with all its related nondominant objects to form a cluster. Weak entities can be attached to an entity to make a cluster. Multilevel data objects using abstractions such as generalization and aggregation can be grouped into an entity cluster. The supertype or aggregate entity name is used as the entity cluster name. Constraint-related objects that extend the ER model to incorporate integrity constraints, such as the exclusive- OR, can be grouped into an entity cluster. Additionally, ternary or higher-degree relationships potentially can be grouped into an entity cluster. The cluster repre- sents the relationship as a whole.
4.5.3 Clustering Technique
The grouping operations and their order of precedence determine the individual activities needed for clustering. We can now learn how to build a root entity cluster from the elementary entities and relationships defi ned in the ER modeling process. This technique assumes that a top-down analysis has been performed as part of the database requirement analysis and that the analysis has been docu- mented so that the major functional areas and subareas are identifi ed. Functional areas are often defi ned by an enterprise ’ s important organizational units, business activities, or, possibly, by dominant applications for processing information. As an example, recall Figure 4.3 (reconstructed in Figure 4.10 ), which can be thought of as having three major functional areas: company organization ( Division ,
Department ), project management ( Project , Skill , Location , Employee ), and employee data ( Manager , Secretary , Engineer , Technician , Prof-assoc , Work- station , and Desktop ). Note that the functional areas are allowed to overlap.
Figure 4.10 uses an ER diagram resulting from the database requirement analysis to show how clustering involves a series of bottom-up steps using the basic group- ing operations. The following list explains these steps.
1. Defi ne points of grouping within functional areas. Locate the dominant enti- ties in a functional area through natural relationships, local n -ary relationships, integrity constraints, abstractions, or just the central focus of many simple
FIGURE 4.10
ER diagram: clustering technique.
1 1 1 N
belongs-to N
N is-allocated
1
1
has- allocated
1
1 is- managed-by
contains
is- headed-by has
d 1 1
N
1
1
1 1 N
Employee N
1 Project
Technician Engineer
Secretary Manager
Prof-assoc Workstation
Desktop skill-used
assigned-to
Department Division
N
N Location
Skill
is-allocated 1
1
is- married-to
manages +
N
N
Employee data functional area Project management functional area
Company organization functional area
relationships. If such points of grouping do not exist within an area, consider a functional grouping of a whole area.
2. Form entity clusters. Use the basic grouping operations on elementary entities and their relationships to form higher-level objects, or entity clusters. Because entities may belong to several potential clusters, we need to have a set of pri- orities for forming entity clusters. The following set of rules, listed in priority order, defi nes the set that is most likely to preserve the clarity of the conceptual model:
a. Entities to be grouped into an entity cluster should exist within the same functional area; that is, the entire entity cluster should occur within the boundary of a functional area. For example, in Figure 4.10 , the relationship between Department and Employee should not be clustered unless Employee is included in the company organization functional area with Department and Division . In another example, the relationship between the supertype
Employee and its subtypes could be clustered within the employee data functional area.
b. If a confl ict in choice between two or more potential entity clusters cannot be resolved (e.g., between two constraint groupings at the same level of precedence), leave these entity clusters ungrouped within their functional area. If that functional area remains cluttered with unresolved choices, defi ne functional subareas in which to group unresolved entities, entity clusters, and their relationships.
3. Form higher-level entity clusters. Apply the grouping operations recursively to any combination of elementary entities and entity clusters to form new levels of entity clusters (higher-level objects). Resolve confl icts using the same set of priority rules given in step 2. Continue the grouping operations until all the entity representations fi t on a single page without undue complexity. The root entity cluster is then defi ned.
4. Validate the cluster diagram. Check for consistency of the interfaces (relation- ships) between objects at each level of the diagram. Verify the meaning of each level with the end users.
The result of one round of clustering is shown in Figure 4.11 , where each of the clusters is shown at level 2.