Data Modeling Essentials 2005 phần 4 ppsx

are mutually exclusive, a more common situation than you might suspect. We can indicate this with an exclusivity arc (Figure 4.13). We have previously warned against introducing too many additional conventions and symbols. However, the exclusivity arc is useful enough to justify the extra complexity, and it is even supported by some CASE tools. 8 As well as highlighting opportunities to generalize relationships, the exclusivity arc can suggest potential entity class supertypes. In Figure 4.13, we are prompted to supertype Company, Individual, Partnership, and Government Body, perhaps to Taxpayer (Figure 4.14). We find that we use exclusivity arcs quite frequently during the modeling process. In some cases, they do not make it from the whiteboard to the final conceptual model, being replaced with a single relationship to the supertype. Of course, if your CASE tool does not support the convention and you wish to retain the arc, rather than supertype, you will need to record the rule in supporting documentation. 140 ■ Chapter 4 Subtypes and Supertypes Figure 4.12 Generalization of one-to-many relationships. Person Insurance Policy be involved in involve Person Insurance Policy be insured under insure be beneficiary of nominate as beneficiary be contact for have as contact hold as security be assigned as security to 8 Notably Oracle Designer from Oracle Corporation. UML tools we have reviewed support arcs but apparently only between pairs of relationships. Simsion-Witt_04 10/8/04 7:40 PM Page 140 4.14.3 Generalizing One-to-Many and Many-to-Many Relationships Our final example involves many-to-many relationships, along with two one-to-many relationships (see Figure 4.15 on next page). The generalization should be fairly obvious, but you need to recognize that if you include the one-to-many relationships in the generalization, you will lose the rules that only one employee can fill a position or act in a position. (Conversely, you will gain the ability to be able to break those rules.) 4.14 Generalization of Relationships ■ 141 Figure 4.13 Diagramming convention for mutually exclusive relationships. Tax Assessment Company Individual Partnership Government Body be for be the subject of be for be the subject of be the subject of be for be the subject b e f o r exclusivity arc Figure 4.14 Entity class generalization prompted by mutually exclusive relationships. Tax Assessment Taxpayer be for be the subject of Simsion-Witt_04 10/8/04 7:40 PM Page 141 4.15 Theoretical Background In 1977 Smith and Smith published an important paper entitled “Database Abstractions: Aggregation and Generalization,” 9 which recognized that the two key techniques in data modeling were aggregation/disaggregation and generalization/specialization. Aggregation means “assembling component parts,” and disaggregation means, “breaking down into component parts.” In data modeling terms, examples of disaggregation include breaking up Order into Order Header and Ordered Item, or Customer into Name, Address, and Birth Date. This is quite different from specialization and generalization, which are about clas- sifying rather than breaking down. It may be helpful to think of disaggregation as “widening” a model and specialization as “deepening” it. Many texts and papers on data modeling focus on disaggregation, par- ticularly through normalization. Decisions about the level of generalization are often hidden or dismissed as “common sense.” We should be very suspicious of this; before the rules of normalization were formalized, that process too was regarded as just a matter of common sense. 10 142 ■ Chapter 4 Subtypes and Supertypes Figure 4.15 Generalizing one-to-many and many-to-many relationships. Employee Position be eligible for fill be acting in have applied for have filled Employee Position ???? 9 ACM Transactions on Database Systems, Vol. 2, No. 2 (1977). 10 Research in progress by Simsion has shown that experienced modelers not only vary in the level of generalization that they choose for a particular problem, but also may show a bias toward higher or lower levels of generalization across different problems (see www. simsion.com.au). Simsion-Witt_04 10/8/04 7:40 PM Page 142 In this book, and in day-to-day modeling, we try to give similar weight to the generalization/specialization and aggregation/disaggregation dimensions. 4.16 Summary Subtypes and supertypes are used to represent different levels of entity class generalization. They facilitate a top-down approach to the development and presentation of data models and a concise documentation of business rules about data. They support creativity by allowing alternative data models to be explored and compared. Subtypes and supertypes are not directly implemented by standard relational DBMSs. The logical and physical data models therefore need to be subtype-free. By adopting the convention that subtypes are nonoverlapping and exhaustive, we can ensure that each level of generalization is a valid implementation option. The convention results in the loss of some representa- tional power, but it is widely used in practice. 4.16 Summary ■ 143 Simsion-Witt_04 10/8/04 7:40 PM Page 143 This page intentionally left blank Chapter 5 Attributes and Columns “There’s a sign on the wall but she wants to be sure ’Cause you know sometimes words have two meanings” – Page/Plant: Stairway to Heaven, © Superhype Publishing Inc. “Sometimes the detail wags the dog” – Robert Venturi 5.1 Introduction In the last two chapters, we focused on entity classes and relationships, which define the high-level structure of a data model. We now return to the “nuts and bolts” of data: attributes (in the conceptual model) and columns (in the logical and physical models). The translation of attributes into columns is generally straightforward, 1 so in our discussion we will usually refer only to attributes unless it is necessary to make a distinction. At the outset, we need to say that attribute definition does not always receive the attention it deserves from data modelers. One reason is the emphasis on diagrams as the primary means of presenting a model. While they are invaluable in communicating the over- all shape, they hide the detail of attributes. Often many of the participants in the development and review of a model see only the diagrams and remain unaware of the underlying attributes. A second reason is that data models are developed progressively; in some cases the full requirements for attributes become clear only toward the end of the modeling task. By this time the specialist data modeler may have departed, leaving the supposedly straightforward and noncreative job of attribute definition to database administrators, process modelers, and programmers. Many data modelers seem to believe that their job is finished when a reasonably stable framework of entity classes, relationships, and primary keys is in place. On the contrary, the data modeler who remains involved in the development of a data model right through to implementation will be in a good 145 1 We discuss the specifics of the translation of attributes (and relationships) into columns, together with the addition of supplementary columns, in Chapter 11. Simsion-Witt_05 10/8/04 7:56 PM Page 145 position to ensure not only that attributes are soundly modeled as the need for them arises, but to intercept “improvements” to the model before they become entrenched. In Chapter 2 we touched on some of the issues that arise in modeling attributes (albeit in the context of looking at columns in a logical model). In this chapter we look at these matters more closely. We look first at what makes a sound attribute and definition, and then introduce a classification scheme for attributes, which enables us to discuss the different types of attributes in some detail. The classification scheme also provides a starting point for constructing attribute names. Naming of attributes is far more of an issue than naming of entity classes and relationships, if only because the number of attributes in a model is so much greater. The chapter concludes with a discussion of the role of generalization in the context of attributes. As with entity-relationship modeling, we have some quite firm rules for aggregation, whereas generalization decisions often involve trade-offs among conflicting objectives. And, as always, there is room for choice and sometimes creativity. 5.2 Attribute Definition Proper definitions are an essential starting point for detailed modeling of attributes. In the early stages of modeling, we propose and record attributes before even the entity classes are fully defined, but our final model must include an unambiguous definition of each attribute. If we fail to do this, we are likely to overlook the more subtle issues discussed in this chapter and run the risk that the resulting columns in the database will be used inappropriately by programmers or users. Poor attribute definitions have the same potential to compromise data quality as poor entity class definitions (see Section 3.4.3). Definitions need not be long: a single line is often enough if the parent entity class is well defined. In essence, we need to know what the attribute is intended to record, and how to interpret the values that it may take. More formally, a good attribute definition will: 1. Complete the sentence: “Assignment of a value to the <attribute name> for an instance of <entity class name> is a record of . . .”; for example: Assignment of a value to the Fee Exemption Minimum Balance for an instance of Account is a record of the minimum amount which must be held in this Account at all times to qualify for exemption from annual account keeping fees.” As in this example, the definition should refer to a single instance, (e.g., “The date of birth of this Customer,” “The minimum amount of a transaction that can be made by a Customer against a Product of this type.”) 146 ■ Chapter 5 Attributes and Columns Simsion-Witt_05 10/8/04 7:56 PM Page 146 2. Answer the questions “What does it mean to assign a value to this attribute?” and “What does each value that can be assigned to this attribute mean?” It can be helpful to imagine that you are about to enter data into a data entry form or screen that will be loaded into an instance of the attribute. What information will you need in order to answer the following questions: ■ What fact about the entity instance are you providing information about? ■ What value should you enter to state that fact? For a column to be completely defined in a logical data model, the following information is also required (although ideally your documentation tool will provide facilities for recording at least some of it in a more structured manner than writing it into the definition): ■ What type of column it is (e.g., character, numeric) ■ Whether it forms part of the primary key or identifier of the entity class ■ What constraints (business rules) it is subject to, in particular whether it is mandatory (must have a value for each entity instance), and the range or set of allowed values ■ Whether these constraints are to be managed by the system or externally ■ The likelihood that these constraints will change during the life of the system ■ (For some types of attribute) the internal and external representations (formats) that are to be used. In a conceptual data model, by contrast, we do not need to be so pre- scriptive, and we are also providing the business stakeholders a view of how their information requirements will be met rather than a detailed first cut database design, so we need to provide the following information for each attribute: ■ What type of attribute it is in business terms (see Section 5.4) ■ Any important business rules to which it is subject. 5.3 Attribute Disaggregation: One Fact per Attribute In Chapter 2 we introduced the basic rule for attribute disaggregationone fact per attribute. It is almost never technically difficult to achieve this, and it generally leads to simpler programming, greater reusability of data, and 5.3 Attribute Disaggregation: One Fact per Attribute ■ 147 Simsion-Witt_05 10/8/04 7:56 PM Page 147 easier implementation of change. Normalization relies on this rule being observed; otherwise we may find “dependencies” that are really dependencies on only part of an attribute. For example, Bank Name may be deter- mined by a three-part Bank-State-Branch Number, but closer examination might show that the dependency is only on the “Bank” part of the Number. Why, then, is the rule so often broken in practice? Violations (sometimes referred to as overloaded attributes) may occur for a variety of reasons, including: 1. Failing to identify that an attribute can be decomposed into more fundamental attributes that are of value to the business 2. Attempting to achieve greater efficiency through data compression 3. Reflecting the fact that the compound attribute is more often used by the business than are its components 4. Relying on DBMS or programming facilities to perform “trivial” decom- position when required 5. Confusing the way data is presented with the way it is stored 6. Handling variable length and “semistructured” attributes (e.g., addresses) 7. Changing the definition of attributes after the database is implemented as an alternative to changing the database design 8. Complying with external standards or practices 9. Perpetuating past practices, which may have resulted originally from 1 through 8 above. In our experience, most problems occur as a result of attribute definition being left to programmers or analysts with little knowledge of data modeling. In virtually all cases, a solution can be found that meets requirements without compromising the “one fact per attribute” rule. Compliance with external standards or user wishes is likely to require little more than a translation table or some simple data formatting and unpacking between screen and database. However, as in most areas of data modeling, rigid adherence to the rule will occasionally compromise other objectives. For example, divid- ing a date attribute into components of Year, Month, and Day may make it difficult to use standard date manipulation routines. When conflicts arise, we need to go back to first principles and look at the total impact of each option. The most common types of violation are discussed in the following sections. 5.3.1 Simple Aggregation An example of simple aggregation is an attribute Quantity Ordered that includes both the numeric quantity and the unit of measure (e.g., “12 cases”). Quite obviously, this aggregation of two different facts restricts our ability to 148 ■ Chapter 5 Attributes and Columns Simsion-Witt_05 10/8/04 7:56 PM Page 148 compare quantities and perform arithmetic without having to “unpack” the data. Of course, if the business was only interested in Quantity Ordered as, for example, text to print on a label, we would have an argument for treating it as a single attribute (but in this case we should surely review the attribute name, which implies that numeric quantity information is recorded). A good test as to whether an attribute is fully decomposed is to ask: ■ Does the attribute correspond to a single business fact? (The answer should be “Yes.”) ■ Can the attribute be further decomposed into attributes that them- selves correspond to meaningful business facts? (The answer should be “No.”) ■ Are there business processes that update only part of the attribute? (The answer should be “No.”) We should also look at processes that read the attribute (e.g., for display or printing). However, if the reason for using only part of the attribute is merely to provide an abbreviation of the same fact as represented by the whole, there is little point in decom- posing the attribute to reflect this. ■ Are there dependencies (potentially affecting normalization) that apply to only part of the attribute? (The answer should be “No.”) Let’s look at a more complex example in this light. A Person Name attribute might be a concatenation of salutation (Prof.), family name (Deng), given names (Chan, Wei), and suffixes, qualifications, titles, and honorifics (e.g., Jr., MBA, DFC). Will the business want to treat given names individually (in which case we will regard them as forming a repeating group and normalize them out to a separate entity class)? Or will it be sufficient to separate First Given Name (and possibly Preferred Given Name, which cannot be automatically extracted) from Other Given Names? Should we separate the different qualifications? It depends on whether the business is genuinely interested in individual qualifications, or simply wants to address letters correctly. To answer these questions, we need to consider the needs of all potential users of the database, and employ some judgment as to likely future requirements. Experienced data modelers are inclined to err on the side of disaggregation, even if familiar attributes are broken up in the process. The situation has parallels with normalization, in which familiar concepts (e.g., Invoice) are broken into less obvious components (in this case Invoice Header, Invoice Item) to achieve a technically better structure. But most of us would not split First Given Name into Initial and Remainder of Name, even if there was a need to deal with the initials separately. We can verify this decision by using the questions suggested earlier: ■ “Does First Given Name correspond to a single business fact?” Most people would agree that it does. This provides a strong argument that we are already at a “one fact per attribute” level. 5.3 Attribute Disaggregation: One Fact per Attribute ■ 149 Simsion-Witt_05 10/8/04 7:56 PM Page 149 [...]... the float datatype can be used The decimal Datatype Number accommodated 1 2 3 4 5 6 7 8 integer Figure 5 .4 Length (var)char 1 2 4 36 1,296 46 ,656 1,679,616 60 ,46 6,176 2,176,782,336 78,3 64, 1 64, 096 2.82×1012 127 32,767 2, 147 ,48 3, 647 Identifier capacities 1 64 ■ Chapter 5 Attributes and Columns 3 4 5 6 datatype requires the number of digits after the decimal point to be specified If the decimal datatype... Product Description 162 ■ Chapter 5 Attributes and Columns 5 .4. 4 Column Datatype and Length Requirements We now look at the translation of attribute types into column datatypes If your DBMS does not support UDTs (user-defined datatypes), you should assign to each column the appropriate DBMS datatype (as indicated in Sections 5 .4. 4.1 thru 5 .4. 4 .4) If, however, you are using an SQL99-compliant DBMS that... a data quality problem as a failure to get correct and complete data into the database (Data quality is not only about getting the right data into the system; it is also about correctly interpreting the data in the system.) Indeed data quality can be compromised by any of the following: ■ ■ ■ Data- capture errors (not only invalid data getting into the database but also the failure of all required data. .. required data to get into the database) Data- interpretation errors (when users misinterpret data) Data- processing errors (when developers misinterpret data processing requirements) Thus, correct interpretation of data structures is essential by data entry personnel, data users, and developers There are various views on how one might interpret the meaning of a data item in a database Practitioners and... Standardization of DBMS datatype usage 5 .4. 2 The Attribute Taxonomy in Detail 5 .4. 2.1 Identifiers Identifiers may be system-generated, administrator-defined, or externally defined Examples of system-generated identifiers are Customer Numbers, 4 Tasker, D., Fourth Generation Data A Guide to Data Analysis for New and Old Systems, Prentice-Hall, Australia (1989) This book is currently out of print 5 .4 Types of Attributes... datatypes need to be recorded as being at 00:00 on the day after the actual date (but displayed correctly!) Months should probably use the datatype suitable for dates and standardize the day to the 1st of the month 5 .4 Types of Attributes ■ 165 7 Years should use the integer2 datatype 8 Times of Day can use the datatype defined in the DBMS to record date and time together if there is no specific datatype... in Section 5 .4. 5 13 If there is a specific datatype in the DBMS to hold position data, it should be used for Locations If not, the most common solution is to use a coordinate system (e.g., represent a point by two decimal columns holding the x and y coordinates, a line segment by the x and y coordinates of each end, a polygon by the x and y coordinates of each vertex, and so on) 5 .4. 4 .4 Text Attributes... using Y and N Section 5 .4. 5 discusses conversion between external and internal representations 5 .4. 4.3 Quantifiers 1 Counts should use the integer datatype The length should be sufficient to accommodate the maximum value (e.g., if more than 32,767 use a 4- byte integer, otherwise if more than 127 use a 2-byte integer) 2 Dimensions, Factors, and Intervals should generally use a decimal datatype if available... in the database using specialized currency or money datatypes, integer datatypes (holding cents), or decimal datatypes (holding dollars and two decimal places); the meaningfulness of comparisons between Price attributes and other attributes is quite independent of the DBMS datatypes we choose Meaningfulness of comparison is therefore a property of the attributes that form part of the conceptual data. .. other options 5 .4 Types of Attributes ■ 163 5 .4. 4.2 Categories If a Category attribute is represented internally using the same character strings as are used externally, the char or varchar datatype should be used with a length sufficient to accommodate the longest character string If (as is more usually the case) it is represented internally using a shorter code, the char or varchar datatype should . arc Figure 4. 14 Entity class generalization prompted by mutually exclusive relationships. Tax Assessment Taxpayer be for be the subject of Simsion-Witt_ 04 10/8/ 04 7 :40 PM Page 141 4. 15 Theoretical. reviewed support arcs but apparently only between pairs of relationships. Simsion-Witt_ 04 10/8/ 04 7 :40 PM Page 140 4. 14. 3 Generalizing One-to-Many and Many-to-Many Relationships Our final example involves. www. simsion.com.au). Simsion-Witt_ 04 10/8/ 04 7 :40 PM Page 142 In this book, and in day-to-day modeling, we try to give similar weight to the generalization/specialization and aggregation/disaggregation dimensions. 4. 16 Summary Subtypes

Định dạng
Số trang	56
Dung lượng	1,07 MB