Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 49 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
49
Dung lượng
397,44 KB
Nội dung
• Product quality focuses on the characteristics of the product itself. The approach is to carry out inspections of the finished product, look for defects, and correct them. • Process quality focuses on the characteristics of the process used to build the product. The focus of process quality lies on defect preven- tion rather than detection and aims to reduce reliance on mass inspections as a way of achieving quality [8]. In the context of DBs, product quality relates to characteristics of the data model and the data itself (the product), while process quality relates to how data models are developed and how the data are collected and loaded (the process). This chapter focuses on product quality. We refer to information quality in a wide sense as comprising DB sys- tem quality and data presentation quality (see Figure 14.2). In fact, it is important that data in the DB correctly reflect the real world, that is, the data are accurate. It is also important for the data to be easy to understand. In DB system quality, three different aspects could be considered: DBMS quality, data model quality (both conceptual and logical), and data quality. This chapter deals with data model quality and data quality. To assess DBMS quality, we can use an international standard like IS 9126 [9], or some of the existing product comparative studies (e.g., [10] for ODBMS evaluation). Unfortunately, until a few years ago, quality issues focused on software quality [3, 9, 1114], disregarding DB quality [15]. Even in traditional DB Database Quality 487 Information quality Database quality Presentation quality DBMS quality Data model quality Data quality Figure 14.2 Information and DB quality. design, quality-related aspects have not been explicitly incorporated [16]. Although DB research and practice have not been focused traditionally on quality-related subjects, many of the developed tools and techniques (integ- rity constraints, normalization theory, transaction management) have influ- enced data quality. It is time to consider information quality as a main goal to achieve, instead of a subproduct of DB creation and development processes. Most of the works for the evaluation of both data quality and data model quality propose only lists of criteria or desirable properties without providing any quantitative measures. The development of the properties is usually based upon experience in practice, intuitive analysis, and reviews of relevant literature. Quality criteria are not enough on their own to ensure quality in practice, because different people will generally have different interpretations of the same concept. According to the total quality manage- ment (TQM) literature, measurable criteria for assessing quality are necessary to avoid arguments of style [17]. Measurement is also fundamental to the application of statistical process control, one of the key techniques of the TQM approach [8]. The objective should be to replace intuitive notions of design quality with formal, quantitative measures to reduce subjectivity and bias in the evaluation process. However, defining reliable and objective measures of quality in software development is a difficult task. This chapter is an overview of the main issues relating to the assessment of DB quality. It addresses data model quality and also considers data (val- ues) quality. 14.2 Data Model Quality A data model is a collection of concepts that can be used to describe a set of data and operations to manipulate the data. There are two types of data mod- els: conceptual data models (e.g., E/R model), which are used in DB design, and logical models (e.g., relational, hierarchy, and network models), which are supported by DBMSs. Using conceptual models, one can build a descrip- tion of reality that would be easy to understand and interpret. Logical mod- els support data descriptions that can be processed by a computer through a DBMS. In the design of DBs, we use conceptual models first to produce a high-level description of the reality, then we translate the conceptual model into a logical model. Although the data modeling phase represents only a small portion of the overall development effort, its impact on the final result is probably 488 Advanced Database Technology and Design greater than that of any other phase [18]. The data model forms the foun- dation for all later design work and is a major determinant of the quality of the overall system design [19, 20]. Improving the quality of the data model, therefore, is a major step toward improving the quality of the system being developed. The process of building quality data models begins with an under- standing of the big picture of model quality and the role that data models have in the development of ISs. There are no generally accepted guidelines for evaluating the quality of data models, and little agreement even among experts as to what makes a good data model [21]. As a result, the quality of data models pro- duced in practice is almost entirely dependent on the competence of the data modeler. When systems analysts and users inspect different data models from the same universe of discourse, they often perceive that some models are, in some sense, better than others, but they may have difficulty in explaining why. Therefore an important concern is to clarify what is meant by a good data model, a data model of high quality. Quality in data modeling is frequently defined as a list of desirable properties for a data model [2227]. By understanding each property and planning your modeling approach to address each one, you can significantly increase the likelihood that your data models will exhibit characteristics that render them useful for IS design. The quality factors are usually based on practical experience, intuitive analysis, and reviews of relevant literature. Although such lists provide a useful starting point for understanding and improving quality in data modeling, they are mostly unstructured, use imprecise definitions, often overlap, often confuse properties of models with language and method properties, and often have goals that are unrealistic or even impossible to reach [28]. Expert data modelers intuitively know what makes a good data model, but such knowledge can generally be acquired only through experience. For data modeling to progress from a craft to an engineering discipline, the desir- able qualities of data models need to be made explicit [22]. The conscious listing (or bringing to the surface) of those qualities helps to identify areas on which attention needs to be focused. This can act as a guide to improve the model and explore alternatives. Not only is the definition of quality factors important to evaluate data models, but we also have to consider other ele- ments that allow any two data models, no matter how different they may be, to be compared precisely, objectively, and comprehensively [29]. So, in this chapter, we propose and describe the following elements: quality factors, Database Quality 489 stakeholders, quality concepts, improvement strategies, quality metrics, and weightings. 14.2.1 Quality Factors In the literature related to quality in data modeling, there exist a lot of quality factors definitions. We list here the more relevant ones: • Completeness. Completeness is the ability of the data model to meet all user information and functional requirements. • Correctness. Correctness indicates whether the model conforms to the rules of the data modeling technique in use. • Minimality. A data model is minimal when every aspect of the requirements appears once in the data model. In general, it is better to avoid redundancies. • Normality. Normality comes from the theory of normalization asso- ciated with the relational data model; it aims at keeping the data in a clean, purified normal form. • Flexibility. Flexibility is defined as the ease with which the data model can be adapted to changes in requirements. • Understandability. Understandability is defined as the ease with which the concepts and structures in the data model can be under- stood by users of the model. • Simplicity. Simplicity relates to the size and complexity of the data model. Simplicity depends not on whether the terms in which the model is expressed are well known or understandable but on the number of different constructs required. While it is important to separate the various dimensions of value from the purposes of analysis, it is also important to bear in mind the interactions among qualities. In general, some objectives will interfere or conflict with each other; others will have common implications, or concur; and still others will not interact at all. 14.2.2 Stakeholders Stakeholders are people involved in building or using the data modelthere- fore, they have an interest in its quality. Different stakeholders will generally be interested in different quality factors. 490 Advanced Database Technology and Design Different people will have different perspectives on the quality of a data model. An application developer may view quality as ease of implementation, whereas a user may view it as satisfaction of requirements. Both viewpoints are valid, but they need not coincide. Part of the confusion about which is the best model and how models should be evaluated is caused by differences between such perspectives. The design of effective systems depends on the participation and satis- faction of all relevant stakeholders in the design process. An important con- sideration, therefore, in developing a framework for evaluating data models is to consider the needs of all stakeholders. This requires identification of the stakeholders and then incorporation of their perceptions of value for a data model into the framework. The following people are the key stakeholders in the data modeling process. • Users. Users are involved in the process of developing the data model and verifying that it meets their requirements. Users are interested in the data model to the extent that it will meet their current and future requirements and that it represents value for money. • DB designer. The DB designer is responsible for developing the data model and is concerned with satisfying the needs of all stakeholders while ensuring that the model conforms to rules of good data mod- eling practice. • Application developer. The application developer is responsible for implementing the data model once it is finished. Application devel- opers will be primarily concerned with the fact that the model can be implemented given time, budget, resource, and technology constraints. • Data administrator. The data administrator is responsible for ensur- ing that the data model is integrated with the rest of the organization data. The data administrator is primarily concerned with ensuring data shareability across the organization rather than the needs of spe- cific applications. All these perspectives are valid and must be taken into consideration during the design process. The set of qualities defined as part of the framework should be developed by coalescing the interests and requirements of the vari- ous stakeholders involved. It is only from a combination of perspectives that a true picture of data model quality can be established. Database Quality 491 14.2.3 Quality Concepts It is useful to classify quality according to Krogsties framework [30] (see Figure 14.3). Quality concepts are defined as follows: • Syntactic quality is the adherence of a data model to the syntax rules of the modeling language. • Semantic quality is the degree of correspondence between the data model and the universe of discourse. • Perceived semantic quality is the correspondence between stakehold- ers knowledge and the stakeholders interpretation. • Pragmatic quality is the correspondence between a part of a data model and the relevant stakeholders interpretation of it. • Social quality has the goal of feasible agreement among stakeholders, where inconsistencies among various stakeholders interpretations of the data model are solved. Relative agreement (stakeholders interpretations may differ but remain consistent) is more realistic 492 Advanced Database Technology and Design Data model Modeling language Stakeholders' interpretation Syntactic quality Semantic quality Perceived semantic quality Pragmatic quality Social quality AUTOR AUTOR INSTITUCION INSTITUCION LIBRO LIBRO Trata Trata TEMA TEMA Edita Edita EDITORIAL EDITORIAL SOCIO SOCIO Tiene Tiene EJEMPLAR EJEMPLAR Nombre_a Nombre_i Identificativo Presta Presta Consta Consta (1,n) (0,n) (0,n) (0,n) (0,n) (0,n) (0,n) (0,n) (0,n) (1,n) (1,n)(1,n) (1,1) (1,1) N:M N:M N:M N:M N:M 1:N 1:1 Escribe Trabaja Nombre_t Fecha_p Fecha_s Num_s Cod_libro Nombre_e Stakeholders' knowledge Universe of discourse Figure 14.3 Quality concepts. than absolute agreement (all stakeholders interpretations are the same). Each quality concept has different goals that must be satisfied. If some of those goals are not attained, we can think about an improvement strategy. 14.2.4 Improvement Strategies An improvement strategy is a process or activity that can be used to increase the value of a data model with respect to one or more quality factors. Strate- gies may involve the use of automated techniques as well as human judgment and insight. Rather than just simply identifying what is wrong with a model or where it could be improved, we need to identify methods for improving the model. Of course, it is not possible to reduce the task of improving data models to a mechanical process, because that requires invention and insight, but it is useful to identify general techniques that can help improve the qual- ity of data models. In general, an improvement strategy may improve a data model on more than one dimension. However, because of the interactions between qualities, increasing the value of a model on one dimension may decrease its value on other dimensions. 14.2.5 Quality Metrics Quality metrics define ways of evaluating particular quality factors in numerical terms. Developing a set of qualities and metrics for data model evaluation is a difficult task. Subjective notions of design quality are not enough to ensure quality in practice, because different people will have different interpretations of the same concept (e.g., understandability). A metric is a way of measuring a quality factor in a consistent and objective manner. It is necessary to establish metrics for assessing each quality factor. Software engineers have proposed a plethora of metrics for software products, processes, and resources [31, 32]. Unfortunately, almost all the metrics proposed since McCabes cyclomatic number [33] until now have focused on program characteristics, without paying special attention to DBs. Metrics could be used to build prediction systems for DB projects [34], to understand and improve software development and maintenance projects [35], to maintain the quality of the systems [36], to highlight problematic Database Quality 493 TEAMFLY Team-Fly ® areas [37], and to determine the best ways to help practitioners and research- ers in their work [38]. It is necessary that metrics applied to a product be justified by a clear theory [39]. Rigorous measurement of software attributes can provide sub- stantial help in the evaluation and improvement of software products and processes [40, 41]. Empirical validation is necessary, not only to prove the metrics validity but also to provide some limits that can be useful to DB designers. However, as DeChampeaux remarks, we must be conscious that associating with numeric ranges the qualifications good and bad is the hard part [37]. To illustrate the concept of quality metrics, this section shows some metrics that measure the quality factor of simplicity, as applied to E/R mod- els. All the metrics shown here are based on the concept of closed-ended met- rics [42], since they are bounded in the interval [0,1] which allows data modelers to compare different conceptual models on a numerical scale. These metrics are based on complexity theory, which defines the complexity of a system by the number of components in the system and the number of relationships among the components. Because the aim is to simplify the E/R model, the objective will be to minimize the value of these metrics. • The RvsE metric measures the relation that exists between the number of relationships and the number of entities in an E/R model. It is based on M RPROP metric proposed by Lethbridge [42]. We define this metric as follows: RvsE N NN R RE = + 2 where N R is the number of relationships in the E/R model, N E is the number of entities in the E/R model, and N R + N E > 0. When we calculate the number of relationships (N R ), we also consider the IS_A relationships. In this case, we take into account one relationship for each child-parent pair in the IS_A relationship. • The DA metric is the number of derived attributes that exist in the E/R model, divided by the maximum number of derived attributes that may exist in an E/R model (all attributes in the E/R model except one). An attribute is derived when its value can be calculated or deduced from the values of other attributes. We define this metric as follows: 494 Advanced Database Technology and Design DA N N DA A = −1 where N DA is the number of derived attributes in the E/R model, N A is the number of attributes in the E/R model, and N A > 1. When we calculate the number of attributes in the E/R model (N A ), in the case of composite attributes we consider each of their simple attributes. • The CA metric assesses the number of composite attributes com- pared with the number of attributes in an E/R model. A composite attribute is an attribute composed of a set of simple attributes. We define this metric as follows: CA N N CA A = where N CA is the number of composite attributes in the E/R model, N A is the number of attributes in the E/R model, and N A > 0. When we calculate the number of attributes in the E/R model (N A ), in the case of composite attributes we regard each of their simple attributes. • The RR metric is the number of relationships that are redundant in an E/R model, divided by the number of relationships in the E/R model minus 1. Redundancy exists when one relationship R 1 between two entities has the same information content as a path of relationships R 2 , R 3 , …, R n connecting exactly the same pairs of entity instances as R 1 . Obviously, not all cycles of relationships are sources of redundancy. Redundancy in cycles of relationships depends on meaning [22]. We define this metric as follows: RR N N RR R = =1 where N RR is the number of redundant relationships in the E/R model, N R is the number of relationships in the E/R model, and N R > 1. When we calculate the number of relationship (N R ), we also consider the IS_A relationships. In this case, we consider one relationship for each child-parent pair in the IS_A relationship. Database Quality 495 • The M:NRel metric measures the number of M:N relationships com- pared with the number of relationships in an E/R model. We define this metric as follows: MN l N N MNR R :Re : = where N M:NR is the number of M:N relationships in the E/R model, N R is the number of relationships in the E/R model, and N R > 0. When we calculate the number of relationships (N R ), we also consider the IS_A relationships. In this case, we think over one relationship for each child-parent pair in the IS_A relationship. • The IS_ARel metric assesses the complexity of generalization/spe- cialization hierarchies (IS_A) in one E/R model. It is based on the M ISA metric defined by Lethbridge [42]. The IS_ARel metric com- bines two factors to measure the complexity of the inheritance hier- archy. The first factor is the fraction of entities that are leaves of the inheritance hierarchy. That measure, called Fleaf, is calculated thus: Fleaf N N Leaf E = where N Leaf is the number of leaves in one generalization or specialization hierarchy, N E is the number of entities in each generalization or specialization hierarchy, and N E > 0. Figure 14.4 shows several inheritance hierarchies along with their measures of Fleaf. Fleaf approaches 0,5 when the number of leaves is half the number of entities, as shown in Figure 14.4(c) and (d). It approaches 0 in the ridiculous case of a unary tree, as shown in Figure 14.4(c), and it approaches 1 if every entity is a subtype of the top entity, as shown in Figure 14.4(d). On its own, Fleaf has the undesirable property that, for a very shallow hierarchy (e.g., just two or three levels) with a high branching factor, it gives a measurement that is unreasonably high, from a subjective standpoint; see Figure 14.4(a). To correct that problem with Fleaf, an additional factor is used in the calculation of the IS_ARel metric: the average number of direct and indirect supertypes per 496 Advanced Database Technology and Design [...]... data quality and have good marks in these measures: management responsibilities, operation and assurance costs, research and development, production, distribution, personnel management, and legal functions [46] This section makes reference to only two of them: management and design issues 502 Advanced Database Technology and Design 14.3.1 Management Issues Companies must, on the one hand, define... interests are databases, interoperable data systems, and mobile computing Her e-mail address is jipileca@si.ehu.es Zoubida Kedad is an associate professor at the University of Versailles in France She received a Ph.D from the University of Versailles in 1999 Her works mainly concern database design, specifically schema integration issues Team-Fly® 514 Advanced Database Technology and Design and the design. .. the database group in the PRiSM laboratory His research interests are in database design, data integration, data warehouses, workflows, and software engineering He is the co-editor in chief of the International Journal on Networking and Information Systems He has published different books on databases and object technology His e-mail address is Mokrane.Bouzeghoub@prism.uvsq.fr 511 512 Advanced Database. .. maintenance and metrics for databases His e-mail address is mpolo@inf-cr.uclm.es Alexandra Poulovassilis is a reader in computer science in the Department of Computer Science at Cirkbeck College, University of London Her research interests are active databases, database programming languages, graph-based data models and languages, heterogeneous databases and database integration, intelligent networks, and. .. currently teaching advanced databases and natural language processing She has been working in several European and national research projects on natural language techniques, advanced database technologies, and software engineering Her e-mail address is pmf@inf.uc3m.es Peter McBrien is a lecturer at the Department of Computing at Imperial College in London He lectures in databases and computer communications... areas include the design and analysis of distributed information systems and temporal databases His e-mail address is pjm @doc. ic.ac.uk Elisabeth Métais is an associate professor at the University of Versailles in France and a researcher in the PRiSM Laboratory at the University of Versailles She participated in the definition of SECSI, the first expert system on database design, and is currently interested... Indicators Student Secondary School Final Mark Entrance Examination Mark William Smith 8 7 Gene Hackman 9 6 … … 506 Advanced Database Technology and Design References [1] Van Vliet, J C., Software Engineering: Principles and Practice, New York: Wiley, 1993 [2] Zultner, R., QFD for Software: Satisfying Customers,... (BLOBs), 198 Binary relvars, 180 Bitemporal tables, 186 Booch method, 10 Bottom-up design, 301 local DBs and, 307 schema conformance, 307 schema improvement, 307 schema merging, 307 Index semantic schema integration example, 308 See also Distributed databases Bottom-up query evaluation, 103 5 defined, 102 drawbacks, 104 5 example, 104 steps, 103 4 See also Query processing Boyce-Codd normal form (BCNF), 469... forms, 183 CASE tools, 10, 13, 211, 43980 communication between, 44445 conceptual design, 44768 creative design, 451 database design, 442 framework for database design, 44547 functional classification of, 44044 functionalities, 440 history of, 43940 ideal toolset, 441 integration approaches, 444 intelligent, 458 introduction to, 43945 logical design, 46979 maintenance and administration, 443... Databases DBTG model, 5 Deadlocks, 31819 521 522 Advanced Database Technology and Design Decision support systems (DSS), 20 Decomposition, 17781, 183 horizontal, 17779 vertical, 17981 Deduction rules, 47172 Deductive databases, 20, 22, 91131 base predicates, 100 basic concepts of, 93 102 Datalog, 92, 100 defined, 9192, 9396 derived predicates, 101 historical perspective, 92 interpretation, 96 introduction . result is probably 488 Advanced Database Technology and Design greater than that of any other phase [18]. The data model forms the foun- dation for all later design work and is a major determinant. needs? 502 Advanced Database Technology and Design • Do internal information collection, dissemination, and verification procedures measure up to quality requirements? Data quality training and awareness. follows: 494 Advanced Database Technology and Design DA N N DA A = −1 where N DA is the number of derived attributes in the E/R model, N A is the number of attributes in the E/R model, and N A >