DATA MODELING FUNDAMENTALS (P13) docx

OLAP Versus Data Mining. From our earlier discussion on OLAP, you have a clear idea about the features of OLAP. With OLAP queries and analysis, users are able to obtain results and derive interesting patterns from the data. Data mining also enables users to uncover interesting patterns, but there is an essential difference in the way the results are obtained. Figure 9-31 points out the essential difference between the two approaches. Although both OLAP and data mining are complex information delivery systems, the basic difference lies in the interaction of the users with the systems. OLAP is a user-driven methodology; data mining is a data-driven approach. Data mining is a fairly automatic knowledge discovery process. Data Mining: Knowled ge Discovery. The knowledge discovery in data mining technology may be broken down into the following basic steps: . Define business objectives . Prepare data . Launch data mining tools . Evaluate results . Present knowledge discoveries . Incorporate usage of discoveries Figure 9-32 amplifies the knowledge discovery process and shows the relevant data repositories. FIGURE 9-31 OLAP and data mining. 336 CHAPTER 9 MODELING FOR DECISION-SUPPORT SYSTEMS Data Mining/Data Warehousing. How and where does data mining fit in a data warehousing environment? The data warehouse is a valuable and easily available data source for data mining operations. Data in the data warehouse is already cleansed and consoli- dated. Data for data mining may be extracted from the data warehouse. Figure 9-33 illustrates data mining in the data warehouse environment. Observe the movement of dat a for data mining operations. FIGURE 9-32 Knowledge discovery process. FIGURE 9-33 Data mining in the data warehouse environment. DATA MINING SYSTEMS 337 Data Mining Techniques Although a discussion of major data mining techniques might be somewhat useful, our primary concentration is on the data and how to model the data for data mining applications. Detailed study of data mining techniques and algorithms is, therefore, outside the scope of our study. These techniques and algorithms are complex and highly technical. However, we will just touch on major functions, application areas, and techniques. Functions and Techniques. Refer to Figure 9-34 showing data mining functions and techniques. Look at the four columns in the figure and try to underst and the connections. Review the following statements. . Data mining algorithms are part of data mining techniques. . Data mining techniques are used to carry out data mining functions. While perform- ing specific data mining functions, you are applying data mining processes. . A certain data mining function is generally suitable to a given application area. . Each application area is a major area in business where data mining is actively used. Applications. In order to appreciate the tremendous usefulness of data mining, let us list a few major applications of data mining in the business area. Customer Segmentation. This is one of the most widespread applications. Businesses use data mining to understand their customers. Cluster detection algorithms discover clus- ters of customers sharing same buying characteristics. FIGURE 9-34 Data mining functions and techniques. 338 CHAPTER 9 MODELING FOR DECISION-SUPPORT SYSTEMS Market Basket Analysis. This is a very useful application for retail. Link analysis algorithms uncover affinities between products that are bought together. Other businesses such as upscale auction houses use these algorithms to find customers to whom they can sell higher-value items. Risk Management. Insurance companies and mortgage businesses use data mining to discover risks associated with potential customers. Delinquency Tracking. Loan companies use the technology to track customers who are likely to be delinquent on loan repayments. Demand Prediction. Retail and other distribution businesses use data mining to match demand and supply trends to forecast demand for specific products. Data Preparation and Modeling Data for your mining operations depend on the business objectives—what you expect to get out of the data mining technique being adopted to work on the data. You will be able to come up with a set of data elements whose values are required as input into the data mining tool. For getting the values, you need to determine the data sources. Go back and revisit Figure 9-33, which shows data preparation from the enterprise data wareho use. You know the sources that feed the data warehouse. It is assumed that the data warehouse contains data that has been integrated and combined from several sources. The data warehouse is also expected to have clean data with all the impurities removed in the data staging area. If data mining algorithms are allowed to work on inconsistent data, the results could be totally useless. In this subsection, we will concentrate on a box that indicates data selected, extracted, transformed, and prepared for data mining. We will discuss how the data selection and preparation are done. Once the data is pre pared, you need to store the prepared data suita- bly for feeding the data mining applications. What are good methods for this data storage? How do you prepare the data model for this data repository? Data Preprocessing Depending on the particular data mining application, you may find that the needed data elements are not present in your data warehouse. Be prepared to look to other outside and internal sources for additional data. Further, incomplete, noisy, and inconsistent data is not infrequent in corporate databases and large data warehouses. Do not simply assume the correctness of available data and just extract data from these sources and feed the data mining application. Data preprocessing generally consists of the following processes: . Selection of data . Preparation of data . Transformation of data Let us discuss these briefly. That would give us an idea of the data content that should be reflected in the data model for the preprocessed source data for data mining. DATA MINING SYSTEMS 339 Data Selection. Of course, what data is needed depends on the business objectives and the nature of the data mining application. Remember, data mining algorithms work on data at the lowest grain or level of detail. Based on a list of data elements, you need to identify the sources. Maybe most of the data can be extracted from the data warehouse. Otherwise, determine the secondary sources. Data mining algorithms work on data variables. Values of selected active variables are fed into the data mining system to perform the required operations. Active variables would be data attributes that may be found within the fact and dimension tables of the data warehouse repository. Suppose your data mining application wants to perform market basket analysis, that is, to determine what a typical customer is likely to put in a market basket and go to the check- out counter of a supermarket. The active variables in this case would possibly be number of visits and variables to describe each basket such as household identification (from supermarket card), date of purchase, items purchased, basket value, quantities purchased, and promotion code. Active variables generally fall into following categories: Nominal Variable. This has a limited number of values with no significance attached to the values in terms of ranking. Example: gender (male or female). Ordinal Variable. This has a limited number of values with values signifying ranking. Example: customer education (high school or college or graduate school). Continuous Measure Variable. Difference in values of the variable measurable. Con- tinuous variations. Examples: purchase price, number of items. Values for this variable are real numbers. Discrete Measure Variable. Difference in values of the variable measurable. Discrete variations. Example: number of market basket items. Values for this variable are integers. Data Preparation. This step basically entails cleansing the selected data. First, this step begins with a general review of the structure of the data in question and choosing a method to measure quality. Usually, measuring data quality gets done by a combination of statisti- cal methods and data visualization techniques. Most common data problems appear to be the following: Missing Values. No recorded values for many instances. Need to fill in the missing values before using the variable. Several techniques are available to estimate and fill in the missing values. Noisy Data. A few instances have values completely out of line. Example: daily wages exceeding a million dollars. Several smoothing techniques are available to deal with noisy data. Inconsistent Data. Synonyms and homonyms in various source systems may produce incorrect and inconsistent data. Sources must be reviewed and inconsistencies removed. Removal of data problems signals the end of the data preparation step. Once the selected data is cleansed, it is ready for transformation. 340 CHAPTER 9 MODELING FOR DECISION-SUPPORT SYSTEMS Data Transformation. The prepared data is getting ready to be used as input to the data mining algorithm. The data transformation step converts the prepared data into a format suitable for data mining. You may say that data transformation changes the prepared data into a type of analytical model. The analytical model is simply an information structure representing integrated and time-dependent formatting of prepared data. For example, if a supermarket wants to analyze customer purchases, it must first be decided if the analysis will be done at the store level or at the level of individual purchases. The analytical model includes the variables and the levels of detail. Following the identification of the analytical model, detailed data transformation takes place. The objective is to transform the data to fit the exact requirements stipulated by the data mining algorithm. Data transformation may include a number of substeps such as: . Data recoding . Data format conversion . Householding (linking of data of customers in the same household) . Data reduction (by combining correlated variables) . Scaling of parameter values to a range acceptable by data mining algorithms . Discretization (conversion of quantitive variables into categorical variable groups) . Conversion of categoric variable into a numeric representation. Data Modeling Data modeling for data mining applications involve representations of the pertinent data repositories. In our discussions of data mining so far, we have been referring to the data requirements for data mining applications. We pointed out certain advantages of data extraction from the data warehouse. However, data wareh ouse is not a required source; you may directly extract data from operational systems and other sources. Figure 9-35 shows the data movements, the data preprocessing phase, and the data repositories. Study the figure carefully and note the following data repositories for which we need to create data models as suggested below: DM Source Repository. This is a general data store for all possible data mining applications. Periodically, data is extracted from the data warehouse and stored in this repository. Data Model. Normalized relational data model to represent low-level data content for all possible active variables available from the data warehouse. Application Analytical Repository. This is a data store for a specific data mining application. Data is extracted from the above DM Source Repository and other sources and stored in this repository. Only the required active variables are selected. Data Model. Normalized relational data model is recommended to represent data content at the desired data level for only those active variables relevant to the specific data mining application. DATA MINING SYSTEMS 341 Data Mining Input Extract. This data store is meant to be used as input to the data mining algorithm. Data in the Application Analytical Repository is transformed and moved into this data store. This data store contains transformed values for only the required active variables. Data Model. Flat file or normalized relational data model with two or three tables to represent data content to be used as direct input to the data mining algorithm. CHAPTER SUMMARY . Data modeling for decision-support systems is not the same as modeling for operational systems. . Decision-support systems provide information for making strategic decisions. These are informational systems as opposed to operational systems needed to run day-to-day operations of an organization. . Data warehousing is the most common of the decision-support systems widely used today. It is a blend of several technologies. Major components of a data warehouse are source data, data staging, data storage, and information delivery. . Decision makers view business in terms of business dimensions for analysis. There- fore, data modeling for a data warehouse must take into account business dimensions and the business metrics. Dimensional modeling technique is used. . A dimensional data model, known as a STAR schema, consists of several dimension entity types and a fact entity type in the middle. Each of the dimension entity types FIGURE 9-35 Data mining: data movements and repositories. 342 CHAPTER 9 MODELING FOR DECISION-SUPPORT SYSTEMS is in a one-to-many relationship with the common fact entity type. The STAR schema is not normalized. A snowflake schema, some times useful, is a normalized version. The data model for a given data warehouse usually consists of families of STARS. . The conceptual data model in the form of a STAR schema is transformed into a logical model. If the data warehouse is implemented using a relational DBMS, the logical model in this case is a relational model. . OLAP systems provide complex dimensional analysis. Data modeling for MOLAP: representation of multidimensional arrays suitable for the particular MDDBMS selected. Data modeling for ROLAP: E-R model of summarized data as required. . Data mining is a fairly automatic knowledge discovery system. Data modeling for data mining systems consists of modeling for the data repositories: DM source repository, application analytical repository, and DM input extract. REVIEW QUESTIONS 1. Match the column entries: 1. Informational systems A. Uses MDDBMS 2. Data staging area B. Semiadditive 3. Dimension hierarchies C. Normalized 4. Fact entity type D. Knowledge discovery 5. Dimension table E. Decision support 6. Profit margin percentage F. For drill-down analysis 7. Snowflake schema G. Data cleansed and transformed 8. Hypercube H. Generally wide 9. MOLAP I. Metrics as attributes 10. Data mining J. Represents multiple dimensions 2. A data warehouse is a decision-sup port environment, not a product. Discuss. 3. What data does an information package contain? Give a simple example. 4. Explain why the E-R modeling technique is not completely suitable for the data warehouse? How is dimensional modeling different? 5. Describe the composition of the primary keys for the dimension and fact tables. Give simple examples. 6. Describe the nature of the columns in a dimension table transformed from the corresponding conceptual STAR schema. Give typical examples. 7. What is your understanding of a value chain and a value circle in terms of families of STARS? What are the implications for data modeling? 8. Describe the main features of a ROLAP system. Explain how data modeling is done for this. 9. Distinguish between OLAP and data mining with regard to data modeling. 10. Discuss data preprocessing for data mining. What data repositories are involved and how do you model these? REVIEW QUESTIONS 343 IV PRACTICAL APPROACH TO DATA MODELING 345 [...]... aspect of data modeling ensuring quality of the model Having gone through the multifarious facets of data modeling, it is just fitting to bring all that to a logical conclusion by stressing data model quality In recent decades, organizational user groups and information technology professionals have realized the overwhelming significance of data modeling A data modeling effort precedes every database... of data modeling We have traveled quite far covering much ground You have a strong grip on data modeling by now You are an expert on the components of a data model You know how to translate the information requirements of an organization into a suitable data model using model components You have studied a number of examples of data models In short, you now possess a thorough knowledge of what data modeling. .. the data model to the user groups and get their confirmation You can do this correctly and effectively only if your data model is good and of high quality Good Database Blueprint The database of an organization is built and implemented from the data model Every component of the data model gets transformed into one or SIGNIFICANCE OF QUALITY 349 FIGURE 10-1 Purposes of a data model more parts of the database... database If the entity types in a data model are erroneous or incomplete, the resulting database will be incorrect If the data model does not establish the relationships correctly, the database will link the data structures incorrectly For building and implementing the database of an organization accurately and completely, the data model should be of good quality A good data model is a good blueprint;... look at various factors that contribute to high-quality data models You will learn about data model quality dimensions and their implications However, at the outset let us mention a few guiding principles for creating good data models What is the best approach for good data modeling? We highlight in the following a few fundamentals of good data modeling practice Proper Mindset Everyone on the project... output of the data modeling process The data modelers have full responsibility to bear this in mind throughout the entire modeling process—from collection of information requirements to creating the final data model Practice of Agile Modeling Recently, organizations have come to realize that agile software development principles and practice prove to be effective Use and application of agile data modeling. .. cover quality in a data model in a similar fashion First, you need to understand the meaning of quality as it applies to a data model You need to explore the dimensions of data model quality How to recognize a high-quality data model? What are its characteristics? What benefits accrue to the user groups, the data modelers, and other database practitioners from a good data model? Meaning of Data Model Quality... process in the modeling effort; you will cover quality assurance in sufficient detail SIGNIFICANCE OF QUALITY It is obvious that high quality in anything we create is essential That goes without having to mention it specifically Then is not that maxim true for data modeling as well? Why emphasize quality in data modeling separately? There are some special reasons The concepts of data modeling are not... modeling are not that easy to comprehend Data modeling is a specialized effort needing special skills A data modeler must be a business analyst, draftsman, documentation expert, and a database specialist—all rolled into one It takes skill and experience to gain a high degree of proficiency in data modeling It is easy to overlook the essentials and produce bad data models In a large organization, piecing... and logical data models The data model is a complete blueprint The data model can be broken down in cohesive parts for possible partial implementations The connections or links between model components are easily defined The business rules may easily be transposed from the conceptual data model to the logical data model A bad data model does not possess the above characteristics Moreover, a data model . overwhelming significance of data modeling. A data modeling effort precedes every database implementation. However, what we see in practice is a number of 347 Data Modeling Fundamentals. By Paulraj. true for data modeling as well? Why emphasize quality in data modeling separately? There are some special reasons. The concepts of data modeling are not that easy to comprehend. Data modeling. an idea of the data content that should be reflected in the data model for the preprocessed source data for data mining. DATA MINING SYSTEMS 339 Data Selection. Of course, what data is needed

Định dạng
Số trang	30
Dung lượng	645,67 KB