Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 53 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
53
Dung lượng
695,33 KB
Nội dung
System conversions. Trace the evolution of order processing in any company. The company must have started with a file-oriented order entry system in the early 1970s; orders were entered into flat files or indexed files. There was not much stock verification or customer credit verification during the entry of the order. Reports and hard-copy printouts were used to continue with the process of executing the or- ders. Then this system must have been converted into an online order entry system with VSAM files and IBM’s CICS as the online processing monitor. The next con- version must have been to a hierarchical database system. Perhaps that is where your order processing system still remains—as a legacy application. Many compa- nies have moved the system forward to a relational database application. In any case, what has happened to the order data through all these conversions? System conversions and migrations are prominent reasons for data pollution. Try to under- stand the conversions gone through by each of your source systems. Data aging. We have already dealt with data aging when we reviewed how over the course of many years the values in the product code fields could have decayed. The older values lose their meaning and significance. If many of your source systems are old legacy systems, pay special attention to the possibility of aged data in those systems. Heterogeneous system integration. The more heterogeneous and disparate your source systems are, the stronger is the possibility of corrupted data. In such a sce- nario, data inconsistency is a common problem. Consider the sources for each of your dimension tables and the fact table. If the sources for one table are several het- erogeneous systems, be cautious about the quality of data coming into the data warehouse from these systems. Poor database design. Good database design based on sound principles reduces the introduction of errors. DBMSs provide for field editing. RDBMSs enable verifica- tion of the conformance to business rules through triggers and stored procedures. Adhering to entity integrity and referential integrity rules prevents some kinds of data pollution. Incomplete information at data entry. At the time of the initial data entry about an entity, if all the information is not available, two types of data pollution usually oc- cur. First, some of the input fields are not completed at the time of initial data entry. The result is missing values. Second, if the unavailable data is mandatory at the time of the initial data entry, then the person entering the data tries to force generic val- ues into the mandatory fields. Entering N/A for not available in the field for city is an example of this kind of data pollution. Similarly, entry of all nines in the Social Security number field is data pollution. Input errors. In olden days when data entry clerks entered data into computer sys- tems, there was a second step of data verification. After the data entry clerk finished a batch, the entries from the batch were independently verified by another person. Now, users who are also responsible for the business processes enter the data. Data entry is not their primary vocation. Data accuracy is supposed to be ensured by sight verification and data edits planted on the input screens. Erroneous entry of data is a major source of data corruption. Internationalization/localization. Because of changing business conditions, the structure of the business gets expanded into the international arena. The company moves into wider geographic areas and newer cultures. As a company is internation- 300 DATA QUALITY: A KEY TO SUCCESS alized, what happens to the data in the source systems? The existing data elements must adapt to newer and different values. Similarly, when a company wants to con- centrate on a smaller area and localize its operations, some of the values for the data elements get discarded. This change in the company structure and the resulting revi- sions in the source systems are sources of data pollution. Fraud. Do not be surprised to learn that deliberate attempts to enter incorrect data are not uncommon. Here, the incorrect data entries are actually falsifications to commit fraud. Look out for monetary fields and fields containing units of products. Make sure that the source systems are fortified with tight edits for such fields. Lack of policies. In any enterprise, data quality does not just materialize by itself. Pre- vention of entry of corrupt data and preservation of data quality in the source sys- tems are deliberate activities. An enterprise without explicit policies on data quality cannot be expected to have adequate levels of data quality. Validation of Names and Addresses Almost every company suffers from the problem of duplication of names and addresses. For a single person, multiple records can exist among the various source systems. Even within a single source system, multiple records can exist for one person. But in the data warehouse, you need to consolidate all the activities of each person from the various du- plicate records that exist for that person in the multiple source systems. This type of prob- lem occurs whenever you deal with people, whether they are customers, employees, physicians, or suppliers. Take the specific example of an auction company. Consider the different types of cus- tomers and the different purposes for which the customers seek the services of the auction company. Customers bring property items for sale, buy at auctions, subscribe to the cata- logs for the various categories of auctions, and bring articles to be appraised by experts for insurance purposes and for estate dissolution. It is likely that there are different legacy systems at an auction house to service the customers in these different areas. One cus- tomer may come for all of these services and a record gets created for the customer in each of the different systems. A customer usually comes for the same service many times. On some of these occasions, it is likely that duplicate records are created for the same cus- tomer in one system. Entry of customer data happens at different points of contact of the customer with the auction company. If it is an international auction company, entry of cus- tomer data happens at many auction sites worldwide. Can you imagine the possibility for duplication of customer records and the extent of this form of data corruption? Name and address data is captured in two ways (see Figure 13-3). If the data entry is in the multiple field format, then it is easier to check for duplicates at the time of data entry. Here are a few inherent problems with entering names and addresses: ț No unique key ț Many names on one line ț One name on two lines ț Name and the address in a single line ț Personal and company names mixed ț Different addresses for the same person ț Different names and spellings for the same customer DATA QUALITY CHALLENGES 301 Before attempting to deduplicate the customer records, you need to go through a pre- liminary step. First, you have to recast the name and address data into the multiple field format. This is not easy, considering the numerous variations in the way name and address are entered in free-form textual format. After this first step, you have to devise matching algorithms to match the customer records and find the duplicates. Fortunately, many good tools are available to assist you in the deduplication process. Costs of Poor Data Quality Cleansing the data and improving the quality of data takes money and effort. Although data cleansing is extremely important, you could justify the expenditure of money and ef- fort by counting the costs of not having or using quality data. You can produce estimates with the help of the users. They are the ones who can really do estimates because the esti- mates are based on forecasts of lost opportunities and possible bad decisions. The following is a list of categories for which cost estimates can be made. These are broad categories. You will have to get into the details for estimating the risks and costs for each category. ț Bad decisions based on routine analysis ț Lost business opportunities because of unavailable or “dirty” data ț Strain and overhead on source systems because of corrupt data causing reruns ț Fines from governmental agencies for noncompliance or violation of regulations ț Resolution of audit problems 302 DATA QUALITY: A KEY TO SUCCESS Name & Address: Dr. Jay A. Harreld, P.O. Box 999, 100 Main Street, Anytown, NX 12345, U.S.A. Title: Dr. First Name: Jay Middle Initial: A. Last Name: Harreld Street Address-1: P.O. Box 999 Street Address-2: 100 Main Street City: Anytown State: NX Zip: 12345 Country Code: U.S.A. SINGLE FIELD FORMAT MULTIPLE FIELD FORMAT Figure 13-3 Data entry: name and address formats. ț Redundant data unnecessarily using up resources ț Inconsistent reports ț Time and effort for correcting data every time data corruption is discovered DATA QUALITY TOOLS Based on our discussions in this chapter so far, you are at a point where you are convinced about the seriousness of data quality in the data warehouse. Companies have begun to rec- ognize dirty data as one of the most challenging problems in a data warehouse. You would, therefore, imagine that companies must be investing heavily in data clean- up operations. But according to experts, data cleansing is still not a very high priority for companies. This attitude is changing as useful data quality tools arrive on the market. You may choose to apply these tools to the source systems, in the staging area before the load images are created, or to the load images themselves. Categories of Data Cleansing Tools Generally, data cleansing tools assist the project team in two ways. Data error discovery tools work on the source data to identify inaccuracies and inconsistencies. Data correction tools help fix the corrupt data. These correction tools use a series of algorithms to parse, transform, match, consolidate, and correct the data. Although data error discovery and data correction are two distinct parts of the data cleansing process, most of the tools on the market do a bit of both. The tools have features and functions that identify and discover errors. The same tools can also perform the clean- ing up and correction of polluted data. In the following sections, we will examine the fea- tures of the two aspects of data cleansing as found in the available tools. Error Discovery Features Please study the following list of error discovery functions that data cleansing tools are capable of performing. ț Quickly and easily identify duplicate records ț Identify data items whose values are outside the range of legal domain values ț Find inconsistent data ț Check for range of allowable values ț Detect inconsistencies among data items from different sources ț Allow users to identify and quantify data quality problems ț Monitor trends in data quality over time ț Report to users on the quality of data used for analysis ț Reconcile problems of RDBMS referential integrity Data Correction Features The following list describes the typical error correction functions that data cleansing tools are capable of performing. DATA QUALITY TOOLS 303 ț Normalize inconsistent data ț Improve merging of data from dissimilar data sources ț Group and relate customer records belonging to the same household ț Provide measurements of data quality ț Validate for allowable values The DBMS for Quality Control The database management system itself is used as a tool for data qualtiy control in many ways. Relational database management systems have many features beyond the database engine (see list below). Later versions of RDBMS can easily prevent several types of er- rors creeping into the data warehouse. Domain integrity. Provide domain value edits. Prevent entry of data if the entered data value is outside the defined limits of value. You can define the edit checks while set- ting up the data dictionary entries. Update security. Prevent unauthorized updates to the databases. This feature will stop unauthorized users from updating data in an incorrect way. Casual and untrained users can introduce inaccurate or incorrect data if they are given authorization to update. Entity integrity checking. Ensure that duplicate records with the same primary key values are not entered. Also prevent duplicates based on values of other attributes. Minimize missing values. Ensure that nulls are not allowed in mandatory fields. Referential integrity checking. Ensure that relationships based on foreign keys are preserved. Prevent deletion of related parent rows. Conformance to business rules. Use trigger programs and stored procedures to en- force business rules. These are special scripts compiled and stored in the database itself. Trigger programs are automatically fired when the designated data items are about to be updated or deleted. Stored procedures may be coded to ensure that the entered data conforms to specific business rules. Stored procedures may be called from application programs. DATA QUALITY INITIATIVE In spite of the enormous importance of data quality, it seems as though many companies still ask the question whether to pay special attention to it and cleanse the data or not. In many instances, the data for the missing values of attributes cannot be recreated. In quite a number of cases, the data values are so convoluted that the data cannot really be cleansed. A few other questions arise. Should the data be cleansed? If so, how much of it can really be cleansed? Which parts of the data deserve higher priority for applying data cleansing techniques? The indifference and the resistance to data cleansing emerge from a few valid factors: ț Data cleansing is tedious and time-consuming. The cleansing activity demands a combination of the usage of vendor tools, writing of in-house code, and arduous 304 DATA QUALITY: A KEY TO SUCCESS manual tasks of verification and examination. Many companies are unable to sus- tain the effort. This is not the kind of work many IT professionals enjoy. ț The metadata on many source systems may be missing or nonexistent. It will be dif- ficult or even impossible to probe into dirty data without the documentation. ț The users who are asked to ensure data quality have many other business responsi- bilities. Data quality probably receives the least attention. ț Sometimes, the data cleansing activity appears to be so gigantic and overwhelming that companies are terrified of launching a data cleansing initiative. Once your enterprise decides to institute a data cleansing initiative, you may consider one of two approaches. You may opt to let only clean data into your data warehouse. This means only data with a 100% quality can be loaded into the data warehouse. Data that is in any way polluted must be cleansed before it can be loaded. This is an ideal approach, but it takes a while to detect incorrect data and even longer to fix it. This approach is ide- al from the point of view of data quality, but it will take a very long time before all data is cleaned up for data loading. The second approach is a “clean as you go” method. In this method, you load all the data “as is” into the data warehouse and perform data cleansing operations in the data warehouse at a later time. Although you do not withhold data loads, the results of any query are suspect until the data gets cleansed. Questionable data quality at any time leads to losing user confidence that is extremely important for data warehouse success. Data Cleansing Decisions Before embarking on a data cleansing initiative, the project team, including the users, have to make a number of basic decisions. Data cleansing is not as simple as deciding to cleanse all data and to cleanse it now. Realize that absolute data quality is unrealistic in the real world. Be practical and realistic. Go for the fitness-for-purpose principle. Deter- mine what the data is being used for and find the purpose. If the data from the warehouse has to provide exact sales dollars of the top twenty-five customers, then the quality of this data must be very high. If customer demographics are to be used to select prospects for the next marketing campaign, the quality of this data may be at a lower level. In the final analysis, when it comes to data cleansing, you are faced with a few funda- mental questions. You have to make some basic decisions. In the following subsections, we present the basic questions that need to be asked and the basic decisions that need to be made. Which Data to Cleanse. This is the root decision. First of all, you and your users must jointly work out the answer to this question. It must primarily be the users’ deci- sion. IT will help the users make the decision. Decide on the types of questions the data warehouse is expected to answer. Find the source data needed for getting answers. Weigh the benefits of cleansing each piece of data. Determine how cleansing will help and how leaving the dirty data in will affect any analysis made by the users in the data warehouse. The cost of cleaning up all data in the data warehouse is enormous. Users usually un- derstand this. They do not expect to see 100% data quality and will usually settle for ig- noring the cleansing of unimportant data as long as all the important data is cleaned up. DATA QUALITY INITIATIVE 305 But be sure of getting the definitions of what is important or unimportant from the users themselves. Where to Cleanse. Data for your warehouse originates in the source operational sys- tems, so does the data corruption. Then the extracted data moves into the staging area. From the staging area load images are loaded into the data warehouse. Therefore, theoret- ically, you may cleanse the data in any one of these areas. You may apply data cleansing techniques in the source systems, in the staging area, or perhaps even in the data ware- house. You may also adopt a method that splits the overall data cleansing effort into parts that can be applied in two of the areas, or even in all three areas. You will find that cleansing the data after it has arrived in the data warehouse reposito- ry is impractical and results in undoing the effects of many of the processes for moving and loading the data. Typically, data is cleansed before it is stored in the data warehouse. So that leaves you with two areas where you can cleanse the data. Cleansing the data in the staging area is comparatively easy. You have already resolved all the data extraction problems. By the time data is received in the staging area, you are fully aware of the structure, content, and nature of the data. Although this seems to be the best approach, there are a few drawbacks. Data pollution will keep flowing into the stag- ing area from the source systems. The source systems will continue to suffer from the consequences of the data corruption. The costs of bad data in the source systems do not get reduced. Any reports produced from the same data from the source systems and from the data warehouse may not match and will cause confusion. On the other hand, if you attempt to cleanse the data in the source systems, you are tak- ing on a complex, expensive, and difficult task. Many legacy source systems do not have proper documentation. Some may not even have the source code for the production pro- grams available for applying the corrections. How to Cleanse. Here the question is about the usage of vendor tools. Do you use vendor tools by themselves for all of the data cleansing effort? If not, how much of in- house programming is needed for your environment? Many tools are available in the mar- ket for several types of data cleansing functions. If you decide to cleanse the data in the source systems, then you have to find the ap- propriate tools that can be applied to source system files and formats. This may not be easy if most of your source systems are fairly old. In that case, you have to fall back on in- house programs. How to Discover the Extent of Data Pollution. Before you can apply data cleans- ing techniques, you have to assess the extent of data pollution. This is a joint responsibili- ty shared among the users of operational systems, the potential users of the data ware- house, and IT. IT staff, supporting both the source systems and the data warehouse, have a special role in the discovery of the extent of data pollution. IT is responsible for installing the data cleansing tools and training the users in using those tools. IT must augment the effort with in-house programs. In an earlier section, we discussed the sources of data pollution. Reexamine these sources. Make a list that reflects the sources of pollution found in your environment, then determine the extent of the data pollution with regard to each source of pollution. For ex- ample, in your case, data aging could be a source of pollution. If so, make a list of all the old legacy systems that serve as sources of data for your data warehouse. For the data at- 306 DATA QUALITY: A KEY TO SUCCESS tributes that are extracted, examine the sets of values. Check if any of these values do not make sense and have decayed. Similarly, perform detailed analysis for each type of data pollution source. Please look at Figure 13-4. In this figure, you find a few typical ways you can detect the possible presence and extent of data pollution. Use the list as a guide for your environ- ment. Setting Up a Data Quality Framework. You have to contend with so many types of data pollution. You need to make various decisions to embark on the cleansing of data. You must dig into the sources of possible data corruption and determine the pollution. Most companies serious about data quality pull all these factors together and establish a data quality framework. Essentially, the framework provides a basis for launching data quality initiatives. It embodies a systematic plan for action. The framework identifies the players, their roles, and responsibilities. In short, the framework guides the data quality improvement effort. Please refer to Figure 13-5. Notice the major functions carried out within the framework. Who Should be Responsible? Data quality or data corruption originate in the source systems. Therefore, should not the owners of the data in the source systems alone be responsible for data quality? If these data owners are responsible for the data, should they also be bear the responsibility for any data pollution that happens in the source systems? If data quality in the source sys- DATA QUALITY INITIATIVE 307 q Operational systems converted from older versions are prone to the perpetuation of errors. q Operational systems brought in house from outsourcing companies converted from their proprietary software may have missing data. q Data from outside sources that is not verified and audited may have potential problems. q When applications are consolidated because of corporate mergers and acquisitions, these may be error-prone because of time pressures. q When reports from old legacy systems are no longer used, that could be because of erroneous data reported. q If users do not trust certain reports fully, there may be room for suspicion because of bad data. q Whenever certain data elements or definitions are confusing to the users, these may be suspect. q If each department has its own copies of standard data such as Customer or Product, it is likely corrupt data exists in these files. q If reports containing the same data reformatted differently do not match, data quality is suspect. q Wherever users perform too much manual reconciliation, it may because of poor data quality. q If production programs frequently fail on data exceptions, large parts of the data in those systems are likely to be corrupt. q Wherever users are not able to get consolidated reports, it is possible that data is not integrated. Figure 13-4 Discovering the extent of data pollution. ➨ ➨ ➨ ➨ ➨ ➨ ➨ ➨ ➨ ➨ ➨ ➨ be because of poor data quality. tems is high, the data quality in the data warehouse will also be high. But, as you well know, in operational systems, there are no clear roles and responsibilities for maintaining data quality. This is a serious problem. Owners of data in the operational systems are gen- erally not directly involved in the data warehouse. They have little interest in keeping the data clean in the data warehouse. Form a steering committee to establish the data quality framework discussed in the pre- vious section. All the key players must be part of the steering committee. You must have representatives of the data owners of source systems, users of the data warehouse, and IT personnel responsible for the source systems and the data warehouse. The steering com- mittee is charged with assignment of roles and responsibilities. Allocation of resources is also the steering committee’s responsibility. The steering committee also arranges data quality audits. Figure 13-6 shows the participants in the data quality initiatives. These persons repre- sent the user departments and IT. The participants serve on the data quality team in specif- ic roles. Listed below are the suggested responsibilities for the roles: Data Consumer. Uses the data warehouse for queries, reports, and analysis. Establish- es the acceptable levels of data quality. Data Producer. Responsible for the quality of data input into the source systems. Data Expert. Expert in the subject matter and the data itself of the source systems. Re- sponsible for identifying pollution in the source systems. Data Policy Administrator. Ultimately responsible for resolving data corruption as data is transformed and moved into the data warehouse. 308 DATA QUALITY: A KEY TO SUCCESS Identify the business functions affected most by bad data. Establish Data Quality Steering Committee. Agree on a suitable data quality framework. Institute data quality policy and standards. Define quality measurement parameters and benchmarks. Select high impact data elements and determine priorities. Plan and execute data cleansing for high impact data elements. Plan and execute data cleansing for other less severe elements. INITIAL DATA CLEANSING EFFORTS ONGOING DATA CLEANSING EFFORTS IT Professionals User Representatives Figure 13-5 Data quality framework. Data Integrity Specialist. Responsible for ensuring that the data in the source systems conforms to the business rules. Data Correction Authority. Responsible for actually applying the data cleansing tech- niques through the use of tools or in-house programs. Data Consistency Expert. Responsible for ensuring that all data within the data ware- house (various data marts) are fully synchronized. The Purification Process We all know that it is unrealistic to hold up the loading of the data warehouse unless the quality of all data is at the 100% level. That level of data quality is extremely rare. If so, how much of the data should you attempt to cleanse? When do you stop the purification process? Again, we come to the issues of who will use the data and for what purpose. Estimate the costs and risks of each piece of incorrect data. Users usually settle for some extent of errors, provided these errors result in no serious consequences. But the users need to be kept informed of the extent of possible data corruption and exactly which parts of the data could be suspect. How then could you proceed with the purification process? With the complete partici- pation of your users, divide the data elements into priorities for the purpose of data cleansing. You may adopt a simple categorization by grouping the data elements into three priority categories: high, medium, and low. Achieving 100% data quality is critical for the high category. The medium-priority data requires as much cleansing as possible. Some er- rors may be tolerated when you strike a balance between the cost of correction and poten- tial effect of bad data. The low-priority data may be cleansed if you have any time and re- DATA QUALITY INITIATIVE 309 DATA QUALITY INITIATIVES DATA CONSUMER (User Dept.) DATA EXPERT (User Dept.) DATA PRODUCER (User Dept.) DATA POLICY ADMINISTRATOR (IT Dept.) DATA INTEGRITY SPECIALIST (IT Dept.) DATA CORRECTION AUTHORITY (IT Dept.) DATA CONSISTENCY EXPERT (IT Dept.) Figure 13-6 Data quality: participants and roles. [...]... interact with the data warehouse in an informational approach, an analytical approach, or by using data mining techniques? Informational Approach In this approach, with query and reporting tools, the users retrieve historical or current data and perform some standard statistical analysis The data 322 MATCHING INFORMATION TO THE CLASSES OF USERS may be lightly or heavily summarized The result sets may take... checklist for evaluation and selection of these tools 5 As a data warehouse consultant, a large bank with statewide branches has hired you to help the company set up a data quality initiative List your major considerations Produce an outline for a document describing the initiative, the policies, and the procedures Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals Paulraj Ponniah... audit If the external data fails the audit, be prepared to reject the corrupt data and demand a cleaner version Figure 13 -7 illustrates the overall data purification process Please observe the process as shown in the figure and go through the following summary: ț ț ț ț ț ț Establish the importance of data quality Form data quality steering committee Institute a data quality framework Assign roles and... discovery features found in data cleansing tools What is the “clean as you go” method? Is this a good approach for the data warehouse environment? Name any three types of participants on the data quality team What are their functions? EXERCISES 1 Match the columns: 1 2 3 4 5 6 7 8 9 10 domain integrity data aging entity integrity data consumer poor quality data data consistency expert error discovery data pollution... Ability to create reports Limited drilldown Routine analysis with definite results Usually work with summary data Special Normalized data architecture models including an exploration Detailed data, warehouse useful summarized used hardly ever Provision for large queries on Range of special huge volumes of data mining tools, detailed data statistical analysis tools, and data A variety of tools visualization... responsibilities Select tools to assist in the data purification process Prepare in-house programs as needed DATA WAREHOUSE SOURCE SYSTEMS D A AT ST G IN G A E AR A Polluted Data DATA CLEANSING FUNCTIONS Vendor Tools Cleansed Data In-house Programs DATA QUALITY FRAMEWORK IT Professionals / User Representatives Figure 13 -7 Overall data purification process CHAPTER SUMMARY 311 ț ț ț ț ț Train the participants... participants in data cleansing techniques Review and confirm data standards Prioritize data into high, medium, and low categories Prepare schedule for data purification beginning with the high priority data Ensure that techniques are available to correct duplicate records and to audit external data ț Proceed with the purification process according to the defined schedule Practical Tips on Data Quality Before... obtained from external sources Pollution can also be introduced into the data warehouse through errors in external data Surely, if you pay for the external data and do not capture it from the public domain, then you have every right to demand a warranty on data quality In spite of what the vendor might profess about the quality of the data, for each set of external data, set up some kind of data quality... rules for future use Provide ability to the users to modify retrieved rules Select, manipulate, and transform data according to the business rules Have a set of data manipulation and transformation tools Correctly link to data storage to retrieve the selected data Be able to link with metadata Be capable of formatting and structuring output in a variety of ways, both textual and graphical ț Have the means... from information delivery from an operational system If the kinds of strategic information made available in a data warehouse were readily available from the source systems, then we would not really need the warehouse Data warehousing enables the users to make better strategic decisions by obtaining data from the source systems and keeping it in a format suitable for querying and analysis Data Warehouse . correction and poten- tial effect of bad data. The low-priority data may be cleansed if you have any time and re- DATA QUALITY INITIATIVE 309 DATA QUALITY INITIATIVES DATA CONSUMER (User Dept.) DATA EXPERT (User. QUALITY: A KEY TO SUCCESS SOURCE SYSTEMS DATA WAREHOUSE DATA STAGING AREA Cleansed Data Polluted Data DATA CLEANSING FUNCTIONS Vendor Tools In-house Programs DATA QUALITY FRAMEWORK IT Professionals. 308 DATA QUALITY: A KEY TO SUCCESS Identify the business functions affected most by bad data. Establish Data Quality Steering Committee. Agree on a suitable data quality framework. Institute data quality policy