Handling Missing Values: Application to University Data Set doc

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	6
Dung lượng	85,88 KB

Nội dung

Handling Missing Values: Application to University Data Set Dinesh J. Prajapati 1 Jagruti H. Prajapati 2 1 Department of Information Technology, A. D. Patel institute of Technology, New V. V. Nagar-388121, India. 1 Gujarat Technological University (GTU). 1 dinesh249@yahoo.com 2 Department of Information technology, Charotar Institute of Technology, Changa-388421, India. 2 Charotar University of Science and Technology (CHARUSAT) 2 jagruti_eyetea@yahoo.com Abstract Data warehouses usually have some missing values due to unavailable data that affect the number and the quality of the generated rules. The missing values could affect the coverage percentage and number of reduces generated from a specific data set. Missing values lead to the difficulty of extracting useful information from data set. Association rule algorithms typically only identify patterns that occur in the original form throughout the database. Handling Missing Values for Association Rule Mining allows data that approximately matches the pattern to contribute toward the overall support of the pattern. This approach is also useful in processing missing data, which probabilistically contributes to the support of possibly matching patterns. The actual data mining process deals significantly with prediction, estimation, classification, pattern recognition and the development of association rules. Therefore, the significance of the analysis depends heavily on the accuracy of the database and on the chosen sample data to be used for model training and testing. Keywords: Data cleansing, Missing values, Knowledge discovery, Preprocessing. 1. Introduction Missing data are the absence of data items for a subject; they hide some information that may be important. In practice, missing data have been one major factor affecting data quality. The presence of missing data is a general and challenging problem in the data analysis field. Fortunately, missing data imputation techniques can be used to improve data quality. Missing data imputation techniques refer to any strategy that fills in missing values of a data set so that standard data analysis methods can be applied to analyze the completed data set [9]. Generally, there are two types of techniques to impute missing data. Single imputation techniques indicate the substitution of a single value for each missing data Such as mean imputation & so on. Multiple imputation techniques are used to imputing missing data m times, m complete data sets can be formed. For each of the m complete data sets, standard complete data analysis methods will be used to generate m analysis results [4]. Then, m analysis results will be integrated into a final result for the inference. Propensity Score and Markov Chain Monte Carlo are all widely used multiple imputation techniques. Comparing with single imputation International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149 Page 44 techniques, multiple imputation techniques are more complex which need more costs to impute missing data [3]. One limitation of many association rule mining algorithms, such as the Apriori algorithm, is that only database entries which exactly match the candidate patterns may contribute to the support of the candidate pattern. This creates a problem for databases containing missing values [6]. The purpose of the work described in this paper is to compare the different methods to handle missing values. By doing so, the approximated values for missing data items can be incorporated in the ordinal association rules. The rest of this paper is organized as follows: Section 2 the methods for detecting errors and handling missing values are presented. In section 3, Results are shown. Conclusion is drawn in section 4. 2. Handling Missing Values There are two forms of noise in the data as described below [10-11]. 1. Corrupted values: sometimes some of the values in the training set are altered from what they should have been. This may result in one or more tuples in the data set conflicting with the rules already established. The system may then regard these extreme values as noise, and ignore them. The problem is that one never knows if the extreme values are correct or not, and the challenge is how to handle “weird” values in the best manner. 2. Missing attribute values: one or more of the attribute values may be missing both for examples in the training set and for objects which are to be classified. Missing data might occur because the value is not relevant to a particular case, could not be recorded when the data was collected, or is concerns. If attributes are missing in any training set, the system may either ignore this object totally; for instance, finding what is the missing attribute's most probable value, or uses the value “missing”, “unknown” or “NULL” as a separate value for the attribute. Cleansing data of errors is an important processing step particularly when integrating heterogeneous data sources. Dirty data files are prevalent in data warehouses because of incorrect or missing data values, inconsistent attribute naming conventions or incomplete information [12]. One important step in any data processing task is to verify the correctness of data values. Data cleaning also called data cleansing or scrubbing, detects and removes errors and inconsistencies in data in order to improve the quality of data. Causes of data quality problems include misspellings during data entry, missing data, invalid or incomplete information or other reasons such as inconsistent attribute naming conventions [8]. The effect on prediction accuracy of several methods for dealing with missing features at prediction time. The most common approaches for dealing with missing features involve imputation. The main idea of imputation is that if an important feature is missing for a particular instance, it can be estimated from the data that are present [5]. The imputation model should be rich enough to preserve the associations or relationships among variables that will be the focus of later investigation. For example, suppose that a variable Y is imputed under a normal model that includes the variable X1. After imputation, the analyst then uses linear regression to predict Y from X1 and another variable X2 which was not in the imputation model. The estimated coefficient for X2 from this regression would tend to be biased toward zero, because Y has been imputed without regard for its possible relationship with X2 [7]. Filling a missing value can be done using any of following methods [1]. International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149 Page 45 2.1 Handling Missing values by ignoring the tuple When the class label is missing then this method can be used. This method is not effective, unless the tuple contains several attributes with missing values. It is poor when the percentage of missing values per attribute varies considerably. For example, consider a database that contains the following transactions, where “?” represents a missing value. i) A, B, C ii) E, F, G iii) ?, B, E iv) A, B, F Suppose the minimum support is 3. If we ignore the third transaction then the association rule containing item A will be missed completely. 2.2 Handling Missing values by filling the missing value manually This approach is time-consuming and may not be feasible given a large data set with many missing values. 2.3 Handling Missing values by using global constant Replace all the missing attribute values by the same constant, such as a label like “Unknown” or - ∞ . If missing values are replaced by “Unknown” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common. Hence, although this method is simple, it is not recommended. 2.4 Handling Missing values by using attribute mean This method uses mean or average to fill missing value. For example, suppose that the average income of Employee is 20,000. Use this value to replace the missing value for income. If the transaction contains missing value in the attribute which contains non integer or non float values then this method cannot be used. For example, suppose that the average name of Employee is? 2.5 Handling Missing values by using attribute mean for all samples belonging to the same class This method uses mean or average of attribute for the all samples of the same class. For example, if classifying Employees to credit_risk, replace the missing value with the average income value for employees in the same credit risk of the given tuple. If the transaction contains missing value in the attribute for all samples which contains non integer or non float values then this method cannot be used. 2.6 Handling Missing values by using the most probable value This method is widely used method. Limitation of above all the methods can be overcome by this method. Each missing value is replaced by a probability distribution. This probability International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149 Page 46 distribution represents the likelihood of possible values for the missing data, calculated using frequency counts from the entries that do contain data for the corresponding field. 2.7 Handling Missing values by using the most probable value for all the samples belonging to the same class This method uses probability of attribute for the all samples of the same class. For example, if classifying Employees to credit risk, replace the missing value with the probability of income for employees in the same credit risk of the given tuple. If the transaction contains missing value in the attribute for all samples which contains non integer or non float values then this method cannot be used. 3. Results For this experiment, we have taken a database with no missing values, to have a reference database, and we have randomly introduced missing values for each attribute (rate 10%). we have used database of University from UCI Data Repository [13]. The university database has 165 data and 13 attributes. Database is in original (LISP-readable) form. Each observation concerns one university. In some cases, more information is provided about the attribute (e.g., units or domain). Some duplicates may exist and a single observation may have more than one value for a given attribute (esp. academic emphasis). It appears that several attributes could serve as a distinguished class attribute for this database. It is a LISP readable file with a few relevant functions at the end of the data file. For the University data set, various algorithms are implemented for handling missing values and Figure 1 presents the results which demonstrate accuracy of the algorithms implemented. It is clearly visible in the results that Missing values are successfully filled with a low noise rate. Accuracy Versus Tolerance 20 30 40 50 0% 5% 10% 15% 20% Tolerance Accuracy Average Class Probability Probability + Class Fig. 1 Accuracy versus Tolerance with 10% Missing values filled International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149 Page 47 4. Conclusion This paper discusses different methods to impute the missing values. Missing values are replaced by probability distributions over possible values for the missing feature, which allows the corresponding transaction to support all itemsets that could possibly match the data. Transactions which do not exactly match the candidate itemset may also contribute a partial amount of support this behavior is beneficial for databases with many missing values or containing numeric data. Handling missing values using the most probable information for all the samples belonging to the same class gives better result as compare to other techniques because presented technique is a hybrid approach of class technique and probability technique. This hybridization is eliminating mutual disadvantages of both the basic techniques. Missing values filled with better accuracy leads to better results, this phenomenon is also observed. References [1] Jiawei Han, Micheline Kamber, Data Mining Concepts & Techniques, Morgan Kaufmann Publishers. [2] Nayak, J. Cook, D (2001), Approximate association rule mining, In Florida Artificial Intelligence Research Symposium. [3] Azzam Sleit, Mousa Al-Akhras, Inas Juma, Marwah Alian, Applying Ordinal Association Rules for Cleansing Data With Missing Values, Marsland Press Journal of American Science 2009:5(3) 52-62. [4] Jianhua Wu, Qinbao Songl Junyi Shen, An Novel Association Rule Mining Based Missing Nominal Data Imputation Method, Eighth ACIS International Conference. [5] Chih-Hung Wu, Chian-Huei Wun, Hung-Ju Chou, "Using Association Rules for Completing Missing Data,", Fourth International Conference on Hybrid Intelligent Systems (HIS'04), 2004 pp.236-241. [6] Ragel, A. 1998, Preprocessing of Missing Values Using Robust Association Rules, In Proceedings of the Second Pacific-Asia Conference. [7] Lakshminarayan, K., Harp, S., Goldman, R., and Samad, T. 1996, Imputation of missing data using machine learning techniques, In Proceedings of the Second International Conference on Knowledge Discovery in Databases and Data Mining.  [8] Ragel, A. and Cremilleux, B., “MVC - a preprocessing method to deal with missing values”, In Proceedings of Knowl Based Syst 1999, 285-291. [9] Arnaud Ragel & Bruno Cremilleux, “Treatment of Missing Values for Association Rules”, In Proceedings of PAKDD 1998. International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149 Page 48 [10] Arnaud Ragel, Bruno Cremilleux & J. L. Bosson, “An Interactive and Understandable Method to Treat Missing Values: Application to a Medical Data Set”, In ACM Comput. Surv. 1985. [11] Luai Al Shalabi, “A comparative study of techniques to deal with missing data in data sets”, In Proceedings of the 4th International Multiconference on Computer Science and Information Technology /CSIT 2006. [12] A. Pujari, Data Mining Techniques, Universities Press, India, 2001. [13] UCI Data Repository. International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149 Page 49 . Method to Treat Missing Values: Application to a Medical Data Set , In ACM Comput. Surv. 1985. [11] Luai Al Shalabi, “A comparative study of techniques to deal with missing data in data sets”,. techniques are used to imputing missing data m times, m complete data sets can be formed. For each of the m complete data sets, standard complete data analysis methods will be used to generate m. presence of missing data is a general and challenging problem in the data analysis field. Fortunately, missing data imputation techniques can be used to improve data quality. Missing data imputation

Ngày đăng: 28/03/2014, 23:20

Xem thêm