HandlingMissingValues:Applicationto
University DataSet
Dinesh J. Prajapati
1
Jagruti H. Prajapati
2
1
Department of Information Technology, A. D. Patel institute of Technology, New V. V. Nagar-388121, India.
1
Gujarat Technological University (GTU).
1
dinesh249@yahoo.com
2
Department of Information technology, Charotar Institute of Technology, Changa-388421, India.
2
Charotar University of Science and Technology (CHARUSAT)
2
jagruti_eyetea@yahoo.com
Abstract
Data warehouses usually have some missing values due to unavailable data that affect the
number and the quality of the generated rules. The missing values could affect the coverage
percentage and number of reduces generated from a specific data set. Missing values lead to the
difficulty of extracting useful information from data set. Association rule algorithms typically
only identify patterns that occur in the original form throughout the database. HandlingMissing
Values for Association Rule Mining allows data that approximately matches the pattern to
contribute toward the overall support of the pattern. This approach is also useful in processing
missing data, which probabilistically contributes to the support of possibly matching patterns.
The actual data mining process deals significantly with prediction, estimation, classification,
pattern recognition and the development of association rules. Therefore, the significance of the
analysis depends heavily on the accuracy of the database and on the chosen sample datato be
used for model training and testing.
Keywords: Data cleansing, Missing values, Knowledge discovery, Preprocessing.
1. Introduction
Missing data are the absence of data items for a subject; they hide some information that may be
important. In practice, missingdata have been one major factor affecting data quality. The
presence of missingdata is a general and challenging problem in the data analysis field.
Fortunately, missingdata imputation techniques can be used to improve data quality. Missing
data imputation techniques refer to any strategy that fills in missing values of a dataset so that
standard data analysis methods can be applied to analyze the completed dataset [9].
Generally, there are two types of techniques to impute missing data. Single imputation
techniques indicate the substitution of a single value for each missingdata Such as mean
imputation & so on. Multiple imputation techniques are used to imputing missingdata m times,
m complete data sets can be formed. For each of the m complete data sets, standard complete
data analysis methods will be used to generate m analysis results [4]. Then, m analysis results
will be integrated into a final result for the inference. Propensity Score and Markov Chain Monte
Carlo are all widely used multiple imputation techniques. Comparing with single imputation
International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149
Page 44
techniques, multiple imputation techniques are more complex which need more costs to impute
missing data [3].
One limitation of many association rule mining algorithms, such as the Apriori algorithm, is that
only database entries which exactly match the candidate patterns may contribute to the support of
the candidate pattern. This creates a problem for databases containing missing values [6]. The
purpose of the work described in this paper is to compare the different methods to handle
missing values. By doing so, the approximated values for missingdata items can be incorporated
in the ordinal association rules. The rest of this paper is organized as follows: Section 2 the
methods for detecting errors and handlingmissing values are presented. In section 3, Results are
shown. Conclusion is drawn in section 4.
2. HandlingMissing Values
There are two forms of noise in the data as described below [10-11].
1. Corrupted values: sometimes some of the values in the training set are altered from what they
should have been. This may result in one or more tuples in the dataset conflicting with the rules
already established. The system may then regard these extreme values as noise, and ignore them.
The problem is that one never knows if the extreme values are correct or not, and the challenge is
how to handle “weird” values in the best manner.
2. Missing attribute values: one or more of the attribute values may be missing both for examples
in the training set and for objects which are to be classified. Missingdata might occur because
the value is not relevant to a particular case, could not be recorded when the data was collected,
or is concerns. If attributes are missing in any training set, the system may either ignore this
object totally; for instance, finding what is the missing attribute's most probable value, or uses
the value “missing”, “unknown” or “NULL” as a separate value for the attribute.
Cleansing data of errors is an important processing step particularly when integrating
heterogeneous data sources. Dirty data files are prevalent in data warehouses because of
incorrect or missingdata values, inconsistent attribute naming conventions or incomplete
information [12]. One important step in any data processing task is to verify the correctness of
data values. Data cleaning also called data cleansing or scrubbing, detects and removes errors
and inconsistencies in data in order to improve the quality of data. Causes of data quality
problems include misspellings during data entry, missing data, invalid or incomplete information
or other reasons such as inconsistent attribute naming conventions [8]. The effect on prediction
accuracy of several methods for dealing with missing features at prediction time. The most
common approaches for dealing with missing features involve imputation. The main idea of
imputation is that if an important feature is missing for a particular instance, it can be estimated
from the data that are present [5].
The imputation model should be rich enough to preserve the associations or relationships among
variables that will be the focus of later investigation. For example, suppose that a variable Y is
imputed under a normal model that includes the variable X1. After imputation, the analyst then
uses linear regression to predict Y from X1 and another variable X2 which was not in the
imputation model. The estimated coefficient for X2 from this regression would tend to be biased
toward zero, because Y has been imputed without regard for its possible relationship with X2
[7]. Filling a missing value can be done using any of following methods [1].
International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149
Page 45
2.1 HandlingMissing values by ignoring the tuple
When the class label is missing then this method can be used. This method is not effective,
unless the tuple contains several attributes with missing values. It is poor when the percentage of
missing values per attribute varies considerably. For example, consider a database that contains
the following transactions, where “?” represents a missing value.
i) A, B, C
ii) E, F, G
iii) ?, B, E
iv) A, B, F
Suppose the minimum support is 3. If we ignore the third transaction then the association rule
containing item A will be missed completely.
2.2 HandlingMissing values by filling the missing value manually
This approach is time-consuming and may not be feasible given a large dataset with many
missing values.
2.3 HandlingMissing values by using global constant
Replace all the missing attribute values by the same constant, such as a label like “Unknown” or
-
∞
. If missing values are replaced by “Unknown” then the mining program may mistakenly
think that they form an interesting concept, since they all have a value in common. Hence,
although this method is simple, it is not recommended.
2.4 HandlingMissing values by using attribute mean
This method uses mean or average to fill missing value. For example, suppose that the average
income of Employee is 20,000. Use this value to replace the missing value for income. If the
transaction contains missing value in the attribute which contains non integer or non float values
then this method cannot be used. For example, suppose that the average name of Employee is?
2.5 HandlingMissing values by using attribute mean for all samples belonging to the same
class
This method uses mean or average of attribute for the all samples of the same class. For example,
if classifying Employees to credit_risk, replace the missing value with the average income value
for employees in the same credit risk of the given tuple. If the transaction contains missing value
in the attribute for all samples which contains non integer or non float values then this method
cannot be used.
2.6 HandlingMissing values by using the most probable value
This method is widely used method. Limitation of above all the methods can be overcome by
this method. Each missing value is replaced by a probability distribution. This probability
International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149
Page 46
distribution represents the likelihood of possible values for the missing data, calculated using
frequency counts from the entries that do contain data for the corresponding field.
2.7 HandlingMissing values by using the most probable value for all the samples belonging
to the same class
This method uses probability of attribute for the all samples of the same class. For example, if
classifying Employees to credit risk, replace the missing value with the probability of income for
employees in the same credit risk of the given tuple. If the transaction contains missing value in
the attribute for all samples which contains non integer or non float values then this method
cannot be used.
3. Results
For this experiment, we have taken a database with no missing values, to have a reference
database, and we have randomly introduced missing values for each attribute (rate 10%). we
have used database of University from UCI Data Repository [13]. The university database has
165 data and 13 attributes. Database is in original (LISP-readable) form. Each observation
concerns one university. In some cases, more information is provided about the attribute (e.g.,
units or domain). Some duplicates may exist and a single observation may have more than one
value for a given attribute (esp. academic emphasis). It appears that several attributes could serve
as a distinguished class attribute for this database. It is a LISP readable file with a few relevant
functions at the end of the data file. For the Universitydata set, various algorithms are
implemented for handlingmissing values and Figure 1 presents the results which demonstrate
accuracy of the algorithms implemented. It is clearly visible in the results that Missing values are
successfully filled with a low noise rate.
Accuracy Versus Tolerance
20
30
40
50
0% 5% 10% 15% 20%
Tolerance
Accuracy
Average
Class
Probability
Probability + Class
Fig. 1 Accuracy versus Tolerance with 10% Missing values filled
International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149
Page 47
4. Conclusion
This paper discusses different methods to impute the missing values. Missing values are replaced
by probability distributions over possible values for the missing feature, which allows the
corresponding transaction to support all itemsets that could possibly match the data. Transactions
which do not exactly match the candidate itemset may also contribute a partial amount of support
this behavior is beneficial for databases with many missing values or containing numeric data.
Handling missing values using the most probable information for all the samples belonging to
the same class gives better result as compare to other techniques because presented technique is a
hybrid approach of class technique and probability technique. This hybridization is eliminating
mutual disadvantages of both the basic techniques. Missing values filled with better accuracy
leads to better results, this phenomenon is also observed.
References
[1] Jiawei Han, Micheline Kamber, Data Mining Concepts & Techniques, Morgan Kaufmann
Publishers.
[2] Nayak, J. Cook, D (2001), Approximate association rule mining, In Florida Artificial
Intelligence Research Symposium.
[3] Azzam Sleit, Mousa Al-Akhras, Inas Juma, Marwah Alian, Applying Ordinal Association
Rules for Cleansing Data With Missing Values, Marsland Press Journal of American
Science 2009:5(3) 52-62.
[4] Jianhua Wu, Qinbao Songl Junyi Shen, An Novel Association Rule Mining Based Missing
Nominal Data Imputation Method, Eighth ACIS International Conference.
[5] Chih-Hung Wu, Chian-Huei Wun, Hung-Ju Chou, "Using Association Rules for Completing
Missing Data,", Fourth International Conference on Hybrid Intelligent Systems (HIS'04),
2004 pp.236-241.
[6] Ragel, A. 1998, Preprocessing of Missing Values Using Robust Association Rules, In
Proceedings of the Second Pacific-Asia Conference.
[7] Lakshminarayan, K., Harp, S., Goldman, R., and Samad, T. 1996, Imputation of missingdata using machine learning techniques, In Proceedings of the Second International
Conference on Knowledge Discovery in Databases and Data Mining.
[8] Ragel, A. and Cremilleux, B., “MVC - a preprocessing method to deal with missing
values”, In Proceedings of Knowl Based Syst 1999, 285-291.
[9] Arnaud Ragel & Bruno Cremilleux, “Treatment of Missing Values for Association Rules”,
In Proceedings of PAKDD 1998.
International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149
Page 48
[10] Arnaud Ragel, Bruno Cremilleux & J. L. Bosson, “An Interactive and Understandable
Method to Treat MissingValues:Applicationto a Medical Data Set”, In ACM Comput.
Surv. 1985.
[11] Luai Al Shalabi, “A comparative study of techniques to deal with missingdata in data
sets”, In Proceedings of the 4th International Multiconference on Computer Science and
Information Technology /CSIT 2006.
[12] A. Pujari, Data Mining Techniques, Universities Press, India, 2001.
[13] UCI Data Repository.
International Journal of emerging trends in engineering and development Issue 1, Vol.1(August-2011) ISSN 2249-6149
Page 49
. Method to Treat Missing Values: Application to a Medical Data Set , In ACM Comput. Surv. 1985. [11] Luai Al Shalabi, “A comparative study of techniques to deal with missing data in data sets”,. techniques are used to imputing missing data m times, m complete data sets can be formed. For each of the m complete data sets, standard complete data analysis methods will be used to generate m. presence of missing data is a general and challenging problem in the data analysis field. Fortunately, missing data imputation techniques can be used to improve data quality. Missing data imputation