An introduction to data classification

Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ Contents  Introduction  Common Techniques in Data Classification  Handing Different Data Types  Variations on Data Classification Introduction  Definition: Given a set of training data points along with associated training labels, determine the class label for an unlabeled test instance  Classification algorithms contain two phases:  Training Phase: a model is constructed from the training instances  Testing Phase: the model is used to assign a label to an unlabeled test instance  The output may be presented for a test instance in one of two ways  Discrete Label  Numerical Score Introduction  Application domains  Customer Target Marketing  Medical Disease Diagnosis  Multimedia Data Analysis  Document Categorization and Filtering  … Introduction  The work in the data classification area  Technique-centered: Numerous classes of techniques are studied such as decision trees, neural networks, SVM methods, probabilistic methods, …  Data-Type Centered: Many different data types are created by different applications such as text, multimedia, uncertain data, time series, and discrete sequence  Variations on Classification Analysis: Numerous variations on the standard classification problem exist, which deal with more challenging scenarios such as rare class learning, transfer learning or semi-supervised learning Common Techniques in Data Classification Feature Selection Methods  There are two broad kinds of feature selection methods:  Filter Models: a crisp criterion on a single feature, or a subset of features, is used to evaluate their suitability for classification This method is independent of the specific algorithm being used  Wrapper Models: the feature selection process is embedded into a classification algorithm, in order to make the feature selection process sensitive to the classification algorithm This approach recognizes the fact that different algorithms may work better with different features Common Techniques in Data Classification Probabilistic Methods  Probabilistic classification algorithms use statistical inference to find the best class for a given example  Usually, posterior probability is used to determine class membership for each new instance  Two ways to estimate posterior probabilities:  Bayes classifier (generative model)  Model the posterior probability, by learning a discriminative function that maps an input feature vector directly onto a class label (discriminative model) Logistic regression is a popular discriminative classifier Common Techniques in Data Classification Decision Trees  Decision trees create a hierarchical partitioning of the data, which relates the different partitions at the leaf level to the different classes  Widely used learning method  Easy to interpret: can be re-represented as if-then-else rules  Does not require any prior knowledge of data distribution, works well on noisy data  Has been applied to:  Classify medical patients based on the disease  Equipment malfunction by cause  Many Algorithms:  Hunt’s Algorithm (one of the earliest)  CART  ID3  C4.5  SLIQ  SPRINT Decision Trees A simple golf example based decision tree Common Techniques in Data Classification Rule-Based Methods  A set of rules is mined from the training data in the first phase (or training phase)  During the testing phase, it is determined which rules are relevant to the test instance and the final result is based on a combination of the class values predicted by the different rules  For example: rules of recognizing the letter ‘x’ may be ‘two lines that cross over and are at approximately 45 and 135 degree’  Many Algorithms:  Classification based on Associations (CBA)  CN2  RIPPER Common Techniques in Data Classification SVM Classifiers  Support Vector Machine (SVM) finds an optimal solution  SVMs maximize the margin around the separating hyperplane and the “difficult points” close to decision boundary  The decision function is fully specified by a subset of training samples, the support vectors  Solving SVMs is a quadratic programming problem Common Techniques in Data Classification Neural Networks  Neural networks attempt to simulate biological systems, corresponding to the human brain  Set of nodes connected by directed weighted edges A more typical NN Basic NN unit x1 n w1 x2 x3 w2 w3 o   (  wi xi ) i 1  ( y)  1  e y x1 x2 x3 Output nodes Hidden nodes Handing Different Data Types Large Scale Data: Big Data and Data Streams  Larger data sets allow the creation of more accurate and sophisticated models Challenges: computation cost and real time processing 1.1 Data Streams  The ability to continuously collect and process large volumes of data has lead to the popularity of data streams  Two primary problems arise in the construction of training models  One-pass Constraint: Data streams have very large volume, all processing algorithms need to perform their computations in a single pass over the data  Concept Drift: Data streams may change over time It is crucial to adjust the model in an incremental way, so that it achieves high accuracy over current test instances Handing Different Data Types 1.2 The Big Data Framework  The data is stored on disk on a single machine, and it is desirable to scale up the approach with disk-efficient algorithms  Effective methods for analysis of large amounts of data  Decision tree: SLIQ, BOAT, RainForest,…  SVM method: SVMLight, SVMPerf,…  Google’s MapReduce framework Handing Different Data Types Text Classification  The main challenge with text classification is that the data is extremely high dimensional and sparse  Rule-based methods, the Bayes method, and SVM classifiers tend to be more popular than other classifiers Handing Different Data Types Multimedia Classification  Multimedia data has also become increasingly popular such image, videos, audio, …  Multimedia data poses unique challenges, both in terms of data representation, and information fusion Handing Different Data Types Time Series and Sequence Data Classification  Time series data is popular in many applications such as sensor networks, and medical informatics  Two kinds of classification are possible with time-series data:  Classifying specific time-instants: These correspond to specific events that can be inferred at particular instants of the data stream For examples: speaker detection and segmentation  Classifying part or whole series: the class labels are associated with portions or all of the series, and these are used for classification For example, an ECG timeseries will show characteristic shapes for specific diagnostic criteria for diseases Handing Different Data Types Uncertain Data Classification  Many forms of data collection are uncertain in nature For example, data collected with the use of sensors is often uncertain  Statistical methods are used in order to infer parts of the data Variations on Data Classification Rare Class Learning  It is closely related to outlier analysis It can be considered a supervised variation of the outlier detection problem  In rare class learning, the distribution of the classes is highly imbalanced in the data, and it is typically more important to correctly determine the positive class  Two effective approach solve this problem  Example Weighting: In this case, the examples are weighted differently, depending upon their cost of misclassification  Example Re-sampling: In this case, the examples are appropriately re-sampled, so that rare classes are over-sampled, whereas the normal classes are under-sampled Variations on Data Classification Ensemble Learning  Meta-algorithm is a classification method that re-uses one or more currently existing classification algorithm by applying either multiple models for robustness, or combining the results of the same algorithm with different parts of the data  Some examples of popular meta-algorithms  Boosting (AdaBoost)  Bagging  Random Forests  … Variations on Data Classification Enhancing Classification Methods with Additional Data 3.1 Semi-Supervised Learning  Semi-supervised learning methods improve the effectiveness of learning methods with the use of unlabeled data, when only a small amount of labeled data is available  Using both labeled and unlabeled data to build better learners, than using each one alone Variations on Data Classification 3.2 Transfer Learning  Labeled data from a different domain is used to enhance the learning process  Learning methods fall into one of the following four categories:  Instance-Based Transfer: the feature spaces of the two domains are highly overlapping; even the class labels may be the same  Feature-Based Transfer: there may be some overlaps among the features, but a significant portion of the feature space may be different  Parameter-Based Transfer: the motivation is that a good training model has typically learned a lot of structure Therefore, if two tasks are related, then the structure can be transferred to learn the target task  Relational-Transfer Learning: if two domains are related, they may share some similarity relations among objects These similarity relations can be used for transfer learning across domains Traditional ML vs TL Learning Process of Traditional ML Learning Process of Transfer Learning training items training items Learning System Learning System Learning System Knowledge Learning System Variations on Data Classification Evaluating Classification Algorithms  Methodology used for evaluation:  Hold-out: a fixed percentage of the training examples are “held out,” and not used in the training These examples are then used for evaluation  Bootstrapping: sampling with replacement is used for creating the training examples The class accuracy is then evaluated as a weighted combination of the accuracy on the unsampled (test) examples, and the accuracy on the full labeled data  Cross-validation: the training data is divided into a set of k disjoint subsets One of the k subsets is used for testing, whereas the other (k−1) subsets are used for training This process is repeated by using each of the k subsets as the test set, and the error is averaged over all possibilities Variations on Data Classification Evaluating Classification Algorithms  Quantification of accuracy  Absolute classification accuracy, which directly computes the fraction of examples that are correctly classified  Confusion matrix  ROC curve ...Contents  Introduction  Common Techniques in Data Classification  Handing Different Data Types  Variations on Data Classification Introduction  Definition: Given a set of training data points... Multimedia data poses unique challenges, both in terms of data representation, and information fusion Handing Different Data Types Time Series and Sequence Data Classification  Time series data is... crucial to adjust the model in an incremental way, so that it achieves high accuracy over current test instances Handing Different Data Types 1.2 The Big Data Framework  The data is stored on

Định dạng
Số trang	26
Dung lượng	656,63 KB