Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 26 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
26
Dung lượng
656,63 KB
Nội dung
Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ Contents Introduction Common Techniques in Data Classification Handing Different Data Types Variations on Data Classification Introduction Definition: Given a set of training data points along with associated training labels, determine the class label for an unlabeled test instance Classification algorithms contain two phases: Training Phase: a model is constructed from the training instances Testing Phase: the model is used to assign a label to an unlabeled test instance The output may be presented for a test instance in one of two ways Discrete Label Numerical Score Introduction Application domains Customer Target Marketing Medical Disease Diagnosis Multimedia Data Analysis Document Categorization and Filtering … Introduction The work in the data classification area Technique-centered: Numerous classes of techniques are studied such as decision trees, neural networks, SVM methods, probabilistic methods, … Data-Type Centered: Many different data types are created by different applications such as text, multimedia, uncertain data, time series, and discrete sequence Variations on Classification Analysis: Numerous variations on the standard classification problem exist, which deal with more challenging scenarios such as rare class learning, transfer learning or semi-supervised learning Common Techniques in Data Classification Feature Selection Methods There are two broad kinds of feature selection methods: Filter Models: a crisp criterion on a single feature, or a subset of features, is used to evaluate their suitability for classification This method is independent of the specific algorithm being used Wrapper Models: the feature selection process is embedded into a classification algorithm, in order to make the feature selection process sensitive to the classification algorithm This approach recognizes the fact that different algorithms may work better with different features Common Techniques in Data Classification Probabilistic Methods Probabilistic classification algorithms use statistical inference to find the best class for a given example Usually, posterior probability is used to determine class membership for each new instance Two ways to estimate posterior probabilities: Bayes classifier (generative model) Model the posterior probability, by learning a discriminative function that maps an input feature vector directly onto a class label (discriminative model) Logistic regression is a popular discriminative classifier Common Techniques in Data Classification Decision Trees Decision trees create a hierarchical partitioning of the data, which relates the different partitions at the leaf level to the different classes Widely used learning method Easy to interpret: can be re-represented as if-then-else rules Does not require any prior knowledge of data distribution, works well on noisy data Has been applied to: Classify medical patients based on the disease Equipment malfunction by cause Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3 C4.5 SLIQ SPRINT Decision Trees A simple golf example based decision tree Common Techniques in Data Classification Rule-Based Methods A set of rules is mined from the training data in the first phase (or training phase) During the testing phase, it is determined which rules are relevant to the test instance and the final result is based on a combination of the class values predicted by the different rules For example: rules of recognizing the letter ‘x’ may be ‘two lines that cross over and are at approximately 45 and 135 degree’ Many Algorithms: Classification based on Associations (CBA) CN2 RIPPER Common Techniques in Data Classification SVM Classifiers Support Vector Machine (SVM) finds an optimal solution SVMs maximize the margin around the separating hyperplane and the “difficult points” close to decision boundary The decision function is fully specified by a subset of training samples, the support vectors Solving SVMs is a quadratic programming problem Common Techniques in Data Classification Neural Networks Neural networks attempt to simulate biological systems, corresponding to the human brain Set of nodes connected by directed weighted edges A more typical NN Basic NN unit x1 n w1 x2 x3 w2 w3 o ( wi xi ) i 1 ( y) 1 e y x1 x2 x3 Output nodes Hidden nodes Handing Different Data Types Large Scale Data: Big Data and Data Streams Larger data sets allow the creation of more accurate and sophisticated models Challenges: computation cost and real time processing 1.1 Data Streams The ability to continuously collect and process large volumes of data has lead to the popularity of data streams Two primary problems arise in the construction of training models One-pass Constraint: Data streams have very large volume, all processing algorithms need to perform their computations in a single pass over the data Concept Drift: Data streams may change over time It is crucial to adjust the model in an incremental way, so that it achieves high accuracy over current test instances Handing Different Data Types 1.2 The Big Data Framework The data is stored on disk on a single machine, and it is desirable to scale up the approach with disk-efficient algorithms Effective methods for analysis of large amounts of data Decision tree: SLIQ, BOAT, RainForest,… SVM method: SVMLight, SVMPerf,… Google’s MapReduce framework Handing Different Data Types Text Classification The main challenge with text classification is that the data is extremely high dimensional and sparse Rule-based methods, the Bayes method, and SVM classifiers tend to be more popular than other classifiers Handing Different Data Types Multimedia Classification Multimedia data has also become increasingly popular such image, videos, audio, … Multimedia data poses unique challenges, both in terms of data representation, and information fusion Handing Different Data Types Time Series and Sequence Data Classification Time series data is popular in many applications such as sensor networks, and medical informatics Two kinds of classification are possible with time-series data: Classifying specific time-instants: These correspond to specific events that can be inferred at particular instants of the data stream For examples: speaker detection and segmentation Classifying part or whole series: the class labels are associated with portions or all of the series, and these are used for classification For example, an ECG timeseries will show characteristic shapes for specific diagnostic criteria for diseases Handing Different Data Types Uncertain Data Classification Many forms of data collection are uncertain in nature For example, data collected with the use of sensors is often uncertain Statistical methods are used in order to infer parts of the data Variations on Data Classification Rare Class Learning It is closely related to outlier analysis It can be considered a supervised variation of the outlier detection problem In rare class learning, the distribution of the classes is highly imbalanced in the data, and it is typically more important to correctly determine the positive class Two effective approach solve this problem Example Weighting: In this case, the examples are weighted differently, depending upon their cost of misclassification Example Re-sampling: In this case, the examples are appropriately re-sampled, so that rare classes are over-sampled, whereas the normal classes are under-sampled Variations on Data Classification Ensemble Learning Meta-algorithm is a classification method that re-uses one or more currently existing classification algorithm by applying either multiple models for robustness, or combining the results of the same algorithm with different parts of the data Some examples of popular meta-algorithms Boosting (AdaBoost) Bagging Random Forests … Variations on Data Classification Enhancing Classification Methods with Additional Data 3.1 Semi-Supervised Learning Semi-supervised learning methods improve the effectiveness of learning methods with the use of unlabeled data, when only a small amount of labeled data is available Using both labeled and unlabeled data to build better learners, than using each one alone Variations on Data Classification 3.2 Transfer Learning Labeled data from a different domain is used to enhance the learning process Learning methods fall into one of the following four categories: Instance-Based Transfer: the feature spaces of the two domains are highly overlapping; even the class labels may be the same Feature-Based Transfer: there may be some overlaps among the features, but a significant portion of the feature space may be different Parameter-Based Transfer: the motivation is that a good training model has typically learned a lot of structure Therefore, if two tasks are related, then the structure can be transferred to learn the target task Relational-Transfer Learning: if two domains are related, they may share some similarity relations among objects These similarity relations can be used for transfer learning across domains Traditional ML vs TL Learning Process of Traditional ML Learning Process of Transfer Learning training items training items Learning System Learning System Learning System Knowledge Learning System Variations on Data Classification Evaluating Classification Algorithms Methodology used for evaluation: Hold-out: a fixed percentage of the training examples are “held out,” and not used in the training These examples are then used for evaluation Bootstrapping: sampling with replacement is used for creating the training examples The class accuracy is then evaluated as a weighted combination of the accuracy on the unsampled (test) examples, and the accuracy on the full labeled data Cross-validation: the training data is divided into a set of k disjoint subsets One of the k subsets is used for testing, whereas the other (k−1) subsets are used for training This process is repeated by using each of the k subsets as the test set, and the error is averaged over all possibilities Variations on Data Classification Evaluating Classification Algorithms Quantification of accuracy Absolute classification accuracy, which directly computes the fraction of examples that are correctly classified Confusion matrix ROC curve ...Contents Introduction Common Techniques in Data Classification Handing Different Data Types Variations on Data Classification Introduction Definition: Given a set of training data points... Multimedia data poses unique challenges, both in terms of data representation, and information fusion Handing Different Data Types Time Series and Sequence Data Classification Time series data is... crucial to adjust the model in an incremental way, so that it achieves high accuracy over current test instances Handing Different Data Types 1.2 The Big Data Framework The data is stored on