Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 11 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
11
Dung lượng
146,32 KB
Nội dung
Data Mining Classification: Alternative Techniques Imbalanced Class Problem 10/09/2007 Introduction to Data Mining Class Imbalance Problem z Lots of classification problems where the classes are skewed (more records from one class than another) – Credit card fraud – Intrusion detection – Defective products in manufacturing assembly line 10/09/2007 Introduction to Data Mining Challenges z Evaluation measures such as accuracy is not well-suited for imbalanced class z Detecting the rare class is like finding needle in a haystack 10/09/2007 Introduction to Data Mining Confusion Matrix z Confusion Matrix: PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) 10/09/2007 Introduction to Data Mining Accuracy PREDICTED CLASS Class=Yes ACTUAL CLASS z Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Most widely-used metric: Accuracy = 10/09/2007 a+d TP + TN = a + b + c + d TP + TN + FP + FN Introduction to Data Mining Problem with Accuracy z Consider a 2-class problem – Number of Class examples = 9990 – Number of Class examples = 10 z If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – This is misleading because the model does not detect any class example – Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc) 10/09/2007 Introduction to Data Mining Alternative Measures PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No Precision (p) = Recall (r) = a b c d a a+c a a+b F - measure (F) = 10/09/2007 Class=No 2rp 2a = r + p 2a + b + c Introduction to Data Mining ROC (Receiver Operating Characteristic) A graphical approach for displaying trade-off between detection rate and false alarm rate z Developed D l d iin 1950 1950s ffor signal i ld detection t ti th theory tto analyze noisy signals z ROC curve plots TPR against FPR – Performance of a model represented as a point in an ROC curve – Changing the threshold parameter of classifier changes the location of the point z 10/09/2007 Introduction to Data Mining ROC Curve (TPR,FPR): z (0,0): declare everything to be negative class z (1,1): declare everything to be positive class z (1,0): ideal z Diagonal line: – Random guessing – Below diagonal line: prediction is opposite of the true class 10/09/2007 Introduction to Data Mining ROC (Receiver Operating Characteristic) z To draw ROC curve, classifier must produce continuous-valued output – Outputs are used to rank test records records, from the most likely positive class record to the least likely positive class record z Many classifiers produce only discrete outputs (i.e., predicted class) – How to get continuous-valued outputs? Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM 10/09/2007 Introduction to Data Mining 10 Example: Decision Trees Decision Tree C ti Continuous-valued l d outputs t t 10/09/2007 Introduction to Data Mining 11 ROC Curve Example 10/09/2007 Introduction to Data Mining 12 ROC Curve Example - 1-dimensional data set containing classes (positive and negative) - Any points located at x > t is classified as positive At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, FNR=0.88 10/09/2007 Introduction to Data Mining 13 Using ROC for Model Comparison z No model consistently outperform the other z M1 is better for small FPR z M2 is better for large FPR z Area Under the ROC curve z Ideal: Area z Random guess: Area 10/09/2007 Introduction to Data Mining =1 = 0.5 14 How to Construct an ROC curve Instance • Use classifier that produces continuous-valued output for each test instance score(+|A) score(+|A) True Class 0.95 + 93 0.93 + 0.87 - 0.85 - 0.85 - 0.85 + 0.76 - 53 0.53 + 0.43 - 10 0.25 + • Sort the instances according to score(+|A) in decreasing order • Apply threshold at each unique value of score(+|A) • Count the number of TP, FP, TN, FN at each threshold •TPR = TP/(TP+FN) •FPR = FP/(FP + TN) 10/09/2007 Introduction to Data Mining 15 How to construct an ROC curve + - + - - - + - + + 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 4 3 3 2 FP 5 4 1 0 TN 0 1 4 5 5 Class Threshold >= FN 1 2 2 3 TPR 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 FPR 1 0.8 0.8 0.6 0.4 0.2 0.2 0 ROC Curve: 10/09/2007 Introduction to Data Mining 16 Handling Class Imbalanced Problem z Class-based ordering (e.g RIPPER) – Rules for rare class have higher priority z Cost-sensitive classification – Misclassifying rare class as majority class is more expensive than misclassifying majority as rare class z Sampling-based approaches 10/09/2007 Introduction to Data Mining 17 Cost Matrix PREDICTED CLASS Class=Yes ACTUAL CLASS Class=Yes Class=No Cost Matrix 10/09/2007 f(Yes, Yes) f(Yes,No) f(No, Yes) f(No, No) C(i,j): Cost of misclassifying class i example as class j PREDICTED CLASS C(i j) C(i, ACTUAL CLASS Class=No Cl Class=Yes Y Cl Class=No N Class=Yes C(Yes, Yes) C(Yes, No) Class=No C(No, Yes) C(No, No) Introduction to Data Mining C t = ∑ C (i , j ) × f (i , j ) Cost 18 Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS Model M1 + - + -1 100 - PREDICTED CLASS + ACTUAL CLASS C(i,j) Model M2 - + 150 40 - 60 250 ACTUAL CLASS Accuracy = 80% Cost = 3910 10/09/2007 PREDICTED CLASS + - + 250 45 - 200 Accuracy = 90% Cost = 4255 Introduction to Data Mining 19 Cost Sensitive Classification z Example: Bayesian classifer – Given a test record x: Compute p(i|x) for each class i Decision rule: classify node as class k if k = arg max p(i | x ) i – For 2-class, classify x as + if p(+|x) > p(-|x) 10/09/2007 This decision rule implicitly assumes that C(+|+) = C(-|-) = and C(+|-) = C(-|+) Introduction to Data Mining 20 Cost Sensitive Classification z General decision rule: – Classify test record x as class k if k = arg ∑ p(i | x ) × C (i, j ) j z i 2-class: – Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+) – Cost(-) ( ) = p(+|x) ( | ) C(+,-) ( ) + p(-|x) ( | ) C(-,-) ( ) – Decision rule: classify x as + if Cost(+) < Cost(-) 10/09/2007 if C(+,+) = C(-,-) = 0: p( + | x ) > C ( −, + ) C ( −, + ) + C ( + , − ) Introduction to Data Mining 21 Sampling-based Approaches z Modify the distribution of training data so that rare class is well-represented in training set – Undersample U d l th the majority j it class l – Oversample the rare class z Advantages and disadvantages 10/09/2007 Introduction to Data Mining 22 ...Challenges z Evaluation measures such as accuracy is not well-suited for imbalanced class z Detecting the rare class is like finding needle in a haystack 10/09/2007 Introduction... 10/09/2007 Introduction to Data Mining 12 ROC Curve Example - 1-dimensional data set containing classes (positive and negative) - Any points located at x > t is classified as positive At threshold... 1 0.8 0.8 0.6 0.4 0.2 0.2 0 ROC Curve: 10/09/2007 Introduction to Data Mining 16 Handling Class Imbalanced Problem z Class-based ordering (e.g RIPPER) – Rules for rare class have higher priority