1. Trang chủ
  2. » Kỹ Năng Mềm

chap5 imbalanced classes

11 23 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 11
Dung lượng 146,32 KB

Nội dung

Data Mining Classification: Alternative Techniques Imbalanced Class Problem 10/09/2007 Introduction to Data Mining Class Imbalance Problem z Lots of classification problems where the classes are skewed (more records from one class than another) – Credit card fraud – Intrusion detection – Defective products in manufacturing assembly line 10/09/2007 Introduction to Data Mining Challenges z Evaluation measures such as accuracy is not well-suited for imbalanced class z Detecting the rare class is like finding needle in a haystack 10/09/2007 Introduction to Data Mining Confusion Matrix z Confusion Matrix: PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) 10/09/2007 Introduction to Data Mining Accuracy PREDICTED CLASS Class=Yes ACTUAL CLASS z Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Most widely-used metric: Accuracy = 10/09/2007 a+d TP + TN = a + b + c + d TP + TN + FP + FN Introduction to Data Mining Problem with Accuracy z Consider a 2-class problem – Number of Class examples = 9990 – Number of Class examples = 10 z If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – This is misleading because the model does not detect any class example – Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc) 10/09/2007 Introduction to Data Mining Alternative Measures PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No Precision (p) = Recall (r) = a b c d a a+c a a+b F - measure (F) = 10/09/2007 Class=No 2rp 2a = r + p 2a + b + c Introduction to Data Mining ROC (Receiver Operating Characteristic) A graphical approach for displaying trade-off between detection rate and false alarm rate z Developed D l d iin 1950 1950s ffor signal i ld detection t ti th theory tto analyze noisy signals z ROC curve plots TPR against FPR – Performance of a model represented as a point in an ROC curve – Changing the threshold parameter of classifier changes the location of the point z 10/09/2007 Introduction to Data Mining ROC Curve (TPR,FPR): z (0,0): declare everything to be negative class z (1,1): declare everything to be positive class z (1,0): ideal z Diagonal line: – Random guessing – Below diagonal line: ‹ prediction is opposite of the true class 10/09/2007 Introduction to Data Mining ROC (Receiver Operating Characteristic) z To draw ROC curve, classifier must produce continuous-valued output – Outputs are used to rank test records records, from the most likely positive class record to the least likely positive class record z Many classifiers produce only discrete outputs (i.e., predicted class) – How to get continuous-valued outputs? ‹ Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM 10/09/2007 Introduction to Data Mining 10 Example: Decision Trees Decision Tree C ti Continuous-valued l d outputs t t 10/09/2007 Introduction to Data Mining 11 ROC Curve Example 10/09/2007 Introduction to Data Mining 12 ROC Curve Example - 1-dimensional data set containing classes (positive and negative) - Any points located at x > t is classified as positive At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, FNR=0.88 10/09/2007 Introduction to Data Mining 13 Using ROC for Model Comparison z No model consistently outperform the other z M1 is better for small FPR z M2 is better for large FPR z Area Under the ROC curve z Ideal: ƒ Area z Random guess: ƒ Area 10/09/2007 Introduction to Data Mining =1 = 0.5 14 How to Construct an ROC curve Instance • Use classifier that produces continuous-valued output for each test instance score(+|A) score(+|A) True Class 0.95 + 93 0.93 + 0.87 - 0.85 - 0.85 - 0.85 + 0.76 - 53 0.53 + 0.43 - 10 0.25 + • Sort the instances according to score(+|A) in decreasing order • Apply threshold at each unique value of score(+|A) • Count the number of TP, FP, TN, FN at each threshold •TPR = TP/(TP+FN) •FPR = FP/(FP + TN) 10/09/2007 Introduction to Data Mining 15 How to construct an ROC curve + - + - - - + - + + 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 4 3 3 2 FP 5 4 1 0 TN 0 1 4 5 5 Class Threshold >= FN 1 2 2 3 TPR 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 FPR 1 0.8 0.8 0.6 0.4 0.2 0.2 0 ROC Curve: 10/09/2007 Introduction to Data Mining 16 Handling Class Imbalanced Problem z Class-based ordering (e.g RIPPER) – Rules for rare class have higher priority z Cost-sensitive classification – Misclassifying rare class as majority class is more expensive than misclassifying majority as rare class z Sampling-based approaches 10/09/2007 Introduction to Data Mining 17 Cost Matrix PREDICTED CLASS Class=Yes ACTUAL CLASS Class=Yes Class=No Cost Matrix 10/09/2007 f(Yes, Yes) f(Yes,No) f(No, Yes) f(No, No) C(i,j): Cost of misclassifying class i example as class j PREDICTED CLASS C(i j) C(i, ACTUAL CLASS Class=No Cl Class=Yes Y Cl Class=No N Class=Yes C(Yes, Yes) C(Yes, No) Class=No C(No, Yes) C(No, No) Introduction to Data Mining C t = ∑ C (i , j ) × f (i , j ) Cost 18 Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS Model M1 + - + -1 100 - PREDICTED CLASS + ACTUAL CLASS C(i,j) Model M2 - + 150 40 - 60 250 ACTUAL CLASS Accuracy = 80% Cost = 3910 10/09/2007 PREDICTED CLASS + - + 250 45 - 200 Accuracy = 90% Cost = 4255 Introduction to Data Mining 19 Cost Sensitive Classification z Example: Bayesian classifer – Given a test record x: Compute p(i|x) for each class i ‹ Decision rule: classify node as class k if ‹ k = arg max p(i | x ) i – For 2-class, classify x as + if p(+|x) > p(-|x) ‹ 10/09/2007 This decision rule implicitly assumes that C(+|+) = C(-|-) = and C(+|-) = C(-|+) Introduction to Data Mining 20 Cost Sensitive Classification z General decision rule: – Classify test record x as class k if k = arg ∑ p(i | x ) × C (i, j ) j z i 2-class: – Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+) – Cost(-) ( ) = p(+|x) ( | ) C(+,-) ( ) + p(-|x) ( | ) C(-,-) ( ) – Decision rule: classify x as + if Cost(+) < Cost(-) ‹ 10/09/2007 if C(+,+) = C(-,-) = 0: p( + | x ) > C ( −, + ) C ( −, + ) + C ( + , − ) Introduction to Data Mining 21 Sampling-based Approaches z Modify the distribution of training data so that rare class is well-represented in training set – Undersample U d l th the majority j it class l – Oversample the rare class z Advantages and disadvantages 10/09/2007 Introduction to Data Mining 22 ...Challenges z Evaluation measures such as accuracy is not well-suited for imbalanced class z Detecting the rare class is like finding needle in a haystack 10/09/2007 Introduction... 10/09/2007 Introduction to Data Mining 12 ROC Curve Example - 1-dimensional data set containing classes (positive and negative) - Any points located at x > t is classified as positive At threshold... 1 0.8 0.8 0.6 0.4 0.2 0.2 0 ROC Curve: 10/09/2007 Introduction to Data Mining 16 Handling Class Imbalanced Problem z Class-based ordering (e.g RIPPER) – Rules for rare class have higher priority

Ngày đăng: 02/11/2020, 16:39