Data Mining Tutorial

tài liệu giới thiệu về khai thác dữ liệu

Trang 1

D A Dickey

Trang 2

April 2012

Trang 3

Data Mining - What is it?

Trang 4

• A “divisive” method (splits)

• Start with “root node” – all in one group

• Get splitting rules

• Response often binary

• Result is a “tree”

• Example: Loan Defaults

• Example: Framingham Heart Study

• Example: Automobile fatalities

Trang 5

Pr{default} =0.006

No default Default

Trang 6

Some Actual Data

Trang 8

How to make splits?

• Which variable to use?

• Where to split?

– Cholesterol >

– Systolic BP > _

• Goal: Pure “leaves” or “terminal nodes”

• Ideal split: Everyone with BP>x has

problems, nobody with BP<x has

problems

Trang 9

No Yes

Trang 10

 2 Test Statistic

• Expect 100(150/200) = 75 in upper left if

independent (etc e.g 100(50/200) = 25 )

Compare to Tables –

Significant!

WHERE IS HIGH BP CUTOFF???

Trang 11

Measuring “Worth” of a Split

• P-value is probability of Chi-square as

great as that observed if independence is

• P-values all too small.

• Best Chi-square  max logworth

Trang 12

Logworth for Age Splits

Age 47 maximizes logworth

?

Trang 13

How to make splits?

• Which variable to use?

Trang 14

Multiple testing

• 50 different BPs in data, 49 ways to split

• Sunday football highlights always look

good!

• If he shoots enough times, even a 95% free throw shooter will miss.

• Tried 49 splits, each has 5% chance of

declaring significance even if there’s no

relationship

Trang 15

Multiple testing

 =

Pr{ falsely reject hypothesis 1}

 = Pr{ falsely reject hypothesis 2}

Pr{ falsely reject one or the other} < 2

Desired: 0.05 probabilty or lessSolution: use = 0.05/2

Or – compare 2(p-value) to 0.05

Trang 16

Multiple testing

• Bonferroni – original idea

• Kass – apply to data mining (trees)

• Stop splitting if minimum p-value is large.

• For m splits, logworth becomes

Trang 17

Other Split Evaluations

• Gini Diversity Index

Trang 18

• Split if diversity in parent “node” >

summed diversities in child nodes

• Observations should be

– Homogeneous (not diverse) within leaves – Different between leaves

– Leaves should be diverse

• Framingham tree used Gini for splits

Trang 19

Validation

• Traditional stats – small dataset, need all observations to estimate parameters of interest

• Data mining – loads of data, can afford

“holdout sample”

• Variation: n-fold cross validation

– Randomly divide data into n sets

– Estimate on n-1, validate on 1

– Repeat n times, using each set as holdout.

Trang 20

Pruning

• Grow bushy tree on the “fit data”

• Classify holdout data

• Likely farthest out branches do not

improve, possibly hurt fit on holdout data

• Prune non-helpful branches

• What is “helpful”? What is good

discriminator criterion?

Trang 21

• Want diversity in parent “node” > summed diversities in child nodes

• Goal is to reduce diversity within leaves

• Goal is to maximize differences between leaves

• Use validation average squared error,

proportion correct decisions, etc.

• Costs (profits) may enter the picture for

splitting or pruning.

Trang 22

Accounting for Costs

• Pardon me (sir, ma’am) can you spare

some change?

• Say “sir” to male +$2.00

• Say “ma’am” to female +$5.00

• Say “sir” to female -$1.00 (balm for

slapped face)

• Say “ma’am” to male -$10.00 (nose splint)

Trang 23

Including Probabilities

True Gender M

Expected profit is 2(0.7)-1(0.3) = $1.10 if I say “sir”

Expected profit is -7+1.5 = -$5.50 (a loss) if I say “Ma’am”Weight leaf profits by leaf size (# obsns.) and sum

Prune (and split) to maximize profits

+$1.10 -$5.50

Trang 24

Additional Ideas

• Forests – Draw samples with replacement (bootstrap) and grow multiple trees

• Random Forests – Randomly sample the

“features” (predictors) and build multiple trees

• Classify new point in each tree then

average the probabilities, or take a

plurality vote from the trees

Trang 25

* Cumulative Lift Chart

- Go from leaf of most

Trang 28

• Predict Pi in cell i

• Yij jth response in cell i.

• Split to minimize i j (Yij-Pi)2

Trang 29

Real data example: Traffic accidents in Portugal*

Y = injury induced “cost to society”

* Tree developed by Guilhermina Torrao, (used with permission) NCSU Institute for Transportation Research & Education

Help - I ran Into a “tree” Help - I ran

Into a “tree”

Trang 30

Cool < - > Nerdy

“Analytics” - “Statistics”

“Predictive Modeling” - “Regression”

Another major tool:

Regression (OLS: ordinary least squares)

Trang 31

If the Life Line is long and deep, then this represents a long life full of vitality and health A short line, if strong and deep, also shows great vitality in your life and the ability to overcome health problems However, if the line is short and shallow, then your life may have the tendency to

be controlled by others

http://www.ofesite.com/spirit/palm/lines/linelife.htm

Trang 32

Wilson & Mather JAMA 229 (1974) X=life line length Y=age at death

Result: Predicted Age at Death = 79.24 – 1.367(lifeline) (Is this “real”??? Is this repeatable???)

proc sgplot;

scatter Y=age X=line;

reg Y=age X=line;

run ;

Trang 33

We Use LEAST SQUARES

Squared residuals sum to 9609

Trang 34

Simulation: Age at Death = 67 + 0(life line) + e

Error e has normal distribution mean 0 variance 200

Simulate 20 cases with n= 50 bodies each

NOTE: Regression equations :

Predicted Age at Death = 79.24 – 1.367(lifeline)

Would NOT be unusual if there is no true relationship

Trang 35

Traditionally p<0.05 implies hypothesized value is wrong

p>0.05 is inconclusive

Distribution of tUnder H0

Trang 36

proc reg data=life;

Area 0.19825Area 0.19825 0.39650

-0.86 0.86

Trang 37

Conclusion: insufficient evidence against the hypothesis of no linear relationship

H0: True slope is 0 (no association)

H1: True slope is not 0 P=0.3965

Trang 38

Simulation: Age at Death = 67 + 0(life line) + e

Error e has normal distribution mean 0 variance 200  WHY?

Simulate 20 cases with n= 50 bodies each

Want estimate of variability around the true line True variance is

Use sums of squared residuals (SS)

Sum of squared residuals from the mean is “SS(total)” 9755

Sum of squared residuals around the line is “SS(error)” 9609

(1) SS(total)-SS(error) is SS(model) = 146

(2) Variance estimate is SS(error)/(degrees of freedom) = 200

(3) SS(model)/SS(total) is R2, i.e proportion of variablity “explained” by the model

Corrected Total 49 9755.22000

Root MSE 14.14854 R-Square 0.0150

Trang 39

Those Mysterious “Degrees of Freedom” (DF)

First Martian  information about average height

0 information about variation

2nd Martian gives first piece of information (DF) about error variance around mean

n Martiansn-1 DF for error (variation)

Trang 40

Martian Height

Martian Weight

2 points  no information on variation of errors

n points  n-2 error DF

Trang 41

How Many Table Legs?

(regress Y on X1, X2)

X1

X2

error

Fit a plane  n-3 (37) error DF (2 “model” DF, n-1=39 “total” DF)

Regress Y on X1 X2 … X7  n-8 error DF (7 “model” DF, n-1 “total” DF)

Sum of Mean Source DF Squares Square Model 2 32660996 16330498 Error 37 1683844 45509 Corrected Total 39 34344840

Three legs will all touch the floor

Fourth leg gives first chance to measure error (first error DF)

Trang 42

Extension: Multiple Regression

Issues:

(1) Testing joint importance versus individual significance

(2) Prediction versus modeling individual effects(3) Collinearity (correlation among inputs)

Example: Hypothetical company’s sales Y depend on TV

advertising X1 and Radio Advertising X2

Y = 0 + 1X1 + 2X2 +e

Jointly critical (can’t omit both!!)

Two engine plane can still fly if engine #1 failsTwo engine plane can still fly if engine #2 fails

Neither is critical individually

Trang 43

Data Sales; length sval $8; length cval $8;

input store TV radio sales;

Trang 44

Conclusion: Can predict well with just TV, just radio, or both!

SAS code:

proc reg data=next; model sales = TV radio;

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 32660996 16330498 358.84 <.0001 (Can’t omit both)Can’t omit both)

TV 1 5.00435 5.01845 1.00 0.3251 (Can’t omit both)can omit TV)

radio 1 4.66752 4.94312 0.94 0.3512 (Can’t omit both)can omit radio)

Estimated Sales = 531 + 5.0 TV + 4.7 radio with error variance 45509 (Can’t omit both)standard deviation 213)

TV approximately equal to radio so, approximately

Estimated Sales = 531 + 9.7 TV or

Estimated Sales = 531 + 9.7 radio

Trang 48

Summary:

Good predictions given by

Sales = 531 + 5.0 x TV + 4.7 x Radio or Sales = 479 + 9.7 x TV or

Sales = 612 + 9.6 x Radio or (lots of others)

Why the confusion?

The evil Multicollinearity!!

(correlated X’s)

Trang 49

Multicollinearity can be diagnosed by looking at principal components (axes of variation)

Variance along PC axes  “eigenvalues” of correlation matrixDirection axes point  “eigenvectors” of correlation matrix

TV $

Radio $

Principal Component Axis 1

Principal Component Axis 2

Proc Corr; Var TV radio sales;

Pearson Correlation Coefficients, N = 40

Prob > |r| under H0: Rho=0

Trang 50

TEXT MINING

Hypothetical collection of e-mails (“corpus”) from analytics students:

John, message 1: There’s a good cook there

Susan, message 1: I have an analytics practicum then

Susan, message 2: I’ll be late from analytics

John, message 2: Shall we take the kids to a movie?

John, message 3: Later we can eat what I cooked yesterday

(etc.)

Compute word counts:

analytics cook_n cook_v kids late movie practicum

John 0 1 1 1 1 1 0

Susan 2 0 0 0 1 0 1

Trang 51

Text Mining Mini-Example: Word counts in 16 e-mails

Trang 52

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

dimension

Prin1

Job 0.317700 Practicum 0.318654 Analytics 0.306205 Movie -.283351 Data 0.314980 SAS 0.279258 Kids -.309731 Miner 0.290127 Grocerylist -.269651 Interview 0.261794 Late -.049560 Cook_v -.267515 Cook_n -.225621

Trang 54

PROC CLUSTER (single linkage) agrees !

Trang 56

Unsupervised Learning

• We have the “features” (predictors)

• We do NOT have the response even on a

training data set (UNsupervised)

Trang 57

EM  PROC FASTCLUS

• Step 1 – find (50) “seeds” as separated as possible

• Step 2 – cluster points to nearest seed

– Drift: As points are added, change seed

(centroid) to average of each coordinate

– Alternatively: Make full pass then recompute

seed and iterate

• Step 3 – aggregate clusters using Ward’s method

Trang 58

Clusters as Created

Trang 59

As Clustered – PROC FASTCLUS

Trang 60

Cubic Clustering Criterion (to decide # of Clusters)

• Divide random scatter of (X,Y) points into 4 quadrants

• Pooled within cluster variation much less than overall variation

• Large variance reduction

• Big R-square despite no real clusters

• CCC compares random scatter R-square

to what you got to decide #clusters

• 3 clusters for “macaroni” data.

Trang 61

Grades vs IQ and Study Time

Data tests; input IQ Study_Time Grade; IQ_S = IQ*Study_Time ;

Proc reg data=tests; model Grade = IQ;

Proc reg data=tests; model Grade = IQ Study_Time;

Trang 62

Contrast:

TV advertising looses significance when radio is added

IQ gains significance when study time is added

Model for Grades:

Predicted Grade = 0.74 + 0.47 x IQ + 2.10 x Study Time

Question:

Does an extra hour of study really deliver 2.10 points for

everyone regardless of IQ? Current model only allows this

Trang 63

Grade = (72.21-15.86)+(6.47-4.11) x Study Time = 56.35 + 2.36 x Study Time

proc reg; model Grade = IQ Study_Time IQ_S;

Trang 64

(1) Adding interaction makes everything insignificant (individually) !

(2) Do we need to omit insignificant terms until only significant ones remain?

(3) Has an acquitted defendant proved his innocence?

(4) Common sense trumps statistics!

Slope = 1.30Slope = 2.36

Trang 65

Classification Variables (dummy variables, indicator variables)

Predicted Accidents = 1181 + 2579 X11

X11 is 1 in November, 0 elsewhere

Interpretation:

In November, predict 1181+2579(1) = 3660

In any other month predict 1181 + 2579(0) = 1181

1181 is average of other months

2579 is added November effect (vs average of others)

Model for NC Crashes involving Deer:

Proc reg data=deer; model deer = X11;

Trang 67

Looks like December and October need dummies too!

Proc reg data=deer; model deer = X10 X11 X12 ;

Average of Jan through Sept is 929 crashes per month

Add 1391 in October, 2830 in November , 1377 in December

date x10 x11 x12 JAN03 0 0 0 FEB03 0 0 0 MAR03 0 0 0 APR03 0 0 0 MAY03 0 0 0 JUN03 0 0 0 JUL03 0 0 0 AUG03 0 0 0 SEP03 0 0 0 OCT03 1 0 0

NOV03 0 1 0 DEC03 0 0 1

JAN04 0 0 0 FEB04 0 0 0 MAR04 0 0 0 APR04 0 0 0 MAY04 0 0 0 JUN04 0 0 0 JUL04 0 0 0 AUG04 0 0 0 SEP04 0 0 0 OCT04 1 0 0

NOV04 0 1 0 DEC04 0 0 1

Trang 69

What the heck – let’s do all but one (need “average of rest” so must leave out at least one)

Proc reg data=deer; model deer = X1 X2 … X10 X11;

Average of rest is just December mean 2307 Subtract 886 in January,

add 1452 in November October (X10) is not significantly different than

December

Trang 71

positive

Trang 72

Add date (days since Jan 1 1960 in SAS) to capture trend

Proc reg data=deer; model deer = date X1 X2 … X10 X11;

date 1 0.22341 0.03245 6.88 <.0001

Trend is 0.22 more accidents per day (1 per 5 days) and is significantly

different from 0

Trang 78

Logistic Regression

• “Trees” seem to be main tool.

• Logistic – another classifier

• Older – “tried & true” method

Trang 80

Example: Seat Fabric Ignition

• Flame exposure time = X

– Y=0 , X= 3, 5, 9 10 , 13, 16

• p’s all different : pi=exp(a+bXi) /(1+exp(a+bXi))

• Find a,b to maximize Q(a,b)

Trang 81

• Logistic idea: Map p in (0,1) to L in whole real line

• p(i) = ea+bXi/(1+ea+bXi)

• Write p(i) if response, 1-p(i) if not

• Multiply all n of these together, find a,b to maximize

Trang 82

DATA LIKELIHOOD;

ARRAY Y(14) Y1-Y14; ARRAY X(14) X1-X14;

DO I=1 TO 14; INPUT X(I) y(I) @@; END;

IF Y(i)=1 THEN Q=Q*P; ELSE Q=Q*(1-P);

END; IF Q<0.0006 THEN Q=0.0006; OUTPUT; END;END;

Trang 83

Likelihood function (Q)

-2.6

0.23

Trang 84

Concordant pair 

Discordant Pair

Trang 85

IGNITION DATA

The LOGISTIC Procedure

Analysis of Maximum Likelihood Estimates

Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSqIntercept 1 -2.5879 1.8469 1.9633 0.1612TIME 1 0.2346 0.1502 2.4388 0.1184 Association of Predicted Probabilities and Observed Responses

Percent Concordant 79.2 Somers' D 0.583

Percent Discordant 20.8 Gamma 0.583

Percent Tied 0.0 Tau-a 0.308

Pairs 48 c 0.792

Trang 87

Example:

Shuttle Missions

• O-rings failed in Challenger disaster

• Low temperature

• Prior flights “erosion” and “blowby” in O-rings

• Feature: Temperature at liftoff

• Target: problem (1) - erosion or blowby vs no problem (0)

Định dạng
Số trang	102
Dung lượng	3,84 MB