HỒI QUI DỮ LIỆU. Cao Học Ngành Khoa Học Máy Tính. Giáo trình điện tử. TS Võ Thị Ngọc Châu

Khoa Khoa Học & Kỹ Thuật Máy Tính Trường Đại Học Bách Khoa Tp Hồ Chí Minh Chương 3: Hồi qui liệu Cao Học Ngành Khoa Học Máy Tính Giáo trình điện tử Biên soạn bởi: TS Võ Thị Ngọc Châu (chauvtn@cse.hcmut.edu.vn) Học kỳ – 2011-2012 1 Tài liệu tham khảo [1] Jiawei Han, Micheline Kamber, “Data Mining: Concepts and Techniques”, Second Edition, Morgan Kaufmann Publishers, 2006 [2] David Hand, Heikki Mannila, Padhraic Smyth, “Principles of Data Mining”, MIT Press, 2001 [3] David L Olson, Dursun Delen, “Advanced Data Mining Techniques”, Springer-Verlag, 2008 [4] Graham J Williams, Simeon J Simoff, “Data Mining: Theory, Methodology, Techniques, and Applications”, Springer-Verlag, 2006 [5] Hillol Kargupta, Jiawei Han, Philip S Yu, Rajeev Motwani, and Vipin Kumar, “Next Generation of Data Mining”, Taylor & Francis Group, LLC, 2009 [6] Daniel T Larose, “Data mining methods and models”, John Wiley & Sons, Inc, 2006 [7] Ian H.Witten, Eibe Frank, “Data mining : practical machine learning tools and techniques”, Second Edition, Elsevier Inc, 2005 [8] Florent Messeglia, Pascal Poncelet & Maguelonne Teisseire, “Successes and new directions in data mining”, IGI Global, 2008 [9] Oded Maimon, Lior Rokach, “Data Mining and Knowledge Discovery Handbook”, Second Edition, Springer Science + Business Media, LLC 2005, 2010 2 Nội dung Chương 1: Tổng quan khai phá liệu Chương 2: Các vấn đề tiền xử lý liệu Chương 3: Hồi qui liệu Chương 4: Phân loại liệu Chương 5: Gom cụm liệu Chương 6: Luật kết hợp Chương 7: Khai phá liệu công nghệ sở liệu Chương 8: Ứng dụng khai phá liệu Chương 9: Các đề tài nghiên cứu khai phá liệu Chương 10: Ôn tập 3 Chương 3: Hồi qui liệu 3.1 Tổng quan hồi qui 3.2 Hồi qui tuyến tính 3.3 Hồi qui phi tuyến 3.4 Ứng dụng 3.5 Các vấn đề với hồi qui 3.6 Tóm tắt 4 3.0 Tình Ngày mai giá cổ phiếu STB bao nhiêu??? 5 3.0 Tình y Y1 y=x+1 Y1’ X1 x Mơ hình phân bố liệu y theo x??? 6 3.0 Tình Bài tốn phân tích giỏ hàng thị trường (market basket analysis) Ỉ kết hợp mặt hàng? 7 3.0 Tình Khảo sát yếu tố tác động đến xu hướng sử dụng quảng cáo trực tuyến Việt Nam Sự giải trí cảm nhận (+0.209) Chất lượng thông tin (+0.261) Chất lượng thông tin cảm nhận (+0.199) Sự khó chịu cảm nhận (-0.175) Sự tin cậy cảm nhận Thái độ tính riêng tư Sự tương tác (+0.373) Chuẩn chủ quan (+0.254) Nhận thức kiểm soát hành vi (+0.377) 8 3.0 Tình … Hồi qui (regression) Khai phá liệu có tính dự báo (Predictive data mining) Tình ??? Khai phá liệu có tính mơ tả (Descriptive data mining) Tình ??? 9 3.1 Tổng quan hồi qui Định nghĩa - Hồi qui (regression) J Han et al (2001, 2006): Hồi qui kỹ thuật thống kê cho phép dự đoán trị (số) liên tục Wiki (2009): Hồi qui (Phân tích hồi qui – regression analysis) kỹ thuật thống kê cho phép ước lượng mối liên kết biến R D Snee (1977): Hồi qui (Phân tích hồi qui) kỹ thuật thống kê lĩnh vực phân tích liệu xây dựng mơ hình từ thực nghiệm, cho phép mơ hình hồi qui vừa khám phá dùng cho mục đích dự báo (prediction), điều khiển (control), hay học (learn) chế tạo liệu R D Snee, Validation of Regression Models: Methods and Examples, Technometrics, Vol 19, No (Nov., 1977), pp 415-428 10 10 Logistic regression π(x) = (a) β > (b) β < Logistic function the parameter β determines the rate of growth or increase of the curve the sign of β indicates whether the curve increases or decreases the magnitude of β determines the rate of that increase or decrease β > 0: (x) increases as x increases β < 0: π(x) decreases as x increases β → 0: the curve tends to become a horizontal straight line When β = 0, Y is independent of X 38 38 Logistic regression Logistic regression Ỉ logistic discriminant analysis Descriptive model a very powerful tool for classification problems in discriminant analysis Ỉ tends to have higher accuracy when training data is plenty as compared to Naïve Bayes applied in many medical and clinical research studies As a neural network model without hidden nodes and with a logistic activation function and softmax output function The yis are binary variables and thus not normally distributed The distribution of yi given x is assumed to follow a Bernoulli distribution: Ỉ a linear function of x 39 39 Logistic regression Logistic regression Ỉ logistic discriminant analysis Estimate the β’s: maximum likelihood π(x) = p(y=1|x) = Æ find the smallest possible deviance between the observed and predicted values (kind of like finding the best fitting line) using calculus (derivatives specifically) Ỉ use different "iterations" in which it tries different solutions until it gets the smallest possible deviance or best fit Ỉ Once it has found the best solution, it provides a final value for the deviance D, which is usually referred to as "negative two log likelihood“ thought of as a Chi-square value  likelihood of the reduced mod el   D = −2 ln mod likelihood of the full el   Likelihood of the reduced model = likelihood of predicted values (π(x)) Likelihood of the full model = 40 probabilities of observed values (y=1/0) 40 Logistic regression The parameter estimates for the five variables selected in the final model, with the corresponding Wald statistics No variable appears to be not significant, using a significance level of 0.05 The variable Vdpflart indicates whether or not the price of the first purchase is paid in instalments; it is decisively estimated to be the variable most associated with the response variable P Giudici, Applied Data Mining – Statistical Methods for Business and Industry, John Wiley & Sons Ltd, 2003, p.166 41 41 Generalized additive models Extension of the generalized linear model Replace the simple weighted sums of the predictor variables by weighted sums of transformed versions of the predictor variables The relationships between the response variable and the predictor variables are estimated nonparametrically The right-hand side is sometimes termed the additive predictor greater flexibility When some of the functions are estimated from the data and some are determined by the researcher, the generalized additive model is sometimes called “semiparametric.” 42 42 Generalized additive models The model retains the merits of linear and generalized linear models How g changes with any particular predictor variable does not depend on how other predictor variables change Interpretation is eased This is at the cost of assuming that such an additive form does provide a good approximation to the “true” surface The model can be readily generalized by including multiple predictor variables within individual f components of the sum Relaxing the simple additive interpretation The additive form also means that we can examine each smoothed predictor variable separately, to see how well it fits the data 43 43 Generalized additive models A GAM fitting algorithm Backfitting algorithm to estimate functions fj and constant α Proceed the following steps Initialize Cycle Continue until the individual functions not change [9], pp 218-219 44 44 Generalized additive models A GAM fitting algorithm Initialize: α =yi, fj = fj0, j = 1, …, p Cycle: j = 1, …, p,1, …, p, Each predictor is given an initial functional relationship to the response such as a linear one The intercept is given an initial value of the mean of y A single predictor is selected Fitted values are constructed using all of the other predictors These fitted values are subtracted from the response A smoother Sj is applied to the resulting “residuals,” taken to be a function of the single excluded predictor The smoother updates the function for that predictor Each of the other predictors is, in turn, subjected to the same process Continue until the individual functions not change 45 45 Generalized additive models These “adaptive” methods seem to be most useful when the data have a high signal to noise ration, when the response function is highly nonlinear, when the variability in the response function changes dramatically from location to location Ỉ Experience to date suggests that data from the engineering and physical sciences are most likely to meet these criteria Ỉ Data from the social sciences are likely to be far too noisy 46 46 Generalized additive models Neural networks are a special case of the generalized additive linear models Multilayer feedforward neural networks with one hidden layer where m is the number of processing-units in the hidden layer The family of functions that can be computed depends on the number of neurons in the hidden layer and the activation function σ Note that a standard multilayer feedforward network with a smooth activation function σ can approximate any continuous function on a compact set to any degree of accuracy if and only if the network’s activation function σ is not a polynomial 47 47 Projection pursuit regression The additive models essentially focus on individual variables (albeit transformed versions of these) The additive models can be extended so that each additive component involves several variables, but it is not clear how best to select such subsets If the total number of available variables is large, then we may also be faced with a combinatorial explosion of possibilities 48 48 Projection pursuit regression The basic projection pursuit regression model This is a linear combination of (potentially nonlinear) transformations of linear combinations of the raw variables The f functions are not constrained (as in neural networks) to take a particular form, but are usually found by smoothing, as in generalized additive models The term projection pursuit arises from the viewpoint that one is projecting X in direction αk, and then seeking directions of projection that are optimal for some purpose optimal as components in a predictive model the model is fitted using standard iterative procedures to estimate the parameters in the αk vector 49 49 Projection pursuit regression The projection pursuit regression model has obvious close similarities to the neural network model Projection pursuit regression models can be proven to have the same ability to estimate arbitrary functions as neural networks, but they are not as widely used A generalization of neural networks Estimating their parameters can have advantages over the neural network situation Projection pursuit regression tends may not be practical for data sets that are massive (large n) and high-dimensional (large p) The fitting process is rather complex from a computational viewpoint 50 50 Tóm tắt Regression Linear models Generalized linear model Logistic models Feedforward neural networks Back-propagration neural networks Generalized additive models Projection pursuit regression Æ Linearity to Nonlinearity Æ Descriptive vs Predictive 51 51 Đọc thêm Predictive modeling for regression Regression modeling, multiple regression and model building, logistic regression Ronald D Snee, Technometrics, vol 19, no (Nov, 1977), pp 415-428 Choosing between logistic regression and discriminant analysis [9], chapter 25, pp 523-540 Validation of regression models: methods and examples [9], chapter 11, pp 209-230 Statistical methods for data mining [6], chapter 2-4, pp 33-203 Data mining within a regression framework [2], chapter 11, pp 367-398 S James Press, Sandra Wilson, Journal of the American Statistical Association, vol 73, no 364 (Dec, 1978), pp 699-705 Fitting curves to data using nonlinear regression: a practical and nonmathematical review Harvey J Motulsky, Lennart A Ransnas, FASEB J., vol (1987), pp 365-374 52 52

Định dạng
Số trang	52
Dung lượng	735,35 KB