Bayes correlation Hot Hot Hot Hot

BAYES BAYES luong BAYES Tuong Tuong quan quan 22 biên biên dinh dinh luong Lê Đông Nhật Nam Dẫn nhập số liệu data=read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/VietNamH.csv") data=data[,c(8,9)] data=na.omit(data) data=subset(data,lntotal>0 & lnmed>0) attach(data) Nguồn liệu: Vietnam World Bank Livings Standards Survey Chú thích: dataset khảo sát chi phí y tế 5999 hộ gia đình Việt Nam năm 1997, trích xuất biến số lnmed (response variable), chi phí y tế logarit hóa, lntotal = thu nhập 12 tháng (cũng logatir hóa) Để giảm cỡ mẫu, loại bỏ tất case có giá trị explanatory va response variable = = NA Data cuối lại 5004 trường hợp Câu hỏi cần giải giả định : Khảo sát mối liên hệ chi phí y tế thu nhập bình quân ? Bài toán giải theo cách thông thường là: (1) hệ số tương quan r Pearson (2) hồi quy tuyến tính đơn biến Mục tiêu này: Thay phương pháp phương pháp BAYES Thăm dò trực quan mối liên hệ tuyến tính Y ~ X ggplot(data,aes(x=lnmed))+geom_density(fill="deeppink1",alpha=0.5)+theme_light(base_size=20) transparent_theme = theme( axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank(), axis.line = element_blank(), panel.background = element_rect(fill = "transparent",colour = NA), plot.background = element_rect(fill = "transparent",colour = NA)) xmin =min(lntotal); xmax =max(lntotal) ymin =min(lnmed); ymax =max(lnmed) scatterPlot=ggplot(data,aes(lntotal,lnmed))+geom_point(alpha=0.2,color="deeppink1")+geom_smooth(method="lm",color="deeppink4",fill="gold",al pha=0.6) px=ggplot(data,aes(factor(1),lntotal))+geom_boxplot(width=0.2,fill="gold")+coord_flip()+transparent_theme py=ggplot(data,aes(factor(1),lnmed))+geom_boxplot(width=0.1,fill="gold")+transparent_theme px_grob = ggplotGrob(px) py_grob = ggplotGrob(py) scatterPlot+annotation_custom(grob = px_grob, xmin = xmin, xmax = xmax,ymin = ymin-1.5, ymax = ymin+1.5)+annotation_custom(grob = py_grob,xmin = xmin-1.5, xmax = xmin+1.5,ymin = ymin, ymax = ymax)+theme_light(base_size=20) Thăm dò trực quan mối liên hệ tuyến tính Y ~ X Vấn đề gây trở ngại outliers, nhiều outliers… Đây phân phối response variable: lnmed Nó giống phân phối t-student Dự báo nhận xét ban đầu Những dự báo nhận xét ban đầu: + Giữa lnmed lntotal chắn có tương quan tuyến tính, dù yếu mô hình tuyến tính hệ số Pearson’s r có ý nghĩa (Vì cỡ mẫu lớn) + Phương pháp cổ điển (Pearson, least square regression) se bị giới hạn, không xác hoàn toàn, chúng dựa giả định lnmed có phân phối Gaussian, data có nhiều outliers + Giải pháp xác toàn diện phải dựa vào phân phối t-student Y, tương đương với mô hình GLM với family=student-t dự báo cho tham số: Mu, sigma Nu Pearson Thứ gọi hệ số tương quan r Giải pháp linh hoạt triệt để Bayes Least square linear regression GLM, phân phối student Không sợ outliers, bất chấp outliers kiểu phân phối Y Dựa vào GLM, mô tả tất tham số phân phối Student t Prior biến hóa tùy ý thích Phân phối hậu nghiệm, Suy diễn không dùng giả thuyết Cách thứ 1: Đi tìm thứ gọi hệ số tương quan Pearson corr= cor.test(lntotal,lnmed) corr Pearson's product-moment correlation data: lntotal and lnmed t = 17.784, df = 5002, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 95 percent confidence interval: 0.2176193 0.2697427 sample estimates: cor 0.2438571 (corr$estimate)^2 cor 0.0594663 Pearson’s r trường hợp đặc biệt 𝑅2 Dựa vào mô hình hồi quy tuyến tính Y ~X mg=glm(lnmed~lntotal,family=gaussian) summary.lm(mg) Call: glm(formula = lnmed ~ lntotal, family = gaussian) Residuals: Min 1Q -6.4450 -0.9499 Median 0.1034 3Q 1.0609 Max 5.4595 Coefficients: Estimate Std Error t value Pr(>|t|) (Intercept) 0.91834 0.30414 3.019 0.00254 ** lntotal 0.57573 0.03237 17.784 < 2e-16 *** Signif codes: ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ Residual standard error: 1.541 on 5002 degrees of freedom Multiple R-squared: 0.05947, Adjusted R-squared: 0.05928 F-statistic: 316.3 on and 5002 DF, p-value: < 2.2e-16 AIC(mg) [1] 18529.9 Cách thứ 2: Mô tả phân phối Student-t Y giá… require(gamlss) mt=gamlss(data=data,formula=lnmed~lntotal,sigma.formula=~lntotal,nu.formula= ~lntotal,family=TF) Như nói trên, Pearson r mô hình glm có nguy bị ảnh hưởng outlier, giải pháp tổng quát sử dụng phân phối Student T cho Y thay Gaussian Để dựng mô hình Student t, ta phải dùng tới package gamlss summary(mt) Family: c("TF", "t Family") Kết cho thấy không Mu mà sigma Y phụ thuộc vào X, Nu không bị ảnh hưởng AIC thấp Call: gamlss(formula = lnmed ~ lntotal, sigma.formula = ~lntotal, nu.formula = ~lntotal, family = TF, data = data) Từ mô hình ta tính R2 hệ số r giả = hệ số tương quan giá trị dự báo thực tế Y data$pred=predict(mt,type="response") cor.test(data$pred,lnmed) Fitting method: RS() Mu link function: identity Mu Coefficients: Estimate Std Error t value Pr(>|t|) (Intercept) 1.23862 0.29898 4.143 3.49e-05 *** lntotal 0.54169 0.03232 16.762 < 2e-16 *** Signif codes: ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ Sigma link function: log Sigma Coefficients: Estimate Std Error t value Pr(>|t|) (Intercept) -0.95404 0.16105 -5.924 3.35e-09 *** lntotal 0.14611 0.01701 8.592 < 2e-16 *** Signif codes: ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ Nu link function: log Nu Coefficients: Estimate Std Error t value Pr(>|t|) (Intercept) 3.5218 8.6520 0.407 0.684 lntotal 0.1420 0.9245 0.154 0.878 No of observations in the fit: 5004 Degrees of Freedom for the fit: Residual Deg of Freedom: 4998 at cycle: 12 Pearson's product-moment correlation data: data$pred and lnmed t = 17.784, df = 5002, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 95 percent confidence interval: 0.2176193 0.2697427 sample estimates: cor 0.2438571 (cor(data$pred,lnmed))^2 [1] 0.0594663 Global Deviance: 18434.87 AIC: 18446.87 SBC: 18485.97 ************************************************ Cách thứ 3: đón tàu Bayes để theo lộ trình hoàn toàn khác Prior Cơ chế phân tích hồi quy tuyến tính theo Bayes : 𝝁𝟏 Response variable Y giả định có phân phối t Student với tham số Mu,sigma Nu Phân phối Gaussian (𝝁𝟏,sd1) Prior (phân phối tiền định) tham số là: Mu ~ Intercept + b1.X, với Intercept beta1 có prior = gaussian Nu có prior = Gamma Exponential (hằng định) Sigma có prior = Student t (nếu phụ thuộc X) Uniform (hằng định) Giá trị Intercept 𝝁𝟐 Một mô hình thiết lập để dự báo Mu,sigma Nu Y theo X Phân phối Gaussian (𝝁2,sd2) Phân phối hậu nghiệm xác định Markov Chain Monte Carlo Student t Giá trị beta 𝝁 uniform sigma 𝝈 𝒗 nu Gamma Giá trị Y : phân phối Student t Exponential 𝒀 ~ 𝒃𝟎 + 𝒃𝟏 ∗ 𝑿 Sử dụng package brms require(brms) require(coda) set.seed(123) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) Brms giao thức tiện dụng kết nối model MCMC sampler (ở dùng STAN) Từ version 0,9, brms chạy song song nhiều core CPU (tối đa core PC Intel i5 i7) Trước hết, bạn xác định prior tự động hàm get_prior, thiết kế prior cho riêng Sau nhập nội dung model sau: Kết brms xuất cho ta phân phối hậu nghiệm tham số: Intercept Y, beta X, sigma Y, Nu Y… prior1=get_prior(data=data,formula=lnmed~lntotal,family=student) prior1 prior class coef group nlpar bound b b Intercept b lntotal Intercept gamma(2, 0.1) nu student_t(3, 0, 10) sigma sigma lnmed bayesm1=brm(data=data,formula=lnmed~lntotal,family=student,prior=prior1,chains=4,warmup=500,iter=1500) plot(bayesm1) Inference for Stan model: student(identity) brms-model bayesm1$fit chains, each with iter=1500; warmup=500; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000 mean se_mean sd 2.5% 25% 50% 75% 97.5% b_Intercept 0.88 0.01 0.30 0.29 0.68 0.88 1.08 1.45 b_lntotal 0.58 0.00 0.03 0.52 0.56 0.58 0.60 0.64 sigma_lnmed 1.49 0.00 0.02 1.45 1.48 1.49 1.51 1.53 nu 36.07 0.31 12.35 19.35 27.23 33.62 42.23 66.55 lp -6394.02 0.04 1.40 -6397.57 -6394.70 -6393.70 -6392.97 -6392.27 n_eff Rhat b_Intercept 2202 b_lntotal 2188 sigma_lnmed 1866 nu 1627 lp 1390 Phân phối hậu nghiệm Kiểm định giả thuyết H1 với Bayes Factor hypothesis(bayesm1,"lntotal>0.5",alpha=0.01) Hypothesis Tests for class b: Estimate Est.Error l-99% CI u-99% CI Evid.Ratio lntotal-(0.5) > 0.08 0.03 0.01 Inf 209.53 * '*': The expected value under the hypothesis lies outside the 99% CI Bayes Factor hypothesis(bayesm1,"lntotal[...]... vị log(thu nhập cả năm) tăng sẽ làm log(chi phí y tế) tăng trung bình 0,58 đơn vị Tỉ lệ này gần như chắc chắn cao hơn 0,5 (Bayes Factor = 209,5, p value 0 0.08 0.03 0.01 Inf 209.53 * '*': The expected value under the hypothesis lies outside the 99% CI Bayes Factor hypothesis(bayesm1,"lntotal

Định dạng
Số trang	20
Dung lượng	1,41 MB