Variable selection procedures in linear regression models

VARIABLE SELECTION PROCEDURES IN LINEAR REGRESSION MODELS XIE YANXI NATIONAL UNIVERSITY OF SINGAPORE 2013 VARIABLE SELECTION PROCEDURES IN LINEAR REGRESSION MODELS XIE YANXI (B.Sc. National University of Singapore) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF STATISTICS AND APPLIED PROBABILITY NATIONAL UNIVERSITY OF SINGAPORE 2013 ii ACKNOWLEDGEMENTS First of all, I would like to take this opportunity to show my greatest gratitude to my supervisor Associate Prof Leng Chenlei, who continuously and consistently guided me into the field of statistical research. Without his patience and continuous support, it would be impossible for me to finish this thesis. I really appreciate his help and kindness whenever I encountered any problems or doubts. It is really my honor to have him, a brilliant young professor, as my supervisor through my four years’ study in Department of Statistics and Applied Probability. Special acknowledgement also goes to Associate Prof Zhang Jin-Ting, who kindly provided useful real dataset to me. I could not finish the real data application part of my thesis without his fast reply and help and I really appreciate that. Special thanks also go to all the professors and staffs in Department of Statistics Acknowledgements and Applied Probability. I have been in this nice department since my undergraduate studies in NUS. I have benefited a lot during this eight-year time and I believe I will benefit more in my whole life from this department. Furthermore, I would like to express my appreciation to all my colleagues and friends in Department of Statistics and Applied Probability, for supporting and encouraging me during my four years’ PhD life. You have made my PhD life a pleasant and enjoyable one. Last but not least, I would like to thank my parents for their understanding and support. I appreciate that they have provided a nice environment for me to pursue my knowledge in my life. iii iv CONTENTS Acknowledgements Summary ii vii List of Tables ix List of Figures xi Chapter Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter Consistency Property of Forward Regression and Orthogonal Matching Pursuit CONTENTS v 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Review of Penalized Approaches . . . . . . . . . . . . . . . . 2.2.2 Review of Screening Approaches . . . . . . . . . . . . . . . . 18 2.3 2.4 2.5 2.6 Screening Consistency of OMP . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Model Setup and Technical Conditions . . . . . . . . . . . . 28 2.3.2 OMP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Selection Consistency of Forward Regression . . . . . . . . . . . . . 34 2.4.1 Model Setup and Technical Conditions . . . . . . . . . . . . 34 2.4.2 FR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.2 Simulation Results for OMP Screening Consistent Property . 43 2.5.3 Simulation Results for FR Selection Consistent Property . . 47 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter H-Likelihood 52 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.1 Partial Linear models . . . . . . . . . . . . . . . . . . . . . . 53 3.2.2 H-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Variable Selection via Penalized H-Likelihood . . . . . . . . . . . . 67 3.3.1 67 3.3 Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS vi 3.3.2 Estimation Procedure via Penalized h-likelihood . . . . . . . 69 3.3.3 Variable Selection via the Adaptive Lasso Penalty . . . . . . 73 3.3.4 Computational Algorithm . . . . . . . . . . . . . . . . . . . 74 3.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.5.1 Framingham Data . . . . . . . . . . . . . . . . . . . . . . . 91 3.5.2 MACS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter Conclusion 103 4.1 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Chapter A Appendix 107 A.1 Proof of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.1.1 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . 107 A.1.2 Proof of Lemma 2.3 . . . . . . . . . . . . . . . . . . . . . . . 111 A.2 Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.2.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . 113 A.2.2 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . 118 A.2.3 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography 127 vii ABSTRACT With the rapid development in information technology industry, contemporary data from various fields such as finance and gene expressions tend to be extremely large, where the number of variables or parameters d can be much larger than the sample size n. For example, one may wish to associate protein concentrations with expression of genes, or to predict survival time by using gene expression data.To solve this kind of high dimensionality problems, it is challenging to find important variables out of thousands of predictors, with a number of observations usually in tens or hundreds. In other words, it is becoming a major issue to investigate the existence of complex relationships and dependencies in data, in the aim of building a relevant model for inference. In fact, there are two fundamental goals in statistical learning: identifying relevant predictors and ensuring high prediction accuracy. The first goal, by means of variable selection, is of particular importance Summary when the true underlying model has a sparse representation. Discovering relevant predictors can enhance the performance of the prediction for the fitted model. Usually an estimate βˆ is considered desirable if it is consistent in terms of both coefficient estimate and variable selection. Hence, before we try to estimate the regression coefficients β, it is preferable that we have a set of useful predictors in hand. The emphasis of our task in this thesis is to propose methods, in the aim of identifying relevant predictors to ensure selection consistency, or screening consistency in variable selection. The primary interest is on Orthogonal Matching Pursuit (OMP) and Forward Regression (FR). Theoretical aspects of OMP and FR are investigated in details in this thesis. Furthermore, we have introduced a new penalized h-likelihood approach to identify non-zero relevant fixed effects in the partial linear model setting. This penalized h-likelihood incorporates variable selection procedures in the setting of mean modeling via h-likelihood. A few advantages of this newly proposed method are listed below. First of all, compared to the traditional marginal likelihood, the h-likelihood avoids the messy integration for the random effects and hence is convenient to use. In addition, h-likelihood plays an important role in inferences for models having unobservable or unobserved random variables. Last but not least, it has been demonstrated by simulation studies that the proposed penaltybased method is able to identify zero regression coefficients in modeling the mean structure and produces good fixed effects estimation results. viii ix List of Tables Table 2.1 Simulation Summary for OMP with (n,d)=(100,5000) . . . . 47 Table 2.2 Simulation Summary for FR with (n,d)=(100,5000) . . . . . 49 Table 3.1 Conjugate HGLMs. . . . . . . . . . . . . . . . . . . . . . . . 63 Table 3.2 Simulation Summary of PHSpline for six examples. . . . . . 81 Table 3.3 Simulation result for Example 1. . . . . . . . . . . . . . . . . 84 Table 3.4 Simulation result for Example 2. . . . . . . . . . . . . . . . . 85 A.2 Proof of Theorems 117 → 0, which implies that max max∗ χ21 ≤ 2(m∗ + 2) log(d) with probability tending to j∈T |M |≤m as d → ∞. Then in conjunction with (C4), we have max Xj j∈T max max∗ χ21 ≤ nτmax × 2(m∗ + 2) log(d) j∈T |M |≤m ≤ nτmax × 3Knξ0 +4ξmin × νnξ = nτmax × 3Kνnξ+ξ0 +4ξmin , (A.11) with probability tending to one. Combining (A.9) and (A.11) and putting them back to (A.5), we can show −1 n−1 Ω(k) ≥ νβ4 τmin ν −1 Cβ−2 τmax n−ξ0 −4ξmin − 3Kνnξ+ξ0 +4ξmin −2 −1 = τmax τmin Cβ−2 νβ4 ν −1 n−ξ0 −4ξmin −2 ×{1 − 3Kν τmax τmin Cβ2 νβ−4 nξ+2ξ0 +8ξmin −2 } = Knξ0 +4ξmin −2 ×{1 − 3Kν τmax τmin Cβ2 νβ−4 nξ+2ξ0 +8ξmin −2 }, −2 Cβ2 νβ−4 ν. Under uniformly for every k ≤ Knξ0 +4ξmin . Recall that K = 2τmax τmin condition (C4), we have [Knξ0 +4ξmin ] n −1 Y ≥ n −1 Ω(k) k=1 −2 ≥ × {1 − 3Kν τmax τmin Cβ2 νβ−4 nξ+2ξ0 +8ξmin −2 } → 2. (A.12) A.2 Proof of Theorems 118 In contrast, under the assumption V ar(Yi ) = 1, we have n−1 Y →p 1. This contradicts with the result of (A.12). Hence, it implies that it is impossible to have S (k) ∩ T = ∅ for every ≤ k ≤ Knξ0 +4ξmin . Consequently, with probability tending to one, all relevant predictors should be recovered within a total of Knξ0 +4ξmin steps. This completes the proof. A.2.2 Proof of Theorem 2.2 Proof: It suffices to show that P {BIC(S (k) ) − BIC(S (k+1) )} > → 1. (A.13) 1≤k n. By definition, we have σ ˆ(S (k+1) ) ≤ n Y (A.14) →p 1. Then, with probability tending to one, the right-hand side of (A.14) is ≥ log + = log [1 + 2 σ ˆ(S ˆ(S (k) ) − σ (k+1) ) RSS(S (k) − 3n−1 log d ) − RSS(S (k+1) ) ] − 3n−1 log d 2n A.2 Proof of Theorems 119 = log [1 + Ω(k) ] − 3n−1 log d, 2n (A.15) by the definition of Ω(k) = RSS(S (k) ) − RSS(S (k+1) ).In addtion, we use the fact that log (1 + x) ≥ min{log 2, 0.5x} for any x > 0. Then the right-hand side of (A.15) is ≥ min{log 2, Ω(k) } − 3n−1 log d 4n ≥ min{log 2, 4−1 K −1 n−ξ0 −4ξmin } − 3n−1 log d, (A.16) according to (A.12). Moreover, the right-hand side of (A.16) is independent of k, hence, it is a uniform lower bound for BIC(S (k) ) − BIC(S (k+1) ) with ≤ k < kmin . Thus, it suffices to show that the right-hand side of (A.16) is positive with probability tending to one. To this end, we firstly note that log − 3n−1 log d → 0, under condition (C4). Therefore, we can look at 4−1 K −1 n−ξ0 −4ξmin − 3n−1 log d ≥ 4−1 K −1 n−ξ0 −4ξmin − 3νnξ−1 = 4−1 K −1 n−ξ0 −4ξmin (1 − 12νKnξ+ξ0 +4ξmin −1 ) ≥ 0, with probability tending to one under condition (C4). This completes the proof. A.2 Proof of Theorems A.2.3 120 Proof of Theorem 2.3 XTT (XTT XT )−1 XT xj Step Let µ(T ) = max j ∈T / 1< 1. This condition ensures that the algorithm chooses a relevant variable at the first step, i.e. S (1) ⊂ T . By definition of µ(T ), there exists ν = XTT XT u ∈ R|T | such that µ(T ) = max XTT (XTT XT )−1 XT xj = max | ν T (XTT XT )−1 XT xj | ν ∞ j ∈T / j ∈T / | u T X T xj | j ∈T / (XTT XT ) u ∞ max | xTj XT u | j ∈T / . = max | xTi XT u | = max i∈T Therefore, if µ(T ) < 1, we can find u ∈ R|T | such that max | xTj XT u |< max | xTi XT u | . i∈T j ∈T / Consider an arbitrary δn > 0, and βT such that max | xTj XT βT |< max | xTi XT βT | −2δn . i∈T j ∈T / Moreover, with probability larger than − η, max | xTj (y − XT βT ) | ≤ δn = σ j 2n ln(2d/η). Therefore, (A.17) implies max | xTj y | ≤ max | xTj XT βT | + max | xTj (y − XT βT ) | j ∈T / j ∈T / j ∈T / (A.17) A.2 Proof of Theorems 121 < max | xTi XT βT | − max | xTi (y − XT βT ) | i∈T i∈T ≤ max | xTi y | . i∈T Therefore, we have proven max | xTj y | < max | xTi y | . i∈T j ∈T / It guarantees that the algorithm chooses a relevant variable at the first step, i.e. S (1) ⊂ T . Step We now proceed by induction on k to show that S (k+1) ⊂ T before the process stops. Assume the claim is true after k steps for k ≥ 2. By induction hypothesis, we have S (k) ⊂ T at the end of step k. Define Ω(k) = RSS(S (k) ) − RSS(S (k+1) ) = | xTj {In − H(S (k) ) }Y |2 (k) xj , (k) −1 T T where H(S (k) ) = X(S (k) ) {X(S X(S (k) ) is a projection matrix, Xj (k) ) X(S (k) ) } H(S (k) ) }Xj and RSS(S (k) ) = Y T {In − H(S (k) ) }Y . Aim: max Ω(k) > max Ω(k) j∈T j ∈T / max Ω(k) = max j∈T | xTj {In − H(S (k) ) }Y |2 (k) j∈T xj max | xTj {In − H(S (k) ) }Y |2 ≥ j∈T max j∈T (k) xj = {In − A.2 Proof of Theorems 122 max | xTj {In − H(S (k) ) }Y |2 j∈T ≥ max j∈T xj = max | xTj {In − H(S (k) ) }Y |2 . j∈T On the other hand, | xTj {In − H(S (k) ) }Y |2 max Ω(k) = max j ∈T / (k) j ∈T / xj max | xTj {In − H(S (k) ) }Y |2 ≤ j ∈T / (k) xj j ∈T / max | xTj {In − H(S (k) ) }Y |2 ≤ because max j ∈T / j ∈T / , 1−c XTT (XTT XT )−1 XT xj (k) < c < implies xj j ∈T / > − c. From Lemma 2.3, it implies α,i∈T Xβ (S (k) ) + αxi − y 2 ≤ Xβ (S (k) ) −y 2 λmin | T − S (k) | − Xβ (S (k) ) − Xβ 2 + αxj − y 2 Therefore, max | xTj {In − H(S (k) ) }Y |2 ≥ (max | (Xβ (S j∈T (k) ) j∈T = Xβ (S (k) ) −y 2 − y)T xj |)2 − α,j∈T ≥ λmin | T − S (k) | Xβ (S ≥ λ2min | T − S (k) | β (S ≥ λ2min | T − S (k) | βT \S (k) > λ2min | βmin |2 (k) ) (k) ) Xβ (S − Xβ −β 2 2 2 (k) ) . A.2 Proof of Theorems 123 √ 4σ (n + n log n) > (1 − µ)2 λ2min √ 4σ (n + n log n) = . (1 − µ)2 λ2min (A.18) On the other hand, max | (Xβ (S j ∈T / (k) ) − y)T xj | = max | (Xβ (S (k) ) − Xβ + Xβ − y)T xj | ≤ max | (Xβ (S (k) ) − Xβ)T xj | + max | (Xβ − y)T xj | j ∈T / j ∈T / j ∈T / (A.19) Part of right hand side of equation A.19: max | (Xβ (S j ∈T / (k) ) − Xβ)T xj | ≤ µ max | (Xβ (S (k) ) − Xβ)T xj | = µ max | (Xβ (S (k) ) − y)T xj | j∈T j∈T ≤ µ max | xTj {In − H(S (k) ) }Y | j∈T ≤ µ max xj ≤ µ λmax β (S j∈T ≤ µ λmax σ (k) ) X −β β (S (k) ) −β ∞ 2log (2d0 /η0 )/λmin , with probability larger than − η0 for η0 ≥ 0. Part of right hand side of equation A.19: max | (Xβ−y)T xj | ≤ σ j ∈T / with probability larger than − √1 2dη πlog d 2(1 + η)log d for η ≥ 0. (Please refer to Cai, Xu and Zhang (2009), see details in Appendix of the above mentioned paper.) A.2 Proof of Theorems 124 Hence, with probability larger than min(1 − max | xTj {In − H(S (k) ) }Y |2 ≤ (µ λmax σ j ∈T / < (1 − c)(2σ √1 , 2dη πlog d − η0 ), we have 2log (2d0 /η0 )/λmin + σ 2(1 + η)log d)2 2(1 + η)log d)2 = 4σ 2(1 + η) log d (1 − c). Now we have, max Ω(k) ≤ 4σ 2(1 + η) log d j ∈T / √ 4σ (n + n log n) < (1 − µ)2 < max Ω(k), j∈T with probability tending to as n → ∞. This completes the proof for the induction part. Therefore, the algorithm selects a relevant variable at each step until the algorithm stops. √ Step Stopping Rule: σ (n + n log n). Consider the Gaussian error ε ∼ N (0, σ In ), it satisfies P( ε Suppose X = ε 22 σ2 2≤ σ n+2 n log n) ≥ − . n is a χ2n random variable. Then for any λ > 0, n P (X > (1 + λ)n) ≤ √ exp {− (λ − log (1 + λ))}. λ πn Please refer to Cai (2002), lemma for a detailed proof. Hence, P( ε 2≤ σ n+2 n log n) = − P (X > (1 + λ)n) A.2 Proof of Theorems 125 n ≥1− √ exp {− (λ − log (1 + λ))}, λ πn where λ = n−1 log n. It follows from the fact that log(1 + λ) ≤ λ − 21 λ2 + 31 λ3 . Therefore, 2≤ P( ε σ n+2 1 4(log n) √ √ n log n) ≥ − } exp { n πlog n n ≥1− since √ πlog n , n n) √ exp { 4(log } ≤ for all n ≥ 2. n Let b2 = σ √ n + n log n. We have | βi |> b2 . (1−µ)λmin Suppose the algorithm has run k steps for some k < d0 =| T |. We will verify that rk rk 2> b2 where is the square root of the residual sum of squares RSS, and so Forward Regression does not stop at the current step. Again let XT \S (k) denote the set of unselected but correct variables and βT \S (k) be the corresponding coefficients. Note that rk = (I − H(S (k) ) )Xβ + (I − H(S (k) ) )ε ≥ (I − H(S (k) ) )Xβ − (I − H(S (k) ) )ε ≥ (I − H(S (k) ) )XT \S (k) βT \S (k) ≥ λmin > βT \S (k) b2 − b2 . 1−µ > b2 . − ε − ε A.2 Proof of Theorems Therefore, by all the three steps, the theorem is proved. 126 127 Bibliography [1] Akaike,H. (1973).Information theory and an extension of the maximum likelihood principle. Second International Symposium of Information Theory. 267281. [2] Antoniadis, A. (1997). Wavelets in statistics: a review (with discussion). Italian Journal of Statistics 6, 97-144. [3] Barron,A. and Cohen,A. (2008). Approximation and learning by greedy algorithms. The Annals of Statistics. 36, 64-94. [4] Bondell,H., Krishna,A. and Ghosh,S. (2010). Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics. 66, 10691077. [5] Breslow,N. and Clayton,D. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association. 88, 9-25. Bibliography [6] Bunea,F., Tsybakov,A. and Wegkamp,M. (2007).Sparsity oracle inequalities for the Lasso.Electronic Journal of Statistics. 1, 169-194. [7] Cai,T. and Wang,L. (2011).Orthogonal Matching Pursuit for Sparse Signal Recovery with Noise. IEEE Transactions on Information Theory. 57(7), 46804688. [8] Cai,T. Xu,G. and Zhang,J. (2009). On recovery of sparse signals via l1 minimization. IEEE Transactions on Information Theory. 55(7), 3388-3397. [9] Candes,E. and Tao,T. (2007).The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics. 35(6), 2313-2351. [10] Castelli,W.P. Garrison, R.J. Wilson, P.W. Abbott, R.D. Kalousdian, S. and Kannel, W.B. (1986). Incidence of coronary heart disease and lipoprotein cholesterol levels. The Framingham Study. Journal of the American Medical Association. 256(20), 2835-2838. [11] Chen,J. and Chen,Z. (2008).Extended Bayesian information criteria for model selection with large model spaces.Biometrika. 95(3), 759-771. [12] Chen,J. and Chen,Z. (2009).Tournament screening cum EBIC for feature selection with high dimensional feature spaces.Science in China, Series A. 52(6), 1327-1341. [13] Chen,Z. and Dunson,D. (2003). Random effects selection in linear mixed models. Biometrics. 59, 762-769. [14] Diggle,P. and Zeger,S. (1994). Analysis of longitudinal data. Clarendon Press, Oxford. [15] Donoho,D.L. (2000).High-dimensional data analysis: The curses and blessings of dimensionality. Manuscript.www-stat.stanford.edu. [16] Donoho,D.L., Elad, M. and Temlyakov, V.N. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory. 52(1), 6-18. 128 Bibliography [17] Donoho,D.L. and Stodden,V. (2006). Breakdown point of model selection when the number of variables exceeds the number of observations. IEEE. 1916-1921. [18] Efron,B., Hastie,T., Johnstone,I. and Tibshirani,R. (2004).Least angle regression. Annals of statistics. 32(2) ,407-499. [19] Fan,J. (1997). Comments on wavelets in statistics: a review by A.Antoniadis. Journal of the Italian Statistical Association. 6,131-138. [20] Fan,J. and Li,R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 96(456), 1348-1360. [21] Fan,J. and Peng,H. (2004). Nonconcave penalized likelihood with diverging number of parameters. Annals of statistics. 32 , 928-961. [22] Fan,J. and Lv,J. (2008).Sure independence screening for ultra-high dimensional feature space.Arxiv preprint math.Journal of the Royal Statistical Society: Series B. 70(5), 849-911. [23] Frank,I. and Friedman,J. (1993). A statistical view of some chemometrics regression tools. Technometrics. 35, 109-135. [24] Ibrahim,J. , Zhu,H. , Garcia,R. and Guo,R. (2011). Fixed and random effects selection in mixed effects models. Biometrics. 67, 495-503. [25] Knight,K. and Fu,W. (2000). Asymptotics for Lasso-type estimators. The Annals of Statistics. 28(5), 1356-1378. [26] Lee,Y. and Nelder,J. (1996). Hierarchical generalized linear models(with discussion). Journal of Royal Statistical Society: Series B. 58, 619-678. [27] Lee,Y. and Nelder,J. (2001a). Hierarchical generalized linear models: a synethesis of generalized linear models, random effect models and structured dispersions. Biomerika. 88, 987-1006. [28] Lee,Y. and Nelder,J. (2001b). Modelling and analysing correlated nonnormal data. Statistical Modelling. 1, 3-16. 129 Bibliography [29] Lee,Y. and Nelder,J. (2004). Conditional and marginal models: another view.Statistical Science. 19(2), 219-238. [30] Lee,Y., Nelder,J. and Pawitan,Y. (2006). Generalized linear models with random effects unified analysis via H-likelihood. Chapman and Hall. [31] Lee,Y. and Nelder,J. (2006). Double hierarchical generalized linear models (with discussion). Applied Statistics. 55(2), 139-185. [32] Leng,C., Lin,Y. and Wahba,G. (2006). A note on Lasso and related procedures on model selection. Statistica Sinica. 16, 1273-1284. [33] Liang,H. ,Wu,H. and Zou,G. (2008). A note on conditionl AIC for linear mixed-effects models. Biometrika. 95, 773-778. [34] Liang,K. and Zeger,S. (1986). Longitudinal data analysis using generalized linear models. Biometrika. 73, 13-22. [35] Lv,J. and Fan,Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. The Annals of statistics. 37, 34983528. [36] McCullagh,P. and Nelder,J. (1989). Generalized Linear Models. 2nd edition, London: Chapman and Hall. [37] Meinshausen,N. and Buhlmann,P. (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of statistics. 34(3), 1436-1462. [38] Meinshausen,N. and Buhlmann,P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B. 72, 417-473. [39] Nelder,J. and Lee,Y. (1992). Likelihood, quasi-likelihood and pseudolikelihood: some comparisons. Journal of the Royal Statistical Society: Series B. 54, 273-284. [40] Ni,X. , Zhang,D. and Zhang,H. (2010). Variable selection for semiparametric mixed models in longitudinal studies. Biometrics. 66, 79-88. 130 Bibliography [41] Osborne,M. ,Presnell,B. and Turlach,B. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis. 20(3),389-404. [42] Osborne,M. ,Presnell,B. and Turlache,B. (2000).On the Lasso and its dual. Journal of Computational and Graphical Statistics. 9,319-337. [43] Pan,J. and mackenzie,G. (2003). Model selection for joint mean-covariance structures in longitudinal studies. Biometrika. 90, 239-244. [44] Qu,A. and Li,R. (2006). Nonparametric modelling and inference function for longitudinal data. Biometrics. 62, 379-391. [45] Schall,R. (1991). Estimation in generalized linear models with random effects. Biometrika. 40, 917-927. [46] Schwarz,G. (1978). Estimating the dimension of a model.The Annals of Statistics. 6(2),461-464. [47] Tibshirani,R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B. 58(1), 267-288. [48] Tibshirani,R. (1997). The Lasso method for variable selection in the Cox model. Statistics in Medicine. 16,385-395. [49] Vaida,F. and Blanchard,S. (2005). Conditional Akaike information for mixed effects models. Biometrika. 92, 351-370. [50] Verbeke,G. and Molenberghs,G. (2000). Linear mixed models for longitudinal data. Berlin:Springer. [51] Wainwright,M. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using L1 − constrained quadratic programming(Lasso). IEEE Transactions on Information Theory. 55(5), 2183-2202. [52] Wang,H. , Li,R. and Tsai,C. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 94, 553-568. 131 Bibliography [53] Wang,H. (2009). Forward Regression for Ultra-High Dimensional Variable Screening. Journal of the American Statistical Association. 104(488), 15121524. [54] Yuan,M. and Lin,Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 68, 49-67. [55] Yuan,M. and Lin,Y. (2007). On the non-negative garrote estimator. Journal of the Royal Statistical Society: Series B. 69(2), 143-161. [56] Zeger,S. and Diggle,P. (1994). Semiparametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics. 50, 689-699. [57] Zhang,C. and Huang,J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. The Annals of Statistics. 36(4), 15671594. [58] Zhang,T. (2009). On the Consistency of Feature Selection using Greedy Least Squares Regression. Journal of Machine Learning Research. 10, 555-568. [59] ZHAO,P. and Yu,B. (2006). On model selection consistency of Lasso.Journal of Machine Learning Research. 7, 2541-2563. [60] Zhou, H. (2006) The adaptive lasso and its oracle properties . Journal of the American Statistical Association. 101, 1418-1429. [61] Zou,H. and Hastie,T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 67, 301-320. [62] Zou,H. and Li,R. (2008). One-step sparse estimates in nonconcave penalized likelihood models(with discussion). The Annals of Statistics. 36, 1509-1566. [63] Zou,H. and Zhang,H. (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics. 37(4), 1733-1751. 132 [...]... improve the performance of learning models in terms of obtaining higher estimation accuracy of the model In regression analysis, the linear model has been commonly used to link a response variable to explanatory variables for data analysis The resulting ordinary least squares estimates (LSE) have a closed form, which is easy to compute However, LSE fails when the number of linear predictors d is greater... n Best subset selection is one of the standard techniques for improving the performance of LSE Best subset selection, such as Akaike’s information criterion AIC and Byesian information criterion BIC, following either forward or backward stepwise selection procedures to select variables Among all the subset selection procedures in the aim of selecting relevant variables, orthogonal matching pursuit (OMP),... thesis Basically, we are dealing with high dimensional data in the linear model settings The aim of this thesis is to achieve variable selection accuracy before we do any prediction for the model In chapter 2, we show two main results of the thesis Firstly, we show the screening property of orthogonal matching pursuit(OMP) in variable selection under proper conditions In addition, we also show the... n Best subset selection is one of the standard techniques for improving the performance of LSE Best subset selection, such as Akaike’s information criterion AIC and Byesian information criterion BIC, following either forward or backward stepwise selection procedures to select variables Among all the subset selection procedures in the aim of selecting relevant variables, Orthogonal matching pursuit (Zhang,... also show the consistency property of Forward Regression( FR) in variable selection under proper conditions In chapter 3, we provide an extension to variable selection in modeling of the mean of partial linear models by adding a penalty term to the h-likelihood On top of that, some simulation studies are present to give the performance of the proposed method In the last chapter, we make some summary and... consistent in terms of both coefficient estimate and variable selection Hence, before we try to estimate the regression coefficients β, it is preferable that we have a set of useful predictors in hand The emphasis of our task in this chapter is to propose methods, in the aim of identifying relevant predictors to ensure selection consistency, or screening consistency in variable selection The primary interest... described in Donoho (2000), our task is to find a needle in a haystack, teasing the relevant information out of a vast pile of glut Statistically, the aim is to conduct variable selection, which is the technique of selecting a subset of relevant features for building robust learning models, under small n and large d situation By removing most irrelevant and redundant variables from the data, variable selection. .. literature All the above mentioned variable selection procedures only consider the fixed effect estimates in the linear models However, in real life, a lot of existing data have both the fixed effects and random effects involved For example, in the clinic trials, several observations are taken for a period of time for one particular patient 3 1.2 Motivation After collecting the data needed for all the patients,... main drawbacks at the same time First of all, Lasso can not handle collinearity problem When the pairwise correlations among 2.2 Literature Review 14 a group of variables are very high, Lasso tends to select only one variable from the group and ignore the rest of the variables in that group In addition, Lasso is not suitable for general factor selection since it can only select individual input variables... using gene expression data To solve this kind of high dimensionality problems, it is challenging to find important variables out of thousands of predictors, with a number of observations usually in 1.1 Background tens or hundreds In other words, it is becoming a major issue to investigate the existence of complex relationships and dependencies in data, in the aim of building a relevant model for inference . VARIABLE SELECTION PROCEDURES IN LINEAR REGRESSION MODELS XIE YANXI NATIONAL UNIVERSITY OF SINGAPORE 2013 VARIABLE SELECTION PROCEDURES IN LINEAR REGRESSION MODELS XIE YANXI (B.Sc non-zero relevant fixed effects in the partial linear model setting. This penalized h-likelihood incorporates variable selection procedures in the setting of mean modeling via h-likelihood. A few. learning models in terms of obtaining higher estimation accuracy of the model. In regression analysis, the linear model has been commonly used to link a response variable to explanatory variables

Định dạng
Số trang	144
Dung lượng	1,03 MB