Implementing machine learning methods in stata

Introduction Examples Trees and Forests Stata approach References Implementing machine learning methods in Stata Austin Nichols September 2018 Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Preliminaries Methods Definitions What are machine learning algorithms (MLA)? Methods to derive a rule from data, or reduce the dimension of available information Also known as data mining, data science, statistical learning, or statistics Or econometrics, if you are in my tribe Fundamental distinction: most MLA are designed to reproduce how a human would classify something, with all inherent biases No pretension to deep structural parameters or causal inference—but this is changing Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Preliminaries Methods Unsupervised MLA: no labels (no outcome data) Clustering: cluster kmeans, kmedians Principal component analysis: pca Latent class analysis: gsem in Stata 15 Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Preliminaries Methods Supervised MLA: labels (outcome y) Regression or linear discriminants: regress, discrim lda Nonlinear discriminants: discrim knn Shrinkage: lasso, ridge regression, findit lassopack Generalized additive models (findit gam), wavelets, splines (mkspline) Nonparametric regress e.g lpoly, npregress Support Vector Machines or kernel machines “Structural” Equation Models e.g sem, gsem, irt, fmm Tree builders such as ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), CART (Breiman et al., 1984) Neural Networks (NN), Convolutional NN Boosting e.g AdaBoost Bagging e.g RandomForest Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Preliminaries Methods The big These last are what are usually meant by Machine Learning NN and Convolutional NN are widely used in parsing images e.g satellite photos (see also Nichols and Nisar 2017) Boosting and bagging are based on trees (CART), but Breiman (2001) showed bagging was consistent whereas boosting need not be Hastie, Tibshirani, and Friedman (2009; Sect 10.7) outline some other advantages of bagging Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Netflix kaggle Mr Mushroom The Netflix Prize The Netflix Prize was a competition to better predict user ratings for films, based on previous ratings of Netflix users The best predictor that beat the existing Netflix algorithm (Cinematch) by more than 10 percent would win a million dollars There were also annual progress prizes for major improvements over previous leaders (one percent or greater reductions in RMSE) The Netflix competition began on October 2, 2006, and days later, one team had already beaten Cinematch Over the second year of the competition, only three teams reached the leading position: BellKor, BigChaos, and BellKor in BigChaos, a joint team of the two other teams Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Netflix kaggle Mr Mushroom More exciting than the World Cup On June 26, 2009, BellKor’s Pragmatic Chaos, a merger of Bellkor in BigChaos and Pragmatic Theory, achieved a 10.05 percent improvement over Cinematch, making them eligible for the $1m grand prize On July 25, 2009, The Ensemble (a merger of Grand Prize Team and Opera Solutions and Vandelay United) achieved a 10.09 percent improvement over Cinematch On July 26, 2009, the final standings showed two teams beating the minimum requirements for the Grand Prize: The Ensemble and BellKor’s Pragmatic Chaos On September 18, 2009, Netflix announced BellKor’s Pragmatic Chaos as the winner The Ensemble had in fact matched the performance of BellKor’s Pragmatic Chaos, but since BellKor’s Pragmatic Chaos submitted their method in the final round of submissions 20 minutes earlier, the rules made them the winner Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Netflix kaggle Mr Mushroom kaggle competitions There are many of these types of competitions posted at kaggle.com at any given time, some with large cash prizes (active right now: Zillow home price prediction for $1.2m and Dept of Homeland Security passenger screening for $1.5m) Virtually all of the development in this methods space is being done in R and Python (since Breiman passed away, there is less f77 code being written) Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Netflix kaggle Mr Mushroom Discriminants The linear discriminant method draws a line (hyperplane) between data points such that as many data points in group are on one side and as many data points in group are on the other as possible For example, a company surveys 24 people in town as to whether they own lawnmowers or not, and wants to classify based on the two variables shown The line shown separates “optimally” among all possible lines (Fisher 1934) A similar approach can classify mushrooms as poisonous or not Or we can use a semiparametric version averaging over the k nearest neighbors (both subcommands of discrim) 14 16 Lot size 18 20 22 24 Predicting lawnmower ownership 60 80 100 Income Nonowner Linear discriminant Austin Nichols 120 140 Owner Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Netflix kaggle Mr Mushroom A punny example From the Stata manual: Example of [MV] discrim knn classifies poisonous and edible mushrooms Misclassifying poisonous mushrooms as edible is a big deal at dinnertime You have invited some scientist friends over for dinner, including Mr Mushroom a real “fun guy” Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Trees Ensembles Estimating out-of-sample The ensemble of models can predict for any new observations with the same set of variables defined In McBride and Nichols (2016), it turns out that choosing the parametric model that performs best out of sample also dramatically improves prediction It’s really the holdout observations, and prioritizing out-of-sample performance, that drives the improvement A k-fold cross-validation approach would also this Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Trees Ensembles Causal Inference Nichols and McBride (2017) make the point that prediction is exactly the target for a propensity score model (as in teffects ipw or teffects ipwra etc.), though better predictions are not always better! In particular, if one estimates the probability of treatment as a function of excluded instruments, and not every confounder, a better predicted probability of treatment can lead to worse inference Comparing across many of these methods, bagging (RandomForest) worked best, in the sense that it had the lowest MSE for the true treatment effect Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Trees Ensembles Direction For the rest of this talk, we will focus on the winner in that prior work, but the goal is to implement a stochastic ensemble method from scratch, with an eye toward tweaks in the method that can improve causal inference The code is not public yet, but email me if you’d like to be a beta tester It is currently called stens, for “stochastic ensemble” method Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Code outline Caveats Innovations Overview Basic method uses CART: binary splits that minimize “impurity” (entropy/gini/twoing) In a regression tree, the split is based on sum of squared residuals, which is the default in stens Note that the sum of squared residuals for a binary outcome is just the number of observations misclassified Each leaf in a complete tree is captured by a single dummy built of interaction terms The prediction is either a classification (predicted class) for that leaf, or an average outcome y¯ for that leaf Breiman et al (1984) advocate pruning a complete tree and using cross-validation Pruning in such a system means combining dummies via an OR operation Breiman (1996) instead advocates no pruning and instead using bootstrap aggregation Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Code outline Caveats Innovations Outline of code Bootstrap data to create matrix d Randomly choose columns of d on which to split at current node Compute optimal splits over all choices, pick best to create d0 and d1 submatrices Store best choice (dummy syntax and predicted value) in prediction matrix Repeat steps 2-4 in submatrices until stoppingrule met in all Collect results in prediction matrix for this tree (number of leaves by 2) Repeat steps 1-6 until treelimit met Along the way, compute proximities between each pair of observations as the fraction of the time they fall in the same leaf Also compute “variable importance” by permuting each feature used in the tree in the out-of-bad sample and computing the difference in prediction error Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Code outline Caveats Innovations Step Bootstrap Approximately 63.212 percent of observations are used for each tree: these are the “bag,” in Breiman’s parlance About 36.788 percent are the out-of-bag sample, a randomly selected sample that can be used to assess out-of-sample performance of the built tree Can also draw clusters for each sample rather than observations Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Code outline Caveats Innovations Step Sample features At each node, of the m features (predictor variables), randomly select k =9.5)*(hours 0.5 Can also construct ROC curves Can also choose priors to weigh different classification errors differently Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Code outline Caveats Innovations Step Repeat Now, bootstrap again and build a new stochastic tree Keep doing this either until maximum number of trees is reached (a user-specified limit with a default of 500) or a “change in prediction” limit is reached (e.g predictions have changed less than c(epsdouble) over the last 10 new trees) Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Code outline Caveats Innovations Caveats Currently does not handle missing values except via a trick proposed by Breiman: impute randomly via hotdeck, run stens and predict proximity, impute using nearest cases (using proximity and kNN), then rerun stens Does not handle nonbinary splits except through repeated splits Currently a mix of ado and Mata code Needs to be made faster—several ways forward here Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Code outline Caveats Innovations Horizon Big innovation to come: Estimate a causal model in each tree Predict (noisily) the probability of treatment in each tree, Reweight within tree for ATE or ATT, then Average over trees for overall average impact estimate But also, assess dependence of impact estimate on: out of sample prediction error rate, distance from average prediction, or permutations in one feature Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Breiman, Leo, Jerome Friedman, Richard A Olshen, and Charles J Stone 1984 Classification and Regression Trees Wadsworth, New York Breiman, Leo 1996 “Bagging predictors.” Machine Learning, 24(2): 123140 Breiman, Leo 2001 “Random forests.” Machine Learning, 45(1): 532 Fisher, R A 1936 “The use of multiple measurements in taxonomic problems.” Annals of Eugenics, 7: 179–188 Hastie, Trevor, Robert Tibshirani, and Jerome Friedman 2009 The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition Springer, New York McBride, Linden, and Austin Nichols 2016 “Retooling Poverty Targeting Using Out-of-Sample Validation and Machine Learning.” The World Bank Economic Review Nichols, Austin, and Linden McBride 2017 “Propensity scores and causal inference using machine learning methods.” https://www.stata.com/meeting/baltimore17/ Nichols, Austin, and Hiren Nisar 2017 “Analyzing satellite data in Stata.” https://www.stata.com/meeting/baltimore17/ Austin Nichols Implementing machine learning methods in Stata ... class analysis: gsem in Stata 15 Austin Nichols Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata approach References Preliminaries Methods Supervised... Predicting lawnmower ownership 60 80 100 Income Nonowner Linear discriminant Austin Nichols 120 140 Owner Implementing machine learning methods in Stata Introduction Examples Trees and Forests Stata. .. https://www .stata. com/meeting/baltimore17/ Nichols, Austin, and Hiren Nisar 2017 “Analyzing satellite data in Stata. ” https://www .stata. com/meeting/baltimore17/ Austin Nichols Implementing machine learning methods in Stata

Định dạng
Số trang	31
Dung lượng	349,34 KB
File đính kèm	72. QUALITATIVE ANALYSIS.rar (93 B)