This paper investigates whether the accuracy of models used in accounting research to predict categorical dependent variables (classification) can be improved by using a data analytics approach. This topic is important because accounting research makes extensive use of classification in many different research streams that are likely to benefit from improved accuracy.
University of Arkansas, Fayetteville ScholarWorks@UARK Theses and Dissertations 8-2018 Predicting Changes in Earnings: A Walk Through a Random Forest Joshua Hunt University of Arkansas, Fayetteville Follow this and additional works at: http://scholarworks.uark.edu/etd Part of the Accounting Commons Recommended Citation Hunt, Joshua, "Predicting Changes in Earnings: A Walk Through a Random Forest" (2018) Theses and Dissertations 2856 http://scholarworks.uark.edu/etd/2856 This Dissertation is brought to you for free and open access by ScholarWorks@UARK It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of ScholarWorks@UARK For more information, please contact scholar@uark.edu, ccmiddle@uark.edu Predicting Changes in Earnings: A Walk Through a Random Forest A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Business Administration with a concentration in Accounting by Joshua O’Donnell Sebastian Hunt Louisiana Tech University Bachelor of Science in Mathematics, 2007 Louisiana Tech University Master of Arts in Teaching, 2011 University of Arkansas Master of Accountancy, 2013 University of Arkansas Master of Science in Statistics and Analytics, 2017 August 2018 University of Arkansas This dissertation is approved for recommendation to the Graduate Council Vern Richardson, Ph.D Dissertation Director James Myers, Ph.D Committee Member Cory Cassell, Ph.D Committee Member David Douglass, Ph.D Committee Member Abstract This paper investigates whether the accuracy of models used in accounting research to predict categorical dependent variables (classification) can be improved by using a data analytics approach This topic is important because accounting research makes extensive use of classification in many different research streams that are likely to benefit from improved accuracy Specifically, this paper investigates whether the out-of-sample accuracy of models used to predict future changes in earnings can be improved by considering whether the assumptions of the models are likely to be violated and whether alternative techniques have strengths that are likely to make them a better choice for the classification task I begin my investigation using logistic regression to predict positive changes in earnings using a large set of independent variables Next, I implement two separate modifications to the standard logistic regression model, stepwise logistic regression and elastic net, and examine whether these modifications improve the accuracy of the classification task Lastly, I relax the logistic regression parametric assumption and examine whether random forest, a nonparametric machine learning technique, improves the accuracy of the classification task I find little difference in the accuracy of the logistic regression-based models; however, I find that random forest has consistently higher out-of-sample accuracy than the other models I also find that a hedge portfolio formed on predicted probabilities using random forest earns larger abnormal returns than hedge portfolios formed using the logistic regression-based models In subsequent analysis, I consider whether the documented improvements exist in an alternative classification setting: financial misstatements I find that random forest’s out-of-sample area under the receiver operating characteristic (AUC) is significantly higher than the logistic-based models Taken together, my findings suggest that the accuracy of classification models used in accounting research can be improved by considering the strengths and weaknesses of different classification models and considering whether machine learning models are appropriate Acknowledgements I would like to thank my mother, Catherine Hunt, who not only taught me how to read, but also instilled in me the importance of education and cultivated my love of learning from an early age Table of Contents Introduction Algorithms .10 Logistic Regression 10 Stepwise Logistic Regression 14 Elastic Net 15 Cross-Validation 18 Random Forest .19 Data and Methods 22 Results 24 Main Analyses .24 Additional Analyses .26 Additional Misstatement Analyses 31 Conclusion .35 References 38 Appendices .43 Tables 52 Figures 62 Introduction The goal of this paper is to show that accounting researchers can improve the accuracy of classification (using models to predict categorical dependent variables) by considering whether the assumptions of a particular classification technique are likely to be violated and whether an alternative classification technique has strengths that are likely to make it a better choice for the classification task Accounting research makes extensive use of classification in a variety of research streams One of the most common classification techniques used in accounting research is logistic regression However, logistic regression is not the only classification technique available and each technique has its own set of assumptions and its own strengths and weaknesses Using a data analytics approach, I investigate whether the out-of-sample accuracy of predicting changes in earnings can be improved by considering limitations found in a logistic regression model and addressing those limitations with alternative classification techniques I begin my investigation by predicting positive versus negative changes in earnings for several reasons First, prior accounting research uses statistical approaches to predict changes in earnings that focus on methods rather than theory, providing an intuitive starting point for my investigation (Ou and Penman 1989a, 1989b; Holthausen and Larcker 1992) While data analytics has advanced since the time of these papers, the statistical nature of their approach fits in well with a data analytics approach Data analytics tends to take a more statistical, resultsdriven approach to prediction tasks relative to traditional accounting research Second, changes in earnings are a more balanced dataset in regard to the dependent variable relative to many of the other binary dependent variables that accounting literature uses (e.g., the incidence of fraud, misstatements, going concerns, bankruptcy, etc.) Positive earnings changes range from 40% to 60% percent prevalence in a given year for my dataset Logistic regression can achieve high accuracy in unbalanced datasets but this accuracy may have little meaning because of the nature of the data For example, in a dataset of 100 observations that only have occurrences of a positive outcome, one can have high accuracy (95 percent for this example) without correctly classifying any of the positive outcomes Third, focusing on predicting changes in earnings allows me to use a large dataset which, in turn, allows me to use a large set of independent variables Lastly, changes in earnings are also likely to be of interest to investors and regulators because of their relationship to abnormal returns (Ou and Penman 1989b; Abarbenell and Bushee 1998) Logistic regression is the first algorithm I investigate because of its prevalent use in accounting literature Logistic regression uses a maximum likelihood estimator, an iterative process, to find the parameter estimates Logistic regression has several assumptions First, logistic regression requires a binary dependent variable Second, logistic regression requires that the model be correctly specified, meaning that no important variables are excluded from the model and no extraneous variables are included in the model Third, logistic regression is a parametric classification algorithm, meaning that the log odds of the dependent variable must be linear in the parameters I use a large number of independent variables chosen because of their use in prior literature This makes it more likely that extraneous variables are included in the model, violating the second logistic regression assumption To address this potential problem, I implement stepwise logistic regression, following prior literature (Ou and Penman 1989b; Holthausen and Larcker I only discuss a limited number of the assumptions for logistic regression here More detail is provided on all of the assumptions in the logistic regression section Ou and Penman (1989b) begin with 68 independent variables and Holthausen and Larcker (1992) use 60 independent variables My independent variables are based on these independent variables as well as 11 from Abarbenell and Bushee (1998) 1992; Dechow, Ge, Larson, and Sloan 2011) The model begins with all the input variables and each variable is dropped one at a time The Akaike information criterion (AIC) is used to test whether dropping a variable results in an insignificant change in model fit, and if so, it is permanently deleted This is repeated until the model only contains variables that change the model fit significantly when dropped While stepwise logistic regression makes it less likely that extraneous variables are included in the model, it has several weaknesses First, the stepwise procedure performs poorly in the presence of collinear variables (Judd and McClelland 1989) This can be a concern with a large set of independent variables Second, the resulting coefficients are inflated, which may affect out-of-sample predictions (Tibshirani 1996) Third, the measures of overall fit, z-statistics, and confidence intervals are biased (Pope and Webster 1972; Wilkinson 1979; Whittingham, Stephens, Bradbury, and Freckleton 2001).4 I implement elastic net to address the first two weaknesses of stepwise logistic regression (multicollinearity and inflated coefficients) Elastic net is a logistic regression with added constraints Elastic net combines Least Absolute Shrinkage and Selection Operator (lasso) and ridge regression constraints Lasso is an L1 penalty function that selects important variables by shrinking coefficients toward zero (Tibshirani 1996) Ridge regression also shrinks coefficients, but uses an L2 penalty function and does not zero out coefficients (Hoerl and Kennard 1970) This is an example of backward elimination Stepwise logistic regression can also use forward elimination or a combination of backward and forward elimination I use backward elimination because it is similar to what has been used in prior literature (Ou and Penman 1989b; Holthausen and Larcker 1992; Dechow et al 2011) Coefficients tend to be inflated because the stepwise procedure overfits the model to the data The procedure attempts to insure only those variables that improve fit are included based on the current dataset and this causes the coefficients to be larger than their true parameter estimates Similarly, the model fit statistics are inflated The zstatistics and confidence intervals tend to be incorrectly specified due to degrees of freedom errors and because these statistical tests are classical statistics that not take into account prior runs of the model A L1 penalty function penalizes the model for complexity based on the absolute value of the coefficients A L2 penalty function penalizes the model for complexity based on the sum of the squared coefficients Lasso performs poorly with collinear variables while ridge regression does not Elastic net combines the L1 and L2 penalties, essentially performing ridge regression to overcome lasso’s weaknesses and then lasso to eliminate irrelevant variables Logistic regression, stepwise logistic regression, and elastic net are all parametric models subject to the assumption that the independent variables are linearly related to the log odds of the dependent variable (the third logistic regression assumption) Given that increasing (decreasing) a particular financial ratio may not equate to a linear increase (decrease) in the log odds of a positive change in earnings, it is not clear that the relationship is linear To address this potential weakness, I implement random forest, a nonparametric model The basic idea of random forest was first introduced in 1995 by Ho (1995) and the algorithm now known as random forest was implemented in 2001 by Brieman (2001) Since then it has been used in biomedical research, chemical research, genetic research, and many other fields (Díaz-Uriarte and De Andres 2006; Svetnik, Liaw, Tong, Culberson, Sheridan, and Feuston 2003; Palmer, O’Boyle, Glen, and Mitchell 2007; Bureau, Dupuis, Falls, Lunetta, Hayward, Keith, and Van Eerdewegh 2005) Random forest is a decision tree-based algorithm that averages multiple decision trees Decision trees are formed on random samples of the training dataset and random independent variables are used in forming the individual decision trees Many decision trees are formed with different predictor variables and these trees remain unpruned.8 Each tree is formed on a different bootstrapped sample of the training data These procedures help ensure that the decision trees are not highly correlated and reduce variability Highly correlated decision trees in the forest would make the estimation less reliable A training data set refers to the in-sample data set used to form estimates to test on the out-of-sample data set In my setting, I use rolling year windows as training set and test out-of-sample accuracy on the th year Pruning a decision tree refers to removing branches that have little effect on overall accuracy This helps reduce overfitting Tables Table Model Accuracy 1970-2014 Split N Logistic Accuracy Stepwise Accuracy Elastic Net Accuracy Random Forest Accuracy 50/50 41094 0.57 0.57 0.577 0.6 60/40 32881 0.588 0.586 0.589 0.624 70/30 24663 0.601 0.601 0.598 0.645 80/20 16445 0.621 0.623 0.612 0.665 90/10 8223 0.659 0.671 0.636 0.693 95/05 4113 0.697 0.714 0.661 0.735 Table shows the percentile splits, corresponding sample size, and accuracy The percentile splits are taken from ranking raw probabilities formed from each respective model Accuracy represents how correctly each model classifies a positive change in earnings The bold numbers represent the largest accuracies 52 Table Five year groups 95/05 split Years Logistic Regression Stepwise Logistic Elastic Net Random Forest 1970-1974 0.631 0.644 0.601 0.605 1975-1979 0.644 0.63 0.647 0.692 1980-1984 0.703 0.715 0.69 0.71 1985-1989 0.748 0.743 0.735 0.756 1990-1994 0.709 0.724 0.702 0.729 1995-1999 0.753 0.761 0.732 0.728 2000-2004 0.717 0.728 0.702 0.748 2005-2009 0.685 0.754 0.549 0.791 2010-2014 0.667 0.672 0.655 0.739 Table presents year groups The corresponding out-of-sample accuracy is given The bolded numbers represent the largest accuracies 53 Table Top ten most important variables Random Forest Variables Freq Elastic Net Variables Freq Stepwise Variables Freq ADJEPSFX 45 Z_CAPX_AB 45 PIAT_OP 44 ETR_AB 45 Z_CHG_SALEAT_OP 45 CHG_CURRENT_OP 42 CAPX_AB 44 Z_PIAT_OP 45 OIADPAT_OP 42 CHG_DAYSAR_OP 43 Z_CHG_INVTAT_OP 44 INVTAT_OP 38 CHG_SALECHE_OP 43 Z_INVTAT_OP 43 CHG_QUICK_OP 37 LAG1_CHG_CAPXAT_OP 43 Z_CHG_INVT_OP 42 CHG_SALE_OP 37 CHG_SALERECT_OP 40 Z_DVPSX_OP 42 SALE_OP 35 LF_AB 39 Z_OIADPAT_OP 42 CHG_PIAT_OP 34 CHG_INVTAT_OP 38 Z_ ADJEPSFX 41 CHG_SALEAT_OP 34 DAYSAR_OP 38 Z_CHG_CURRENT_OP 41 CHG_INVT_OP 33 Table shows the top ten most chosen independent variables over the sample period 19702014 inclusive The numbers represent the corresponding number of times chosen, with 45 being the largest possible number Variables are defined in the appendix 54 Table Abnormal returns Panel A Logistic Regression Abnormal Returns Confusion Matrix BHAR Hedge PP Panel B Stepwise Logistic Regression PValue N Confusion Matrix 0.144 0.077 0.000 0.000 4113 2079 Hedge PP PN TP -0.067 0.12 0.000 0.000 2034 1520 TN -0.153 0.000 FP FN -0.038 0.103 0.085 0.000 Panel A Logistic Regression BHAR PValue Panel C Elastic Net N Confusion Matrix Panel D Random Forest BHAR PValue N Confusion Matrix BHAR PValue 0.174 0.075 0.000 0.000 4113 2079 PN TP -0.099 0.123 0.000 0.000 2034 1650 1265 TN -0.189 0.000 1371 624 769 FP FN -0.112 0.086 0.000 0.000 429 663 0.142 0.066 0.000 0.000 4113 2079 Hedge PP 0.051 0.054 0.005 0.000 4113 2079 Hedge PP PN TP -0.077 0.12 0.000 0.000 2034 1582 PN TP 0.002 0.083 0.848 0.000 2034 1455 1347 TN -0.159 0.000 1356 TN -0.068 0.000 559 687 FP FN -0.106 0.087 0.000 0.000 497 678 FP FN -0.016 0.118 0.454 0.000 Panel B Stepwise Logistic Regression Panel C Elastic Net Metric Metric N Panel D Random Forest Fit Metrics Metric Value Value Value Metric Value 55 Accuracy 0.697 Accuracy 0.714 Accuracy 0.661 Accuracy 0.735 Kappa Sensitivity 0.394 0.689 Kappa Sensitivity 0.428 0.700 Kappa Sensitivity 0.322 0.654 Kappa Sensitivity 0.468 0.713 Specificity 0.707 Specificity 0.732 Specificity 0.670 Specificity 0.762 Prevalence Detection Rate Detection Prevalence 0.537 0.370 Prevalence Detection Rate Detection Prevalence 0.549 0.385 Prevalence Detection Rate Detection Prevalence 0.541 0.354 Prevalence Detection Rate Detection Prevalence 0.562 0.401 0.505 0.505 0.505 0.505 Table presents abnormal returns and supplemental fit data The confusion matrix column represents data available in a confusion matrix for the 95/05 data split PP represents those predicted to be a positive change, PN represents those predicted to be a negative change, TP represent the true positives, TN represents the true negatives, FP represents false positives, and FN represents false negatives BHAR represents the 12 month abnormal size adjusted returns, PValue represents the significance for the abnormal returns, and N is the number of observations Fit Metrics are accuracy, Kappa which represents how well the classifier performs relative to random chance, Sensitivity is the true positive rate, Specificity is the true negative rate, prevalence is the number of positive occurrences, detection rate is the number of true positives relative to the total, and detection prevalence is the number of predicted positive relative to the total Table Cross-validation vs rolling window cross-validation by year Year In-sample CV Out-ofSample CV Difference CV In-sample RWCV Out-ofSample RWCV Difference RWCV CV compared to RWCV 1970 0.59 0.575 0.015 0.552 0.573 -0.021 Smaller 1971 0.595 0.535 0.06 0.593 0.557 0.037 Larger 1972 0.615 0.509 0.106 0.609 0.525 0.084 Larger 1973 0.644 0.586 0.058 0.715 0.581 0.134 Smaller 1974 0.647 0.543 0.104 0.624 0.573 0.05 Larger 1975 0.658 0.584 0.074 0.584 0.581 0.003 Larger 1976 0.685 0.572 0.113 0.691 0.572 0.119 Smaller 1977 0.682 0.591 0.09 0.646 0.6 0.045 Larger 1978 0.678 0.591 0.087 0.674 0.617 0.057 Larger 1979 0.665 0.57 0.095 0.652 0.594 0.058 Larger 1980 0.66 0.569 0.091 0.549 0.59 -0.041 Larger 1981 0.655 0.549 0.106 0.578 0.576 0.003 Larger 1982 0.633 0.612 0.021 0.536 0.605 -0.069 Smaller 1983 0.631 0.569 0.062 0.613 0.568 0.045 Larger 1984 0.628 0.604 0.024 0.567 0.603 -0.037 Smaller 1985 0.632 0.617 0.014 0.617 0.626 -0.009 Larger 1986 0.637 0.564 0.073 0.637 0.594 0.043 Larger 1987 0.64 0.585 0.055 0.605 0.572 0.033 Larger 1988 0.629 0.555 0.074 0.571 0.587 -0.016 Larger 1989 0.613 0.589 0.023 0.594 0.61 -0.016 Larger 1990 0.619 0.575 0.044 0.607 0.586 0.02 Larger 1991 0.607 0.554 0.053 0.573 0.579 -0.006 Larger 1992 0.6 0.596 0.005 0.589 0.586 0.003 Larger 1993 0.602 0.568 0.034 0.598 0.578 0.02 Larger 56 Table (cont.) Year In-sample CV Out-ofSample CV Difference CV In-sample RWCV Out-ofSample RWCV Difference RWCV CV compared to RWCV 1994 0.605 0.575 0.03 0.568 0.575 -0.008 Larger 1995 0.61 0.59 0.02 0.6 0.599 0.001 Larger 1996 0.615 0.551 0.064 0.605 0.564 0.041 Larger 1997 0.61 0.57 0.04 0.58 0.582 -0.002 Larger 1998 0.604 0.578 0.026 0.564 0.615 -0.051 Smaller 1999 0.6 0.551 0.05 0.61 0.591 0.018 Larger 2000 0.602 0.579 0.023 0.587 0.616 -0.028 Smaller 2001 0.604 0.61 -0.006 0.604 0.622 -0.017 Smaller 2002 0.609 0.588 0.021 0.617 0.589 0.028 Smaller 2003 0.619 0.568 0.05 0.607 0.58 0.027 Larger 2004 0.623 0.553 0.07 0.588 0.571 0.017 Larger 2005 0.615 0.582 0.033 0.6 0.577 0.023 Larger 2006 0.618 0.582 0.037 0.634 0.587 0.047 Smaller 2007 0.631 0.563 0.067 0.633 0.571 0.062 Larger 2008 0.61 0.805 -0.195 0.493 0.639 -0.146 Larger 2009 0.617 0.597 0.021 0.624 0.583 0.04 Smaller 2010 0.624 0.591 0.033 0.609 0.582 0.027 Larger 2011 0.628 0.569 0.06 0.604 0.592 0.013 Larger 2012 0.63 0.566 0.064 0.601 0.611 -0.009 Larger 2013 0.643 0.636 0.007 0.613 0.59 0.023 Smaller 2014 0.643 0.591 0.053 0.611 0.609 0.001 Larger Table shows presents accuracy for the 50/50 split of the data The cross-validation (CV) insample accuracy, out-of-sample, and difference is compared with the rolling window crossvalidation (RWCV) method The method that produces the smallest absolute difference performs the best in this setting The last column highlights whether the difference from CV is larger than RWCV 57 Table Out-of-sample accuracy Years Random Forest CV Random Forest RWCV 1970-1974 0.605 0.592 1975-1979 0.692 0.687 1980-1984 0.71 0.698 1985-1989 0.756 0.735 1990-1994 0.729 0.762 1995-1999 0.728 0.732 2000-2004 0.748 0.731 2005-2009 0.791 0.779 2010-2014 0.739 0.716 Table presents the accuracy for year groups for the 95/05 split for Random forest CV and Random forest RWCV The bold numbers represent the largest value 58 Table Model AUC 2000-2014 Sampling Logistic AUC Stepwise AUC Elastic Net AUC Random Forest AUC Original 0.6720 0.6722 0.6768 0.7175 Down 0.5917 0.5915 0.5702 0.6622 Up 0.5653 0.5977 0.5818 0.7480 SMOTE 0.5859 0.5837 0.5715 0.6736 Table shows the sampling methods and AUC The sampling methods are Down-sampling, Up-sampling, and SMOTE AUC represents the out-of-sample area under the roc curve The bold numbers represent the largest AUC 59 Table Five year groups Sampling Logistic Regression Stepwise Logistic Elastic Net Random Forest Original 0.5775 0.5763 0.5791 0.6390 Down 0.5669 0.5741 0.5539 0.6261 Up 0.5486 0.5681 0.5663 0.6808 SMOTE 0.5577 0.5542 0.5537 0.6257 Original 0.6667 0.6667 0.6667 0.7409 Down 0.5974 0.5814 0.5667 0.7093 Up 0.6057 0.6099 0.5762 0.7644 SMOTE 0.6072 0.6023 0.5824 0.7149 2005-2009 2010-2014 Table presents five year groups by sampling method The corresponding out-of-sample AUC is given The bold numbers represent the largest AUC 60 Table Top ten most important variables Random Forest Variables Freq Elastic Net Variables Freq Stepwise Variables Freq WC 10 SOFTASS 10 PPENTAT 10 SALEEMP 10 SALEAT 10 SOFTASS SALEAT 10 RECTSALE 10 NCO PPENTAT 10 PPENTAT 10 LVLFIN MKTVOLATILITY 10 POSACC 10 FIN HOLDRET 10 NETSALE 10 CHGLIABB GROSS 10 NCO 10 CHGASS FIN ISSUE 10 RECTSALE SOFTASS LVLFIN 10 POSACC RECTAT GEOSALEGROW 10 PRCNTCHGSALE Table shows the top ten most chosen independent variables for misstatements over the sample period 2005-2014 inclusive The numbers represent the corresponding number of times chosen, with 10 being the largest possible number Variables are defined in the appendix 61 Figures Figure Separation plot of raw probabilities of traditional cross-validation random forest Figure Separation plot of raw probabilities of rolling window cross-validation random forest 62 Figure Separation plot of ranked probabilities of traditional cross-validation random forest Figure Separation plot of ranked probabilities of rolling window cross-validation random forest 63 Figure Separation plot of ranked probabilities of logistic regression for the original sample (AUC = 0.6998) 64 Figure Separation plot random of ranked probabilities of random forest for the original sample (AUC = 0.7462) 65 Figure Separation plot of ranked probabilities of random forest up-sampling fit (AUC = 0.7458).46 46 Wharton Research Data Services (WRDS) was used in preparing this manuscript This service and the data available thereon constitute valuable intellectual property and trade secrets of WRDS and/or its third-party suppliers 66 ... can handle missing data The main disadvantage is high variability, meaning that a small change in the sample can cause a large change in the final tree (James et al 2013) This disadvantage leads... settings and that using a nonparametric machine learning algorithm may improve out-of-sample accuracy.17 Second, I introduce a novel cross-validation method to the machine learning area that should... on predicting changes in earnings allows me to use a large dataset which, in turn, allows me to use a large set of independent variables Lastly, changes in earnings are also likely to be of interest