This study develops model to assess the performance and thereby forecast ranking of players through data available in public domain at The Fédération Internationale de Football Association.
International Journal of Management (IJM) Volume 11, Issue 1, January 2020, pp 119–137, Article ID: IJM_11_01_013 Available online at http://www.iaeme.com/ijm/issues.asp?JType=IJM&VType=11&IType=1 Journal Impact Factor (2019): 9.6780 (Calculated by GISI) www.jifactor.com ISSN Print: 0976-6502 and ISSN Online: 0976-6510 © IAEME Publication Scopus Indexed A DATA ANALYTICS APPROACH TO PLAYER ASSESSMENT Nitin Singh Professor, Operations Management & Information Systems, IIM Ranchi, India ABSTRACT There is abundance of data in the new digital age which can be harnessed to gather insights through application of data analytics This study develops model to assess the performance and thereby forecast ranking of players through data available in public domain at The Fédération Internationale de Football Association The study applies Principal Component Analysis followed by classification A combination of these two approaches seeks to determine ranking of players and to classify them in different categories Results indicate that player ranks can be predicted and classified on their playing attributes and accordingly an appropriate selection decision can be made Keywords: data analytics; forecasting; sport management; machine learning; principal component regression Cite this Article: Nitin Singh, A Data Analytics Approach to Player Assessment, International Journal of Management (IJM), 11 (1), 2020, pp 119–137 http://www.iaeme.com/IJM/issues.asp?JType=IJM&VType=11&IType=1 INTRODUCTION A variety of data-capture technologies exist in the new digital age These technologies allow sport management firms to capture and collect data on games, players, playing styles, scores, and many other game / player attributes This data can be harnessed to gather insights through the application of data analytics There have also been several discussions around this issue in literary and business circles The Fédération Internationale de Football Association (FIFA) has approved the use of Electronic Performance and Tracking Systems (EPTS) which has made the data related to physical player performance available (FIFA 2019) Now, the data collected during training sessions and live matches through EPTS can be used to predict player and match performance A data-driven approach to analysing player performance and ranking could be an interesting area to investigate In this context, data analytics could be of immense value The relative performance of players may vary when playing against different members of an opposing team At the same time, player performance could also vary when playing with different members of the same team This phenomenon could be a function of numerous variables that are difficult to grapple with through intuitive thinking or basic calculations Currently, players are ranked on the basis of different player attributes, such as http://www.iaeme.com/IJM/index.asp 119 editor@iaeme.com A Data Analytics Approach to Player Assessment physical fitness, agility, strength, and so on This data is diverse, and multiple variables must be considered simultaneously in order to reach the ‘right’ decision This is where data analytics can be applied We also feel that this paper may stimulate further research to the role of data analytics in examining escalation of commitment theory in the discipline of sport This study may also accentuate opportunities for data analytics and empirical studies in examining escalation behaviour while highlighting considerations for a more effective player assessment approach Escalation of commitment theory must be discussed in more detail in order to this In this study, first we analyse the characteristics that influence player ranking in football We also examine the relationship between player characteristics and performance The research methodology applied is Principal Component Analysis (PCA) followed by classifier approach The former (PCA) is used to understand player attributes that play a major role in determining the ranking of players Latter approach (classifier) attempts to classify the players on the basis of their playing attributes A combination of these two approaches provides a useful methodology to determine the ranking of players before hand and to classify them in different categories In the second stage, we examine the role of escalation of commitment in player assesment In the next stage, we discuss contribution to theory and managerial implications that could be relevant to sport managers, researchers, administrators, and coaches Escalation of commitment theory must be discussed to examine opportunities for data analytics and empirical studies in examining escalation behaviour in player assessment The theory of escalation of commitment depicts circumstances in which actor(s) may tend to maintain or even increase commitment to a specific course of action despite the presence of impartial evidence of negative or ambiguous outcome (Hutchinson, 2018; Sleesman, Conlon, McNamara & Miles 2012; Staw, 1976) The escalation behaviour begins when the actor(s) allocates significant resources to a course of action to accomplish a planned goal though there is little or no evidence of benefits of that goal It has been found that this behavior generates comparable characteristics though the context may be different (Brockner, 1992) There has been extensive research on escalation behaviour but less research has dealt with or examined applications of escalation of commitment to other disciplines, including those related to sport (Mähring, Keil, Mathiassen, & Pries-Heje, 2008) We discuss relevant studies that relate to escalation of commitment and also few relevant studies that have conducted an enquiry involving data analytics though not necessarily in escalation of commitment Berg & Hutchinson (2012) conducted an empirical enquiry into the role of politics in the escalation of commitment and also offers opportunities for research in the sport context in which escalation of commitment applies The bidding and selection of a player by a sport organisation is an interest-driven behaviour, and decision of the sport manager based on player performance assessment would result in the selection of a player in a league/team (Friedman, Parent, & Mason, 2004) If the decision is objective and based on a data driven analytical method, it will be beneficial to the league By this logic, there is a need to develop and investigate such methods and validate these such that managers are equipped with objective ways to assess the players Crowder, Dixon, Ledford, and Robinson (2002) studied betting through modelling 92 football teams in the English Football Association League over the years 1992-1997 Specifically, the researchers examined betting models in different leagues Their objective was to create a dynamic model for predicting match outcomes, and they proposed a refinement of the Poisson model suggested by Dixon and Coles (1997) The model they developed could predict the probability of a match win, draw or loss for the betting market http://www.iaeme.com/IJM/index.asp 120 editor@iaeme.com Nitin Singh Quenzel and Shea researched ways to predict the winner of ‘tied’ football matches based on different attributes of the match strategy employed (2014) They concluded that, in such cases, the point spread is significantly predictive They also found weak evidence that the chances of winning are reduced if more sacks are allowed This study provides useful insights into match strategy, which enables football managers to design their strategies accordingly The selection of teams based on an optimal assessment of players is a critical component of success, or winning, in sport (Bharathan, Sundarraj, Abhijeet, & Ramakrishnan, 2015) This study examined the performance utility of cricket players through hypothesis testing The performance of batsmen was evaluated using a two-sample t-test to determine if there is a significant difference in strike rate, run scores, and boundary hits among batsmen Likewise, bowler performance was evaluated using a two-sample t-test to check if there is a significant difference in strike rate among bowlers The relevance of big data in sport has also been studied (Rein & Memmert, 2016) Specifically, a tactical analysis of elite football was studied This paper presented how big data and data analytics (in particular, modern machine-learning technologies) may help address tactical decisions in elite football and aid in developing a theoretical model for tactical decision-making in team sport A data analytics-based approach was also applied to bidding for sporting events, specifically bidding to host the Beijing 2022 Olympics (Liu, Hautbois, & Desbordes, 2017) The analysis measured the social impact of the bidding for the Winter Olympic Games and the attitudes of non-host residents towards the bidding process In particular, the study sought to contribute by taking the perspectives of non-host communities into account Additionally, this study also offered insights into the perceptions and attitudes of citizens from emerging markets towards event bidding and hosting Ruiz and Cruz (2015) developed a generative model for predicting outcomes in college basketball The researchers showed that a classical model for football can also provide competitive results in predicting basketball outcomes A modified model was presented in two ways First, they attempted to capture the specific behaviour of each National Collegiate Athletic Association (NCAA) conference The second model aimed to capture the different strategies used by each team and conference A comparative study of machine-learning methods was applied to predict cricket match outcomes using the opinions of crowds on social networks (Mustafa, Nawaz, Lali, Zia, & Mehmood, 2017) The researchers investigated the feasibility of applying collective information obtained from micro posts on Twitter to predict the winner of a cricket match using classification algorithms The results were found to be sufficiently promising to be used to forecast winning cricket teams Furthermore, the effectiveness of a supervised learning algorithm was evaluated, and support vector machine was found to have an advantage over other classifiers It is observable from the aforementioned studies that data analytics has been applied increasingly in sport management With sports becoming more competitive, researchers are turning to sport analytics for newer models to understand the relevance of data analytics in sports across different areas including, bidding, player performance, team performance, decision-making, entertainment, and attracting fans more effectively It is also observed that there have been data analytics based enquires in studying escalation of commitment in the discipline of sport A summary of these studies suggests that there is a combination of data-capturing technology and the adaption of newer data analytics models within the sport industry The area of player assessment requires more such data driven analytical models to have a http://www.iaeme.com/IJM/index.asp 121 editor@iaeme.com A Data Analytics Approach to Player Assessment comprehensive and quantitative assessment Such assessments have been found to have implications to the theory of escalation of commitment RESEARCH QUESTION Technology can track how fast a player is running, how deft s/he is, how quick, and how much strength is exhibited during multiple games that the players have played In the past, this couldn’t be measured, but now with variety of data capture systems, technology can gather how efficient players are from diverse areas of the sport It has been noted in managerial and research circles that, in the current competitive scenario, it is essential for teams to be able to leverage technology and to measure players’ performance by using data that is being captured by technology This brings us to the research question How can data analytic methods be used to predict the performance and thereby ranking of a player, especially one involving modelling and making use of the available data on player attributes? It is worthwhile to develop and adopt data driven methods with an analytical approach that can allow managers to make an objective decision (through publicly available data) based on players’ strength, accuracy, deftness, speed, agility etc There is also need towards extending the models already developed so far in research that would enable researchers and managers to understand a data driven approach to rank the players based on their past performance RESEARCH METHODOLOGY This study uses three years of player rating data–2016, 2017, and 2018–to assess the performance of football players To begin this study, we conducted a theoretical review of related papers published in this area In doing so, we have documented and identified relevant articles in the area of sport analytics Sport analytics has received significant research attention in the past few years, as demonstrated in the literature review, and studies have suggested that sport analytics could be used to a greater degree in sport management The present study employs sport analytics to assess player performance in the sport of football 3.1 Data and Materials The objective is build a model to assess performance and ranking of the football players The existing open source rating data at The Fédération Internationale de Football Association (FIFA) data for last three years was collected and analysed (FIFA, 2019) The rating data has various variables (which are playing attributes like pace, dribbling etc.) which have been rated on a scale of – 100 while the players have been ranked from to 50 For example, the attributes are pace, dribble, pass capability, physical strength, speed, shooting capacity and others A snapshot of data with variables is provided in Table Table Snapshot of data PNAME RANK TEAM PAC 90 89 92 82 DRI 90 95 94 86 SHO 93 90 84 90 DEF 33 26 30 42 PAS 82 86 79 79 PHY 80 61 60 81 81 76 90 50 86 72 92 81 88 63 82 81 38 88 32 73 75 71 84 88 82 83 66 70 http://www.iaeme.com/IJM/index.asp 122 ATTACK 50 75 75 75 75 50 75 75 FW 100 100 100 100 50 100 100 125 75 editor@iaeme.com Nitin Singh 10 79 83 87 25 70 74 75 50 Note Player & Team names are suppressed From Federation Internationale de Football Association, 2019 The player attributes (variables) presented in Table are described as below PNAME: Name of player RANK: Rank of the player TEAM: Name of the club/organization to which the player belongs PAC: Pace DRI: Dribbling SHO: Shooting DEF: Defense PAS: Pass PHY: Physical strength FW: Footwork skill There was missing data for some records and the missing values were estimated by taking average of the nearest neighbourhood assuming that ratings on each attribute (PAC, DRI etc.) of similar players would have similar values We had to code some variables (footwork, position, reflexes, attack, handling) quantifying values which were presented in textual format Table provides a snapshot of data 3.2 Method Multiple regression analysis is a widely used technique for assessing the dependence of a dependent variable (here, rank) on several explanatory (or predictor) variables (Hair, Black, Babin, Anderson, & Tatham, 2006) Rawlings, Pantula, and Dickey (2001) Several studies have used multivariate regression for assessments (Lehmann, Overton, & Leathwick, 2002; Montgomery, Peck, & Vining 2012; Salkever, 1976) However, multiple regression approach cannot be used when multi-collinearity is present among independent variables (Dickey, 2001; Montgomery, Peck, and Vining, 2012) In the FIFA data under study, few variables were found to exhibit high correlation and multi-collinearity (Table & 3) Table Correlation matrix Variables RAN K PAC DRI SHO DEF PAS PHY ATTAC K SKILL MOVE RANK PAC DRI SHO DEF PAS PHY ATTACK SKILL MOVES FOOT WORK -0.281 -0.309 -0.218 0.275 -0.100 0.094 0.008 -0.336 -0.281 0.523 0.515 -0.529 0.112 -0.168 0.313 0.486 -0.309 0.523 0.782 -0.714 0.785 -0.561 0.401 0.851 -0.218 0.515 0.782 -0.799 0.570 -0.238 0.351 0.691 0.275 -0.529 -0.714 -0.799 -0.397 0.473 -0.175 -0.682 -0.100 0.112 0.785 0.570 -0.397 -0.541 0.298 0.652 0.094 -0.168 -0.561 -0.238 0.473 -0.541 -0.140 -0.460 0.008 0.313 0.401 0.351 -0.175 0.298 -0.140 0.359 -0.336 0.486 0.851 0.691 -0.682 0.652 -0.460 0.359 FOOT WOR K -0.152 0.077 0.258 0.352 -0.152 0.196 -0.012 0.131 0.237 -0.152 0.077 0.258 0.352 -0.152 0.196 -0.012 0.131 0.237 http://www.iaeme.com/IJM/index.asp 123 editor@iaeme.com A Data Analytics Approach to Player Assessment Table Multi-collinearity statistics Statistic R² Toleranc e VIF RAN K 0.222 PAC DRI SHO DEF PAS PHY ATTACK 0.320 SKILL MOVES 0.762 FOOT WORK 0.472 0.580 0.912 0.866 0.849 0.821 0.634 0.778 0.420 0.088 0.134 0.151 0.179 0.366 0.680 0.238 0.528 1.286 2.379 11.369 7.470 6.603 5.589 2.734 1.470 4.198 1.894 In order to handle this issue, we employed Principal Component Analysis (PCA) to examine inter-correlation among components PCA is able to avoid the issue of multicollinearity since running a PCA on the raw data produces components that are linear combinations of the uncorrelated independent variables (Jolliffe, 2002) Also, it is able to reduce large number of explanatory variables to a lesser number of components (Hair et al., 2006) This provides a regression equation for an underlying process by employing explanatory variables In the literature, Principal component analysis (PCA) is considered a suitable technique for identifying and listing major factors affecting a dependent variable (Burns, Bush, & Sinha, 2014; Hair, Black, Babin, Anderson, & Tatham, 2006) Hence, in the first stage, PCA was applied to discover components contributing to overall player performance In the next stage, Principal Components Regression (PCR) is applied to components derived from a PCA The basic idea behind PCR is to compute the components and then apply some, or all, of these components as independent predictors in a linear regression model using the least squares procedure (Jolliffe, 2002) The main conceptual basis of PCR is very closely related to the one that is underlying PCA, and the technique is similar as well In this study, a smaller number of components (four) are found to be sufficient to explain 92.71% of variability in the data To ensure statistical rigor, we also undertake tests of multi-collinearity, correlations and sample adequacy as presented in the next section RESULTS AND DISCUSSION 4.1 Principal Component Analysis The first objective of this study is to discover the major components in assessing player performance To perform the PCA, a minimum of five cases or records must be present per variable (Hair et al., 2006) Data was insufficient for certain variables – diving, handling, reflexes, kicking and position We examined the goodness-of-fit for the variables as the model could be impacted by sparse data Few variables like diving, handling, reflexes, kicking and position had small coefficients, and therefore, they were dropped (Joiliffe, 2002) The process was repeated until the fit improved and we were able to get clear components and variable loadings Two statistical tests are conducted in order to determine the suitability of PCA which are presented in Table and First, Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy score (Table 3) was found to be above the recommended level of 0.50 for all the variables (Joiliffe, 2002) Second, Bartlett’s test of sphericity (Table 5) was found to be significant (Chi Square with p < 0.05), indicating that significant inter-correlations exist between the variables and thereby suggesting that employment of PCA is appropriate http://www.iaeme.com/IJM/index.asp 124 editor@iaeme.com Nitin Singh Table Kaiser-Meyer-Olkin measure of sampling adequacy PAC DRI SHO DEF PAS PHY ATTACK KMO 0.700 0.780 0.676 0.669 0.625 0.582 0.817 0.689 Table Bartlett's sphericity test Chi-square (Observed value) Chi-square (Critical value) DF p-value (Two-tailed) alpha 863.726 11.591 21 0.0001 0.95 Note Test interpretation: Ho: There is no correlation significantly different from between the variables Ha: At least one of the correlations between the variables is significantly different from As the computed p-value is lower than the significance level alpha=0.95, we reject the null hypothesis Ho, and accept the alternative hypothesis Ha Third, we performed oblique rotation (oblimin, promax), and examined the component correlation matrix We found no inter correlations among components and, in the fourth stage, we repeated the analysis with varimax rotation thus maximizing the component loadings (Hair et al., 2006) The variables which finally get included in PCA are pace, dribbling capacity, shooting, defence, passing capacity, physical strength & attacking capacity The rotated components with varimax rotation were found to provide more clear and distinct components Seven variables (diving, handling, reflexes, kicking, speed, footwork, and position) were not considered in PCA due to low contributions to components The statistical test results (KMO 0.689, Bartlett’s Test of Sphericity 863.72, with Significance F 3.807 0.003 editor@iaeme.com A Data Analytics Approach to Player Assessment Table 11 Model parameters Source Value Standard error t Pr > |t| Lower bound (95%) Intercept 25.500 1.130 22.562 23.266 < 0.000 F1 -2.272 0.544 -2.337 -2.348 0.021 F2 -1.555 1.137 -1.367 -3.802 0.044 F3 4.148 1.383 3.000 1.415 0.003 F4 -1.876 1.564 -1.200 0.232 -4.968 F5 -2.091 1.859 -1.125 0.263 -5.766 Note Components which are highly significant are shown in bold font Upper bound (95%) 27.734 -0.196 0.693 6.881 1.215 1.584 In summary, we draw these conclusions a) Given the value of R2, 61.2% of the variability of the dependent variable RANKING is explained by the explanatory components b) Given the p-value of the F statistic computed in the ANOVA table, and given the significance level of 5%, the information brought by the explanatory components is significantly better than what a basic mean would bring The estimated regression equation for the model is: 𝑅𝑎𝑛𝑘 = 22.562 − 2.337 ∗ 𝐹1 − 1.367 ∗ 𝐹2 + 3.000 ∗ 𝐹3 + ℇ The p-values for these components (F1, F2, F3) are significant (