1. Trang chủ
  2. » Công Nghệ Thông Tin

agz unformatted nature

42 35 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 42
Dung lượng 2,44 MB

Nội dung

Mastering the Game of Go without Human Knowledge David Silver*, Julian Schrittwieser*, Karen Simonyan*, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis DeepMind, New Street Square, London EC4A 3TW *These authors contributed equally to this work A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains Recently, AlphaGo became the first program to defeat a world champion in the game of Go The tree search in AlphaGo evaluated positions and selected moves using deep neural networks These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from selfplay Here, we introduce an algorithm based solely on reinforcement learning, without human data, guidance, or domain knowledge beyond game rules AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games This neural network improves the strength of tree search, resulting in higher quality move selection and stronger self-play in the next iteration Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts 1–4 However, expert data is often expensive, unreliable, or simply unavailable Even when reliable data is available it may impose a ceiling on the performance of systems trained in this manner In contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning These systems have outperformed humans in computer games such as Atari 6, and 3D virtual environments 8–10 However, the most challenging domains in terms of human intellect – such as the game of Go, widely viewed as a grand challenge for artificial intelligence 11 – require precise and sophisticated lookahead in vast search spaces Fully general methods have not previously achieved human-level performance in these domains AlphaGo was the first program to achieve superhuman performance in Go The published version 12 , which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October 2015 AlphaGo Fan utilised two deep neural networks: a policy network that outputs move probabilities, and a value network that outputs a position evaluation The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning The value network was trained to predict the winner of games played by the policy network against itself Once trained, these networks were combined with a Monte-Carlo Tree Search (MCTS) 13–15 to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte-Carlo rollouts using a fast rollout policy) to evaluate positions in the tree A subsequent version, which we refer to as AlphaGo Lee, used a similar approach (see Methods), and defeated Lee Sedol, the winner of 18 international titles, in March 2016 Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee 12 in several important aspects First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data Second, it only uses the black and white stones from the board as input features Third, it uses a single neural network, rather than separate policy and value networks Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any MonteCarlo rollouts To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning Further technical differences in the search algorithm, training procedure and network architecture are described in Methods Reinforcement Learning in AlphaGo Zero Our new method uses a deep neural network fθ with parameters θ This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value, (p, v) = fθ (s) The vector of move probabilities p represents the probability of selecting each move (including pass), pa = P r(a|s) The value v is a scalar evaluation, estimating the probability of the current player winning from position s This neural network combines the roles of both policy network and value network 12 into a single architecture The neural network consists of many residual blocks of convolutional layers 16, 17 with batch normalisation 18 and rectifier non-linearities 19 (see Methods) The neural network in AlphaGo Zero is trained from games of self-play by a novel reinforcement learning algorithm In each position s, an MCTS search is executed, guided by the neural network fθ The MCTS search outputs probabilities π of playing each move These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network fθ (s); MCTS may therefore be viewed as a powerful policy improvement operator 20, 21 Self-play with search – using the improved MCTS-based policy to select each move, then using the game winner z as a sample of the value – may be viewed as a powerful policy evaluation operator The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure 22, 23 : the neural network’s parameters are updated to make the move probabilities and value (p, v) = fθ (s) more closely match the improved search probabilities and π , z); these new parameters are used in the next iteration of self-play to make the self-play winner (π search even stronger Figure illustrates the self-play training pipeline The Monte-Carlo tree search uses the neural network fθ to guide its simulations (see Figure 2) Each edge (s, a) in the search tree stores a prior probability P (s, a), a visit count N (s, a), and an action-value Q(s, a) Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s, a) + U (s, a), where U (s, a) ∝ P (s, a)/(1 + N (s, a)) 12, 24 , until a leaf node s is encountered This leaf position is expanded and evaluated just Figure 1: Self-play reinforcement learning in AlphaGo Zero a The program plays a game s1 , , sT against itself In each position st , a Monte-Carlo tree search (MCTS) αθ is executed (see Figure 2) using the latest neural network fθ Moves are selected according to the search probabilities computed by the MCTS, at ∼ π t The terminal position sT is scored according to the rules of the game to compute the game winner z b Neural network training in AlphaGo Zero The neural network takes the raw board position st as its input, passes it through many convolutional layers with parameters θ, and outputs both a vector pt , representing a probability distribution over moves, and a scalar value vt , representing the probability of the current player winning in position st The neural network parameters θ are updated so as to maximise the similarity of the policy vector pt to the search probabilities πt , and to minimise the error between the predicted winner vt and the game winner z (see Equation 1) The new parameters are used in the next iteration of self-play a Figure 2: Monte-Carlo tree search in AlphaGo Zero a Each simulation traverses the tree by selecting the edge with maximum action-value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge (which is incremented once traversed) b The leaf node is expanded and the associated position s is evaluated by the neural network (P (s, ·), V (s)) = fθ (s); the vector of P values are stored in the outgoing edges from s c Action-values Q are updated to track the mean of all evaluations V in the subtree below that action d Once the search is complete, search probabilities π are returned, proportional to N 1/τ , where N is the visit count of each move from the root state and τ is a parameter controlling temperature once by the network to generate both prior probabilities and evaluation, (P (s , ·), V (s )) = fθ (s ) Each edge (s, a) traversed in the simulation is updated to increment its visit count N (s, a), and to update its action-value to the mean evaluation over these simulations, Q(s, a) = 1/N (s, a) s |s,a→s where s, a → s indicates that a simulation eventually reached s after taking move a from position s MCTS may be viewed as a self-play algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, π = αθ (s), proportional to the exponentiated visit count for each move, πa ∝ N (s, a)1/τ , where τ is a temperature parameter The neural network is trained by a self-play reinforcement learning algorithm that uses MCTS to play each move First, the neural network is initialised to random weights θ0 At each subsequent iteration i ≥ 1, games of self-play are generated (Figure 1a) At each time-step t, an MCTS search π t = αθi−1 (st ) is executed using the previous iteration of neural network fθi−1 , and a move is played by sampling the search probabilities π t A game terminates at step T when V (s ), both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length; the game is then scored to give a final reward of rT ∈ {−1, +1} (see Methods for details) The data for each time-step t is stored as (st , π t , zt ) where zt = ±rT is the game winner from the perspective of the current player at step t In parallel (Figure 1b), new network parameters θi are trained from data (s, π , z) sampled uniformly among all time-steps of the last iteration(s) of self-play The neural network (p, v) = fθi (s) is adjusted to minimise the error between the predicted value v and the self-play winner z, and to maximise the similarity of the neural network move probabilities p to the search probabilities π Specifically, the parameters θ are adjusted by gradient descent on a loss function l that sums over mean-squared error and cross-entropy losses respectively, (p, v) = fθ (s), l = (z − v)2 − π log p + c||θ||2 (1) where c is a parameter controlling the level of L2 weight regularisation (to prevent overfitting) Empirical Analysis of AlphaGo Zero Training We applied our reinforcement learning pipeline to train our program AlphaGo Zero Training started from completely random behaviour and continued without human intervention for approximately days Over the course of training, 4.9 million games of self-play were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4s thinking time per move Parameters were updated from 700,000 mini-batches of 2,048 positions The neural network contained 20 residual blocks (see Methods for further details) Figure 3a shows the performance of AlphaGo Zero during self-play reinforcement learning, as a function of training time, on an Elo scale 25 Learning progressed smoothly throughout training, and did not suffer from the oscillations or catastrophic forgetting suggested in prior literature Figure 3: Empirical evaluation of AlphaGo Zero a Performance of self-play reinforcement learning The plot shows the performance of each MCTS player αθi from each iteration i of reinforcement learning in AlphaGo Zero Elo ratings were computed from evaluation games between different players, using 0.4 seconds of thinking time per move (see Methods) For comparison, a similar player trained by supervised learning from human data, using the KGS data-set, is also shown b Prediction accuracy on human professional moves The plot shows the accuracy of the neural network fθi , at each iteration of self-play i, in predicting human professional moves from the GoKifu data-set The accuracy measures the percentage of positions in which the neural network assigns the highest probability to the human move The accuracy of a neural network trained by supervised learning is also shown c Mean-squared error (MSE) on human professional game outcomes The plot shows the MSE of the neural network fθi , at each iteration of self-play i, in predicting the outcome of human professional games from the GoKifu data-set The MSE is between the actual outcome z ∈ {−1, +1} and the neural network value v, scaled by a factor of of a neural network trained by supervised learning is also shown to the range [0, 1] The MSE 26–28 Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for compari- son, AlphaGo Lee was trained over several months After 72 hours, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the hour time controls and match conditions as were used in the man-machine match in Seoul (see Methods) AlphaGo Zero used a single machine with Tensor Processing Units (TPUs) 29 , while AlphaGo Lee was distributed over many machines and used 48 TPUs AlphaGo Zero defeated AlphaGo Lee by 100 games to (see Extended Data Figure and Supplementary Information) To assess the merits of self-play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS data-set; this achieved state-of-the-art prediction accuracy compared to prior work 12, 30–33 (see Extended Data Table and respectively) Supervised learning achieved better initial performance, and was better at predicting the outcome of human professional games (Figure 3) Notably, although supervised learning achieved higher move prediction accuracy, the self-learned player performed much better overall, defeating the human-trained player within the first 24 hours of training This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in AlphaGo Zero with the previous neural network architecture used in AlphaGo Lee (see Figure 4) Four neural networks were created, using either separate policy and value networks, as in AlphaGo Lee, or combined policy and value networks, as in AlphaGo Zero; and using either the convolutional network architecture from AlphaGo Lee or the residual network architecture from AlphaGo Zero Each network was trained to minimise the same loss function (Equation 1) using a fixed data-set of self-play games generated by AlphaGo Zero after 72 hours of self-play training Using a residual network was more accurate, achieved lower error, and improved performance in AlphaGo by over 600 Elo Combining policy and value together into a single network slightly reduced the move prediction accuracy, but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo This is partly due to Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee Comparison of neural network architectures using either separate (“sep”) or combined policy and value networks (“dual”), and using either convolutional (“conv”) or residual networks (“res”) The combinations “dual-res” and “sep-conv” correspond to the neural network architectures used in AlphaGo Zero and AlphaGo Lee respectively Each network was trained on a fixed data-set generated by a previous run of AlphaGo Zero a Each trained network was combined with AlphaGo Zero’s search to obtain a different player Elo ratings were computed from evaluation games between these different players, using seconds of thinking time per move b Prediction accuracy on human professional moves (from the GoKifu data-set) for each network architecture c Mean-squared error on human professional game outcomes (from the GoKifu data-set) for each network architecture improved computational efficiency, but more importantly the dual objective regularises the network to a common representation that supports multiple use cases Knowledge Learned by AlphaGo Zero AlphaGo Zero discovered a remarkable level of Go knowledge during its self-play training process This included fundamental elements of human Go knowledge, and also non-standard strategies beyond the scope of traditional Go knowledge Figure shows a timeline indicating when professional joseki (corner sequences) were discovered (Figure 5a, Extended Data Figure 1); ultimately AlphaGo Zero preferred new joseki variants that were previously unknown (Figure 5b, Extended Data Figure 2) Figure 5c and the Supplementary Information show several fast self-play games played at different stages of training Tournament length games played at regular intervals throughout training are shown in Extended Data Figure and Supplementary Information AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts including fuseki (opening), tesuji (tactics), life-and-death, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles Surprisingly, shicho (“ladder” capture sequences that may span the whole board) – one of the first elements of Go knowledge learned by humans – were only understood by AlphaGo Zero much later in training Final Performance of AlphaGo Zero We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero using a larger neural network and over a longer duration Training again started from completely random behaviour and continued for approximately 40 days Over the course of training, 29 million games of self-play were generated Parameters were updated from 3.1 million mini-batches of 2,048 positions each The neural network contained 10 Each residual block applies the following modules sequentially to its input: A convolution of 256 filters of kernel size × with stride Batch normalisation A rectifier non-linearity A convolution of 256 filters of kernel size × with stride Batch normalisation A skip connection that adds the input to the block A rectifier non-linearity The output of the residual tower is passed into two separate “heads” for computing the policy and value respectively The policy head applies the following modules: A convolution of filters of kernel size × with stride Batch normalisation A rectifier non-linearity A fully connected linear layer that outputs a vector of size 192 + = 362 corresponding to logit probabilities for all intersections and the pass move The value head applies the following modules: A convolution of filter of kernel size × with stride Batch normalisation A rectifier non-linearity 28 A fully connected linear layer to a hidden layer of size 256 A rectifier non-linearity A fully connected linear layer to a scalar A non-linearity outputting a scalar in the range [−1, 1] The overall network depth, in the 20 or 40 block network, is 39 or 79 parameterised layers respectively for the residual tower, plus an additional layers for the policy head and layers for the value head We note that a different variant of residual networks was simultaneously applied to computer Go 33 and achieved amateur dan-level performance; however this was restricted to a single-headed policy network trained solely by supervised learning Neural Network Architecture Comparison Figure shows the results of a comparison between network architectures Specifically, we compared four different neural networks: dual-res: The network contains a 20-block residual tower, as described above, followed by both a policy head and a value head This is the architecture used in AlphaGo Zero sep-res: The network contains two 20-block residual towers The first tower is followed by a policy head and the second tower is followed by a value head dual-conv: The network contains a non-residual tower of 12 convolutional blocks, followed by both a policy head and a value head sep-conv: The network contains two non-residual towers of 12 convolutional blocks The first tower is followed by a policy head and the second tower is followed by a value head This is the architecture used in AlphaGo Lee Each network was trained on a fixed data-set containing the final million games of selfplay data generated by a previous run of AlphaGo Zero, using stochastic gradient descent with 29 the annealing rate, momentum, and regularisation hyperparameters described for the supervised learning experiment; however, cross-entropy and mean-squared error components were weighted equally, since more data was available Evaluation We evaluated the relative strength of AlphaGo Zero (Figure 3a and 6) by measuring the Elo rating of each player We estimate the probability that player a will defeat player b by a logistic function p(a defeats b) = , 1+exp (celo (e(b)−e(a)) and estimate the ratings e(·) by Bayesian logistic regression, computed by the BayesElo program 25 using the standard constant celo = 1/400 Elo ratings were computed from the results of a second per move tournament between AlphaGo Zero, AlphaGo Master, AlphaGo Lee, and AlphaGo Fan The raw neural network from AlphaGo Zero was also included in the tournament The Elo ratings of AlphaGo Fan, Crazy Stone, Pachi and GnuGo were anchored to the tournament values from prior work 12 , and correspond to the players reported in that work The results of the matches of AlphaGo Fan against Fan Hui and AlphaGo Lee against Lee Sedol were also included to ground the scale to human references, as otherwise the Elo ratings of AlphaGo are unrealistically high due to self-play bias The Elo ratings in Figure 3a, 4a and 6a were computed from the results of evaluation games between each iteration of player αθi during self-play training Further evaluations were also performed against baseline players with Elo ratings anchored to the previously published values 12 We measured the head-to-head performance of AlphaGo Zero against AlphaGo Lee, and the 40 block instance of AlphaGo Zero against AlphaGo Master, using the same player and match conditions as were used against Lee Sedol in Seoul, 2016 Each player received hours of thinking time plus byoyomi periods of 60 seconds per move All games were scored using Chinese rules with a komi of 7.5 points Data Availability The datasets used for validation and testing are the GoKifu dataset (available from http://gokifu.com/ ) and the KGS dataset (available from https://u-go.net/gamerecords/ ) 30 References 35 Barto, A G & Duff, M Monte Carlo matrix inversion and reinforcement learning Advances in Neural Information Processing Systems 687–694 (1994) 36 Singh, S P & Sutton, R S Reinforcement learning with replacing eligibility traces Machine learning 22, 123–158 (1996) 37 Lagoudakis, M G & Parr, R Reinforcement learning as classification: Leveraging modern classifiers In International Conference on Machine Learning, 424–431 (2003) 38 Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B & Geist, M Approximate modified policy iteration and its application to the game of Tetris Journal of Machine Learning Research 16, 1629–1676 (2015) 39 Littman, M L Markov games as a framework for multi-agent reinforcement learning In International Conference on Machine Learning, 157–163 (1994) 40 Enzenberger, M The integration of a priori knowledge into a Go playing neural network (1996) URL: http://www.cgl.ucsf.edu/go/Programs/neurogo-html/neurogo.html 41 Enzenberger, M Evaluation in Go by a neural network using soft segmentation In Advances in Computer Games Conference, 97–108 (2003) 42 Sutton, R Learning to predict by the method of temporal differences Machine Learning 3, 9–44 (1988) 43 Schraudolph, N N., Dayan, P & Sejnowski, T J Temporal difference learning of position evaluation in the game of Go Advances in Neural Information Processing Systems 817–824 (1994) 44 Silver, D., Sutton, R & Măuller, M Temporal-difference search in computer Go Machine Learning 87, 183–219 (2012) 31 45 Silver, D Reinforcement Learning and Simulation-Based Search in Computer Go Ph.D thesis, University of Alberta, Edmonton, Canada (2009) 46 Gelly, S & Silver, D Monte-Carlo tree search and rapid action value estimation in computer Go Artificial Intelligence 175, 1856–1875 (2011) 47 Coulom, R Computing Elo ratings of move patterns in the game of Go International Computer Games Association Journal 30, 198–208 (2007) 48 Gelly, S., Wang, Y., Munos, R & Teytaud, O Modification of UCT with patterns in MonteCarlo Go Tech Rep 6062, INRIA (2006) 49 Baxter, J., Tridgell, A & Weaver, L Learning to play chess using temporal differences Machine Learning 40, 243–263 (2000) 50 Veness, J., Silver, D., Blair, A & Uther, W Bootstrapping from game tree search In Advances in Neural Information Processing Systems, 1937–1945 (2009) 51 Lai, M Giraffe: Using Deep Reinforcement Learning to Play Chess Master’s thesis, Imperial College London (2015) 52 Schaeffer, J., Hlynka, M & Jussila, V Temporal difference learning applied to a highperformance game-playing program In International Joint Conference on Artificial Intelligence, 529–534 (2001) 53 Tesauro, G TD-gammon, a self-teaching backgammon program, achieves master-level play Neural Computation 6, 215–219 (1994) 54 Buro, M From simple features to sophisticated evaluation functions In International Conference on Computers and Games, 126–145 (1999) 55 Sheppard, B World-championship-caliber Scrabble Artificial Intelligence 134, 241–275 (2002) 32 56 Moravˇc´ık, M et al Deepstack: Expert-level artificial intelligence in heads-up no-limit poker Science (2017) 57 Tesauro, G & Galperin, G On-line policy improvement using Monte-Carlo search In Advances in Neural Information Processing, 1068–1074 (1996) 58 Tesauro, G Neurogammon: a neural-network backgammon program In International Joint Conference on Neural Networks, vol 3, 33–39 (1990) 59 Samuel, A L Some studies in machine learning using the game of checkers II - recent progress IBM Journal of Research and Development 11, 601–617 (1967) 60 Kober, J., Bagnell, J A & Peters, J Reinforcement learning in robotics: A survey The International Journal of Robotics Research 32, 1238–1274 (2013) 61 Zhang, W & Dietterich, T G A reinforcement learning approach to job-shop scheduling In International Joint Conference on Artificial Intelligence, 1114–1120 (1995) 62 Cazenave, T., Balbo, F & Pinson, S Using a Monte-Carlo approach for bus regulation In International IEEE Conference on Intelligent Transportation Systems, 1–6 (2009) 63 Evans, R & Gao, J Deepmind AI reduces Google data centre cooling bill by 40% (2016) URL: https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/ 64 Abe, N et al Empirical comparison of various reinforcement learning strategies for sequential targeted marketing In IEEE International Conference on Data Mining, 3–10 (2002) 65 Silver, D., Newnham, L., Barker, D., Weller, S & McFall, J Concurrent reinforcement learning from customer interactions In International Conference on Machine Learning, 924–932 (2013) 66 Tromp, J Tromp-Taylor rules (1995) URL: http://tromp.github.io/go.html 67 Măuller, M Computer Go Artificial Intelligence 134, 145–179 (2002) 33 68 Shahriari, B., Swersky, K., Wang, Z., Adams, R P & de Freitas, N Taking the human out of the loop: A review of Bayesian optimization Proceedings of the IEEE 104, 148–175 (2016) 69 Segal, R B On the scalability of parallel UCT Computers and Games 6515, 36–47 (2011) 34 KGS train KGS test GoKifu validation Supervised learning (20 block) 62.0 60.4 54.3 Supervised learning (12 layer 12 ) 59.1 55.9 - Reinforcement learning (20 block) - - 49.0 Reinforcement learning (40 block) - - 51.3 Extended Data Table 1: Move prediction accuracy Percentage accuracies of move prediction for neural networks trained by reinforcement learning (i.e AlphaGo Zero) or supervised learning respectively For supervised learning, the network was trained for days on KGS data (amateur games); comparative results are also shown from Silver et al 12 For reinforcement learning, the 20 block network was trained for days and the 40 block network was trained for 40 days Networks were also evaluated on a validation set based on professional games from the GoKifu data set KGS train KGS test GoKifu validation Supervised learning (20 block) 0.177 0.185 0.207 Supervised learning (12 layer 12 ) 0.19 0.37 - Reinforcement learning (20 block) - - 0.177 Reinforcement learning (40 block) - - 0.180 Extended Data Table 2: Game outcome prediction error Mean squared error on game outcome predictions for neural networks trained by reinforcement learning (i.e AlphaGo Zero) or supervised learning respectively For supervised learning, the network was trained for days on KGS data (amateur games); comparative results are also shown from Silver et al 12 For reinforcement learning, the 20 block network was trained for days and the 40 block network was trained for 40 days Networks were also evaluated on a validation set based on professional games from the GoKifu data set Thousands of steps Reinforcement learning Supervised learning 0–200 10−2 10−1 200–400 10−2 10−2 400–600 10−3 10−3 600–700 10−4 10−4 700–800 10−4 10−5 >800 10−4 - Extended Data Table 3: Learning rate schedule Learning rate used during reinforcement learning and supervised learning experiments, measured in thousands of steps (mini-batch updates) 5-3 point press Small avalanche 4.00e-03 4.00e-04 3.50e-03 3.00e-04 Frequency Frequency 10 3.50e-04 3.00e-03 2.50e-03 2.00e-03 1.50e-03 2.50e-04 11 2.00e-04 1.50e-04 1.00e-04 1.00e-03 5.00e-04 5.00e-05 0.00e+00 0.00e+00 10 20 30 40 50 60 70 10 20 Hours 30 40 50 60 70 Hours Attach and draw back Knight's move pincer 2.50e-04 2.50e-05 2.00e-05 Frequency 2.00e-04 Frequency 1.50e-04 1.00e-04 5.00e-05 1.50e-05 1.00e-05 5.00e-06 0.00e+00 0.00e+00 10 20 30 40 50 60 70 Hours 10 20 30 40 50 60 70 Hours Pincer 3-3 point 4.00e-04 3.50e-04 12 3.00e-04 10 Frequency 2.50e-04 2.00e-04 11 1.50e-04 1.00e-04 5.00e-05 0.00e+00 10 20 30 40 50 60 70 Hours Extended Data Figure 1: Frequency of occurence over time during training, for each joseki from Figure 5a (corner sequences common in professional play that were discovered by AlphaGo Zero) The corresponding joseki are reproduced in this figure as insets 1-1 point Outside attachment 4.50e-04 6.00e-04 4.00e-04 5.00e-04 3.50e-04 4.00e-04 Frequency Frequency 3.00e-04 2.50e-04 2.00e-04 1.50e-04 1.00e-04 4 3.00e-04 2.00e-04 1.00e-04 10 5.00e-05 0.00e+00 0.00e+00 10 20 30 40 50 60 70 10 20 Hours 30 40 50 60 70 Hours Knight's move approach One-space jump 6.00e-04 6.00e-03 5.00e-04 5.00e-03 3.00e-04 2 2.00e-04 1.00e-04 4.00e-03 Frequency 4.00e-04 Frequency 10 3.00e-03 2.00e-03 1.00e-03 0.00e+00 0.00e+00 10 20 30 40 50 60 70 10 20 Hours 30 40 50 60 70 Hours 3-3 invasion 3-3 point knight's move 12 3.50e-02 10 3.00e-02 2.50e-03 17 11 15 13 10 11 13 2.00e-03 19 18 12 14 Frequency Frequency 2.50e-02 2.00e-02 1.50e-02 1.50e-03 20 1.00e-03 1.00e-02 5.00e-04 16 at 5.00e-03 0.00e+00 0.00e+00 10 20 30 40 50 60 70 Hours 10 20 30 40 50 60 70 Hours Extended Data Figure 2: Frequency of occurence over time during training, for each joseki from Figure 5b, (corner sequences that AlphaGo Zero favoured for at least one iteration), and one additional variation The corresponding joseki are reproduced in this figure as insets Game 1, B: AG Zero, W: AG Zero, Result: B+R 90 63 13 65 51 23 48 88 98 28 50 22 55 76 96 86 92 17 94 59 56 83 30 19 93 81 12 52 79 41 25 77 39 14 10 38 32 33 44 57 54 46 16 91 100 80 40 31 68 29 87 69 99 60 75 64 24 67 26 66 61 73 42 85 11 18 15 71 21 84 62 78 20 45 27 35 95 53 97 47 72 70 58 49 43 89 36 82 37 34 74 Game 5, B: AG Zero, W: AG Zero, Result: B+R 21 71 20 13 70 12 11 14 10 19 64 62 65 72 27 66 67 69 68 75 76 29 30 73 26 24 25 22 74 23 100 97 94 98 87 93 92 82 90 91 95 81 78 86 61 63 79 77 60 80 59 44 48 56 55 57 40 52 51 47 54 49 34 58 50 53 36 28 33 18 37 39 96 88 84 32 31 35 17 15 16 45 99 89 83 42 41 43 38 46 85 Game 9, B: AG Zero, W: AG Zero, Result: W+R 79 70 78 77 66 64 65 61 62 69 68 76 20 71 60 55 46 84 80 73 72 74 58 59 47 87 82 57 81 75 67 63 52 54 85 83 88 49 48 5610086 43 50 53 99 51 45 44 41 39 42 40 94 14 16 11 12 10 13 15 17 96 19 18 97 91 90 89 at 38 92 at 82 95 at 93 36 37 32 33 26 22 31 34 24 25 23 21 28 35 30 27 29 98 at 82 Game 13, B: AG Zero, W: AG Zero, Result: W+R 58 27 29 36 34 57 26 25 23 24 38 32 33 28 35 30 56 31 22 47 37 54 61 44 46 40 66 76 45 99 50 55 89 43 63 95 96 97 86 65 53 62 17 77 91 98 51 64 41 42 12 21 94 69 67 60 14 10 11 100 74 84 82 80 85 87 68 75 92 90 83 81 79 78 48 88 71 70 49 13 52 16 15 18 19 72 93 39 59 20 73 Game 2, B: AG Zero, W: AG Zero, Result: B+R 10044 42 48 82 81 98 43 23 22 18 16 80 79 99 87 86 21 24 15 14 20 52 78 49 45 77 94 95 25 26 64 58 67 96 27 28 11 41 51 29 30 13 50 59 31 32 56 12 97 33 34 37 35 10 46 39 36 38 40 54 60 47 91 53 25 22 21 20 23 27 79 80 26 61 57 55 97 96 81 17 83 95 49 13 15 14 48 47 50 16 44 70 75 69 72 76 84 74 85 90 19 65 62 63 93 88 89 92 Game 6, B: AG Zero, W: AG Zero, Result: W+R 17 16 13 29 28 71 12 69 14 15 66 65 19 70 67 18 64 68 55 60 61 72 10098 73 74 99 51 38 35 49 36 37 34 32 33 50 43 52 53 54 40 30 76 25 39 96 27 95 23 21 22 84 41 24 26 78 47 46 58 48 97 77 42 81 83 57 80 31 63 62 82 93 59 45 56 79 44 90 89 85 86 88 87 92 91 75 20 10 94 11 Game 10, B: AG Zero, W: AG Zero, Result: W+R 15 18 16 14 10 13 17 11 40 98 12 97 38 39 19 29 83 96 28 82 80 74 41 42 33 57 55 64 81 71 70 72 26 23 37 63 62 6510056 75 73 43 30 24 25 99 52 66 93 91 89 86 88 45 44 67 22 54 53 92 58 60 90 87 68 46 69 20 21 51 50 48 94 49 59 61 27 95 78 76 47 32 31 34 35 84 79 85 36 10 Game 7, B: AG Zero, W: AG Zero, Result: B+R 38 36 35 28 37 39 85 16 30 25 26 24 20 33 97 80 79 86 32 31 27 23 88 87 93 95 18 34 29 99 98 100 92 96 91 94 71 83 53 90 63 45 10 19 74 82 59 44 41 17 77 15 21 69 52 62 66 40 65 49 43 76 51 57 22 14 12 11 58 67 68 60 42 47 46 50 54 55 13 70 61 64 48 56 72 73 at 61 75 at 67 78 at 58 81 at 67 84 at 58 89 at 67 Game 11, B: AG Zero, W: AG Zero, Result: W+R 94 89 97 87 83 84 95 82 92 96 99100 77 76 79 78 88 22 80 11 81 98 91 12 10 68 75 66 90 70 74 73 65 67 13 43 69 72 71 26 24 44 47 40 38 85 14 52 23 28 48 46 41 39 31 30 20 18 16 25 27 55 45 42 49 36 21 19 15 32 86 51 53 29 50 57 56 37 34 33 35 17 54 64 63 62 60 59 58 61 93 Game 4, B: AG Zero, W: AG Zero, Result: B+R 93 59 89 70 94 58 45 47 26 24 18 65 67 90 44 21 25 17 20 66 46 23 68 69 99 48 50 92 95 98 61 76 49 72 88 91 96 84 77 74 43 73 10097 75 78 82 87 85 83 79 60 15 13 86 62 80 81 12 11 41 10 14 30 36 34 40 39 32 29 16 35 56 52 54 22 37 33 31 57 27 53 55 42 38 51 19 28 64 63 71 Game 8, B: AG Zero, W: AG Zero, Result: B+R 15 48 16 14 10 13 80 76 47 11 94 50 17 77 96 12 79 45 20 74 75 95 55 54 21 91 97 49 53 9810099 59 58 56 46 57 92 93 61 60 72 73 63 62 90 78 71 65 64 89 67 66 69 68 29 83 88 70 27 28 44 41 87 23 26 42 43 33 25 24 40 31 30 32 18 22 39 35 34 19 82 81 84 85 37 36 38 52 86 51 Game 12, B: AG Zero, W: AG Zero, Result: W+R 59 45 58 57 61 43 42 44 17 28 54 53 29 55 95 89 78 71 67 56 51 52 32 87 88 48 83 70 60 18 96 92 98 90 85 84 97 91 86 50 36 82 93 99 49 46 37 13 66 64 27 62 65 63 31 47 94 21 69 68 79 34 22 20 19 7610077 40 25 81 11 15 30 38 75 33 23 80 16 10 14 74 73 35 26 24 12 72 41 39 77 at 46 Game 14, B: AG Zero, W: AG Zero, Result: W+R 62 87 57 56 58 97 88 91 47 48 13 92 14 46 49 59 60 65 64 5010053 54 51 61 99 98 67 55 66 79 75 78 84 77 68 63 76 90 52 42 73 74 95 43 41 44 85 30 96 45 81 24 72 69 70 40 82 22 71 20 39 36 33 18 11 21 19 37 31 26 34 28 15 10 23 32 16 27 29 25 12 17 35 38 94 83 at 63 86 at 68 89 at 63 27100 99 98 96 93 94 95 43 at 40 67 at 54 17 15 13 14 11 12 88 97 25 16 86 84 87 85 81 10 62 61 79 72 68 71 83 76 56 58 59 75 66 65 57 54 45 63 64 55 39 36 60 73 44 37 40 31 38 74 52 69 42 41 29 30 53 80 70 32 28 21 24 78 51 23 22 26 50 49 20 90 46 92 19 18 82 89 91 35 33 34 48 47 77 at 69 Game 15, B: AG Zero, W: AG Zero, Result: W+R 67 57 56 55 44 45 59 66 20 17 19 11 58 39 53 46 18 41 14 15 52 62 35 63 43 42 16 37 13 79 84 38 36 61 51 21 40 78 97100 71 73 74 77 64 85 29 70 72 75 80 88 28 68 69 76 81 93 90 65 50 49 96 82 92 87 48 47 94 91 95 99 98 89 24 26 31 60 54 23 22 25 86 32 27 10 83 30 34 33 12 93 at 68 Game 18, B: AG Zero, W: AG Zero, Result: B+R 100 40 10 93 45 39 38 32 94 33 31 30 18 92 99 11 98 19 68 66 73 71 67 66 70 65 62 58 86 28 69 61 56 54 57 74 72 68 59 53 55 51 85 82 71 75 52 88 36 73 60 42 12 64 78 76 77 41 43 63 87 35 34 29 90 84 91 46 37 89 24 46 47 24 45 16 23 20 18 14 12 15 19 17 13 17 83 80 at 68 Game 17, B: AG Zero, W: AG Zero, Result: W+R Game 3, B: AG Zero, W: AG Zero, Result: W+R 77 60 58 30 76 99 72 70 69 21 61 56 57 33 97 98 74 71 68 87 59 31 35 37 80 79 75 73 86 88 91 26 25 34 84 55 29 27 22 32 83 81 78 66 64 62 54 89 90 39 36 28 85 82 67 65 63 53 52 49 43 41 40 93 92 94 51 50 48 42 44 38 96 10095 11 Game 16, B: AG Zero, W: AG Zero, Result: W+R 16 15 18 20 22 21 10 11 12 19 17 14 13 38 90 75 74 65 91 64 51 50 73 95 76 48 49 57 94 79 80 46 47 78 77 42 45 52 53 72 92 97 96 43 28 30 82 24 40 41 54 56 26 98100 32 27 29 81 62 58 25 55 68 86 93 44 83 33 31 60 59 89 63 66 67 84 85 61 87 70 71 69 88 36 34 35 37 39 23 99 at 44 Game 19, B: AG Zero, W: AG Zero, Result: B+R 10 11 12 14 15 16 13 41 40 42 65 59 57 49 47 53 60 63 44 58 50 48 51 64 68 56 52 66 61 45 43 54 71 69 67 55 62 98 94 46 92 99 95 9310091 82 81 83 84 29 85 24 70 86 27 77 75 25 80 76 74 37 36 18 20 22 79 78 28 26 30 32 34 17 19 21 96 72 23 31 33 35 97 90 73 39 38 88 87 89 Game 20, B: AG Zero, W: AG Zero, Result: W+R 19 18 10 11 21 20 23 37 36 14 56 84 27 97 41 25 24 30 40 29 13 26 98 28 34 92 42 58 22 15 33 90 39 87 89 96 54 52 46 57 93 31 35 91 95 86 88 53 51 45 99 12 32 85 17 100 66 76 80 79 68 62 63 81 72 47 43 49 50 38 60 67 64 61 48 71 70 55 44 16 59 69 65 77 82 73 74 75 78 94 83 at 48 Extended Data Figure 3: AlphaGo Zero (20 block) self-play games The day training run was subdivided into 20 periods The best player from each period (as selected by the evaluator) played a single game against itself, with hour time controls 100 moves are shown for each game; full games are provided in Supplementary Information Game 1, B: AG Zero, W: AG Zero, Result: B+R 48 61 86 11 80 97 27 34 49 72 50 73 85 14 84 96 52 39 74 45 36 31 15 53 82 95 88 81 26 29 19 33 10 42 37 59 28 16 30 58 75 56 68 99 35 77 70 79 25 83 41 20 63 23 94 18 21 69 67 92 89 62 90 47 38 51 65 10012 24 13 78 71 64 57 43 32 54 22 98 91 76 60 46 87 17 44 93 40 55 66 Game 2, B: AG Zero, W: AG Zero, Result: W+R 58 57 98 99 11 14 13 82 17 21 15 56 47 18 10 19 62 61 67 66 60 63 20 16 75 74 68 59 79 81 52 64 69 72 70 55 65 73 80 48 96 78 83 76 71 43 51 77 88 89 45 49 50 93 90 91 97 44 95 92 85 84 10012 37 53 86 38 35 36 54 29 23 87 42 40 39 34 28 25 24 22 46 41 30 27 26 31 33 32 Game 3, B: AG Zero, W: AG Zero, Result: B+R 85 67 42 86 78 84 75 83 93 94 81 51 52 34 82 61 92 80 96 97 55 64 39 40 62 36 63 95 87 98 66 65 41 71 72 57 91 90 99 76 68 37 33 100 77 69 70 73 79 89 49 58 54 88 43 45 74 47 38 48 46 53 50 16 14 44 56 29 12 15 35 27 26 31 20 10 11 59 28 24 25 19 18 23 30 17 22 21 32 13 60 94 at 85 Game 5, B: AG Zero, W: AG Zero, Result: W+R 58 57 48 40 44 35 46 47 39 74 42 45 49 41 73 36 38 37 53 54 51 55 71 50 52 69 65 70 63 64 66 96 67 62 61 94 93 97 68 90 95 5610089 72 99 79 98 83 81 85 82 80 84 59 91 92 88 86 87 77 76 32 75 33 78 43 34 20 at 13 Game 6, B: AG Zero, W: AG Zero, Result: W+R 43 45 39 44 47 40 41 46 42 48 29 60 35 33 34 32 30 11 72 31 36 92 71 73 89 17 12 63 65 64 70 69 66 67 51 60 28 61 50 94 55 59 98 97 56 54 21 20 24 23 10074 53 52 17 18 68 62 99 76 75 15 16 19 57 77 14 22 58 13 10 25 26 27 Game 7, B: AG Zero, W: AG Zero, Result: W+R 76 75 82 79 42 68 77 66 67 74 49 51 78 57 55 41 60 39 40 72 64 65 92 91 85 25 44 48 56 53 54 61 21 20 69 70 94 86 84 87 45 36 58 50 22 71 93 37 43 46 23 19 95 47 24 89 38 28 100 62 90 59 63 52 93 16 19 30 21 20 25 31 29 24 22 23 27 18 11 26 10 28 13 14 12 15 Game 4, B: AG Zero, W: AG Zero, Result: W+R 89 82 81 87 84 80 79 29 76 73 77 83 85 78 74 75 90 86 49 59 47 30 88 94 54 56 57 46 52 53 58 48 42 92 93 63 31 60 45 95 98 62 61 43 44 65 50 10099 51 64 55 72 28 69 67 68 70 66 23 21 25 22 12 71 27 36 97 96 11 10 15 32 34 33 26 19 41 35 91 38 24 17 37 39 40 16 13 14 18 88 90 91 85 87 83 81 84 37 82 80 86 78 79 96 95 38 49 Game 8, B: AG Zero, W: AG Zero, Result: B+R 89 93 88 91 95 92 94 17 85 87 98 97 100 99 19 86 36 75 18 68 53 66 35 96 74 72 70 64 60 61 67 33 34 55 73 71 69 63 62 65 25 32 41 54 56 52 21 22 57 50 15 13 20 37 39 59 51 47 46 14 12 43 49 16 45 48 44 58 73 17 35 34 29 31 30 98 97 32 26 27 18 88 12 96 11 81 13 10 14 16 83 15 42 76 99 33 79 77 83 80 78 81 84 82 90 11 10 24 28 23 30 29 31 26 27 40 38 80 at 66 Game 9, B: AG Zero, W: AG Zero, Result: W+R 22 18 17 20 21 10 11 28 70 100 23 12 99 19 14 15 97 16 13 26 29 25 27 83 82 65 71 98 73 84 78 79 63 64 72 69 66 86 58 87 85 67 68 59 57 80 46 48 36 81 77 76 74 60 56 49 45 44 33 75 51 96 52 43 42 38 37 31 32 95 94 61 47 34 35 30 39 89 50 92 62 93 88 90 91 54 53 40 41 55 Game 10, B: AG Zero, W: AG Zero, Result: W+R 90 89 53 73 71 59 93 10 11 51 31 72 45 68 65 94 12 52 48 49 67 66 14 15 50 80 75 16 13 29 76 58 40 70 28 63 62 61 57 30 69 82 60 81 64 77 84 99 96 97 83 78 10098 95 39 79 92 86 87 35 32 27 85 33 88 26 22 23 91 25 24 47 36 37 21 18 46 19 42 44 17 20 74 43 41 55 54 56 34 38 Game 11, B: AG Zero, W: AG Zero, Result: B+R 87 50 86 45 63 67 64 65 20 62 17 66 22 46 24 25 16 23 26 47 31 27 30 28 32 44 52 19 18 100 51 21 15 33 34 11 10 12 40 13 36 35 38 42 43 37 14 41 39 29 80 81 48 49 78 79 72 74 77 84 71 70 59 61 99 76 73 60 75 97 98 68 55 56 90 95 96 69 85 58 54 93 94 83 57 91 92 82 53 89 88 Game 12, B: AG Zero, W: AG Zero, Result: W+R 31 32 28 23 22 33 30 15 92 91 76 87 86 88 90 89 85 80 95 93 84 79 77 81 26 25 27 70 78 97 39 20 35 17 24 38 75 82 83 94 21 37 49 96 69 100 98 57 56 52 34 16 18 36 58 53 59 61 63 41 29 66 99 62 47 19 65 50 54 60 64 68 13 11 72 40 44 46 43 51 55 67 48 12 10 71 73 42 45 14 74 24 at 17 Game 13, B: AG Zero, W: AG Zero, Result: B+R 41 35 30 39 40 43 49 19 20 37 38 42 21 29 31 36 25 27 28 32 33 23 22 26 10 24 34 60 53 87 51 84 82 83 86 96 99100 90 81 97 93 95 92 89 88 62 50 66 98 91 85 52 11 18 65 75 71 76 58 94 48 61 63 79 14 72 73 77 59 47 67 64 12 13 74 57 56 46 16 15 17 68 70 55 45 44 54 69 78 Game 14, B: AG Zero, W: AG Zero, Result: W+R 13 14 23 15 21 20 16 22 17 18 25 19 26 27 24 43 44 28 99 39100 42 40 97 30 29 31 32 41 59 33 58 77 91 76 71 72 84 87 89 90 93 57 98 70 73 69 85 86 75 88 94 95 79 78 66 63 74 68 53 67 61 62 64 65 92 51 55 60 81 96 83 80 48 47 37 35 50 56 10 11 54 49 45 46 36 34 52 12 82 38 80 at 13 Game 15, B: AG Zero, W: AG Zero, Result: W+R 41 47 40 36 37 28 18 74 13 26 39 45 22 21 17 20 43 27 23 19 44 42 46 67 24 14 16 38 65 25 71 15 35 70 73 68 69 29 72 95 91 89 92 75 79 77 93 64 87 90 80 78 76 94 88 66 51 54 83 58 11 48 52 99 97 33 31 82 60 55 10 49 53 10098 96 81 32 30 84 56 57 61 59 12 50 63 85 34 86 Game 16, B: AG Zero, W: AG Zero, Result: W+R 65 27 29 16 15 17 28 30 25 12 13 26 14 11 18 66 97 96 92 88 67 10098 84 81 86 87 99 85 90 89 93 73 80 91 78 74 72 82 94 83 95 79 77 75 68 64 63 69 62 33 20 22 31 24 19 21 42 23 50 46 43 49 47 48 61 59 60 52 32 70 51 41 71 37 38 44 40 34 39 45 76 57 55 10 54 53 56 58 35 36 62 at 55 Game 17, B: AG Zero, W: AG Zero, Result: B+R 61 59 58 60 10 46 19 20 91 92 26 45 21 10093 94 23 24 96 55 54 47 25 22 95 43 31 30 99 97 56 32 44 88 98 34 85 86 79 84 87 89 41 42 71 78 83 90 49 82 80 81 64 74 77 70 73 72 35 57 40 69 75 76 39 11 18 65 63 67 68 48 62 38 28 66 14 50 52 36 37 27 12 13 51 53 33 16 15 17 29 Game 18, B: AG Zero, W: AG Zero, Result: W+R 86 10 88 85 54 12 52 87 55 13 53 67 63 66 62 51 58 61 59 60 57 56 11 35 21 18 20 19 15 17 16 50 14 26 98 92 97 93 24 22 64 25 23 91 99 10096 65 Game 19, B: AG Zero, W: AG Zero, Result: W+R 60 71 58 59 61 13 15 49 69 67 77 56 11 35 12 14 16 47 48 42 75 10 36 45 51 46 39 43 44 91 76 79 80 37 52 50 53 40 63 65 64 68 85 83 84 54 82 55 62 66 86 90 38 81 92 57 93 89 41 18 88 87 23 24 43 84 95 78 77 94 71 89 75 90 34 76 72 73 74 30 38 36 49 82 29 28 37 42 40 48 47 80 27 33 31 32 41 46 70 68 69 83 81 79 39 44 45 17 22 19 27 26 31 25 28 29 30 34 100 20 97 78 98 99 70 33 21 32 95 74 72 73 94 96 Game 20, B: AG Zero, W: AG Zero, Result: W+R 41 21 16 15 35 40 18 14 22 20 17 19 29 34 42 24 25 28 36 30 27 26 38 32 23 31 33 44 37 39 45 10098 88 87 99 84 82 89 85 81 51 90 72 74 79 50 13 57 75 66 70 68 69 73 80 12 53 48 56 64 65 91 67 97 11 49 46 52 60 58 63 71 83 10 47 54 61 59 62 76 92 93 94 77 78 86 96 95 43 55 at 46 Extended Data Figure 4: AlphaGo Zero (40 block) self-play games The 40 day training run was subdivided into 20 periods The best player from each period (as selected by the evaluator) played a single game against itself, with hour time controls 100 moves are shown for each game; full games are provided in Supplementary Information Game 1, B: AG Lee, W: AG Zero, Result: W+R 30 32 34 26 28 29 31 33 35 27 64 36 73 72 67 68 69 71 70 54 11 10 14 24 13 20 22 23 16 66 12 65 19 21 15 18 96 97 47 51 52 57 49 48 56 62 63 50 58 60 53 80 81 17 61 59 75 55 94 95 74 90 92 87 86 89 93 82 10099 91 78 77 79 88 98 84 83 76 25 85 40 44 43 37 39 38 41 42 46 45 Game 2, B: AG Lee, W: AG Zero, Result: W+R 32 34 36 85 28 30 31 33 35 37 84 83 29 51 82 50 41 45 46 68 61 59 43 42 66 67 60 69 44 65 49 70 72 91 93 27 76 63 64 92 40 94 74 73 80 71 62 38 81 75 78 89 77 88 54 53 39 79 98 100 56 11 52 57 90 87 58 55 86 23 96 95 97 21 99 10 14 15 12 48 47 25 24 18 16 20 13 19 17 22 26 Game 3, B: AG Lee, W: AG Zero, Result: W+R 30 28 29 79 83 20 21 77 31 78 67 41 55 58 40 22 76 80 66 69 64 73 59 53 52 56 32 27 26 82 43 81 70 71 54 50 39 24 23 88 62 42 86 68 65 75 61 51 49 97 25 89 33 92 85 72 74 57 19 63 84 87 91 100 18 90 60 99 98 48 46 11 13 14 15 12 16 47 45 35 34 37 36 38 44 93 10 17 94 95 96 Game 4, B: AG Lee, W: AG Zero, Result: W+0.50 18 20 30 32 14 16 17 19 21 31 93 29 15 59 81 80 48 50 51 79 58 75 82 53 78 77 73 76 52 49 60 74 54 55 56 84 85 86 41 65 92 13 57 35 43 46 62 45 44 42 33 40 66 63 64 36 34 88 39 67 37 25 89 90 38 72 68 23 22 26 87 28 70 96 99 27 24 91 61 47 71 11 69 95 10 94 12 98 97 100 83 at 58 Game 5, B: AG Lee, W: AG Zero, Result: W+R 67 43 65 32 33 64 36 37 39 44 24 25 30 31 34 38 13 66 28 26 27 41 35 40 42 29 88 83 86 15 14 16 17 18 19 20 21 22 23 84 99100 98 11 10 95 96 92 93 97 12 91 94 89 80 60 72 50 63 49 90 78 73 59 70 69 74 79 77 71 75 76 81 68 37 43 36 39 42 40 41 44 35 48 Game 9, B: AG Lee, W: AG Zero, Result: W+R 91 89 90 45 88 83 84 85 87 86 96 63 78 77 81 80 79 82 Game 13, B: AG Zero, W: AG Lee, Result: B+R 94 72 74 76 85 88 71 70 73 10 96 86 75 10087 94 98 17 99 92 97 85 87 82 57 56 55 47 58 45 46 51 53 52 62 54 61 30 28 29 20 21 31 60 72 62 34 22 67 32 27 26 57 61 33 24 2310058 56 66 73 25 35 65 68 98 19 59 64 70 75 18 69 76 71 74 97 11 13 14 99 15 12 16 37 36 39 44 43 38 40 42 41 55 93 10 53 54 51 92 17 52 47 50 46 48 49 94 95 Game 6, B: AG Lee, W: AG Zero, Result: W+0.50 24 40 20 22 23 26 19 21 36 35 28 25 53 55 51 52 27 65 42 54 67 63 61 64 89 29 30 57 58 62 85 88 41 31 32 66 60 59 79 77 84 87 33 34 56 78 80 83 91 37 18 68 69 90 92 99 96 43 38 93 86 95 98 97 17 70 73 100 45 46 74 75 44 71 47 16 14 12 10 49 81 82 76 72 50 48 15 13 11 25 24 15 11 19 23 27 22 20 30 12 13 26 21 28 16 14 95 18 69 53 51 33 32 29 93 54 52 50 34 89 80 79 81 48 82 84 31 38 77 68 66 58 63 78 65 55 59 45 60 57 56 62 67 64 61 47 90 49 83 91 46 Game 7, B: AG Lee, W: AG Zero, Result: W+R 16 18 20 15 14 17 12 60 19 87 61 58 57 62 59 63 13 55 56 47 46 53 54 45 44 37 52 65 67 91 88 96 89 35 36 66 90 92 93 94 33 34 64 95 22 51 71 73 21 68 69 70 85 84 10 86 72 83 26 49 41 43 11 74 97 28 42 38 75 76 99 98 27 23 50 39 40 48 77 7810082 25 24 29 30 32 81 79 80 31 Game 10, B: AG Lee, W: AG Zero, Result: W+R 48 42 41 44 45 90 24 25 94 37 31 38 26 39 36 28 29 40 35 30 27 46 49 81 34 32 23 43 47 80 33 76 77 78 79 63 83 50 51 97 85 62 82 89 75 86 8810098 11 59 61 64 84 87 99 93 60 65 67 95 58 10 72 66 69 73 16 96 55 71 70 68 20 56 12 53 57 19 13 92 91 52 54 74 15 14 17 18 22 21 Game 11, B: AG Zero, W: AG Lee, Result: B+R Game 14, B: AG Zero, W: AG Lee, Result: B+R 59 41 58 39 57 49 56 40 37 50 51 45 47 18 53 55 20 21 42 60 52 28 46 48 54 19 23 11 10 33 43 44 24 22 25 13 12 35 34 26 17 15 14 36 94 29 27 10016 38 88 89 90 96 91 83 97 74 62 85 84 95 72 70 79 78 73 69 68 31 76 87 71 93 82 86 92 66 65 63 67 80 77 30 64 32 81 61 75 98 99 Game 15, B: AG Zero, W: AG Lee, Result: B+R Game 18, B: AG Zero, W: AG Lee, Result: B+R 37 18 31 29 36 13 12 10 25 30 27 28 34 15 11 41 32 33 35 17 14 16 42 43 21 19 44 26 20 39 38 23 76 47 45 75 69 22 46 72 74 70 73 40 60 59 71 61 77 79 10088 24 93 78 83 86 87 51 50 58 66 68 89 90 92 95 99 84 85 56 52 48 49 54 65 62 94 91 96 97 80 81 53 55 57 67 63 64 98 82 Game 19, B: AG Zero, W: AG Lee, Result: B+1.50 99 98100 39 36 42 24 96 44 16 18 19 95 41 43 17 21 11 10 93 40 22 20 23 13 12 97 94 25 15 27 37 14 26 38 91 29 31 92 33 28 35 77 88 74 49 53 73 71 30 78 48 50 51 52 86 70 63 34 65 67 69 85 46 47 87 62 61 59 64 66 68 82 54 45 89 90 60 32 57 72 84 81 83 58 55 56 75 76 79 80 24 22 50 49 21 42 41 57 23 20 43 59 51 39 36 37 19 48 47 58 56 55 33 44 38 31 32 29 25 26 27 28 95 30 89 74 82 93 40 92 84 88 73 72 81 76 94 91 78 80 83 86 65 75 71 18 77 79 87 46 63 90 54 66 17 85 61 62 60 34 45 67 70 52 97 35 10 12 14 16 68 64 69 53 96 98 99 11 13 15 100 61 10099 22 96 13 31 35 12 28 34 11 10 37 33 32 29 30 36 57 56 49 59 58 46 42 43 60 47 40 39 69 44 41 38 78 68 63 55 53 51 45 48 64 62 54 52 50 74 73 97 98 84 71 82 85 83 70 67 81 80 88 86 77 75 65 87 72 90 66 79 89 26 94 91 76 20 23 25 14 92 18 15 16 24 95 93 21 19 17 27 Game 8, B: AG Lee, W: AG Zero, Result: W+R 66 34 32 35 55 50 51 40 33 26 28 30 29 37 71 57 56 12 49 38 27 31 36 10 46 59 74 39 84 41 70 58 47 52 72 73 60 53 54 96 44 61 48 85 63 62 67 64 65 68 79 81 93 95 69 83 76 45 80 87 92 94 97 77 78 89 17 82 14 98 86 88 16 23 100 22 19 90 91 43 21 20 42 99 13 11 15 24 75 25 18 Game 12, B: AG Zero, W: AG Lee, Result: B+1.50 24 50 51 53 52 29 27 30 64 34 35 22 21 23 25 61 63 48 19 33 31 26 28 62 68 66 18 32 65 17 47 10 11 46 74 20 44 67 12 13 45 42 73 55 57 92 59 14 15 41 43 87 98 16 86 85 84 58 69 56 83 79 78 60 99 54 88 89 91 71 100 90 93 77 75 72 70 96 94 95 76 82 80 97 39 81 36 49 38 37 40 Game 16, B: AG Zero, W: AG Lee, Result: B+R 36 38 35 34 32 33 31 37 41 61 57 58 62 63 67 49 45 39 56 59 65 66 53 52 48 44 60 20 21 85 69 55 54 24 23 25 64 84 68 22 42 72 81 26 71 73 74 43 82 75 76 78 51 47 77 50 28 19 70 40 79 83 80 18 46 17 27 99 30 29 95 93 98 86 10 12 14 16 97 94 92 96100 90 87 88 89 11 13 15 91 39 at 23 Game 17, B: AG Zero, W: AG Lee, Result: B+R 47 45 43 69 62 65 67 46 37 41 32 40 44 18 19 63 59 60 66 68 38 36 31 35 29 61 21 33 30 27 64 34 24 23 17 39 10 11 22 28 20 25 42 12 13 26 14 15 92 16 76 75 93 77 83 82 78 87 85 84 79 89 88 86 74 58 99 90 81 57 55100 52 56 54 70 71 48 51 50 94 91 73 49 53 80 96 95 72 98 97 Game 20, B: AG Zero, W: AG Lee, Result: B+R 27 21 20 25 26 13 14 62 28 60 78 15 59 71 72 58 74 17 18 61 69 10 11 76 19 16 23 29 68 70 80 79 75 22 24 32 63 66 12 81 83 77 73 65 30 82 31 67 64 88 53 50 51 52 39 36 37 45 46 87 89 90 41 35 34 38 85 47 42 40 44 91 43 33 48 99 93 92 49 54 10086 94 95 97 84 96 98 55 56 57 Extended Data Figure 5: Tournament games between AlphaGo Zero (20 block, day) versus AlphaGo Lee using hour time controls 100 moves of the first 20 games are shown; full games are provided in Supplementary Information Game 1, B: AG Master, W: AG Zero, Result: W+R 36 12 31 34 10 32 30 19 26 24 73 11 33 35100 67 66 28 68 23 72 65 60 59 63 25 74 99 98 91 61 62 64 22 20 17 88 86 84 87 95 90 94 29 21 70 85 71 97 96 89 92 27 56 13 93 53 18 57 54 58 55 69 45 47 83 75 39 42 77 37 38 51 15 43 48 78 76 41 40 16 82 46 52 49 44 14 81 50 79 80 Game 2, B: AG Zero, W: AG Master, Result: B+R 68 61 60 66 63 56 69 64 55 72 70 71 49 53 59 58 67 47 52 54 11 10 62 57 65 73 50 13 12 48 51 15 14 16 75 77 74 100 45 99 92 43 44 98 97 85 31 29 27 40 84 76 83 28 22 23 41 42 91 80 17 30 25 24 26 46 78 39 87 35 33 19 79 38 81 34 32 18 20 21 37 36 82 86 96 89 88 95 90 93 94 Game 3, B: AG Master, W: AG Zero, Result: W+R 34 36 44 31 32 68 43 16 46 29 14 39 42 30 74 45 47 63 15 26 33 27 40 35 72 73 62 60 58 48 41 37 38 67 64 61 59 57 28 70 66 65 25 71 69 24 13 Game 6, B: AG Zero, W: AG Master, Result: B+R 11 13 17 43 10 12 14 18 42 31 80 83 27 84 79 60 44 85 25 16 50 32 59 69 41 47 48 55 70 100 49 53 58 56 86 99 40 19 98 54 39 57 81 24 93 64 65 35 46 51 74 96 95 52 78 73 26 23 34 94 33 38 66 77 61 75 22 20 97 63 76 68 71 21 29 36 30 62 67 92 90 28 72 37 15 91 87 88 45 82 89 92 94 95 98 93 88 96 82 91 99 9710089 87 54 56 83 85 86 55 51 50 78 11 80 90 52 49 22 20 18 84 75 10 81 53 23 21 19 17 76 79 12 77 11 10 12 13 32 76 56 54 26 74 53 16 58 90 96 94 75 78 49 55 51 52 89 91 92 67 59 30 72 87 88 93 77 68 70 71 83 84 95 65 69 73 46 50 64 57 15 47 43 44 60 66 14 33 79 42 31 45 34 80 48 25 61 82 38 40 28 10099 63 41 37 39 62 24 86 81 at 53 Game 5, B: AG Master, W: AG Zero, Result: W+R 98 85 at 78 35 36 23 29 22 20 21 19 18 17 27 63 14 44 15 70 69 26 43 16 22 23 27 65 67 21 20 64 41 29 66 39 38 34 28 35 40 37 24 25 31 32 30 36 33 18 24 22 21 30 23 20 25 32 31 26 27 34 33 28 29 78 94 77 87 93 56 88 16 14 12 10 57 15 13 11 97 at 88 Game 7, B: AG Master, W: AG Zero, Result: W+R Game 4, B: AG Zero, W: AG Master, Result: B+R 51 52 50 49 45 48 67 69 59 44 41 46 55 17 54 61 66 71 60 53 40 65 63 64 70 68 72 42 47 73 58 37 38 39 43 35 62 90 100 83 76 89 19 96 92 84 79 75 99 80 97 98 95 91 81 85 86 36 74 82 18 17 19 45 50 49 51 47 48 52 97 53 46 55 54 96 82 95 61 56 81 80 93 94 62 79 78 68 91 71 83 90 92 57 58 10089 98 12 88 85 99 11 10 74 13 84 75 72 73 87 77 76 86 59 60 Game 8, B: AG Zero, W: AG Master, Result: B+R 11 13 15 55 10 12 14 16 54 84 85 83 81 82 74 71 79 80 72 67 68 73 77 62 90 66 65 69 61 76 75 52 78 92 70 91 58 89 60 93 63 97 59 95 87 88 64 99 56 96 94 86 57 98 40 50 53 36 35 38 51 10037 39 41 29 28 33 34 27 26 31 32 25 20 23 30 21 22 24 18 19 45 17 47 46 48 49 43 42 44 42 at 37 Game 9, B: AG Master, W: AG Zero, Result: W+R 62 91 84 61 60 55 56 30 29 33 86 88 90 85 81 32 28 26 27 89 13 87 31 93 11 10 80 79 54 63 58 12 34 65 92 73 72 59 57 16 35 52 77 68 75 74 78 37 64 83 69 66 67 70 76 36 82 53 71 38 94 23 15 25 14 24 99 95 42 97 96 41 10098 50 47 45 51 46 44 43 49 48 39 40 22 21 20 19 18 17 Game 13, B: AG Master, W: AG Zero, Result: W+R 12 10 38 11 40 48 88 87 52 86 84 89 85 49 51 47 96 50 35 33 90 83 97 53 34 30 31 81 80 32 13 82 74 73 99 39 37 72 79 98 95 94 100 36 92 91 66 65 67 17 23 93 58 57 55 71 77 21 70 69 68 56 60 75 18 41 27 29 45 54 15 63 62 76 19 20 43 25 16 28 42 46 59 14 61 78 22 44 24 26 64 Game 17, B: AG Master, W: AG Zero, Result: W+R 50 34 31 51 47 46 43 44 30 28 38 35 22 23 29 13 56 37 36 32 26 33 11 10 52 39 27 53 48 12 55 57 19 18 49 45 16 20 21 25 83 80 24 41 79 82 10063 99 62 98 70 78 68 92 81 90 69 84 77 67 93 42 15 97 61 64 14 91 54 88 66 86 17 95 75 89 87 65 72 71 73 74 58 60 59 85 96 76 94 40 at 35 Game 10, B: AG Zero, W: AG Master, Result: B+R 84 81 80 76 74 75 73 71 83 63 82 46 56 19 79 77 18 30 32 34 72 68 69 65 54 53 57 78 31 33 67 66 62 58 55 61 24 22 23 28 17 59 60 52 50 48 47 21 20 26 88 70 64 85 51 49 25 27 29 93 91 92 89 90 94 97 96 99 45100 43 98 95 42 41 38 40 39 35 37 36 16 14 15 12 13 10 11 86 87 44 Game 14, B: AG Zero, W: AG Master, Result: W+R 11 13 15 69 10 12 14 16 68 44 92 70 91 29 28 33 34 27 26 31 32 25 20 23 30 21 22 61 24 18 94 96 95 93 67 99 66 65 77 89 98 72 81 76 80 90 71 82 83 86 84 87 88 45100 75 73 40 42 74 36 35 38 43 85 78 37 39 41 79 64 97 19 60 58 47 59 46 54 52 62 56 48 55 63 17 57 49 50 53 51 Game 18, B: AG Zero, W: AG Master, Result: B+R 62 34 32 16 35 33 30 25 24 26 28 22 23 39 40 31 29 27 38 41 61 20 19 44 88 89 21 37 36 81 42 84 85 43 15 80 44 38 26 40 25 32 31 54 43 37 39 34 24 46 41 53 68 33 14 42 45 67100 30 29 15 27 80 84 90 92 47 83 91 85 93 35 28 69 16 59 60 86 71 58 70 57 60 56 62 61 50 48 49 52 51 55 74 77 75 72 73 13 98 99 18 17 19 20 21 22 23 36 63 86 87 89 88 82 94 81 64 95 65 12 76 11 10 97 79 96 78 66 Game 12, B: AG Zero, W: AG Master, Result: B+R 29 81 79 78 80 21 28 50 20 23 18 75 74 31 85 57 19 76 51 84 54 53 77 17 27 55 56 82 22 25 52 26 30 88 32 98 87 100 48 95 96 97 94 86 93 47 34 40 49 99 91 92 73 39 43 16 89 58 67 14 15 33 83 60 70 71 12 13 90 45 24 42 44 35 69 72 61 10 11 46 37 36 68 59 65 41 38 66 64 62 63 Game 15, B: AG Master, W: AG Zero, Result: W+R Game 16, B: AG Zero, W: AG Master, Result: W+R 63 31 32 66 84 64 30 27 28 34 24 65 83 18 17 59 29 33 19 60 20 21 14 26 36 38 67 22 93 92 58 25 15 35 37 87 91 94 23 54 61 41 89 85 82 56 53 52 86 88 80 57 55 95 51 16 43 81 96 50 40 79 39 46 12 48 44 45 42 97 11 10 77 90 49 47 73 72 62 99 13 71 78 98 10070 68 69 74 75 76 77 37 39 41 52 83 74 36 35 38 43 15 51 50 75 73 42 40 93 48 82 76 45 72 47 46 66 94 53 79 54 100 49 64 63 81 95 62 80 61 86 96 85 97 58 84 19 44 57 78 92 65 60 59 71 88 68 69 98 87 24 16 67 70 89 91 21 22 56 90 25 20 23 30 55 10 12 14 18 99 27 26 31 32 11 13 17 29 28 33 34 Game 19, B: AG Master, W: AG Zero, Result: W+R Game 20, B: AG Zero, W: AG Master, Result: B+R 12 62 65 66 10 61 63 11 95 78 77 79 76 43 94 52 53 51 97 80 91 96 93 67 87 85 88 90 92 13 82 86 99 83 84 89 98 100 29 33 25 41 64 28 23 26 27 18 21 24 31 22 32 19 20 34 30 35 36 39 16 38 37 81 18 58 57 87 83 17 14 59 82 55 13 12 64 70 90 92 91 54 11 10 66 63 65 93 72100 50 48 51 96 56 77 69 94 67 78 45 49 97 95 74 71 68 46 47 53 52 98 76 75 73 79 99 at 67 Game 11, B: AG Master, W: AG Zero, Result: B+R 40 at 35 75 15 42 14 56 50 48 55 72 47 60 70 49 71 74 46 44 17 45 54 58 59 57 68 73 69 45 44 11 10 68 51 13 12 17 14 67 18 52 46 41 43 85 31 29 27 89 88 38 60 86 28 22 23 39 40 50 55 30 25 24 26 47 59 56 42 35 33 48 53 19 91 92 49 34 32 72 16 69 58 61 66 20 81 80 95 74 71 73 70 54 57 64 84 83 82 76 78 77 75 65 79 at 73 87 at 83 90 at 84 15 37 99 96 63 36 98 21 94 62100 93 at 83 97 at 80 Extended Data Figure 6: AlphaGo Zero (40 block, 40 day) versus AlphaGo Master tournament games using hour time controls 100 moves of the first 20 games are shown; full games are provided in Supplementary Information

Ngày đăng: 12/04/2019, 00:23

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN

w