Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 672 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
672
Dung lượng
15,43 MB
Nội dung
BayesianReasoningandMachineLearning David Barber c 2007,2008,2009,2010,2011,2012,2013 Notation List V a calligraphic symbol typically denotes a set of random variables dom(x) Domain of a variable x=x The variable x is in the state x p(x = tr) probability of event/variable x being in the state true p(x = fa) probability of event/variable x being in the state false p(x, y) probability of x and y p(x ∩ y) probability of x and y p(x ∪ y) probability of x or y p(x|y) The probability of x conditioned on y X ⊥⊥ Y| Z Variables X are independent of variables Y conditioned on variables Z 11 X Variables X are dependent on variables Y conditioned on variables Z 11 Y| Z x f (x) For continuous variables this is shorthand for ables means summation over the states of x, f (x)dx and for discrete varix f (x) 18 I [S] Indicator : has value if the statement S is true, otherwise 19 pa (x) The parents of node x 26 ch (x) The children of node x 26 ne (x) Neighbours of node x 26 dim (x) For a discrete variable x, this denotes the number of states x can take 34 f (x) The average of the function f (x) with respect to the distribution p(x) 158 p(x) δ(a, b) dim x Delta function For discrete a, b, this is the Kronecker delta, δa,b and for continuous a, b the Dirac delta function δ(a − b) 160 The dimension of the vector/matrix x 171 (x = s, y = t) The number of times x is in state s and y in state t simultaneously 197 x y The number of times variable x is in state y 278 D Dataset 291 n Data index 291 N Number of dataset training points 291 S Sample Covariance matrix 315 σ(x) The logistic sigmoid 1/(1 + exp(−x)) 353 erf(x) The (Gaussian) error function 353 xa:b xa , xa+1 , , xb 455 i∼j The set of unique neighbouring edges on a graph 585 Im The m × m identity matrix 605 II DRAFT February 25, 2014 Preface The data explosion We live in a world that is rich in data, ever increasing in scale This data comes from many different sources in science (bioinformatics, astronomy, physics, environmental monitoring) and commerce (customer databases, financial transactions, engine monitoring, speech recognition, surveillance, search) Possessing the knowledge as to how to process and extract value from such data is therefore a key and increasingly important skill Our society also expects ultimately to be able to engage with computers in a natural manner so that computers can ‘talk’ to humans, ‘understand’ what they say and ‘comprehend’ the visual world around them These are difficult large-scale information processing tasks and represent grand challenges for computer science and related fields Similarly, there is a desire to control increasingly complex systems, possibly containing many interacting parts, such as in robotics and autonomous navigation Successfully mastering such systems requires an understanding of the processes underlying their behaviour Processing and making sense of such large amounts of data from complex systems is therefore a pressing modern day concern and will likely remain so for the foreseeable future MachineLearningMachineLearning is the study of data-driven methods capable of mimicking, understanding and aiding human and biological information processing tasks In this pursuit, many related issues arise such as how to compress data, interpret and process it Often these methods are not necessarily directed to mimicking directly human processing but rather to enhance it, such as in predicting the stock market or retrieving information rapidly In this probability theory is key since inevitably our limited data and understanding of the problem forces us to address uncertainty In the broadest sense, MachineLearningand related fields aim to ‘learn something useful’ about the environment within which the agent operates MachineLearning is also closely allied with Artificial Intelligence, with MachineLearning placing more emphasis on using data to drive and adapt the model In the early stages of MachineLearningand related areas, similar techniques were discovered in relatively isolated research communities This book presents a unified treatment via graphical models, a marriage between graph and probability theory, facilitating the transference of MachineLearning concepts between different branches of the mathematical and computational sciences Whom this book is for The book is designed to appeal to students with only a modest mathematical background in undergraduate calculus and linear algebra No formal computer science or statistical background is required to follow the book, although a basic familiarity with probability, calculus and linear algebra would be useful The book should appeal to students from a variety of backgrounds, including Computer Science, Engineering, applied Statistics, Physics, and Bioinformatics that wish to gain an entry to probabilistic approaches in MachineLearning In order to engage with students, the book introduces fundamental concepts in inference using III only minimal reference to algebra and calculus More mathematical techniques are postponed until as and when required, always with the concept as primary and the mathematics secondary The concepts and algorithms are described with the aid of many worked examples The exercises and demonstrations, together with an accompanying MATLAB toolbox, enable the reader to experiment and more deeply understand the material The ultimate aim of the book is to enable the reader to construct novel algorithms The book therefore places an emphasis on skill learning, rather than being a collection of recipes This is a key aspect since modern applications are often so specialised as to require novel methods The approach taken throughout is to describe the problem as a graphical model, which is then translated into a mathematical framework, ultimately leading to an algorithmic implementation in the BRMLtoolbox The book is primarily aimed at final year undergraduates and graduates without significant experience in mathematics On completion, the reader should have a good understanding of the techniques, practicalities and philosophies of probabilistic aspects of MachineLearningand be well equipped to understand more advanced research level material The structure of the book The book begins with the basic concepts of graphical models and inference For the independent reader chapters 1,2,3,4,5,9,10,13,14,15,16,17,21 and 23 would form a good introduction to probabilistic reasoning, modelling andMachineLearning The material in chapters 19, 24, 25 and 28 is more advanced, with the remaining material being of more specialised interest Note that in each chapter the level of material is of varying difficulty, typically with the more challenging material placed towards the end of each chapter As an introduction to the area of probabilistic modelling, a course can be constructed from the material as indicated in the chart The material from parts I and II has been successfully used for courses on Graphical Models I have also taught an introduction to Probabilistic MachineLearning using material largely from part III, as indicated These two courses can be taught separately and a useful approach would be to teach first the Graphical Models course, followed by a separate Probabilistic MachineLearning course A short course on approximate inference can be constructed from introductory material in part I and the more advanced material in part V, as indicated The exact inference methods in part I can be covered relatively quickly with the material in part V considered in more in depth A timeseries course can be made by using primarily the material in part IV, possibly combined with material from part I for students that are unfamiliar with probabilistic modelling approaches Some of this material, particularly in chapter 25 is more advanced and can be deferred until the end of the course, or considered for a more advanced course The references are generally to works at a level consistent with the book material and which are in the most part readily available Accompanying code The BRMLtoolbox is provided to help readers see how mathematical models translate into actual MATLAB code There are a large number of demos that a lecturer may wish to use or adapt to help illustrate the material In addition many of the exercises make use of the code, helping the reader gain confidence in the concepts and their application Along with complete routines for many MachineLearning methods, the philosophy is to provide low level routines whose composition intuitively follows the mathematical description of the algorithm In this way students may easily match the mathematics with the corresponding algorithmic implementation IV DRAFT February 25, 2014 Part II: Learning in Probabilistic Models Part III: MachineLearning Part IV: Dynamical Models Part V: Approximate Inference 1: 2: 3: 4: 5: 6: 7: Probabilistic Modelling Course Time-series Short Course Approximate Inference Short Course Probabilistic MachineLearning Course Graphical Models Course Part I: Inference in Probabilistic Models Probabilistic Reasoning Basic Graph Concepts Belief Networks Graphical Models Efficient Inference in Trees The Junction Tree Algorithm Making Decisions 8: Statistics for MachineLearning 9: Learning as Inference 10: Naive Bayes 11: Learning with Hidden Variables 12: Bayesian Model Selection 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: MachineLearning Concepts Nearest Neighbour Classification Unsupervised Linear Dimension Reduction Supervised Linear Dimension Reduction Linear Models Bayesian Linear Models Gaussian Processes Mixture Models Latent Linear Models Latent Ability Models 23: 24: 25: 26: Discrete-State Markov Models Continuous-State Markov Models Switching Linear Dynamical Systems Distributed Computation 27: Sampling 28: Deterministic Approximate Inference Website The BRMLtoolbox along with an electronic version of the book is available from www.cs.ucl.ac.uk/staff/D.Barber/brml Instructors seeking solutions to the exercises can find information at the website, along with additional teaching materials DRAFT February 25, 2014 V Other books in this area The literature on MachineLearning is vast with much relevant literature also contained in statistics, engineering and other physical sciences A small list of more specialised books that may be referred to for deeper treatments of specific topics is: • Graphical models – Graphical models by S Lauritzen, Oxford University Press, 1996 – Bayesian Networks and Decision Graphs by F Jensen and T D Nielsen, Springer Verlag, 2007 – Probabilistic Networks and Expert Systems by R G Cowell, A P Dawid, S L Lauritzen and D J Spiegelhalter, Springer Verlag, 1999 – Probabilistic Reasoning in Intelligent Systems by J Pearl, Morgan Kaufmann, 1988 – Graphical Models in Applied Multivariate Statistics by J Whittaker, Wiley, 1990 – Probabilistic Graphical Models: Principles and Techniques by D Koller and N Friedman, MIT Press, 2009 • MachineLearningand Information Processing – Information Theory, Inference andLearning Algorithms by D J C MacKay, Cambridge University Press, 2003 – Pattern Recognition andMachineLearning by C M Bishop, Springer Verlag, 2006 – An Introduction To Support Vector Machines, N Cristianini and J Shawe-Taylor, Cambridge University Press, 2000 – Gaussian Processes for MachineLearning by C E Rasmussen and C K I Williams, MIT press, 2006 Acknowledgements Many people have helped this book along the way either in terms of reading, feedback, general insights, allowing me to present their work, or just plain motivation Amongst these I would like to thank Dan Cornford, Massimiliano Pontil, Mark Herbster, John Shawe-Taylor, Vladimir Kolmogorov, Yuri Boykov, Tom Minka, Simon Prince, Silvia Chiappa, Bertrand Mesot, Robert Cowell, Ali Taylan Cemgil, David Blei, Jeff Bilmes, David Cohn, David Page, Peter Sollich, Chris Williams, Marc Toussaint, Amos Storkey, Zakria Hussain, Le Chen, Seraf´ın Moral, Milan Studen´ y, Luc De Raedt, Tristan Fletcher, Chris Vryonides, Yannis ++ haralambous, Tom Furmston, Ed Challis and Chris Bracegirdle I would also like to thank the many students that have helped improve the material during lectures over the years I’m particularly grateful to Taylan Cemgil for allowing his GraphLayout package to be bundled with the BRMLtoolbox The staff at Cambridge University Press have been a delight to work with and I would especially like to thank Heather Bergman for her initial endeavors and the wonderful Diana Gillooly for her continued enthusiasm A heartfelt thankyou to my parents and sister – I hope this small token will make them proud I’m also fortunate to be able to acknowledge the support and generosity of friends throughout Finally, I’d like to thank Silvia who made it all worthwhile VI DRAFT February 25, 2014 BRMLtoolbox The BRMLtoolbox is a lightweight set of routines that enables the reader to experiment with concepts in graph theory, probability theory andMachineLearning The code contains basic routines for manipulating discrete variable distributions, along with more limited support for continuous variables In addition there are many hard-coded standard MachineLearning algorithms The website contains also a complete list of all the teaching demos and related exercise material BRMLTOOLKIT Graph Theory ancestors ancestralorder descendents children edges elimtri connectedComponents istree neigh noselfpath parents spantree triangulate triangulatePorder - Return the ancestors of nodes x in DAG A Return the ancestral order or the DAG A (oldest first) Return the descendents of nodes x in DAG A return the children of variable x given adjacency matrix A Return edge list from adjacency matrix A Return a variable elimination sequence for a triangulated graph Find the connected components of an adjacency matrix Check if graph is singly-connected Find the neighbours of vertex v on a graph with adjacency matrix G return a path excluding self transitions return the parents of variable x given adjacency matrix A Find a spanning tree from an edge list Triangulate adjacency matrix A Triangulate adjacency matrix A according to a partial ordering Potential manipulation condpot changevar dag deltapot disptable divpots drawFG drawID drawJTree drawNet evalpot exppot eyepot grouppot groupstate logpot markov maxpot maxsumpot multpots numstates - Return a potential conditioned on another variable Change variable names in a potential Return the adjacency matrix (zeros on diagonal) for a Belief Network A delta function potential Print the table of a potential Divide potential pota by potb Draw the Factor Graph A plot an Influence Diagram plot a Junction Tree plot network Evaluate the table of a potential when variables are set exponential of a potential Return a unit potential Form a potential based on grouping variables together Find the state of the group variables corresponding to a given ungrouped state logarithm of the potential Return a symmetric adjacency matrix of Markov Network in pot Maximise a potential over variables Maximise or Sum a potential over variables Multiply potentials into a single potential Number of states of the variables in a potential VII orderpot orderpotfields potsample potscontainingonly potvariables setevpot setpot setstate squeezepots sumpot sumpotID sumpots table ungrouppot uniquepots whichpot - Return potential with variables reordered according to order Order the fields of the potential, creating blank entries where necessary Draw sample from a single potential Returns those potential numbers that contain only the required variables Returns information about all variables in a set of potentials Sets variables in a potential into evidential states sets potential variables to specified states set a potential’s specified joint state to a specified value Eliminate redundant potentials (those contained wholly within another) Sum potential pot over variables Return the summed probability and utility tables from an ID Sum a set of potentials Return the potential table Form a potential based on ungrouping variables Eliminate redundant potentials (those contained wholly within another) Returns potentials that contain a set of variables Routines also extend the toolbox to deal with Gaussian potentials: multpotsGaussianMoment.m, sumpotGaussianCanonical.m, sumpotGaussianMoment.m, multpotsGaussianCanonical.m See demoSumprodGaussCanon.m, demoSumprodGaussCanonLDS.m, demoSumprodGaussMoment.m Inference absorb absorption absorptionID ancestralsample binaryMRFmap bucketelim condindep condindepEmp condindepPot condMI FactorConnectingVariable FactorGraph IDvars jtassignpot jtree jtreeID LoopyBP MaxFlow maxNpot maxNprodFG maxprodFG MDPemDeterministicPolicy MDPsolve MesstoFact metropolis mostprobablepath mostprobablepathmult sumprodFG - Update potentials in absorption message passing on a Junction Tree Perform full round of absorption on a Junction Tree Perform full round of absorption on an Influence Diagram Ancestral sampling from a Belief Network get the MAP assignment for a binary MRF with positive W Bucket Elimination on a set of potentials Conditional Independence check using graph of variable interactions Compute the empirical log Bayes Factor and MI for independence/dependence Numerical conditional independence measure conditional mutual information I(x,y|z) of a potential Factor nodes connecting to a set of variables Returns a Factor Graph adjacency matrix based on potentials probability and decision variables from a partial order Assign potentials to cliques in a Junction Tree Setup a Junction Tree based on a set of potentials Setup a Junction Tree based on an Influence Diagram loopy Belief Propagation using sum-product algorithm Ford Fulkerson max flow - cut algorithm (breadth first search) Find the N most probable values and states in a potential N-Max-Product algorithm on a Factor Graph (Returns the Nmax most probable States) Max-Product algorithm on a Factor Graph Solve MDP using EM with deterministic policy Solve a Markov Decision Process Returns the message numbers that connect into factor potential Metropolis sample Find the most probable path in a Markov Chain Find the all source all sink most probable paths in a Markov Chain Sum-Product algorithm on a Factor Graph represented by A - Learn AR coefficients using a Linear Dynamical System Fit autoregressive (AR) coefficients of order L to v Bayesian Linear Regression training using basis functions phi(x) Bayesian Logistic Regression with the Relevance Vector Machine Canonical Variates (no post rotation of variates) canonical correlation analysis Gamma Exponential Covariance Function train a Belief Network using Expectation Maximisation MDP deterministic policy solver Finds optimal actions EM marginal transition in MDP Returns term proportional to the q marginal for the utility term backward information needed to solve the MDP process using message passing MDP solver calculates the value function of the MDP with the current policy Factor Analysis Specific Models ARlds ARtrain BayesLinReg BayesLogRegressionRVM CanonVar cca covfnGE EMbeliefnet EMminimizeKL EMqTranMarginal EMqUtilMarginal EMTotalBetaMessage EMvalueTable FA VIII DRAFT February 25, 2014 GMMem GPclass GPreg HebbML HMMbackward HMMbackwardSAR HMMem HMMforward HMMforwardSAR HMMgamma HMMsmooth HMMsmoothSAR HMMviterbi kernel Kmeans LDSbackward LDSbackwardUpdate LDSforward LDSforwardUpdate LDSsmooth LDSsubspace LogReg MIXprodBern mixMarkov NaiveBayesDirichletTest NaiveBayesDirichletTrain NaiveBayesTest NaiveBayesTrain nearNeigh pca plsa plsaCond rbf SARlearn SLDSbackward SLDSforward SLDSmargGauss softloss svdm SVMtrain - Fit a mixture of Gaussian to the data X using EM Gaussian Process Binary Classification Gaussian Process Regression Learn a sequence for a Hopfield Network HMM Backward Pass Backward Pass (beta method) for the Switching Autoregressive HMM EM algorithm for HMM HMM Forward Pass Switching Autoregressive HMM with switches updated only every Tskip timesteps HMM Posterior smoothing using the Rauch-Tung-Striebel correction method Smoothing for a Hidden Markov Model (HMM) Switching Autoregressive HMM smoothing Viterbi most likely joint hidden state of a HMM A kernel evaluated at two points K-means clustering algorithm Full Backward Pass for a Latent Linear Dynamical System (RTS correction method) Single Backward update for a Latent Linear Dynamical System (RTS smoothing update) Full Forward Pass for a Latent Linear Dynamical System (Kalman Filter) Single Forward update for a Latent Linear Dynamical System (Kalman Filter) Linear Dynamical System : Filtering and Smoothing Subspace Method for identifying Linear Dynamical System Learning Logistic Linear Regression Using Gradient Ascent (BATCH VERSION) EM training of a Mixture of a product of Bernoulli distributions EM training for a mixture of Markov Models Naive Bayes prediction having used a Dirichlet prior for training Naive Bayes training using a Dirichlet prior Test Naive Bayes Bernoulli Distribution after Max Likelihood training Train Naive Bayes Bernoulli Distribution using Max Likelihood Nearest Neighbour classification Principal Components Analysis Probabilistic Latent Semantic Analysis Conditional PLSA (Probabilistic Latent Semantic Analysis) Radial Basis function output EM training of a Switching AR model Backward pass using a Mixture of Gaussians Switching Latent Linear Dynamical System Gaussian Sum forward pass compute the single Gaussian from a weighted SLDS mixture Soft loss function Singular Value Decomposition with missing values train a Support vector Machine - performs argmax returning the index and value Assigns values to variables p(x>y) for x~Beta(a,b), y~Beta(c,d) Plot a 3D bar plot of the matrix Z Average of a logistic sigmoid under a Gaussian Cap x at absolute value c inverse of the chi square cumulative density for a data matrix (each column is a datapoint), return the state counts Compute normalised p proportional to exp(logp); Make a conditional distribution from the matrix Samples from a Dirichlet distribution Place the field of a structure in a cell Return the mean and covariance of a conditioned Gaussian Plot a Hinton diagram Subscript vector from linear index True for member of sorted set Length of each cell entry Log determinant of a positive definite matrix computed in a numerically stable manner log(x+eps) unnormalised log of the Gauss-Gamma distribution Compute log(sum(exp(a).*b)) valid for large a Log Normalisation constant of a Dirichlet distribution with parameter u Return majority values in each column on a matrix Maximise a multi-dimensional array over a set of dimensions Find the highest values and states of an array over a set of dimensions General argmax assign betaXbiggerY bar3zcolor avsigmaGauss cap chi2test count condexp condp dirrnd field2cell GaussCond hinton ind2subv ismember_sorted lengthcell logdet logeps logGaussGamma logsumexp logZdirichlet majority maxarray maxNarray DRAFT February 25, 2014 IX mix2mix mvrandn mygamrnd mynanmean mynansum mynchoosek myones myrand myzeros normp randgen replace sigma sigmoid sqdist subv2ind sumlog - Fit a mixture of Gaussians with another mixture of Gaussians Samples from a multi-variate Normal(Gaussian) distribution Gamma random variate generator mean of values that are not nan sum of values that are not nan binomial coefficient v choose k same as ones(x), but if x is a scalar, interprets as ones([x 1]) same as rand(x) but if x is a scalar interprets as rand([x 1]) same as zeros(x) but if x is a scalar interprets as zeros([x 1]) Make a normalised distribution from an array Generates discrete random variables given the pdf Replace instances of a value with another value 1./(1+exp(-x)) 1./(1+exp(-beta*x)) Square distance between vectors in x and y Linear index from subscript vector sum(log(x)) with a cutoff at 10e-200 - Compatibility of object F being in position h for image v on grid Gx,Gy The logarithm of a specific non-Gaussian distribution Place the object F at position h in grid Gx,Gy return points for plotting an ellipse of a covariance unit variance contours of a 2D Gaussian with mean m and covariance S run me at initialisation checks for bugs in matlab and initialises path Returns if point is on a defined grid Miscellaneous compat logp placeobject plotCov pointsCov setup validgridposition X DRAFT February 25, 2014 BIBLIOGRAPHY BIBLIOGRAPHY [299] M Tsodyks, K Pawelzik, and H Markram Neural Networks with Dynamic Synapses Neural Computation, 10:821–835, 1998 [300] L van der Matten and G Hinton Visualizing Data using t-SNE Journal of MachineLearning Research, 9:2579–2605, 2008 [301] P Van Overschee and B De Moor Subspace Identification for Linear Systems; Theory, Implementations, Applications Kluwer, 1996 [302] V Vapnik The Nature of Statistical Learning Theory Springer, New York, 1995 [303] M Verhaegen and P Van Dooren Numerical Aspects of Different Kalman Filter Implementations IEEE Transactions of Automatic Control, 31(10):907–917, 1986 [304] T Verma and J Pearl Causal networks : Semantics and expressiveness In R D Schacter, T S Levitt, L N Kanal, and J.F Lemmer, editors, Uncertainty in Artificial Intelligence, volume 4, pages 69–76, Amsterdam, 1990 North-Holland [305] T O Virtanen, A T Cemgil, and S J Godsill Bayesian extensions to nonnegative matrix factorisation for audio signal modelling In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1825–1828, 2008 [306] G Wahba Support Vector Machines, Repreducing Kernel Hilbert Spaces, and Randomized GACV, pages 69–88 MIT Press, 1999 [307] M J Wainwright and M I Jordan Graphical models, exponential families, and variational inference Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008 [308] H Wallach Efficient training of conditional random fields Master’s thesis, Division of Informatics, University of Edinburgh, 2002 [309] Y Wang, J Hodges, and B Tang Classification of Web Documents Using a Naive Bayes Method 15th IEEE International Conference on Tools with Artificial Intelligence, pages 560–564, 2003 [310] S Waterhouse, D Mackay, and T Robinson Bayesian methods for mixtures of experts In D S Touretzky, M Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems (NIPS), number 8, pages 351–357, Cambridge, MA, 1996 MIT Press [311] C Watkins and P Dayan Q-learning Machine Learning, 8:279–292, 1992 [312] Y Weiss and W T Freeman Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology Neural Computation, 13(10):2173–2200, 2001 [313] M Welling, T P Minka, and Y W Teh Structured Region Graphs: Morphing EP into GBP In F Bacchus and T Jaakkola, editors, Uncertainty in Artificial Intelligence, number 21, pages 609–614, Corvallis, Oregon, USA, 2005 AUAI press [314] J Whittaker Graphical Models in Applied Multivariate Statistics John Wiley & Sons, 1990 [315] W Wiegerinck Variational approximations between mean field theory and the Junction Tree algorithm In C Boutilier and M Goldszmidt, editors, Uncertainty in Artificial Intelligence, number 16, pages 626–633, San Francisco, CA, 2000 Morgan Kaufmann [316] W Wiegerinck and T Heskes Fractional Belief Propagation In S Becker, S Thrun, and K Obermayer, editors, Advances in Neural Information Processing Systems (NIPS), number 15, pages 438–445, Cambridge, MA, 2003 MIT Press [317] C K I Williams Computing with infinite networks In M C Mozer, M I Jordan, and T Petsche, editors, Advances in Neural Information Processing Systems NIPS 9, pages 295–301, Cambridge, MA, 1997 MIT Press [318] C K I Williams and D Barber Bayesian classification with Gaussian processes IEEE Trans Pattern Analysis andMachine Intelligence, 20:1342–1351, 1998 [319] C Yanover and Y Weiss Finding the M Most Probable Configurations Using Loopy Belief Propagation In S Thrun, L Saul, and B Schă olkopf, editors, Advances in Neural Information Processing Systems (NIPS), number 16, pages 1457–1464, Cambridge, MA, 2004 MIT Press [320] J S Yedidia, W T Freeman, and Y Weiss Constructing free-energy approximations and generalized belief propagation algorithms Information Theory, IEEE Transactions on, 51(7):2282–2312, July 2005 634 DRAFT February 25, 2014 BIBLIOGRAPHY BIBLIOGRAPHY [321] S Young, D Kershaw, J Odell, D Ollason, V Valtchev, and P Woodland The HTK Book Version 3.0 Cambridge University Press, 2000 [322] A L Yuille and A Rangarajan The concave-convex procedure Neural Computation, 15(4):915–936, 2003 [323] J.-H Zhao, P L H Yu, and Q Jiang ML estimation for factor analysis: EM or non-EM? Statistics and Computing, 18(2):109–123, 2008 [324] O Zoeter Monitoring non-linear and switching dynamical systems PhD thesis, Radboud University Nijmegen, 2005 DRAFT February 25, 2014 635 BIBLIOGRAPHY 636 BIBLIOGRAPHY DRAFT February 25, 2014 Index − of − M coding, 233 N -max-product, 86 α-expansion, 595 α-recursion, 461 β-recursion, 462 γ-recursion, 463 maximum likelihood Hopfield network, 528 naive mean field, 576 absorbing state, 87 absorption, 99 influence diagram, 131 acceptance function, 553 active learning, 293 acyclic, 26 adjacency matrix, 28, 424 algebraic Riccati equation, 496 ancestor, 26 ancestral ordering, 548 ancestral sampling, 548 antifreeze, 256 approximate inference, 115, 419, 569 belief propagation, 586, 587 Bethe free energy, 585 double integration bound, 597 expectation propagation, 587 graph cut, 594 Laplace approximation, 569 switching linear dynamical system, 508 variational approach, 570 variational inference, 573 AR model, see auto-regressive model ARCH model, 488 Artificial Life, 532 asymmetry, 127 asynchronous updating, 575 auto-regressive model, 485 ARCH, 488 GARCH, 488 switching, 500 time-varying, 487 auxiliary variable sampling, 555 average, 158 backtracking, 83 bag of words, 234, 319 batch update, 356 Baum-Welch, 468 Bayes Information Criterion, 275 Bayes’ factor, 208, 267 model selection, 267 theorem, Bayes’ rule, see Bayes’ theorem Bayesian decision theory, 299 hypothesis testing, 267, 276 image denoising, 573 linear model, 367 mixture model, 412 model selection, 267 Occam’s razor, 270 outcome analysis, 276 Bayesian Dirichlet score, 209 Bayesian linear model, 386 BD score, 209 BDeu score, 209 BDeu score, 209 belief network chest clinic, 51 belief network asbestos-smoking-cancer, 202 cascade, 38 divorcing parents, 35 dynamic, 474 noisy AND gate, 35 noisy logic gate, 35 noisy OR gate, 35 sigmoid, 264 structure learning, 205 training Bayesian, 199 belief propagation, 78, 586 loopy, 584 belief revision, see max-product, 85 637 INDEX Bellman’s equation, 135 Bessel function, 392 beta distribution, 165 function, 165, 193 Bethe free energy, 586 bias, 174 unbiased estimator, 174 bigram, 468 binary entropy, 575 binomial coefficient, 163 binomial options pricing model, 141 bioinformatics, 475 black and white sampling, 566 black-box, 300 Black-Scholes option pricing, 142 Blahut-Arimoto algorithm, 583 Boltzmann machine, 59, 71, 214, 598 restricted, 71 bond propagation, 93 Bonferroni inequality, 22 Boolean network, 532 Bradley-Terry-Luce model, 447 bucket elimination, 90 burn in, 551 calculus, 610 canonical correlation analysis, 332 constrained factor analysis, 439 canonical variates, 339 causal consistency, 127 causality, 47, 127 calculus, 49 influence diagrams, 49 post intervention distribution, 49 CCA see canonical correlation analysis, 332 centering, 171 chain graph, 65 chain component, 65 chain rule, 612 chain structure, 84 changepoint model, 518 checkerboard, 566 chest clinic, 51 missing data, 262 with decisions, 147 children, see directed acyclic graph Cholesky, 340 chord, 26 chordal, 107 Chow-Liu tree, 211 classification, 292, 357, 396 Bayesian, 375 boundary, 231 error analysis, 276 638 INDEX linear parameter model, 352 multiple classes, 358 performance, 276 random guessing, 284 softmax, 358 clique, 26 decomposition, 423 graph, 98 matrix, 28, 423 cliquo, 28 Cluster Variation method, 587 clustering, 293 collaborative filtering, 324 collider, see directed acyclic graph commute, 605 compatibility function, 565 competition model Bradley-Terry-Luce, 447 Elo, 448 TrueSkill, 448 competition models, 447 concave function, 612 conditional entropy, 582 conditional likelihood, 227 conditional mutual information, 237 conditional probability, conditional random field, 221, 473 conditioning, 170 loop cut set, 92 conjugate distribution, 173 exponential family, 172 Gaussian, 178, 179 prior, 172 conjugate gradient, 357 conjugate gradients algorithm, 616 conjugate vector, 615 conjugate vectors algorithm, 616 connected components, 27 connected graph, 27 consistent, 103 consistent estimator, 174 control, 578 convex function, 612 correction smoother, 463, 494 correlation matrix, 160 cosine similarity, 321 coupled HMM, 474 covariance, 160 matrix, 160 covariance function, 387, 390 γ-exponential, 392 construction, 391 Gibbs, 393 isotropic, 392 DRAFT February 25, 2014 INDEX Mat´ern, 392 Mercer kernel, 394 neural network, 393 non-stationary, 393 Ornstein-Uhlenbeck, 392, 394 periodic, 392 rational quadratic, 392 smoothness, 394 squared exponential, 392 stationary, 391 CPT, see conditional probability table CRF, see conditional random field critical point, 613 cross-validation, 297 cumulant, 545 curse of dimensionality, 137, 350 cut set conditioning, 91 cycle, 26 D-map, see dependence map DAG, see directed acyclic graph data anomaly detection, 293 catagorical, 157 dyadic, 421 handwritten digits, 357 labelled, 291 monadic, 422 numerical, 157 ordinal, 157 unlabelled, 292 data compression, 415 vector quantisation, 415 decision boundary, 353 decision function, 294 decision theory, 121, 194, 299 decision tree, 122 decomposable, 107 degree, 29 degree of belief, delta function, see Dirac delta function Kronecker, 160 density estimation, 403 Parzen estimator, 414 dependence map, 69 descendant, 26 design matrix, 373, 386 determinant, 606 deterministic latent variable model, 532 differentiation, 611 digamma function, 183 digit data, 315 Dijkstra’s algorithm, 88 dimension reduction linear, 311, 314 DRAFT February 25, 2014 INDEX supervised, 337 dimensionality reduction linear, 316 non-linear, 317 Dirac delta function, 160, 161 directed acyclic graph, 38 ancestor, 26 ancestral order, 28 cascade, 38 children, 26 collider, 40 descendant, 26 family, 26 immorality, 46 moralisation, 63 parents, 26 direction bias, 473 directional derivative, 612 Dirichlet distribution, 167 Dirichlet process mixture models, 418 discount factor, 136 discriminative approach, 300 training, 300 discriminative approach, 299 discriminative training, 471 dissimilarity function, 305 distributed computation, 525 distribution, Bernoulli, 163, 407 beta, 165, 167, 182, 200, 269 binomial, 163 Categorical, 163 change of variables, 158, 181 conjugate, 178 continuous, density, Dirichlet, 167, 183, 202, 208, 418, 419 discrete, divergence, 161 double exponential, 166 empirical, 161, 295 average, 158 expectation, 158 exponential, 164 exponential family, 171 canonical, 171 gamma, 412 mode, 183 Gauss-gamma, 179, 184 Gauss-inverse-gamma, 178 Gaussian canonical exponential form, 172 conditioning, 170 639 INDEX conjugate, 178, 179 entropy, 171 mixture, 186 multivariate, 168 normalisation, 180, 181 partitioned, 170 propagation, 170 system reversal, 170 univariate, 166 inverse gamma, 165 inverse Wishart, 179 isotropic, 169 joint, kurtosis, 160 Laplace, 166 marginal, mode, 159 moment, 159 multinomial, 163 normal, 166 Poisson, 164, 185 Polya, 418 scaled mixture, 167 skewness, 160 Student’s t, 166 uniform, 162, 164 Wishart, 412 domain, double integration bound, 597 dual parameters, 351 dual representation, 351 dyadic data, 421 dynamic Bayesian network, 474 dynamic synapses, 536 dynamical system linear, 483 non-linear, 532 dynamics reversal, 494 early stopping, 357 edge list, 27 efficient IPF, 218 eigen decomposition, 169, 314, 609 equation, 313 function, 610 problem, 333 spectrum, 394 value, 314, 608 Elo model, 447 emission distribution, 460 emission matrix, 489 empirical independence, 207 empirical distribution, 161, 176, 295 empirical risk, 295 640 INDEX penalised, 296 empirical risk minimisation, 296 energy, 245 entropy, 110, 162, 186, 245, 582 differential, 162, 186 Gaussian, 171 EP, see expectation propagation error function, 353 estimator consistent, 174 evidence, see marginal likelihood hard, 35 likelihood, 37 soft, 35 uncertain, 35 virtual, 37 Evidence Procedure, 369 exact sampling, 548 expectation, see average expectation correction, 512, 513 expectation maximisation, 151, 244, 325, 370, 468, 498 algorithm, 244, 245 antifreeze, 256 belief networks, 248 E-step, 245 energy, 245 entropy, 245 failure case, 255 intractable energy, 254 M-step, 245 mixture model, 405 partial E-step, 253 partial M-step, 253 variational Bayes, 258 Viterbi training, 254 expectation propagation, 587 exponential family, 171 canonical form, 171 conjugate, 172 extended observability matrix, 499 face model, 436 factor analysis, 431 factor rotation, 430 probabilistic PCA, 438 training EM, 434 SVD, 432 factor graph, 67 factor loading, 429 family, see directed acyclic graph feature map, 330 filtering, 461 finance, 139 optimal investment, 142 DRAFT February 25, 2014 INDEX option pricing, 140 finite dimensional Gaussian Process, 387 Fisher information, 184 Fisher’s linear discriminant, 337 Floyd-Warshall-Roy algorithm, 89 forward sampling, 548 Forward-Backward, 462 forward-filtering-backward-sampling, 464 Forward-Sampling-Resampling, 564 gamma digamma, 183 distribution, 164 function, 166, 183 GARCH model, 488 Gaussian canonical representation, 168 distribution, 166 moment representation, 168, 492 sub, 160 super, 160 Gaussian mixture model, 409 Bayesian, 412 collapsing the mixture, 511 EM algorithm, 409 infinite problems, 412 k-means, 415 Parzen estimator, 414 symmetry breaking, 412 Gaussian process, 385 classification, 396 Laplace approximation, 397 multiple classes, 400 regression, 388 smoothness, 394 weight space view, 386 Gaussian sum filtering, 508 Gaussian sum smoothing, 512 generalisation, 291, 295 generalised pseudo Bayes, 516 generative approach, 299 model, 299 training, 299 generative approach, 299 Gibbs sampling, 549 Glicko, 448 GMM, see Gaussian mixture model Google, 457 gradient, 611 descent, 613 Gram matrix, 351, 387 Gram-Schmidt procedure, 615 graph, 25 acyclic, 26 adjacency matrix, 28 DRAFT February 25, 2014 INDEX chain, 65 chain structured, 84 chord, 26 chordal, 107 clique, 26, 28, 98 clique matrix, 28 cliquo, 28 connected, 27 cut, 594 cycle, 26 decomposable, 107 descendant, 28 directed, 25 disconnected, 111 edge list, 27 factor, 67 loop, 26 loopy, 27 multiply connected, 27, 105 neighbour, 26 path, 25 separation, 63 set chain, 115 singly connected, 27 skeleton, 46 spanning tree, 27 tree, 27 triangulated, 107 undirected, 25 vertex degree, 29 graph cut algorithm, 594 graph partitioning, 421 Gull-MacKay iteration, 372 Hamilton-Jacobi equation, 135 Hamiltonian dynamics, 556 Hammersley-Clifford theorem, 61 handwritten digits, 357 Hankel matrix, 499 harmonium, see restricted Boltzmann machine Heaviside step function, 423 Hebb, 526 Hebb rule, 526 hedge fund, 282 Hermitian, 606 Hessian, 357, 611 hidden Markov model, 110, 254, 460 α recursion, 461 β recursion, 462 coupled, 474 direction bias, 473 discriminative training, 471 duration model, 471 entropy, 110 filtering, 461 641 INDEX INDEX indicator function, 19 indicator model, 417 induced representation, 106 inference, 75 bond propagation, 93 bucket elimination, 90 causal, 47 cut set conditioning, 91 HMM, 461 linear dynamical system, 490 MAP, 93 marginal, 75 Markov decision process, 138 max-product, 83 message passing, 75 mixed, 89 MPM, 93 sum-product algorithm, 80 transfer matrix, 77 variable elimination, 75 influence diagram, 125 absorption, 131 asymmetry, 127 causal consistency, 127 chest clinic, 146 decision potential, 130 fundamental link, 127 information link, 125 junction tree, 130 I-map, see independence map no forgetting assumption, 126 ICA, 440 partial order, 126 identifiability, 497 probability potential, 130 identity matrix, 605 solving, 129 IID, see independent and identically distributed utility, 125 IM algorithm, see information-maximisation algorithm utility potential, 130 immorality, 46 information link, 125 importance information maximisation, 583 distribution, 561 information retrieval, 320, 457 sampling, 560 information-maximisation algorithm, 582 particle filter, 563 innovation noise, 485 resampling, 562 input-output HMM, 472 sequential, 562 inverse modus ponens, 15 weight, 561 IPF, see iterative proportional fitting, 215 incidence matrix, 28 efficient, 218 independence Ising model, see Markov network, 64, 93 Bayesian, 208 approximate inference, 573 conditional, 11, 32 isotropic, 169, 411 empirical, 207 isotropic covariance functions, 392 map, 68 item response theory, 446 Markov equivalent, 46 Iterated Conditional Modes, 591 mutual information, 163 iterative proportional fitting, 215 naive Bayes, 229 iterative scaling, 220 parameter, 199 Jeffrey’s rule, 36 perfect map, 69 Jensen’s inequality, 598, 613 independent and identically distributed, 173, 191 Joseph’s symmetrized update, 493 independent components analysis, 443 input-output, 472 likelihood, 463 most likely state (MAP), 464 pairwise marginal, 463 Rauch Tung Striebel smoother, 463 smoothing parallel, 462 sequential, 462 viterbi, 86 Viterbi algorithm, 464 hidden variables, 241 HMM, see hidden Markov model Hopfield network, 525 augmented, 534 capacity, 529 Hebb rule, 527 heteroassociative, 534 maximum likelihood, 528 perceptron, 529 pseudo inverse rule, 527 sequence learning, 526 hybrid Monte Carlo, 555 hyper Markov, 224 hyper tree, 109 hyperparameter, 172, 201, 369 hyperplane, 604 hypothesis testing, 267 Bayesian error analysis, 276 642 DRAFT February 25, 2014 INDEX INDEX jump Markov model, see switching linear dynamical active, 293 system anomaly detection, 293 junction tree, 101, 104 Bayesian, 192 absorption, 99 belief network, 196 algorithm, 97, 108 belief networks clique graph, 98 EM, 248 computational complexity, 110 Dirichlet prior, 202 conditional marginal, 112 inference, 191 consistent, 103 nearest neighbour, 305 hyper tree, 109 online, 293 influence diagram, 130 query, 293 marginal likelihood, 111 reinforcement, 293 most likely state, 113 semi-supervised, 294, 416 normalisation constant, 110 sequences, 468, 497, 526 potential, 98 sequential, 293 running intersection property, 102 structure, 205 separator, 98 supervised, 291 strong, 131 unsupervised, 292, 429 strong triangulation, 131 learning rate, 355 tree width, 109 likelihood, 111, 173, 463, 495, 511 triangulation, 106 bound, 244 marginal, 81, 173 k-means, 415 model, 173, 205 Kalman filter, 489 approximate, 274 Kalman gain, 493 pseudo, 224 KD-tree, 306 likelihood decomposable, 205 kernel, 351, see covariance function line search, 614, 615 classifier, 358 linear algebra, 603 kidnapped robot, 466 linear dimension reduction, 311, 314 Kikuchi, 587 canonical correlation analysis, 332 KL divergence, see Kullback-Leibler divergence latent semantic analysis, 319 KNN, see nearest neighbour non-negative matrix factorisation, 328 Kronecker delta, 160, 605 probabilistic latent semantic analysis, 325 Kullback-Leibler divergence, 161, 245, 571 supervised, 337 kurtosis, 160 unsupervised, 311 Linear Discriminant Analysis, 337 labelled data, 291 linear discriminant analysis, 337 Lagrange as regression, 342 dual, 619 penalised, 342 multiplier, 619 regularised, 342 Lagrangian, 619 linear dynamical system, 170, 489 Laplace approximation, 376, 397, 569 cross moment, 495 latent ability model, 445 dynamics reversal, 494 latent Dirichlet allocation, 419 filtering, 492 latent linear model, 429 identifiability, 497 latent semantic analysis, 319 inference, 490 latent topic, 320 learning, 497 latent variable, 241 likelihood, 495 deterministic, 532 most likely state, 496 model, 241 numerical stability, 491 lattice model, 63 Riccati equations, 496 LDA regularised, 343 smoothing, 494 LDS, see linear dynamical system subspace method, 499 leaky integrate and fire model, 537 switching, 507 Leapfrog discretisation, 557 symmetrising updates, 493 learning DRAFT February 25, 2014 643 INDEX INDEX approximation, 573 Markov chain, 75, 139, 553 absorbing state, 87 PageRank, 457 Markov chain Monte Carlo, 552 auxiliary variable, 555 hybrid Monte Carlo, 555 slice sampling, 559 Swendson-Wang, 557 Gibbs sampling, 550 Metropolis-Hastings, 553 proposal distribution, 553 structured Gibbs sampling, 551 Markov decision process, 133, 148 Bellman’s equation, 135 discount factor, 136 partially observable, 144 planning, 138 policy iteration, 137 reinforcement learning, 144 stationary deterministic policy, 138 temporally unbounded, 136 value iteration, 136 Markov equivalence, 46 Markov network, 58 Boltzmann machine, 59 continuous-state temporal, 483 discrete-state temporal, 455 Gibbs distribution, 58 Hammersley Clifford theorem, 61 Mahalanobis distance, 305, 335 pairwise, 58 manifold potential, 58 linear, 311 Markov random field, 60, 599, 600 low dimensional, 311 alpha-expansion, 595 MAP, see most probable a posteriori, see most probattractive binary, 593 able a posteriori graph cut, 594 MAR, see missing at random map, 590 margin, 359 Potts model, 595 soft, 360 matrix, 605 marginal, adjacency, 28, 424 generalised, 131 block inverse, 608 marginal likelihood, 81, 111, 173, 243, 369, 389 Cholesky, 340 approximate, 371, 376, 377, 399 clique, 28 marginalisation, Gram, 331 Markov Hankel, 499 blanket, 26 incidence, 28 chain, 263, 455 inversion, 607 first order, 484 inversion lemma, 607 stationary distribution, 456 orthogonal, 607 equivalent, 46 positive definite, 610 global, 60 pseudo inverse, 607 hyper, 224 rank, 607 local, 60 stochastic, 77 model, 455 matrix factorisation, 328 pairwise, 60 max-product, 83 random field, 220 linear Gaussian state space model, 489 linear model, 345 Bayesian, 367 classification, 352 factor analysis, 431 latent, 429 regression, 346 linear parameter model, 346 Bayesian, 368 linear perceptron, 354 linear separability, 354 linear transformation, 606 linearly independent, 603 linearly separable, 354 Linsker’s as-if-Gaussian approximation, 583 localisation, 465 logic Aristotle, 15 logistic regression, 353 logistic sigmoid, 353 logit, 353 loop, 26 loop cut set, 92 loopy, 27 loss function, 294, 295 squared, 295 zero-one, 294 loss matrix, 295 Luenberger expanding subspace theorem, 616 644 DRAFT February 25, 2014 INDEX N most probable states, 86 max-sum, 87 maximum cardinality checking, 108 maximum likelihood, 18, 173, 174, 176, 196, 355 belief network, 196 Chow-Liu tree, 212 counting, 196 empirical distribution, 176 factor analysis, 431 Gaussian, 176 gradient optimisation, 261 Markov network, 213 ML-II, 195 naive Bayes, 230 properties, 174 MCMC, see Markov chain Monte Carlo mean field theory, 576 asynchronous updating, 576 Mercer kernel, 394 message passing, 100 schedule, 80, 101 message passing, 75 Metropolis-Hastings acceptance function, 553 Metropolis-Hastings sampling, 553 minimum clique cover, 424 missing at random, 242 completely, 243 missing data, 241 mixed inference, 89 mixed membership model, 419 mixing matrix, 440 mixture Gaussian, 186 mixture model, 403 Bernoulli product, 407 Dirichlet process mixture, 418 expectation maximisation, 405 factor analysis, 435 Gaussian, 409 indicator approach, 417 Markov chain, 458 PCA, 435 mixture of experts, 416 MN, see Markov network mode, 159 model auto-regressive, 485 changepoint, 518 deterministic latent variable, 532 faces, 436 leaky integrate and fire, 537 linear, 345 mixed membership, 419 mixture, 403 DRAFT February 25, 2014 INDEX Rasch, 445 model selection, 267 approximate, 274 moment, 159 moment representation, 492 momentum, 614 monadic data, 422 money financial prediction, 282 loadsa, 282 moralisation, 63, 104 most probable a posteriori, 18, 173 most probable path multiple-source multiple-sink, 89 most probable state N most probable, 85 MRF, see Markov random field multiply connected, 27 multiply connected-distributions, 105 mutual information, 163, 212, 581 approximation, 581 conditional, 162 maximisation, 582 naive Bayes, 229, 407 Bayesian, 234 tree augmented, 236 naive mean field theory, 574 nearest neighbour, 305 probabilistic, 308 network flow, 599 network modelling, 330 neural computation, 525 neural network, 365, 535 depression, 536 dynamic synapses, 536 leaky integrate and fire, 537 Newton update, 357 Newton’s method, 617 no forgetting assumption, 126 node extremal, 85 simplicial, 85 non-negative matrix factorisation, 328 normal distribution, 166 normal equations, 347 normalised importance weights, 561 observed linear dynamical system, 483 Occam’s razor, 270 One of m encoding, 157 online learning, 293 optimisation, 198, 613 Broyden-Fletcher-Goldfarb-Shanno, 618 conjugate gradients algorithm, 616 conjugate vectors algorithm, 616 645 INDEX constrained optimisation, 619 critical point, 613 gradient descent, 613 Luenberger expanding subspace theorem, 616 Newton’s method, 617 quasi Newton method, 618 option pricing, 140 ordinary least squares, 345 orthogonal, 603 orthogonal least squares, 346 orthonormal, 603 outlier, 362 over-complete representation, 324 over-complete representations, 324 overcounting, 98 overfitting, 222, 272, 388 PageRank, 457 pairwise comparison models, 447 pairwise Markov network, 58 parents, see directed acyclic graph part-of-speech tagging, 475 Partial Least Squares, 440 partial order, 126 partially observable MDP, 144 particle filter, 562 partition function, 58, 598 partitioned matrix inversion, 181 Parzen estimator, 308, 414 path blocked, 43 PC algorithm, 206 PCA, see Principal Components Analysis perceptron, 354 logistic regression, 355 perfect elimination order, 107 perfect map, see independence perfect sampling, 548 planning, 138 plant monitoring, 293 plate, 192 Poisson distribution, 164, 185 policy, 136 iteration, 137 stationary deterministic, 138 Polya distribution, 418 POMDP, see partially observable MDP positive definite matrix, 387 parameterisation, 425 posterior, 173 Dirichlet, 203 potential, 58 Potts model, 595 precision, 168, 179, 368 646 INDEX prediction auto-regression, 485 financial, 282 non-parametric, 385 parameteric, 385 predictive variance, 368 predictor-corrector, 461 Principal Components Analysis, 311, 316 algorithm, 314 high dimensional data, 317 kernel, 330 latent semantic analysis, 319 missing data, 321 probabilistic, 438 principal directions, 314 printer nightmare, 226 missing data, 262 prior, 173 probabilistic latent semantic analysis, 325 conditional, 327 EM algorithm, 325 probabilistic PCA, 438 probability conditional, function, 198 density, 9, 158 frequentist, posterior, 173 potential, 130 prior, 173 subjective, probit, 353 probit regression, 353 projection, 604 proposal distribution, 553 Pseudo Inverse, 526 pseudo inverse Hopfield network, 527 pseudo likelihood, 224 quadratic form, 610 quadratic programming, 359 query learning, 293 questionnaire analysis, 445 radial basis functions, 350 Raleigh quotient, 340 Random Boolean networks, 532 Rasch model, 445 Bayesian, 446 Rauch-Tung-Striebel, 463 reabsorption, 114 region graphs, 587 regresion linear parameter model, 346 regression, 292, 386 DRAFT February 25, 2014 INDEX logisitic, 353 regularisation, 272, 296, 348, 357 reinforcement learning, 144, 293 rejection sampling, 545 relevance vector machine, 374, 381 reparameterisation, 97 representation dual, 351 over-complete, 324 sparse, 324 under-complete, 324 resampling, 561 reset model, 518 residuals, 346 responsibility, 411 restricted Boltzmann machine, 71 Riccati equation, 496 ridge regression, 349 risk, 294 robot arm, 578 robust classification, 362 Rose-Tarjan-Lueker elimination, 107 running intersection property, 102, 103 sample mean, 161 variance, 161 sampling, 419 ancestral, 548 forward-filtering-backward-sampling, 464, 495 Gibbs, 549 importance, 560 multi-variate, 546 particle filter, 563 rejection, 545 univariate, 544 Sampling Importance Resampling, 561 scalar product, 603 scaled mixture, 167 search engine, 457 self localisation and mapping, 467 semi-supervised learning, 294 lower dimensional representations, 301 separator, 98 sequential importance sampling, 562 sequential minimal optimisation, 362 set chain, 115 Shafer-Shenoy propagation, 113 shortest path, 87 shortest weighted path, 88 sigmoid logistic, 353 sigmoid belief network, 264 sigmoid function approximate average, 378 simple path, 88 DRAFT February 25, 2014 INDEX simplicial nodes, 107 Simpson’s Paradox, 47 singly connected, 27 singular, 607 Singular Value Decomposition, 318, 609 thin, 341, 609 skeleton, 46, 206 skewness, 160 slice sampling, 559 smoothing, 462 softmax function, 358, 416 spam filtering, 234 spanning tree, 27 sparse representation, 324 spectrogram, 487 speech recognition, 474 spike response model, 535 squared Euclidean distance, 305 squared loss, 295 standard deviation, 159 standard normal distribution, 166 stationary, 553 distribution, 78 stationary Markov chain, 456 stochastic matrix, 77 stop words, 420 strong Junction Tree, 130 strong triangulation, 131 structure learning, 205 Bayesian, 209 network scoring, 209 PC algorithm, 206 undirected, 224 structured Expectation Propagation, 590 subsampling, 551 subspace method, 499 sum-product, 584 sum-product algorithm, 80 supervised learning, 291 -semi, 294 classification, 292 regression, 292 support vector machine, 359 chunking, 362 training, 362 support vectors, 360 SVD, see Singular Value Decomposition SVM, see support vector machine Swendson-Wang sampling, 557 switching AR model, 500 switching Kalman filter, see switching linear dynamical system switching linear dynamical system, 507 changepoint model, 518 collapsing Gaussians, 511 647 INDEX expectation correction, 512 filtering, 508 Gaussian sum smoothing, 512 generalised Pseudo Bayes, 516 inference computational complexity, 508 likelihood, 511 smoothing, 514 symmetry breaking, 412 system reversal, 170 tagging, 475 tall matrix, 324 term-document matrix, 319 test set, 291 text analysis, 329, 419 latent semantic analysis, 319 latent topic, 320 probabilistic latent semantic analysis, 325 TF-IDF, 319 time-invariant LDS, 496 Tower of Hanoi, 137 trace-log formula, 609 train set, 291 training batch, 356 discriminative, 300 generative, 299 generative-discriminative, 300 HMM, 468 linear dynamical system, 497 online, 356 transfer matrix, 77 transition distribution, 460 transition matrix, 483, 489 tree, 27, 76 Chow-Liu, 211 tree augmented network, 236 tree width, 109 triangulation, 106, 107 check, 109 greedy elimination, 107 maximum cardinality, 108 strong, 131 variable elimination, 107 TrueSkill, 448 INDEX unlabelled data, 293 unsupervised learning, 292, 429 utility, 121, 294 matrix, 295 money, 121 potential, 130 zero-one loss, 294 validation, 297 cross, 297 value, 135 value iteration, 136 variable hidden, 244 missing, 244 visible, 244 variable elimination, 75 variance, 159 variational approximation factorised, 574 structured, 576, 577 variational Bayes, 256 expectation maximisation, 258 varimax, 430 vector algebra, 603 vector quantisation, 415 visualisation, 334 Viterbi, 86, 254 Viterbi algorithm, 464 Viterbi alignment, 460 Viterbi training, 254 volatility, 488 Voronoi tessellation, 305 web modelling, 330 website, 330 analysis, 457 whitening, 171, 186 Woodbury formula, 607 XOR function, 354 zero-one loss, 294, 362 uncertainty, 162 under-complete representation, 324 undirected graph, 25 undirected model learning hidden variable, 261 latent variable, 261 uniform distribution, 164 unit vector, 603 648 DRAFT February 25, 2014 ... concern and will likely remain so for the foreseeable future Machine Learning Machine Learning is the study of data-driven methods capable of mimicking, understanding and aiding human and biological... operates Machine Learning is also closely allied with Artificial Intelligence, with Machine Learning placing more emphasis on using data to drive and adapt the model In the early stages of Machine Learning. .. Statistics for Machine Learning 9: Learning as Inference 10: Naive Bayes 11: Learning with Hidden Variables 12: Bayesian Model Selection 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: Machine Learning Concepts