Mathematics for Machine Learning MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A Aldo Faisal Cheng Soon Ong M A TH EM A TICS FO R M A CH IN E LEA RN IN G D EIS EN R O TH ET A L The fundamenta[.]
MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A Aldo Faisal Cheng Soon Ong Contents Foreword Part I Mathematical Foundations 1.1 1.2 1.3 Introduction and Motivation Finding Words for Intuitions Two Ways to Read This Book Exercises and Feedback 11 12 13 16 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Linear Algebra Systems of Linear Equations Matrices Solving Systems of Linear Equations Vector Spaces Linear Independence Basis and Rank Linear Mappings Affine Spaces Further Reading Exercises 17 19 22 27 35 40 44 48 61 63 64 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 Analytic Geometry Norms Inner Products Lengths and Distances Angles and Orthogonality Orthonormal Basis Orthogonal Complement Inner Product of Functions Orthogonal Projections Rotations Further Reading Exercises 70 71 72 75 76 78 79 80 81 91 94 96 4.1 Matrix Decompositions Determinant and Trace 98 99 i This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong (2020) This version is free to view and download for personal use only Not for re-distribution, re-sale, or use in derivative works c by M P Deisenroth, A A Faisal, and C S Ong, 2021 https://mml-book.com ii Contents 4.2 4.3 4.4 4.5 4.6 4.7 4.8 Eigenvalues and Eigenvectors Cholesky Decomposition Eigendecomposition and Diagonalization Singular Value Decomposition Matrix Approximation Matrix Phylogeny Further Reading Exercises 105 114 115 119 129 134 135 137 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Vector Calculus Differentiation of Univariate Functions Partial Differentiation and Gradients Gradients of Vector-Valued Functions Gradients of Matrices Useful Identities for Computing Gradients Backpropagation and Automatic Differentiation Higher-Order Derivatives Linearization and Multivariate Taylor Series Further Reading Exercises 139 141 146 149 155 158 159 164 165 170 170 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 Probability and Distributions Construction of a Probability Space Discrete and Continuous Probabilities Sum Rule, Product Rule, and Bayes’ Theorem Summary Statistics and Independence Gaussian Distribution Conjugacy and the Exponential Family Change of Variables/Inverse Transform Further Reading Exercises 172 172 178 183 186 197 205 214 221 222 7.1 7.2 7.3 7.4 Continuous Optimization Optimization Using Gradient Descent Constrained Optimization and Lagrange Multipliers Convex Optimization Further Reading Exercises 225 227 233 236 246 247 Part II 249 8.1 8.2 8.3 8.4 8.5 Central Machine Learning Problems When Models Meet Data Data, Models, and Learning Empirical Risk Minimization Parameter Estimation Probabilistic Modeling and Inference Directed Graphical Models 251 251 258 265 272 278 Draft (2021-01-14) of “Mathematics for Machine Learning” Feedback: https://mml-book.com Contents iii 8.6 Model Selection 283 9.1 9.2 9.3 9.4 9.5 Linear Regression Problem Formulation Parameter Estimation Bayesian Linear Regression Maximum Likelihood as Orthogonal Projection Further Reading 289 291 292 303 313 315 10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 Dimensionality Reduction with Principal Component Analysis Problem Setting Maximum Variance Perspective Projection Perspective Eigenvector Computation and Low-Rank Approximations PCA in High Dimensions Key Steps of PCA in Practice Latent Variable Perspective Further Reading 317 318 320 325 333 335 336 339 343 11 11.1 11.2 11.3 11.4 11.5 Density Estimation with Gaussian Mixture Models Gaussian Mixture Model Parameter Learning via Maximum Likelihood EM Algorithm Latent-Variable Perspective Further Reading 348 349 350 360 363 368 12 12.1 12.2 12.3 12.4 12.5 12.6 Classification with Support Vector Machines Separating Hyperplanes Primal Support Vector Machine Dual Support Vector Machine Kernels Numerical Solution Further Reading 370 372 374 383 388 390 392 References Index 395 407 c 2021 M P Deisenroth, A A Faisal, C S Ong Published by Cambridge University Press (2020) Foreword Machine learning is the latest in a long line of attempts to distill human knowledge and reasoning into a form that is suitable for constructing machines and engineering automated systems As machine learning becomes more ubiquitous and its software packages become easier to use, it is natural and desirable that the low-level technical details are abstracted away and hidden from the practitioner However, this brings with it the danger that a practitioner becomes unaware of the design decisions and, hence, the limits of machine learning algorithms The enthusiastic practitioner who is interested to learn more about the magic behind successful machine learning algorithms currently faces a daunting set of pre-requisite knowledge: Programming languages and data analysis tools Large-scale computation and the associated frameworks Mathematics and statistics and how machine learning builds on it At universities, introductory courses on machine learning tend to spend early parts of the course covering some of these pre-requisites For historical reasons, courses in machine learning tend to be taught in the computer science department, where students are often trained in the first two areas of knowledge, but not so much in mathematics and statistics Current machine learning textbooks primarily focus on machine learning algorithms and methodologies and assume that the reader is competent in mathematics and statistics Therefore, these books only spend one or two chapters on background mathematics, either at the beginning of the book or as appendices We have found many people who want to delve into the foundations of basic machine learning methods who struggle with the mathematical knowledge required to read a machine learning textbook Having taught undergraduate and graduate courses at universities, we find that the gap between high school mathematics and the mathematics level required to read a standard machine learning textbook is too big for many people This book brings the mathematical foundations of basic machine learning concepts to the fore and collects the information in a single place so that this skills gap is narrowed or even closed This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong (2020) This version is free to view and download for personal use only Not for re-distribution, re-sale, or use in derivative works c by M P Deisenroth, A A Faisal, and C S Ong, 2021 https://mml-book.com “Math is linked in the popular mind with phobia and anxiety You’d think we’re discussing spiders.” (Strogatz, 2014, page 281) Foreword Why Another Book on Machine Learning? Machine learning builds upon the language of mathematics to express concepts that seem intuitively obvious but that are surprisingly difficult to formalize Once formalized properly, we can gain insights into the task we want to solve One common complaint of students of mathematics around the globe is that the topics covered seem to have little relevance to practical problems We believe that machine learning is an obvious and direct motivation for people to learn mathematics This book is intended to be a guidebook to the vast mathematical literature that forms the foundations of modern machine learning We motivate the need for mathematical concepts by directly pointing out their usefulness in the context of fundamental machine learning problems In the interest of keeping the book short, many details and more advanced concepts have been left out Equipped with the basic concepts presented here, and how they fit into the larger context of machine learning, the reader can find numerous resources for further study, which we provide at the end of the respective chapters For readers with a mathematical background, this book provides a brief but precisely stated glimpse of machine learning In contrast to other books that focus on methods and models of machine learning (MacKay, 2003; Bishop, 2006; Alpaydin, 2010; Barber, 2012; Murphy, 2012; Shalev-Shwartz and Ben-David, 2014; Rogers and Girolami, 2016) or programmatic aspects of machine learning (Mă uller and Guido, 2016; Raschka and Mirjalili, 2017; Chollet and Allaire, 2018), we provide only four representative examples of machine learning algorithms Instead, we focus on the mathematical concepts behind the models themselves We hope that readers will be able to gain a deeper understanding of the basic questions in machine learning and connect practical questions arising from the use of machine learning with fundamental choices in the mathematical model We not aim to write a classical machine learning book Instead, our intention is to provide the mathematical background, applied to four central machine learning problems, to make it easier to read other machine learning textbooks Who Is the Target Audience? As applications of machine learning become widespread in society, we believe that everybody should have some understanding of its underlying principles This book is written in an academic mathematical style, which enables us to be precise about the concepts behind machine learning We encourage readers unfamiliar with this seemingly terse style to persevere and to keep the goals of each topic in mind We sprinkle comments and remarks throughout the text, in the hope that it provides useful guidance with respect to the big picture The book assumes the reader to have mathematical knowledge commonly Draft (2021-01-14) of “Mathematics for Machine Learning” Feedback: https://mml-book.com Foreword covered in high school mathematics and physics For example, the reader should have seen derivatives and integrals before, and geometric vectors in two or three dimensions Starting from there, we generalize these concepts Therefore, the target audience of the book includes undergraduate university students, evening learners and learners participating in online machine learning courses In analogy to music, there are three types of interaction that people have with machine learning: Astute Listener The democratization of machine learning by the provision of open-source software, online tutorials and cloud-based tools allows users to not worry about the specifics of pipelines Users can focus on extracting insights from data using off-the-shelf tools This enables nontech-savvy domain experts to benefit from machine learning This is similar to listening to music; the user is able to choose and discern between different types of machine learning, and benefits from it More experienced users are like music critics, asking important questions about the application of machine learning in society such as ethics, fairness, and privacy of the individual We hope that this book provides a foundation for thinking about the certification and risk management of machine learning systems, and allows them to use their domain expertise to build better machine learning systems Experienced Artist Skilled practitioners of machine learning can plug and play different tools and libraries into an analysis pipeline The stereotypical practitioner would be a data scientist or engineer who understands machine learning interfaces and their use cases, and is able to perform wonderful feats of prediction from data This is similar to a virtuoso playing music, where highly skilled practitioners can bring existing instruments to life and bring enjoyment to their audience Using the mathematics presented here as a primer, practitioners would be able to understand the benefits and limits of their favorite method, and to extend and generalize existing machine learning algorithms We hope that this book provides the impetus for more rigorous and principled development of machine learning methods Fledgling Composer As machine learning is applied to new domains, developers of machine learning need to develop new methods and extend existing algorithms They are often researchers who need to understand the mathematical basis of machine learning and uncover relationships between different tasks This is similar to composers of music who, within the rules and structure of musical theory, create new and amazing pieces We hope this book provides a high-level overview of other technical books for people who want to become composers of machine learning There is a great need in society for new researchers who are able to propose and explore novel approaches for attacking the many challenges of learning from data c 2021 M P Deisenroth, A A Faisal, C S Ong Published by Cambridge University Press (2020) Foreword Acknowledgments We are grateful to many people who looked at early drafts of the book and suffered through painful expositions of concepts We tried to implement their ideas that we did not vehemently disagree with We would like to especially acknowledge Christfried Webers for his careful reading of many parts of the book, and his detailed suggestions on structure and presentation Many friends and colleagues have also been kind enough to provide their time and energy on different versions of each chapter We have been lucky to benefit from the generosity of the online community, who have suggested improvements via https://github.com, which greatly improved the book The following people have found bugs, proposed clarifications and suggested relevant literature, either via https://github.com or personal communication Their names are sorted alphabetically Abdul-Ganiy Usman Adam Gaier Adele Jackson Aditya Menon Alasdair Tran Aleksandar Krnjaic Alexander Makrigiorgos Alfredo Canziani Ali Shafti Amr Khalifa Andrew Tanggara Angus Gruen Antal A Buss Antoine Toisoul Le Cann Areg Sarvazyan Artem Artemev Artyom Stepanov Bill Kromydas Bob Williamson Boon Ping Lim Chao Qu Cheng Li Chris Sherlock Christopher Gray Daniel McNamara Daniel Wood Darren Siegel David Johnston Dawei Chen Ellen Broad Fengkuangtian Zhu Fiona Condon Georgios Theodorou He Xin Irene Raissa Kameni Jakub Nabaglo James Hensman Jamie Liu Jean Kaddour Jean-Paul Ebejer Jerry Qiang Jitesh Sindhare John Lloyd Jonas Ngnawe Jon Martin Justin Hsi Kai Arulkumaran Kamil Dreczkowski Lily Wang Lionel Tondji Ngoupeyou Lydia Knă ufing Mahmoud Aslan Mark Hartenstein Mark van der Wilk Markus Hegland Martin Hewing Matthew Alger Matthew Lee Draft (2021-01-14) of “Mathematics for Machine Learning” Feedback: https://mml-book.com 231 7.1 Optimization Using Gradient Descent a moving average The momentum-based method remembers the update ∆xi at each iteration i and determines the next update as a linear combination of the current and previous gradients xi+1 = xi − γi ((∇f )(xi ))> + α∆xi ∆xi = xi − xi−1 = α∆xi−1 − γi−1 ((∇f )(xi−1 ))> , (7.11) (7.12) where α ∈ [0, 1] Sometimes we will only know the gradient approximately In such cases, the momentum term is useful since it averages out different noisy estimates of the gradient One particularly useful way to obtain an approximate gradient is by using a stochastic approximation, which we discuss next 7.1.3 Stochastic Gradient Descent Computing the gradient can be very time consuming However, often it is possible to find a “cheap” approximation of the gradient Approximating the gradient is still useful as long as it points in roughly the same direction as the true gradient Stochastic gradient descent (often shortened as SGD) is a stochastic approximation of the gradient descent method for minimizing an objective function that is written as a sum of differentiable functions The word stochastic here refers to the fact that we acknowledge that we not know the gradient precisely, but instead only know a noisy approximation to it By constraining the probability distribution of the approximate gradients, we can still theoretically guarantee that SGD will converge In machine learning, given n = 1, , N data points, we often consider objective functions that are the sum of the losses Ln incurred by each example n In mathematical notation, we have the form L(θ) = N X Ln (θ) , (7.13) n=1 where θ is the vector of parameters of interest, i.e., we want to find θ that minimizes L An example from regression (Chapter 9) is the negative loglikelihood, which is expressed as a sum over log-likelihoods of individual examples so that L(θ) = − N X n=1 log p(yn |xn , θ) , (7.14) where xn ∈ RD are the training inputs, yn are the training targets, and θ are the parameters of the regression model Standard gradient descent, as introduced previously, is a “batch” optimization method, i.e., optimization is performed using the full training set c 2021 M P Deisenroth, A A Faisal, C S Ong Published by Cambridge University Press (2020) stochastic gradient descent 232 Continuous Optimization by updating the vector of parameters according to > θ i+1 = θ i − γi (∇L(θ i )) = θ i − γi N X (∇Ln (θ i ))> (7.15) n=1 for a suitable step-size parameter γi Evaluating the sum gradient may require expensive evaluations of the gradients from all individual functions Ln When the training set is enormous and/or no simple formulas exist, evaluating the sums P of gradients becomes very expensive N Consider the term n=1 (∇Ln (θ i )) in (7.15), we can reduce the amount of computation by taking a sum over a smaller set of Ln In contrast to batch gradient descent, which uses all Ln for n = 1, , N , we randomly choose a subset of Ln for mini-batch gradient descent In the extreme case, we randomly select only a single Ln to estimate the gradient The key insight about why taking a subset of data is sensible is to realize that for gradient descent to converge, we only require that the gradient is an PN unbiased estimate of the true gradient In fact the term n=1 (∇Ln (θ i )) in (7.15) is an empirical estimate of the expected value (Section 6.4.1) of the gradient Therefore, any other unbiased empirical estimate of the expected value, for example using any subsample of the data, would suffice for convergence of gradient descent Remark When the learning rate decreases at an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to local minimum (Bottou, 1998) ♦ Why should one consider using an approximate gradient? A major reason is practical implementation constraints, such as the size of central processing unit (CPU)/graphics processing unit (GPU) memory or limits on computational time We can think of the size of the subset used to estimate the gradient in the same way that we thought of the size of a sample when estimating empirical means (Section 6.4.1) Large mini-batch sizes will provide accurate estimates of the gradient, reducing the variance in the parameter update Furthermore, large mini-batches take advantage of highly optimized matrix operations in vectorized implementations of the cost and gradient The reduction in variance leads to more stable convergence, but each gradient calculation will be more expensive In contrast, small mini-batches are quick to estimate If we keep the mini-batch size small, the noise in our gradient estimate will allow us to get out of some bad local optima, which we may otherwise get stuck in In machine learning, optimization methods are used for training by minimizing an objective function on the training data, but the overall goal is to improve generalization performance (Chapter 8) Since the goal in machine learning does not necessarily need a precise estimate of the minimum of the objective function, approximate gradients using mini-batch approaches have been widely used Stochastic gradient descent is very effective in large-scale machine learning problems (Bottou et al., 2018), Draft (2021-01-14) of “Mathematics for Machine Learning” Feedback: https://mml-book.com 233 7.2 Constrained Optimization and Lagrange Multipliers Figure 7.4 Illustration of constrained optimization The unconstrained problem (indicated by the contour lines) has a minimum on the right side (indicated by the circle) The box constraints (−1 x and −1 y 1) require that the optimal solution is within the box, resulting in an optimal value indicated by the star x2 −1 −2 −3 −3 −2 −1 x1 such as training deep neural networks on millions of images (Dean et al., 2012), topic models (Hoffman et al., 2013), reinforcement learning (Mnih et al., 2015), or training of large-scale Gaussian process models (Hensman et al., 2013; Gal et al., 2014) 7.2 Constrained Optimization and Lagrange Multipliers In the previous section, we considered the problem of solving for the minimum of a function f (x) , (7.16) x where f : RD → R In this section, we have additional constraints That is, for real-valued functions gi : RD → R for i = 1, , m, we consider the constrained optimization problem (see Figure 7.4 for an illustration) f (x) (7.17) x subject to gi (x) for all i = 1, , m It is worth pointing out that the functions f and gi could be non-convex in general, and we will consider the convex case in the next section One obvious, but not very practical, way of converting the constrained problem (7.17) into an unconstrained one is to use an indicator function J(x) = f (x) + m X 1(gi (x)) , (7.18) i=1 c 2021 M P Deisenroth, A A Faisal, C S Ong Published by Cambridge University Press (2020) 234 Continuous Optimization where 1(z) is an infinite step function ( if z 1(z) = ∞ otherwise Lagrange multiplier Lagrangian (7.19) This gives infinite penalty if the constraint is not satisfied, and hence would provide the same solution However, this infinite step function is equally difficult to optimize We can overcome this difficulty by introducing Lagrange multipliers The idea of Lagrange multipliers is to replace the step function with a linear function We associate to problem (7.17) the Lagrangian by introducing the Lagrange multipliers λi > corresponding to each inequality constraint respectively (Boyd and Vandenberghe, 2004, chapter 4) so that L(x, λ) = f (x) + m X λi gi (x) i=1 > = f (x) + λ g(x) , (7.20a) (7.20b) where in the last line we have concatenated all constraints gi (x) into a vector g(x), and all the Lagrange multipliers into a vector λ ∈ Rm We now introduce the idea of Lagrangian duality In general, duality in optimization is the idea of converting an optimization problem in one set of variables x (called the primal variables), into another optimization problem in a different set of variables λ (called the dual variables) We introduce two different approaches to duality: In this section, we discuss Lagrangian duality; in Section 7.3.3, we discuss Legendre-Fenchel duality Definition 7.1 The problem in (7.17) f (x) (7.21) x subject to primal problem Lagrangian dual problem gi (x) for all i = 1, , m is known as the primal problem, corresponding to the primal variables x The associated Lagrangian dual problem is given by max λ∈Rm subject to D(λ) λ > 0, (7.22) where λ are the dual variables and D(λ) = minx∈Rd L(x, λ) minimax inequality Remark In the discussion of Definition 7.1, we use two concepts that are also of independent interest (Boyd and Vandenberghe, 2004) First is the minimax inequality, which says that for any function with two arguments ϕ(x, y), the maximin is less than the minimax, i.e., max ϕ(x, y) max ϕ(x, y) y x x y (7.23) Draft (2021-01-14) of “Mathematics for Machine Learning” Feedback: https://mml-book.com 7.2 Constrained Optimization and Lagrange Multipliers 235 This inequality can be proved by considering the inequality For all x, y ϕ(x, y) max ϕ(x, y) x y (7.24) Note that taking the maximum over y of the left-hand side of (7.24) maintains the inequality since the inequality is true for all y Similarly, we can take the minimum over x of the right-hand side of (7.24) to obtain (7.23) The second concept is weak duality, which uses (7.23) to show that primal values are always greater than or equal to dual values This is described in more detail in (7.27) ♦ weak duality Recall that the difference between J(x) in (7.18) and the Lagrangian in (7.20b) is that we have relaxed the indicator function to a linear function Therefore, when λ > 0, the Lagrangian L(x, λ) is a lower bound of J(x) Hence, the maximum of L(x, λ) with respect to λ is J(x) = max L(x, λ) λ>0 (7.25) Recall that the original problem was minimizing J(x), max L(x, λ) (7.26) x∈Rd λ>0 By the minimax inequality (7.23), it follows that swapping the order of the minimum and maximum results in a smaller value, i.e., max L(x, λ) > max mind L(x, λ) x∈Rd λ>0 λ>0 x∈R (7.27) This is also known as weak duality Note that the inner part of the righthand side is the dual objective function D(λ) and the definition follows In contrast to the original optimization problem, which has constraints, minx∈Rd L(x, λ) is an unconstrained optimization problem for a given value of λ If solving minx∈Rd L(x, λ) is easy, then the overall problem is easy to solve We can see this by observing from (7.20b) that L(x, λ) is affine with respect to λ Therefore minx∈Rd L(x, λ) is a pointwise minimum of affine functions of λ, and hence D(λ) is concave even though f (·) and gi (·) may be nonconvex The outer problem, maximization over λ, is the maximum of a concave function and can be efficiently computed Assuming f (·) and gi (·) are differentiable, we find the Lagrange dual problem by differentiating the Lagrangian with respect to x, setting the differential to zero, and solving for the optimal value We will discuss two concrete examples in Sections 7.3.1 and 7.3.2, where f (·) and gi (·) are convex Remark (Equality Constraints) Consider (7.17) with additional equality constraints f (x) x subject to gi (x) for all hj (x) = for all i = 1, , m j = 1, , n (7.28) c 2021 M P Deisenroth, A A Faisal, C S Ong Published by Cambridge University Press (2020) weak duality 236 Continuous Optimization We can model equality constraints by replacing them with two inequality constraints That is for each equality constraint hj (x) = we equivalently replace it by two constraints hj (x) and hj (x) > It turns out that the resulting Lagrange multipliers are then unconstrained Therefore, we constrain the Lagrange multipliers corresponding to the inequality constraints in (7.28) to be non-negative, and leave the Lagrange multipliers corresponding to the equality constraints unconstrained ♦ 7.3 Convex Optimization convex optimization problem strong duality convex set Figure 7.5 Example of a convex set Figure 7.6 Example of a nonconvex set We focus our attention of a particularly useful class of optimization problems, where we can guarantee global optimality When f (·) is a convex function, and when the constraints involving g(·) and h(·) are convex sets, this is called a convex optimization problem In this setting, we have strong duality: The optimal solution of the dual problem is the same as the optimal solution of the primal problem The distinction between convex functions and convex sets are often not strictly presented in machine learning literature, but one can often infer the implied meaning from context Definition 7.2 A set C is a convex set if for any x, y ∈ C and for any scalar θ with θ 1, we have θx + (1 − θ)y ∈ C Convex sets are sets such that a straight line connecting any two elements of the set lie inside the set Figures 7.5 and 7.6 illustrate convex and nonconvex sets, respectively Convex functions are functions such that a straight line between any two points of the function lie above the function Figure 7.2 shows a nonconvex function, and Figure 7.3 shows a convex function Another convex function is shown in Figure 7.7 Definition 7.3 Let function f : RD → R be a function whose domain is a convex set The function f is a convex function if for all x, y in the domain of f , and for any scalar θ with θ 1, we have f (θx + (1 − θ)y) θf (x) + (1 − θ)f (y) Remark A concave function is the negative of a convex function convex function concave function epigraph (7.29) (7.30) ♦ The constraints involving g(·) and h(·) in (7.28) truncate functions at a scalar value, resulting in sets Another relation between convex functions and convex sets is to consider the set obtained by “filling in” a convex function A convex function is a bowl-like object, and we imagine pouring water into it to fill it up This resulting filled-in set, called the epigraph of the convex function, is a convex set If a function f : Rn → R is differentiable, we can specify convexity in Draft (2021-01-14) of “Mathematics for Machine Learning” Feedback: https://mml-book.com 237 7.3 Convex Optimization Figure 7.7 Example of a convex function 40 y = 3x2 − 5x + y 30 20 10 −3 −2 −1 x terms of its gradient ∇x f (x) (Section 5.2) A function f (x) is convex if and only if for any two points x, y it holds that f (y) > f (x) + ∇x f (x)> (y − x) (7.31) If we further know that a function f (x) is twice differentiable, that is, the Hessian (5.147) exists for all values in the domain of x, then the function f (x) is convex if and only if ∇2x f (x) is positive semidefinite (Boyd and Vandenberghe, 2004) Example 7.3 The negative entropy f (x) = x log2 x is convex for x > A visualization of the function is shown in Figure 7.8, and we can see that the function is convex To illustrate the previous definitions of convexity, let us check the calculations for two points x = and x = Note that to prove convexity of f (x) we would need to check for all points x ∈ R Recall Definition 7.3 Consider a point midway between the two points (that is θ = 0.5); then the left-hand side is f (0.5 · + 0.5 · 4) = log2 ≈ 4.75 The right-hand side is 0.5(2 log2 2) + 0.5(4 log2 4) = + = And therefore the definition is satisfied Since f (x) is differentiable, we can alternatively use (7.31) Calculating the derivative of f (x), we obtain ∇x (x log2 x) = · log2 x + x · 1 = log2 x + x loge loge (7.32) Using the same two test points x = and x = 4, the left-hand side of (7.31) is given by f (4) = The right-hand side is f (x) + ∇> x (y − x) = f (2) + ∇f (2) · (4 − 2) = + (1 + ) · ≈ 6.9 loge (7.33a) (7.33b) c 2021 M P Deisenroth, A A Faisal, C S Ong Published by Cambridge University Press (2020) 238 Continuous Optimization Figure 7.8 The negative entropy function (which is convex) and its tangent at x = x log2 x f (x) 10 tangent at x = 0 x We can check that a function or set is convex from first principles by recalling the definitions In practice, we often rely on operations that preserve convexity to check that a particular function or set is convex Although the details are vastly different, this is again the idea of closure that we introduced in Chapter for vector spaces Example 7.4 A nonnegative weighted sum of convex functions is convex Observe that if f is a convex function, and α > is a nonnegative scalar, then the function αf is convex We can see this by multiplying α to both sides of the equation in Definition 7.3, and recalling that multiplying a nonnegative number does not change the inequality If f1 and f2 are convex functions, then we have by the definition f1 (θx + (1 − θ)y) θf1 (x) + (1 − θ)f1 (y) f2 (θx + (1 − θ)y) θf2 (x) + (1 − θ)f2 (y) (7.34) (7.35) Summing up both sides gives us f1 (θx + (1 − θ)y) + f2 (θx + (1 − θ)y) θf1 (x) + (1 − θ)f1 (y) + θf2 (x) + (1 − θ)f2 (y) , (7.36) where the right-hand side can be rearranged to θ(f1 (x) + f2 (x)) + (1 − θ)(f1 (y) + f2 (y)) , (7.37) completing the proof that the sum of convex functions is convex Combining the preceding two facts, we see that αf1 (x) + βf2 (x) is convex for α, β > This closure property can be extended using a similar argument for nonnegative weighted sums of more than two convex functions Draft (2021-01-14) of “Mathematics for Machine Learning” Feedback: https://mml-book.com 239 7.3 Convex Optimization Remark The inequality in (7.30) is sometimes called Jensen’s inequality In fact, a whole class of inequalities for taking nonnegative weighted sums of convex functions are all called Jensen’s inequality ♦ In summary, a constrained optimization problem is called a convex optimization problem if Jensen’s inequality convex optimization problem minf (x) x subject to gi (x) for all hj (x) = for all i = 1, , m j = 1, , n , (7.38) where all functions f (x) and gi (x) are convex functions, and all hj (x) = are convex sets In the following, we will describe two classes of convex optimization problems that are widely used and well understood 7.3.1 Linear Programming Consider the special case when all the preceding functions are linear, i.e., x∈Rd subject to c> x (7.39) Ax b , where A ∈ Rm×d and b ∈ Rm This is known as a linear program It has d variables and m linear constraints The Lagrangian is given by L(x, λ) = c> x + λ> (Ax − b) , (7.40) where λ ∈ Rm is the vector of non-negative Lagrange multipliers Rearranging the terms corresponding to x yields L(x, λ) = (c + A> λ)> x − λ> b linear program Linear programs are one of the most widely used approaches in industry (7.41) Taking the derivative of L(x, λ) with respect to x and setting it to zero gives us c + A> λ = (7.42) Therefore, the dual Lagrangian is D(λ) = −λ> b Recall we would like to maximize D(λ) In addition to the constraint due to the derivative of L(x, λ) being zero, we also have the fact that λ > 0, resulting in the following dual optimization problem max λ∈Rm subject to − b> λ (7.43) c + A> λ = λ > This is also a linear program, but with m variables We have the choice of solving the primal (7.39) or the dual (7.43) program depending on c 2021 M P Deisenroth, A A Faisal, C S Ong Published by Cambridge University Press (2020) It is convention to minimize the primal and maximize the dual 240 Continuous Optimization whether m or d is larger Recall that d is the number of variables and m is the number of constraints in the primal linear program Example 7.5 (Linear Program) Consider the linear program x∈R2 subject to > x1 − x2 2 33 −4 −2 x1 x2 −1 −1 (7.44) with two variables This program is also shown in Figure 7.9 The objective function is linear, resulting in linear contour lines The constraint set in standard form is translated into the legend The optimal value must lie in the shaded (feasible) region, and is indicated by the star 2x2 ≤ 33 − 2x1 4x2 ≥ 2x1 − x2 ≤ 2x1 − x2 ≥ x2 ≤ 10 x2 Figure 7.9 Illustration of a linear program The unconstrained problem (indicated by the contour lines) has a minimum on the right side The optimal value given the constraints are shown by the star 0 x1 10 12 14 16 Draft (2021-01-14) of “Mathematics for Machine Learning” Feedback: https://mml-book.com 241 7.3 Convex Optimization 7.3.2 Quadratic Programming Consider the case of a convex quadratic objective function, where the constraints are affine, i.e., mind x∈R subject to > x Qx + c> x Ax b , (7.45) where A ∈ Rm×d , b ∈ Rm , and c ∈ Rd The square symmetric matrix Q ∈ Rd×d is positive definite, and therefore the objective function is convex This is known as a quadratic program Observe that it has d variables and m linear constraints Example 7.6 (Quadratic Program) Consider the quadratic program > > x1 x1 x1 + x2 x2 x∈R2 x2 −1 x1 1 subject to 6 1 x2 −1 (7.46) (7.47) of two variables The program is also illustrated in Figure 7.4 The objective function is quadratic with a positive semidefinite matrix Q, resulting in elliptical contour lines The optimal value must lie in the shaded (feasible) region, and is indicated by the star The Lagrangian is given by > x Qx + c> x + λ> (Ax − b) (7.48a) = x> Qx + (c + A> λ)> x − λ> b , (7.48b) where again we have rearranged the terms Taking the derivative of L(x, λ) with respect to x and setting it to zero gives L(x, λ) = Qx + (c + A> λ) = (7.49) Assuming that Q is invertible, we get x = −Q−1 (c + A> λ) (7.50) Substituting (7.50) into the primal Lagrangian L(x, λ), we get the dual Lagrangian D(λ) = − (c + A> λ)> Q−1 (c + A> λ) − λ> b (7.51) c 2021 M P Deisenroth, A A Faisal, C S Ong Published by Cambridge University Press (2020) 242 Continuous Optimization Therefore, the dual optimization problem is given by maxm λ∈R subject to − (c + A> λ)> Q−1 (c + A> λ) − λ> b λ > (7.52) We will see an application of quadratic programming in machine learning in Chapter 12 7.3.3 Legendre-Fenchel Transform and Convex Conjugate supporting hyperplane Let us revisit the idea of duality from Section 7.2, without considering constraints One useful fact about a convex set is that it can be equivalently described by its supporting hyperplanes A hyperplane is called a supporting hyperplane of a convex set if it intersects the convex set, and the convex set is contained on just one side of it Recall that we can fill up a convex function to obtain the epigraph, which is a convex set Therefore, we can also describe convex functions in terms of their supporting hyperplanes Furthermore, observe that the supporting hyperplane just touches the convex function, and is in fact the tangent to the function at that point And recall that the tangent of a function f (x) at a given point x0 df (x) is the evaluation of the gradient of that function at that point dx x=x0 Legendre transform Physics students are often introduced to the Legendre transform as relating the Lagrangian and the Hamiltonian in classical mechanics Legendre-Fenchel transform convex conjugate convex conjugate In summary, because convex sets can be equivalently described by its supporting hyperplanes, convex functions can be equivalently described by a function of their gradient The Legendre transform formalizes this concept We begin with the most general definition, which unfortunately has a counter-intuitive form, and look at special cases to relate the definition to the intuition described in the preceding paragraph The Legendre-Fenchel transform is a transformation (in the sense of a Fourier transform) from a convex differentiable function f (x) to a function that depends on the tangents s(x) = ∇x f (x) It is worth stressing that this is a transformation of the function f (·) and not the variable x or the function evaluated at x The Legendre-Fenchel transform is also known as the convex conjugate (for reasons we will see soon) and is closely related to duality (Hiriart-Urruty and Lemar´echal, 2001, chapter 5) Definition 7.4 The convex conjugate of a function f : RD → R is a function f ∗ defined by f ∗ (s) = sup (hs, xi − f (x)) (7.53) x∈RD Note that the preceding convex conjugate definition does not need the function f to be convex nor differentiable In Definition 7.4, we have used a general inner product (Section 3.2) but in the rest of this section we Draft (2021-01-14) of “Mathematics for Machine Learning” Feedback: https://mml-book.com