Deeplearning Ian Goodfellow _Yoshua Bengio_ Aaron Courville

This book is accompanied by the above website. The website provides a variety of supplementary material, including exercises, lecture slides, corrections of mistakes, and other resources that should be useful to both readers and instructors. vii Acknowledgments This book would not have been possible without the contributions of many people. We would like to thank those who commented on our proposal for the book and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho, Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas Rohée. We would like to thank the people who offered feedback on the content of the book itself. Some offered feedback on many chapters: Martín Abadi, Guillaume Alain, Ion Androutsopoulos, Fred Bertsch, Olexa Bilaniuk, Ufuk Can Biçici, Matko Bošnjak, John Boersma, Greg Brockman, Alexandre de Brébisson, Pierre Luc Carrier, Sarath Chandar, Pawel Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh, Stephan Dreseitl, Jim Fan, Miao Fan, Meire Fortunato, Frédéric Francis, Nando de Freitas, Çağlar Gülçehre, Jurgen Van Gael, Javier Alonso García, Jonathan Hunt, Gopi Jeyaram, Chingiz Kabytayev, Lukasz Kaiser, Varun Kanade, Asifullah Khan, Akiel Khan, John King, Diederik P. Kingma, Yann LeCun, Rudolf Mathey, Matías Mattamala, Abhinav Maurya, Kevin Murphy, Oleg Mürk, Roman Novak, Augustus Q. Odena, Simon Pavlik, Karl Pichotta, Eddie Pierce, Kari Pulli, Roussel Rahman, Tapani Raiko, Anurag Ranjan, Johannes Roith, Mihaela Rosca, Halis Sak, César Salgado, Grigory Sapunov, Yoshinori Sasaki, Mike Schuster, Julian Serban, Nir Shabat, Ken Shirriff, Andre Simpelo, Scott Stanley, David Sussillo, Ilya Sutskever, Carles Gelada Sáez, Graham Taylor, Valentin Tolmer, Massimiliano Tomassoli, An Tran, Shubhendu Trivedi, Alexey Umnov, Vincent Vanhoucke, Marco VisentiniScarzanella, Martin Vita, David WardeFarley, Dustin Webb, Kelvin Xu, Wei Xue, Ke Yang, Li Yao, Zygmunt Zając and Ozan Çağlayan. We would also like to thank those who provided us with useful feedback on individual chapters: • Notation: Zhang Yuanhang. • Chapter 1, Introduction: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi, viii CONTENTS Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu and Alfredo Solano. • Chapter 2, Linear Algebra: Amjad Almahairi, Nikola Banić, Kevin Bennett, Philippe Castonguay, Oscar Chang, Eric FoslerLussier, Andrey Khalyavin, Sergey Oreshkov, István Petrás, Dennis Prangle, Thomas Rohée, Gitanjali Gulve Sehgal, Colby Toland, Alessandro Vitale and Bob Welland. • Chapter 3, Probability and Information Theory: John Philip Anderson, Kai Arulkumaran, Vincent Dumoulin, Rui Fa, Stephan Gouws, Artem Oboturov, Antti Rasmus, Alexey Surkov and Volker Tresp. • Chapter 4, Numerical Computation: Tran Lam AnIan Fischer and Hu Yuhuang. • Chapter 5, Machine Learning Basics: Dzmitry Bahdanau, Justin Domingue, Nikhil Garg, Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, Peter Shepard, KeeBong Song, Zheng Sun and Andy Wu. • Chapter 6, Deep Feedforward Networks: Uriel Berdugo, Fabrizio Bottarel, Elizabeth Burl, Ishan Durugkar, Jeff Hlywa, Jong Wook Kim, David Krueger and Aditya Kumar Praharaj. • Chapter 7, Regularization for Deep Learning: Morten Kolbæk, Kshitij Lauria, Inkyu Lee, Sunil Mohan, Hai Phong Phan and Joshua Salisbury. • Chapter 8, Optimization for Training Deep Models: Marcel Ackermann, Peter Armitage, Rowel Atienza, Andrew Brock, Tegan Maharaj, James Martens, Kashif Rasul, Klaus Strobl and Nicholas Turner. • Chapter 9, Convolutional Networks: Martín Arjovsky, Eugene Brevdo, Konstantin Divilov, Eric Jensen, Mehdi Mirza, Alex Paino, Marjorie Sayer, Ryan Stout and Wentao Wu. • Chapter 10, Sequence Modeling: Recurrent and Recursive Nets: Gökçen Eraslan, Steven Hickson, Razvan Pascanu, Lorenzo von Ritter, Rui Rodrigues, Dmitriy Serdyuk, Dongyu Shi and Kaiyu Yang. • Chapter 11, Practical Methodology: Daniel Beckstein. • Chapter 12, Applications: George Dahl, Vladimir Nekrasov and Ribana Roscher. • Chapter 13, Linear Factor Models: Jayanth Koushik. ix CONTENTS • Chapter 15, Representation Learning: Kunal Ghosh. • Chapter 16, Structured Probabilistic Models for Deep Learning: Minh Lê and Anton Varfolom. • Chapter 18, Confronting the Partition Function: Sam Bowman. • Chapter 19, Approximate Inference: Yujia Bao. • Chapter 20, Deep Generative Models: Nicolas Chapados, Daniel Galvez, Wenming Ma, Fady Medhat, Shakir Mohamed and Grégoire Montavon. • Bibliography: Lukas Michelbacher and Leslie N. Smith. We also want to thank those who allowed us to reproduce images, figures or data from their publications. We indicate their contributions in the figure captions throughout the text. We would like to thank Lu Wang for writing pdf2htmlEX, which we used to make the web version of the book, and for offering support to improve the quality of the resulting HTML. We would like to thank Ian’s wife Daniela Flori Goodfellow for patiently supporting Ian during the writing of the book as well as for help with proofreading. We would like to thank the Google Brain team for providing an intellectual environment where Ian could devote a tremendous amount of time to writing this book and receive feedback and guidance from colleagues. We would especially like to thank Ian’s former manager, Greg Corrado, and his current manager, Samy Bengio, for their support of this project. Finally, we would like to thank Geoffrey Hinton for encouragement when writing was difficult.

Who Should Read This Book?

This book is designed for two primary audiences: university students, both undergraduate and graduate, who are studying machine learning and starting their careers in deep learning and artificial intelligence research, and software engineers lacking a background in machine learning or statistics who wish to quickly gain the necessary knowledge to implement deep learning in their products or platforms Deep learning has already demonstrated its value across various applications.

Deep learning is a subset of representation learning, which itself falls under the broader category of machine learning, and is utilized in various approaches to artificial intelligence (AI) The Venn diagram illustrates the relationships between these concepts, with each section highlighting specific examples of AI technologies.

Additional layers of more abstract features

Classic machine learning Representation learning

AI systems encompass various disciplines such as computer vision, speech and audio processing, natural language processing, robotics, bioinformatics, chemistry, video games, search engines, online advertising, and finance Key components within these systems, highlighted in shaded boxes, are capable of learning from data, illustrating the interconnectedness of different AI fields.

This book is structured into three distinct parts to cater to a diverse readership The first part introduces fundamental mathematical tools and core concepts of machine learning The second part focuses on well-established deep learning algorithms that are considered effectively solved technologies Finally, the third part explores speculative ideas that are anticipated to play a crucial role in the future of deep learning research.

Readers are encouraged to skip sections that do not align with their interests or expertise Those with a background in linear algebra, probability, and basic machine learning can bypass certain parts, while individuals focused solely on implementing a functional system can stop after part II To assist in selecting relevant chapters, figure 1.6 presents a flowchart outlining the book's high-level organization.

This article is tailored for readers with a computer science background, who are expected to have knowledge of programming, a fundamental grasp of computational performance and complexity theory, as well as introductory calculus and basic graph theory terminology.

Historical Trends in Deep Learning

It is easiest to understand deep learning with some historical context Rather than providing a detailed history of deep learning, we identify a few key trends:

• Deep learning has had a long and rich history, but has gone by many names reﬂecting diﬀerent philosophical viewpoints, and has waxed and waned in popularity.

• Deep learning has become more useful as the amount of available training data has increased.

• Deep learning models have grown in size over time as computer infrastructure (both hardware and software) for deep learning has improved.

• Deep learning has solved increasingly complicated applications with increasing accuracy over time.

Part I: Applied Math and Machine Learning Basics

Part II: Deep Networks: Modern Practices

Part III: Deep Learning Research

Figure 1.6: The high-level organization of the book An arrow from one chapter to another indicates that the former chapter is prerequisite material for understanding the latter.

1.2.1 The Many Names and Changing Fortunes of Neural Net- works

Deep learning, often perceived as a cutting-edge technology, actually has roots dating back to the 1940s Its recent popularity may lead many to believe it is a new phenomenon; however, it has experienced periods of unpopularity and has been known by various names over the years This evolution in terminology reflects the contributions of diverse researchers and perspectives within the field.

Deep learning has evolved through three significant waves: the first wave, known as cybernetics, occurred from the 1940s to the 1960s; the second wave, termed connectionism, took place during the 1980s and 1990s; and the current resurgence of deep learning began in 2006 Understanding this historical context is essential for grasping the development of deep learning technologies.

Early learning algorithms were designed as computational models of biological learning, leading to the term artificial neural networks (ANNs) for deep learning These models are inspired by the biological brain, whether human or animal, but are not necessarily realistic representations of brain function The neural perspective on deep learning is driven by two main ideas: first, that the brain exemplifies intelligent behavior, suggesting that reverse engineering its computational principles could lead to artificial intelligence; and second, that understanding the brain and the foundations of human intelligence can provide valuable insights, making machine learning models significant beyond their engineering applications.

Deep learning transcends the traditional neuroscientific view of machine learning models, embracing a broader principle of learning through multiple levels of composition This approach can be integrated into various machine learning frameworks, extending beyond those inspired by neural networks.

F re q ue nc y o f W o rd o r P hr a se cybernetics (connectionism + neural networks)

Figure 1.7: The ﬁgure shows two of the three historical waves of artiﬁcial neural nets research, as measured by the frequency of the phrases “cybernetics” and “connectionism” or

Neural networks have evolved through three distinct waves, as outlined by historical developments in the field The first wave, spanning the 1940s to 1960s, focused on cybernetics and introduced theories of biological learning, exemplified by early models like the perceptron, which enabled the training of single neurons The second wave, occurring between 1980 and 1995, embraced the connectionist approach and utilized back-propagation to train neural networks with one or two hidden layers Currently, the third wave, known as deep learning, emerged around 2006 and has begun to be documented in book form since 2016, following a similar trend where earlier waves were published long after their scientific inception.

The origins of modern deep learning can be traced back to basic linear models inspired by neuroscience These foundational models were created to process a series of input values, denoted as x1, x2, , xn, and link them to a corresponding output y.

These models would learn a set of weights w 1 , , w n and compute their output f(x w, ) = x 1 w 1 + ã ã ã+x n w n This ﬁrst wave of neural networks research was known as cybernetics, as illustrated in ﬁgure 1.7.

The McCulloch-Pitts Neuron, introduced in 1943, was an early model of brain function capable of recognizing two distinct input categories by evaluating whether f(x, w) was positive or negative For accurate categorization, the weights required adjustment by a human operator In the 1950s, the perceptron emerged as the first model capable of autonomously learning these weights through examples from each category Around the same period, the adaptive linear element (ADALINE) was developed, which predicted real numbers by returning the value of f(x) and could also learn from data to enhance its predictions.

Simple learning algorithms have significantly influenced the current state of machine learning The ADALINE training algorithm, which adjusts weights, is a specific instance of stochastic gradient descent Today, variations of this algorithm continue to be the primary method for training deep learning models.

Linear models, such as those utilized by the perceptron and ADALINE, continue to be among the most popular machine learning models Despite their longstanding use, these models are often trained using methods that differ from the original training techniques.

Linear models exhibit significant limitations, most notably their inability to learn the XOR function, which produces outputs of 1 for inputs [0,1] and [1,0], but 0 for [1,1] and [0,0] This fundamental flaw has led to criticism of linear models and sparked a backlash against biologically inspired learning approaches, as highlighted by Minsky and Papert.

1969) This was the ﬁrst major dip in the popularity of neural networks.

Today, neuroscience is regarded as an important source of inspiration for deep learning researchers, but it is no longer the predominant guide for the ﬁeld.

The diminished influence of neuroscience in deep learning research stems from our limited understanding of the brain, as we lack sufficient data to use it as a guiding framework To truly comprehend the algorithms employed by the brain, we would need to monitor the activity of thousands of interconnected neurons at once However, our current inability to achieve this means we are still far from grasping even the simplest and most well-studied regions of the brain.

Neuroscience has given us a reason to hope that a single deep learning algorithm can solve many diﬀerent tasks Neuroscientists have found that ferrets can learn to

Research indicates that the mammalian brain may utilize a unified algorithm for various tasks, as demonstrated by findings that the auditory processing region can be rewired to interpret visual signals (Von Melchner et al., 2000) Previously, machine learning research was fragmented, with distinct communities focusing on areas like natural language processing, vision, motion planning, and speech recognition However, contemporary deep learning research often encompasses multiple application domains, reflecting a more integrated approach to understanding these complex processes.

Neuroscience provides valuable insights into the development of artificial intelligence, particularly through the concept of computational units that gain intelligence via interactions, mirroring the brain's structure The Neocognitron, introduced by Fukushima in 1980, laid the groundwork for modern convolutional networks by mimicking the mammalian visual system Current neural networks predominantly utilize the rectified linear unit, evolving from the original Cognitron, which was deeply inspired by brain function Although neuroscience influences neural network design, it should not be viewed as a strict blueprint, as biological neurons perform functions distinct from these models Furthermore, while neuroscience has inspired various architectures, our limited understanding of biological learning restricts its applicability to developing effective learning algorithms for training these systems.

Deep learning is often compared to brain function in media discussions, but it should not be seen as an effort to replicate the brain Unlike researchers in other machine learning areas, deep learning experts frequently reference the brain as an influence However, modern deep learning is primarily rooted in applied mathematics, including linear algebra, probability, information theory, and numerical optimization While some researchers acknowledge neuroscience as a source of inspiration, many others do not prioritize it in their work.

It is worth noting that the eﬀort to understand how the brain works on an algorithmic level is alive and well This endeavor is primarily known as

“computational neuroscience” and is a separate ﬁeld of study from deep learning.

Researchers often transition between deep learning and computational neuroscience Deep learning focuses on creating computer systems that effectively tackle tasks requiring intelligence, whereas computational neuroscience aims to develop precise models of brain function.

Scalars, Vectors, Matrices and Tensors

The study of linear algebra involves several types of mathematical objects:

• Scalars: A scalar is just a single number, in contrast to most of the other objects studied in linear algebra, which are usually arrays of multiple numbers.

Scalars are typically represented in italics and assigned lower-case variable names When introducing scalars, it is essential to specify their type, such as stating "Let s ∈ R be the slope of the line" for a real-valued scalar or "Let n ∈ N be the number of units" for a natural number scalar.

A vector is defined as an ordered array of numbers, where each number can be identified by its index Typically represented by lowercase bold letters, such as **x**, the elements of a vector are denoted in italic with subscripts, for example, x₁ for the first element and x₂ for the second It's essential to specify the type of numbers contained in the vector; if they belong to the real numbers (R), and the vector has n elements, it is classified within the Cartesian product of R^n To explicitly present the elements of a vector, they can be formatted as a column within square brackets.

We can think of vectors as identifying points in space, with each element giving the coordinate along a diﬀerent axis.

In vector indexing, we often need to access specific elements using a defined set of indices For instance, to retrieve elements x1, x3, and x6, we can create a set S = {1, 3, 6} and represent it as xS To denote the complement of a set, we use the minus sign; for example, x−1 refers to the vector excluding x1, while x−S represents the vector that omits x1, x3, and x6.

A matrix is a two-dimensional array of numbers, where each element is identified by two indices Typically, matrices are denoted by upper-case bold letters, such as A If matrix A has dimensions m (height) and n (width), it is represented as A ∈ R^(m×n) The elements of a matrix are identified using its name in italic font, with indices separated by commas; for instance, A_(1,1) represents the upper left entry, while A_(m,n) denotes the bottom right entry The notation A_(i,:) refers to the i-th row of matrix A, capturing all elements corresponding to the horizontal coordinate, while A(:,_i) indicates the i-th column.

The transpose of a matrix can be visualized as a mirror image across its main diagonal, where the elements are organized in an array format enclosed in square brackets for clear identification.

When indexing matrix-valued expressions that involve more than a single letter, it is important to use subscripts without converting any letters to lowercase For instance, f(A) i,j denotes the element at position (i, j) in the matrix obtained by applying the function f to the matrix A.

• Tensors: In some cases we will need an array with more than two axes.

A tensor is defined as an array of numbers organized on a regular grid with a varying number of axes We represent a tensor by the symbol “A” and refer to its elements at specific coordinates (i, j, k) as A i,j,k.

The transpose of a matrix is a crucial operation that reflects the matrix across its main diagonal, which extends from the upper left corner to the lower right.

ﬁgure 2.1 for a graphical depiction of this operation We denote the transpose of a matrix A as A  , and it is deﬁned such that

Vectors are essentially matrices with a single column, and their transpose is a matrix with a single row Often, a vector is defined by listing its elements as a row matrix, which can then be transformed into a standard column vector using the transpose operator For example, a vector can be represented as x = [x1, x2, x3]ᵀ.

A scalar can be thought of as a matrix with only a single entry From this, we can see that a scalar is its own transpose: a = a 

We can add matrices to each other, as long as they have the same shape, just by adding their corresponding elements: C = A+B where C i,j = A i,j +B i,j

We can also add a scalar to a matrix or multiply a matrix by a scalar, just by performing that operation on each element of a matrix: D = aãB+c where

In the context of deep learning, we also use some less conventional notation.

In matrix operations, we can add a vector to a matrix, resulting in another matrix represented as C = A + b Here, each element of the resulting matrix C at position i,j is calculated by adding the corresponding element of matrix A at i,j to the element of vector b at j This process effectively adds the vector b to every row of the matrix A, streamlining calculations by avoiding the need to explicitly replicate the vector in each row This technique, known as broadcasting, simplifies matrix addition in linear algebra.

Multiplying Matrices and Vectors

Matrix multiplication is a fundamental operation in linear algebra, where the product of two matrices A and B results in a third matrix C For this multiplication to be valid, the number of columns in matrix A must equal the number of rows in matrix B.

B has rows If A is of shape m×nand B is of shape n×p, then C is of shape m×p We can write the matrix product just by placing two or more matrices together, e.g.

The product operation is deﬁned by

The standard product of two matrices is distinct from the element-wise product, also known as the Hadamard product, denoted as A⊙B In contrast, the dot product of two vectors x and y, which must have the same dimensions, is represented as the matrix product x ⋅ y The resulting matrix product C = AB is calculated by determining each element C(i,j) as the dot product of row i from matrix A and column j from matrix B.

Matrix product operations have many useful properties that make mathematical analysis of matrices more convenient For example, matrix multiplication is distributive:

Matrix multiplication is not commutative (the condition AB = BA does not always hold), unlike scalar multiplication However, the dot product between two vectors is commutative: x  y = y  x (2.8)

The transpose of a matrix product has a simple form:

This allows us to demonstrate equation 2.8, by exploiting the fact that the value of such a product is a scalar and therefore equal to its own transpose: x  y =  x  y 

This textbook does not aim to provide an exhaustive list of the properties of the matrix product, as its primary focus is not linear algebra; however, readers should recognize that numerous additional properties are available.

We now know enough linear algebra notation to write down a system of linear equations:

The equation Ax = b (2.11) represents a system where A is a known matrix in R^(m × n), b is a known vector in R^m, and x is a vector of unknown variables we aim to solve Each element x_i in the vector x corresponds to an unknown variable, while each row of matrix A and each element of vector b introduces additional constraints to the system This allows us to reformulate equation 2.11 for further analysis.

A m,: x= b m (2.15) or, even more explicitly, as:

Figure 2.2: Example identity matrix: This is I 3

Matrix-vector product notation provides a more compact representation for equations of this form.

Identity and Inverse Matrices

Linear algebra oﬀers a powerful tool called matrix inversionthat allows us to analytically solve equation2.11 for many values of A.

Matrix inversion requires an understanding of the identity matrix, which is defined as a matrix that leaves any vector unchanged when multiplied by it The identity matrix for n-dimensional vectors is denoted as I_n and is formally represented as I_n ∈ R^(n × n).

The identity matrix features a straightforward structure, where all entries on the main diagonal equal 1, and all other entries are 0 For a visual representation, refer to figure 2.2.

Thematrix inverse of Ais denoted as A − 1 , and it is deﬁned as the matrix such that

We can now solve equation 2.11 by the following steps:

Of course, this process depends on it being possible to ﬁndA − 1 We discuss the conditions for the existence of A − 1 in the following section.

WhenA − 1 exists, several diﬀerent algorithms exist for ﬁnding it in closed form.

The inverse matrix A − 1 can theoretically solve equations for various values of b; however, it is mainly a theoretical concept and not practical for most software applications Due to the limited precision of digital computers in representing A − 1, algorithms that utilize the value of b typically yield more accurate estimates of x.

Linear Dependence and Span

For A − 1 to be valid, equation 2.11 must yield a unique solution for every value of b However, the system may also result in no solutions or infinitely many solutions for certain values of b It is important to note that having more than one but fewer than infinitely many solutions for a specific b is not feasible; if both x and y are solutions, then z = αx + (1−α)y (2.26) will also be a solution for any real α.

To determine the number of solutions to the equation, we can interpret the columns of matrix A as representing various directions from the origin, defined by the zero vector Each element of vector x indicates the distance to travel in its corresponding direction, with x_i denoting the movement along the i-th column of A This perspective allows us to assess the different ways to reach the target vector b.

A linear combination involves the operation of multiplying each vector in a set, denoted as {v(1), v(2), , v(n)}, by a specific scalar coefficient and then summing the outcomes.

The spanof a set of vectors is the set of all points obtainable by linear combination of the original vectors.

To determine if the equation Ax = b has a solution, it is essential to check if the vector b lies within the span of the columns of matrix A This span is referred to as the column space or range of A.

For the system Ax = b to have a solution for every value of b in R^m, the column space of matrix A must span all of R^m If any point in R^m is not included in the column space, it indicates a potential value of b that lacks a solution This requirement necessitates that matrix A has at least m columns, meaning n must be greater than or equal to m; otherwise, the dimensionality of the column space would fall short of m.

A 3x2 matrix targets a 3-D space, yet the variable x is limited to 2-D, meaning that adjusting the value of x can only outline a 2-D plane within R3 A solution to the equation exists only when the vector b is positioned on that plane.

Having n≥ m is only a necessary condition for every point to have a solution.

A 2×2 matrix with identical columns illustrates that redundancy can exist in columns, as it shares the same column space as a 2×1 matrix with a single copy of the column Consequently, the column space remains a line rather than spanning all of R², despite having two columns.

Linear dependence occurs when a set of vectors includes at least one vector that can be expressed as a linear combination of others A set is considered linearly independent if each vector cannot be formed from the others Adding a vector that is a linear combination of existing vectors does not increase the span of the set For a matrix to represent all of R^m, it must have exactly m linearly independent columns, which is essential for ensuring that equation 2.11 has a solution for every b While no set of m-dimensional vectors can exceed m mutually linearly independent columns, a matrix with more than m columns can contain multiple sets of such columns.

To ensure that a matrix has an inverse, it is essential that equation 2.11 has at most one solution for each value of b This requires the matrix to have no more than m columns, as having more columns would lead to multiple parametrizations of each solution.

For a matrix to be valid for certain operations, it must be square, meaning the number of rows (m) must equal the number of columns (n), and all columns should be linearly independent A square matrix that has linearly dependent columns is referred to as singular.

If a matrix is either non-square or a singular square matrix, it is still feasible to solve the equation, but the method of matrix inversion cannot be applied to determine the solution.

So far we have discussed matrix inverses as being multiplied on the left It is also possible to deﬁne an inverse that is multiplied on the right:

For square matrices, the left inverse and right inverse are equal.

Norms

Sometimes we need to measure the size of a vector In machine learning, we usually measure the size of vectors using a function called a norm Formally, theL p norm is given by

Norms, such as the L p norm, are functions that assign non-negative values to vectors, effectively measuring the distance from the origin to a point x Formally, a norm is defined as a function f that adheres to specific properties, ensuring its validity as a distance measurement in vector spaces.

TheL 2 norm, with p= 2, is known as the Euclidean norm It is simply the

The Euclidean distance from the origin to a point represented by x is commonly referred to as the L2 norm, often denoted as ||x|| without the subscript In machine learning, the squared L2 norm, calculated as x ⋅ x, is frequently used to measure the size of a vector.

The squared L2 norm is often preferred for its mathematical and computational convenience, as its derivatives with respect to each element of x depend solely on that element, unlike the L2 norm, where derivatives rely on the entire vector However, the squared L2 norm can be less desirable in certain contexts, particularly in machine learning, where distinguishing between exactly zero and small nonzero elements is crucial In such situations, the L1 norm is utilized due to its consistent growth rate across all values while maintaining mathematical simplicity.

The L1 norm plays a crucial role in machine learning, especially when distinguishing between zero and nonzero elements is vital Each time an element in the dataset deviates from zero, the L1 norm increases proportionally.

The size of a vector can be assessed by counting its nonzero elements, often mistakenly referred to as the "L 0 norm." However, this term is inaccurate, as the count of nonzero entries does not qualify as a norm since scaling the vector by α does not alter this count Instead, the L 1 norm is frequently utilized as an alternative measure for the number of nonzero entries in a vector.

The L ∞ norm, often referred to as the max norm in machine learning, is defined as the absolute value of the element with the highest magnitude within a vector.

Sometimes we may also wish to measure the size of a matrix In the context of deep learning, the most common way to do this is with the otherwise obscure

A 2 i,j , (2.33) which is analogous to the L 2 norm of a vector.

The dot product of two vectors can be rewritten in terms of norms Speciﬁcally, x  y = || ||x 2 || ||y 2 cosθ (2.34) where θ is the angle between x and y

Special Kinds of Matrices and Vectors

Some special kinds of matrices and vectors are particularly useful.

Diagonal matrices are characterized by having non-zero entries exclusively along their main diagonal, with all other elements being zero Formally, a matrix D qualifies as diagonal if D i,j = 0 for all i ≠ j A well-known example of a diagonal matrix is the identity matrix, which features all diagonal entries equal to 1 Additionally, we denote a square diagonal matrix formed from a vector v as diag(v), where the diagonal entries correspond to the elements of v.

Diagonal matrices are significant due to their computational efficiency in operations such as multiplication and inversion When multiplying a vector x by a diagonal matrix diag(v), each element of x is simply scaled by the corresponding element in v, represented as diag(v)x = v⊙x Inversion of a square diagonal matrix is also efficient, provided all diagonal entries are nonzero, resulting in diag(v)⁻¹ = diag([1/v₁, , 1/vₙ]) Often, while developing general machine learning algorithms with arbitrary matrices, restricting some matrices to diagonal form can lead to simpler and more efficient algorithms.

Diagonal matrices can be rectangular, not just square While non-square diagonal matrices lack inverses, they allow for efficient multiplication When multiplying a non-square diagonal matrix D by a vector x, each element of x is scaled accordingly If D is taller than it is wide, zeros are added to the result; conversely, if D is wider than it is tall, some elements from the end of the vector are discarded.

A symmetric matrix is any matrix that is equal to its own transpose:

Symmetric matrices frequently occur when their entries are derived from a function of two variables that is independent of the order of the arguments For instance, in a distance measurement matrix A, the entry A i,j represents the distance from point i to point j, which results in A i,j equaling A j,i due to the inherent symmetry of distance functions.

A unit vector is a vector with unit norm:

Vectors x and y are considered orthogonal if their dot product equals zero (x ⋅ y = 0) When both vectors have nonzero norms, they form a 90-degree angle with each other In R^n, it is possible to have a maximum of n mutually orthogonal vectors, each with a nonzero norm.

If the vectors are not only orthogonal but also have unit norm, we call them orthonormal.

An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal:

Orthogonal matrices are significant due to their inexpensive computation of inverses, as indicated by the equation A − 1 = A ⊤ It is essential to understand that the rows of orthogonal matrices are not just orthogonal; they are fully orthonormal Notably, there is no specific term for matrices that have orthogonal rows or columns but lack orthonormality.

Eigendecomposition

Understanding mathematical objects can be enhanced by deconstructing them into their fundamental components or identifying universal properties that remain consistent regardless of their representation.

Integers can be expressed through their prime factors, exemplified by the number 12, which can be represented as 2×2×3 regardless of the numeral system used, whether base ten or binary This prime factorization reveals important characteristics, such as the fact that 12 is not divisible by 5, and any multiple of 12 will always be divisible by 3.

Just as prime factorization reveals the fundamental nature of integers, matrix decomposition uncovers insights into their functional properties that may not be evident from their standard array representation.

One of the most widely used kinds of matrix decomposition is called eigendecomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.

An eigenvector of a square matrixAis a non-zero vector v such that multiplication by A alters only the scale of :v

The scalar λ is known as the eigenvalue corresponding to this eigenvector (One can also ﬁnd a left eigenvector such that v  A = λv  , but we are usually concerned with right eigenvectors).

If \( v \) is an eigenvector of matrix \( A \), then any non-zero scalar multiple \( sv \) (where \( s \in \mathbb{R} \) and \( s \neq 0 \)) is also an eigenvector of \( A \) with the same eigenvalue Consequently, it is common practice to focus on unit eigenvectors for simplicity and consistency.

Suppose that a matrix A has nlinearly independent eigenvectors, {v (1) , , v ( ) n }, with corresponding eigenvalues {λ 1 , , λ n } We may concatenate all of the

Figure 2.3 illustrates the impact of eigenvectors and eigenvalues on a matrix A, featuring two orthonormal eigenvectors, v(1) with eigenvalue λ1 and v(2) with eigenvalue λ2 The left side displays a unit circle representing all unit vectors u in R², while the right side shows the transformation of these vectors under the matrix A This visualization reveals how A distorts the unit circle, scaling space in the direction of each eigenvector by its corresponding eigenvalue Additionally, the eigenvectors can be organized into a matrix V, where each column represents an eigenvector, and the eigenvalues can be compiled into a vector λ, leading to the eigendecomposition of matrix A.

Constructing matrices with specific eigenvalues and eigenvectors enables targeted stretching of space However, decomposing matrices into their eigenvalues and eigenvectors is essential for analyzing key properties, similar to how breaking down an integer into its prime factors enhances our understanding of its behavior.

Not all matrices can be decomposed into eigenvalues and eigenvectors, and while some decompositions exist, they may involve complex numbers However, this book focuses on a particular class of matrices that allows for straightforward decomposition Notably, every real symmetric matrix can be expressed using only real-valued eigenvectors and eigenvalues.

The equation A = QΛ represents a transformation where Q is an orthogonal matrix made up of the eigenvectors of matrix A, and Λ is a diagonal matrix containing the corresponding eigenvalues Each eigenvalue Λi,i correlates with the eigenvector found in the i-th column of Q, denoted as Q:,i Due to the orthogonality of Q, matrix A can be interpreted as scaling space by the eigenvalue λi in the direction of the eigenvector vi For a visual representation, refer to figure 2.3.

A real symmetric matrix A always has an eigendecomposition, but this decomposition is not necessarily unique When multiple eigenvectors correspond to the same eigenvalue, any orthogonal vectors within their span can also serve as eigenvectors for that eigenvalue Typically, the entries of the diagonal matrix Λ are arranged in descending order, making the eigendecomposition unique only when all eigenvalues are distinct.

The eigendecomposition of a matrix reveals essential characteristics, indicating that a matrix is singular if any of its eigenvalues are zero For real symmetric matrices, eigendecomposition is instrumental in optimizing quadratic expressions like f(x) = x^T Ax under the constraint ||x||₂ = 1 When x is an eigenvector of A, the function f achieves the corresponding eigenvalue Thus, within this constrained region, the maximum value of f corresponds to the maximum eigenvalue, while the minimum value aligns with the minimum eigenvalue.

A matrix is classified as positive definite if all its eigenvalues are positive, while it is termed positive semidefinite if its eigenvalues are either positive or zero Conversely, a matrix is negative definite if all its eigenvalues are negative, and negative semidefinite if its eigenvalues are negative or zero Positive semidefinite matrices are significant as they ensure that for all vectors x, the expression xᵀAx is greater than or equal to zero Furthermore, positive definite matrices provide the additional assurance that if xᵀAx equals zero, then x must be the zero vector.

Singular Value Decomposition

In section 2.7, we explored matrix decomposition through eigenvectors and eigenvalues An alternative method, singular value decomposition (SVD), factors a matrix into singular vectors and singular values, revealing similar insights to eigendecomposition However, SVD is more universally applicable, as every real matrix possesses a singular value decomposition, unlike eigenvalue decomposition, which is undefined for non-square matrices Thus, when dealing with non-square matrices, SVD is the preferred approach.

Recall that the eigendecomposition involves analyzing a matrix A to discover a matrix V of eigenvectors and a vector of eigenvalues λ such that we can rewrite

The singular value decomposition is similar, except this time we will write A as a product of three matrices:

Suppose thatA is anm×n matrix Then U is deﬁned to be anm×m matrix,

D to be an m×n matrix, and V to be an n×n matrix.

The matrices U and V are orthogonal matrices, while the matrix D is defined as a diagonal matrix, which may not be square.

The diagonal elements of matrix D are referred to as the singular values of matrix A, while the columns of matrix U represent the left-singular vectors, and the columns of matrix V are identified as the right-singular vectors.

The singular value decomposition (SVD) of a matrix A can be understood through the eigendecomposition of functions of A Specifically, the left-singular vectors of A correspond to the eigenvectors of the matrix product AA^T, while the right-singular vectors relate to the eigenvectors of A^TA Additionally, the non-zero singular values of A are the square roots of the eigenvalues of A^TA, a relationship that also holds for AA^T.

One of the key advantages of Singular Value Decomposition (SVD) is its ability to extend the concept of matrix inversion to non-square matrices, which we will explore in the following section.

The Moore-Penrose Pseudoinverse

Matrix inversion is not deﬁned for matrices that are not square Suppose we want to make a left-inverse B of a matrix A, so that we can solve a linear equation

Ax= y (2.44) by left-multiplying each side to obtain x= By (2.45)

Depending on the structure of the problem, it may not be possible to design a unique mapping from A to B.

If shape A has a greater height than width, the equation may not yield a solution Conversely, if A is wider than it is tall, there could be several potential solutions.

The Moore-Penrose pseudoinverse allows us to make some headway in these cases The pseudoinverse of A is deﬁned as a matrix

Practical algorithms for computing the pseudoinverse are not based on this deﬁni- tion, but rather the formula

A + = V D + U  , (2.47) whereU, DandV are the singular value decomposition of A, and the pseudoinverse

D + of a diagonal matrix D is obtained by taking the reciprocal of its non-zero elements then taking the transpose of the resulting matrix.

When a matrix A has more columns than rows, using the pseudoinverse to solve a linear equation yields one of the numerous possible solutions This method specifically provides the solution x = A + y, which has the smallest Euclidean norm ||x||₂ compared to all other potential solutions.

WhenA has more rows than columns, it is possible for there to be no solution.

In this case, using the pseudoinverse gives us the x for which Ax is as close as possible to y in terms of Euclidean norm ||Ax− ||y 2

The Trace Operator

The trace operator gives the sum of all of the diagonal entries of a matrix:

The trace operator is valuable for numerous applications, as it simplifies operations that are challenging to express without summation notation By utilizing matrix products alongside the trace operator, complex calculations become more manageable Notably, it offers an alternative representation of the Frobenius norm of a matrix, enhancing clarity and efficiency in mathematical expressions.

Utilizing the trace operator to express an equation allows for enhanced manipulation through various valuable identities Notably, the trace operator remains unchanged when applied to the transpose of a matrix, highlighting its fundamental properties in linear algebra.

The trace of a square matrix remains invariant when the last factor is moved to the first position, provided that the dimensions of the corresponding matrices permit the resulting product to be defined.

Tr(ABC) = Tr(CAB) = Tr(BCA) (2.51) or more generally,

This invariance to cyclic permutation holds even if the resulting product has a diﬀerent shape For example, for A ∈R m n × and B ∈ R n m × , we have

Tr(AB) = Tr(BA) (2.53) even though AB∈ R m m × and BA ∈R n n ×

Another useful fact to keep in mind is that a scalar is its own trace: a= Tr(a).

The Determinant

The determinant of a square matrix, represented as det(A), is a mathematical function that converts matrices into real numbers It is calculated as the product of the matrix's eigenvalues, serving as a measure of how the matrix transforms space—whether it expands or contracts it A determinant of 0 indicates complete contraction along at least one dimension, resulting in a loss of volume, while a determinant of 1 signifies that the transformation maintains the original volume.

Example: Principal Components Analysis

One simple machine learning algorithm, principal components analysisor PCA can be derived using only knowledge of basic linear algebra.

In a scenario where we have a set of m points {x(1), , x(m)} in R^n, we aim to implement lossy compression This technique allows us to store the points using less memory, albeit at the cost of some precision in the data representation.

We would like to lose as little precision as possible.

One way we can encode these points is to represent a lower-dimensional version of them For each point x ( ) i ∈R n we will ﬁnd a corresponding code vectorc ( ) i ∈R l

When the value of l is smaller than n, the memory required to store code points is reduced compared to the original data Our goal is to identify an encoding function, f(x) = c, that generates a code for the input, and a decoding function that allows us to reconstruct the original input from its code, represented as x ≈ g(f(x)).

Principal Component Analysis (PCA) relies on a specific decoding function, where the decoder is simplified through matrix multiplication to map the code back into R^n This function is defined as g(c) = cDc, with D being a matrix in R^(n l) that governs the decoding process However, optimizing the code for this decoder can present challenges To facilitate a simpler encoding process, PCA enforces orthogonality among the columns of D, although D is not considered an orthogonal matrix unless the dimensions satisfy l = n.

To address the described problem, various solutions can be implemented by proportionally decreasing the values of c i while increasing the scale of D To ensure a unique solution, we impose a constraint that all columns of D must maintain a unit norm.

To develop an effective algorithm, we must determine the optimal code point \( c^* \) for each input point \( x \) This can be achieved by minimizing the distance between the input point \( x \) and its reconstruction \( g(c^*) \), which can be quantified using a norm In the principal components algorithm, the L2 norm is utilized, leading to the formulation \( c^* = \arg \min_c \| -x + g(c) \|_2 \).

Switching to the squared L2 norm is advantageous, as both the L2 norm and its squared version are minimized by the same value of c This equivalence arises because the L2 norm is always non-negative, and the squaring function is monotonically increasing for non-negative inputs Therefore, we can express the minimization problem as \( c^* = \arg \min_c || -x g(c) ||^2_2 \).

(x−g( ))c  (x−g( ))c (2.56) (by the deﬁnition of the L 2 norm, equation 2.30)

(because the scalar g( )c  x is equal to the transpose of itself).

We can now change the function being minimized again, to omit the ﬁrst term, since this term does not depend on :c c ∗ = arg min c −2x  g( ) +c g( )c  g( )c (2.59)

To make further progress, we must substitute in the deﬁnition of g( )c : c ∗ = arg min c −2x  Dc+c  D  Dc (2.60)

(by the orthogonality and unit norm constraints on D)

We can solve this optimization problem using vector calculus (see section 4.3 if you do not know how to do this):

This makes the algorithm eﬃcient: we can optimally encode x just using a matrix-vector operation To encode a vector, we apply the encoder function f( ) = x D  x (2.66)

Using a further matrix multiplication, we can also deﬁne the PCA reconstruction operation: r( ) = x g f( ( )) = x DD  x (2.67)

To select the encoding matrix D, we focus on minimizing the L2 distance between the original inputs and their reconstructions Given that the same matrix D will be utilized for decoding all data points, it is essential to consider the points collectively Therefore, we aim to minimize the Frobenius norm of the error matrix, which is calculated across all dimensions and data points.

To derive the algorithm for finding D ∗, we begin with the case where l = 1, simplifying D to a single vector, d By substituting equation 2.67 into equation 2.68, we can reduce the problem to finding d ∗ as the argument that minimizes d.

The formulation presented is a straightforward method for substitution, yet it lacks stylistic elegance, as it positions the scalar-valued \( x(i) \) to the right of the vector \( d \) Conventionally, scalar coefficients are placed on the left side of the vectors they influence Consequently, it is more common to express the equation as \( d^* = \arg \min d \).

||x ( ) i −d  x ( ) i d|| 2 2 subject to || ||d 2 = 1, (2.70) or, exploiting the fact that a scalar is its own transpose, as d ∗ = arg min d

||x ( ) i −x ( ) i  dd|| 2 2 subject to || ||d 2 = 1 (2.71)The reader should aim to become familiar with such cosmetic rearrangements.

To streamline our analysis, we can reformulate the problem using a unified design matrix of examples instead of treating each example vector separately This approach facilitates a more concise notation We define the matrix X ∈ R m n × by stacking all vectors that represent the data points, where each row X i,: corresponds to the vector x ( ) i.

We can now rewrite the problem as d ∗ = arg min d ||X −Xdd  || 2 F subject to d  d= 1 (2.72)

Disregarding the constraint for the moment, we can simplify the Frobenius norm portion as follows: arg min d ||X −Xdd  || 2 F (2.73)

Tr(X  X −X  Xdd  −dd  X  X +dd  X  Xdd  ) (2.75)

Tr(X  X)−Tr(X  Xdd  )−Tr(dd  X  X) + Tr(dd  X  Xdd  )

= arg min d −Tr(X  Xdd  )−Tr(dd  X  X) + Tr(dd  X  Xdd  ) (2.77) (because terms not involving d do not aﬀect the arg min)

= arg min d −2 Tr(X  Xdd  ) + Tr(dd  X  Xdd  ) (2.78) (because we can cycle the order of the matrices inside a trace, equation 2.52)

= arg min d −2 Tr(X  Xdd  ) + Tr(X  Xdd  dd  ) (2.79)

(using the same property again)

At this point, we re-introduce the constraint: arg min d −2 Tr(X  Xdd  ) + Tr(X  Xdd  dd  ) subject to d  d= 1 (2.80)

= arg min d −2 Tr(X  Xdd  ) + Tr(X  Xdd  ) subject to d  d= 1 (2.81) (due to the constraint)

= arg min d −Tr(X  Xdd  ) subject to d  d= 1 (2.82)

This optimization problem may be solved using eigendecomposition Speciﬁcally, the optimal d is given by the eigenvector of X  X corresponding to the largest eigenvalue.

This derivation focuses on the scenario where l = 1, resulting in the extraction of only the first principal component To obtain a complete set of principal components, the matrix D should consist of the l eigenvectors associated with the largest eigenvalues This concept can be demonstrated through proof by induction, which we suggest as an exercise for further understanding.

Linear algebra is a crucial mathematical discipline essential for grasping deep learning concepts Additionally, probability theory plays a significant role in machine learning, making it another vital area of mathematics to understand.

In this chapter, we describe probability theory and information theory.

Probability theory serves as a mathematical framework for expressing uncertainty, enabling the quantification of uncertain statements through established axioms In artificial intelligence, it is utilized in two primary ways: first, to guide the reasoning processes of AI systems by designing algorithms that compute or approximate expressions based on probability; second, to theoretically analyze the behavior of proposed AI systems using probability and statistics.

Probability theory is essential across various scientific and engineering fields This chapter aims to help readers, particularly those with a software engineering background and minimal knowledge of probability, grasp the concepts presented in this book.

While probability theory allows us to make uncertain statements and reason in the presence of uncertainty, information theory allows us to quantify the amount of uncertainty in a probability distribution.

If you have a background in probability and information theory, you may want to focus only on section 3.14 of this chapter, which covers the graphs used in structured probabilistic models for machine learning However, if you're new to these concepts, this chapter provides enough information to conduct deep learning research projects, although we recommend consulting additional resources, such as Jaynes 2003, for further understanding.

Why Probability?

While many areas of computer science focus on deterministic and reliable systems, where hardware errors are infrequent and manageable, machine learning stands out by heavily relying on probability theory This reliance on probabilistic models may seem unexpected, given the typically stable environments in which computer scientists and software engineers operate.

Machine learning consistently confronts uncertain and stochastic quantities, which can emerge from various sources Since the 1980s, researchers have advocated for the use of probability to quantify this uncertainty Key insights on this topic are influenced by the work of Pearl (1988).

Nearly all activities require some ability to reason in the presence of uncertainty.

It is challenging to identify any proposition or event that can be deemed absolutely true or guaranteed, aside from mathematical statements that are true by definition.

There are three possible sources of uncertainty:

The inherent stochasticity in modeled systems, such as the probabilistic dynamics of subatomic particles in quantum mechanics, highlights the unpredictable nature of certain processes Additionally, theoretical scenarios like a hypothetical card game, where cards are assumed to be shuffled into a random order, further illustrate the concept of randomness in dynamics.

Incomplete observability can make deterministic systems seem stochastic when not all influencing variables are visible A prime example is the Monty Hall problem, where a contestant chooses one of three doors to win a prize Although the outcome is predetermined based on the contestant's choice, the uncertainty from their perspective creates a sense of randomness, highlighting the impact of incomplete information on decision-making.

Incomplete modeling leads to uncertainty in predictions when essential information is discarded For instance, a robot that accurately observes the location of surrounding objects may introduce uncertainty by discretizing space for future location predictions This discretization means that each object could be located anywhere within the discrete cell it occupies, complicating precise positioning.

In various situations, opting for a straightforward yet uncertain rule can be more effective than a complex but certain one, even when a deterministic true rule exists and our modeling system can handle complexity For instance, the simple rule "Most birds fly" is easy to implement and widely applicable, whereas a more intricate rule, such as "Birds fly, except for very young birds that have not yet learned to fly," may be unnecessarily complicated.

Developing and maintaining rehabilitation programs for sick or injured birds, including flightless species like cassowaries, ostriches, and kiwis, is costly and complex Despite significant investment and effort, these programs often remain fragile and susceptible to failure.

Probability theory is essential for representing and reasoning about uncertainty in artificial intelligence, but it may not provide all the necessary tools Originally designed to analyze event frequencies, probability theory effectively applies to repeatable events, such as drawing a hand of cards in poker When we assign a probability p to an outcome, it implies that if the experiment were repeated infinitely, a proportion p of the outcomes would match However, this reasoning is less applicable to unique situations, such as a doctor assessing a patient’s likelihood of having the flu, where infinite replicas of the patient cannot be created In this context, probability reflects a degree of belief, with 1 indicating certainty of having the flu and 0 indicating certainty of not having it This distinction leads to two types of probability: frequentist probability, which focuses on event occurrence rates, and Bayesian probability, which relates to qualitative levels of certainty.

To achieve common sense reasoning about uncertainty, Bayesian probabilities must be treated identically to frequentist probabilities For instance, calculating the probability of a poker player winning with specific cards uses the same formulas as determining the likelihood of a patient having a disease based on symptoms This equivalence arises from a set of common sense assumptions that necessitate the same axioms governing both types of probability, as discussed in Ramsey (1926).

Probability extends logical reasoning to address uncertainty, offering formal rules to assess the truth of propositions based on the assumed truth of others While logic focuses on definitive truths and falsehoods, probability theory quantifies the likelihood of a proposition being true, considering the probabilities of related propositions.

Random Variables

A random variable is a variable that can assume various values in a random manner, typically represented by a lowercase letter in plain typeface, while its possible values are denoted by lowercase script letters For instance, x1 and x2 are potential values of the random variable x In the case of vector-valued variables, the random variable is written as x, with one of its values represented as x By itself, a random variable merely describes possible states; it must be associated with a probability distribution that indicates the likelihood of each state occurring.

Random variables can be classified as either discrete or continuous Discrete random variables possess a finite or countably infinite number of states, which may include named categories without numerical values In contrast, continuous random variables correspond to real values, allowing for a broader range of possible outcomes.

Probability Distributions

A probability distribution defines the likelihood of a random variable or multiple random variables assuming various states The characterization of these distributions varies based on whether the variables in question are discrete or continuous.

3.3.1 Discrete Variables and Probability Mass Functions

A probability mass function (PMF) is used to describe the probability distribution of discrete variables, commonly represented by the capital letter P Each random variable is typically associated with a specific PMF, and it is essential for readers to determine the appropriate PMF based on the identity of the random variable rather than the function's name.

P( )x is usually not the same as P( )y

The probability mass function (PMF) relates a random variable's state to the likelihood of that state occurring It is represented as P(x), where a probability of 1 signifies certainty that x equals a specific value, while a probability of 0 indicates impossibility To clarify which PMF is being referenced, the random variable may be explicitly stated as P(x = x) Additionally, a variable can be defined first, with the distribution specified later using the notation x ∼ P( ).

Joint probability distributions describe the likelihood of multiple variables occurring together, allowing for the simultaneous evaluation of events For example, P(x = x, y = y) represents the probability that both x and y take on specific values at the same time, which can also be expressed as P(x, y) for convenience.

To be a probability mass function on a random variable x, a function P must satisfy the following properties:

• The domain of P must be the set of all possible states of x.

In probability theory, every event \( x \) is assigned a probability \( P(x) \) that ranges between 0 and 1 An event with a probability of 0 is deemed impossible, indicating that no event can have a lower likelihood Conversely, an event that is certain to occur has a probability of 1, meaning that no event can exceed this level of certainty.

•  x ∈ x P(x) = 1 We refer to this property as being normalized Without this property, we could obtain probabilities greater than one by computing the probability of one of many events occurring.

A discrete random variable \( x \) with \( k \) distinct states can be assigned a uniform distribution, ensuring that each state has an equal likelihood of occurrence This is achieved by defining its probability mass function accordingly.

P( = x x i ) = 1 k (3.1) for alli We can see that this ﬁts the requirements for a probability mass function.

The value 1 k is positive because k is a positive integer We also see that

1 k = k k = 1, (3.2) so the distribution is properly normalized.

3.3.2 Continuous Variables and Probability Density Functions

When dealing with continuous random variables, probability distributions are represented by a probability density function (PDF) instead of a probability mass function For a function to qualify as a probability density function, it must meet specific criteria.

• The domain of p must be the set of all possible states of x.

• ∀ ∈x x, p x( ) ≥ 0 Note that we do not require p( ) x ≤1.

A probability density function p(x) does not give the probability of a speciﬁc state directly, instead the probability of landing inside an inﬁnitesimal region with volume δx is given by p x δx( )

To determine the probability mass of a specific set of points, we can integrate the density function The probability that a value \( x \) falls within a set \( S \) is calculated by integrating the function \( p(x) \) over that set For instance, in a univariate scenario, the probability that \( x \) is within the interval \([a, b]\) is represented by the integral of \( p(x) \) from \( a \) to \( b \).

A probability density function (PDF) can be illustrated through a uniform distribution over a continuous random variable within a specified interval on the real number line This is represented by the function u(x; a, b), where 'a' and 'b' denote the interval's endpoints, with the condition that b is greater than a.

In the context of a function parametrized by parameters a and b, we consider x as the function's argument To maintain that there is no probability mass outside the interval [a, b], we define u(x; a, b) = 0 for all x not in [a, b] Within the interval, u(x; a, b) is expressed as u(x; a, b) = (b - a)⁻¹, which is nonnegative throughout the interval and integrates to 1 Consequently, we denote that x follows a uniform distribution on [a, b] as x ∼ U(a, b).

Marginal Probability

When we have a probability distribution for a complete set of variables, we may be interested in finding the probability distribution for a specific subset of those variables This specific distribution is referred to as the marginal probability distribution.

For example, suppose we have discrete random variables x and y, and we know

P(x y We can ﬁnd, ) P( )x with the sum rule:

Marginal probability refers to the calculation of probabilities by summing values in a grid format, where different values of x are arranged in rows and values of y in columns This method allows for easy computation, as one can sum across a row and conveniently write P(x) in the margin next to it.

For continuous variables, we need to use integration instead of summation: p x( )  p x, y dy.( ) (3.4)

Conditional Probability

Conditional probability is the likelihood of an event occurring, given that another event has already taken place It is represented as P(y = y | x = x), indicating the probability of y being equal to y when x equals x This probability can be calculated using a specific formula.

The conditional probability is only deﬁned when P(x= x)> 0.We cannot compute the conditional probability conditioned on an event that never happens.

Conditional probability should not be mistaken for predicting outcomes based on actions taken For instance, while the likelihood of an individual being from Germany is high if they speak German, teaching someone to speak German does not alter their country of origin This distinction highlights that evaluating the effects of an action is known as making an intervention query, a concept rooted in causal modeling, which is beyond the scope of this book.

The Chain Rule of Conditional Probabilities

Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable:

P(x (1) , ,x ( ) n ) = (P x (1) )Π n i=2 P(x ( ) i |x (1) , ,x ( i − 1) ) (3.6) This observation is known as the chain rule or product ruleof probability.

It follows immediately from the deﬁnition of conditional probability in equation 3.5.

For example, applying the deﬁnition twice, we get

Independence and Conditional Independence

Two random variables x and y are independentif their probability distribution can be expressed as a product of two factors, one involving only x and one involving only y:

Two random variables x and y areconditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z:

We can denote independence and conditional independence with compact notation: x y⊥ means that x and y are independent, while x y z⊥ | means that x and y are conditionally independent given z.

Expectation, Variance and Covariance

The expected value of a function f(x) concerning a probability distribution P(x) represents the average value of f when x is sampled from P For discrete variables, this expected value can be calculated using a summation.

P x f x ,( ) ( ) (3.9) while for continuous variables, it is computed with an integral:

When the identity of the distribution is clear from the context, we may simply write the name of the random variable that the expectation is over, as in E x [f(x)].

When the random variable for expectation is clear, we can simplify notation by writing E[f(x)] without subscripts Typically, E[ã] denotes the average of all random variables within the brackets Additionally, we can omit square brackets when there is no confusion.

Expectations are linear, for example,

E x [αf x( ) +βg x( )] = αE x [ ( )] +f x βE x [ ( )]g x , (3.11) when α and β are not dependent on x

The variancegives a measure of how much the values of a function of a random variable x vary as we sample diﬀerent values of xfrom its probability distribution:

When the variance is low, the values of f(x) cluster near their expected value The square root of the variance is known as the standard deviation.

Thecovariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:

High absolute values of covariance indicate significant changes in variable values, often moving far from their means simultaneously A positive covariance suggests that both variables tend to exhibit high values together, while a negative covariance implies that when one variable is high, the other is likely low, and vice versa Unlike covariance, correlation measures the relationship between variables by normalizing their contributions, thus focusing solely on their interdependence rather than their individual scales.

Covariance and dependence are related yet distinct concepts in statistics While independent variables exhibit zero covariance, non-zero covariance indicates dependence However, zero covariance does not necessarily imply independence, as it only means the absence of linear relationships Independence is a stricter condition, excluding both linear and nonlinear associations For instance, if we sample a number x from a uniform distribution between [-1, 1] and then derive a random variable y based on x, the variables x and y will be dependent despite having a covariance of zero This illustrates that dependence can exist even when covariance is absent.

Thecovariance matrix of a random vector x∈ R n is an n×n matrix, such that

Cov( )x i,j = Cov(x i ,x j ) (3.14) The diagonal elements of the covariance give the variance:

Common Probability Distributions

Several simple probability distributions are useful in many contexts in machine learning.

The Bernoulli distribution is a distribution over a single binary random variable.

It is controlled by a single parameter φ∈[0,1], which gives the probability of the random variable being equal to 1 It has the following properties:

The multinoulliorcategorical distribution is a distribution over a single discrete variable with k diﬀerent states, where k is ﬁnite 1 The multinoulli distribution is

1 “Multinoulli” is a term that was recently coined by Gustavo Lacerdo and popularized by Murphy 2012 ( ) The multinoulli distribution is a special case of the multinomial distribution.

A multinomial distribution describes the frequency of occurrences across k categories when drawing n samples from a multinoulli distribution It is important to note that many references use "multinomial" to describe multinoulli distributions, typically without specifying that this applies only to the case where n equals 1 The multinoulli distribution is parameterized by a vector p ∈ [0,1]^(k-1), where each p_i represents the probability of the i-th category, and the probability of the k-th category is derived as 1 - Σp_i, ensuring that the sum of probabilities remains within the range of 1 In practice, multinoulli distributions are often utilized to model categorical data, and thus, we do not usually assign numerical values to the states, which eliminates the need for calculating the expectation or variance of these random variables.

The Bernoulli and multinoulli distributions effectively describe any distribution within their simple domain, which consists of discrete variables that can be enumerated Their strength lies not in their complexity but in the manageable nature of their states In contrast, continuous variables present uncountably many states, meaning that any distribution characterized by a limited number of parameters must impose significant constraints on its form.

The most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution:

See ﬁgure 3.1 for a plot of the density function.

The normal distribution is governed by two key parameters: à, which represents the mean and the coordinate of the central peak, and σ, the standard deviation, which is always a positive value The variance of the distribution is calculated as σ².

To efficiently evaluate the probability density function (PDF) with varying parameter values, we should square and invert σ Utilizing a parameter β within the range (0,∞) allows us to effectively control the precision or inverse variance of the distribution.

Normal distributions serve as an excellent default option for various applications, particularly when there is a lack of prior knowledge regarding the appropriate form of a distribution over real numbers This preference is grounded in two key reasons that underscore their effectiveness.

The normal distribution, denoted as N(x; μ, σ²), features a characteristic "bell curve" shape, where the x-coordinate of the central peak is represented by μ, and the peak's width is determined by σ In this instance, we illustrate the standard normal distribution, which has μ set to 0 and σ equal to 1.

Many distributions we aim to model often closely resemble normal distributions The central limit theorem indicates that the sum of numerous independent random variables tends to be approximately normally distributed Consequently, this allows for the effective modeling of complex systems as normally distributed noise, despite the fact that these systems can be broken down into components with more structured behaviors.

The normal distribution represents the probability distribution that incorporates the maximum uncertainty among all distributions with identical variance, effectively minimizing the amount of prior knowledge introduced into a model A comprehensive exploration and justification of this concept will be provided in section 19.4.2.

The normal distribution generalizes to R n , in which case it is known as the multivariate normal distribution It may be parametrized with a positive deﬁnite symmetric matrix Σ:

The parameter à still gives the mean of the distribution, though now it is vector-valued The parameter Σ gives the covariance matrix of the distribution.

In the univariate case, repeatedly evaluating the probability density function (PDF) for various parameter values can be computationally inefficient due to the need to invert the covariance matrix (Σ) A more efficient alternative is to utilize the precision matrix (β) for parameterizing the distribution.

We often ﬁx the covariance matrix to be a diagonal matrix An even simpler version is the isotropic Gaussian distribution, whose covariance matrix is a scalar times the identity matrix.

In the context of deep learning, we often want to have a probability distribution with a sharp point at x = 0 To accomplish this, we can use the exponential distribution: p x λ( ; ) = λ1 x ≥ 0 exp (−λx ) (3.25)

The exponential distribution uses the indicator function 1 x ≥ 0 to assign probability zero to all negative values of x

A closely related probability distribution that allows us to place a sharp peak of probability mass at an arbitrary point à is the Laplace distribution

3.9.5 The Dirac Distribution and Empirical Distribution

In certain scenarios, we aim to indicate that the entirety of mass in a probability distribution converges at a single point This can be achieved by defining a probability density function (PDF) with the Dirac delta function, represented as p(x) = δ(x - a).

The Dirac delta function is deﬁned such that it is zero-valued everywhere except

The Dirac delta function is a unique mathematical entity known as a generalized function, which is defined by its integration properties rather than conventional outputs for each input value It can be conceptualized as the limit of a sequence of functions that increasingly concentrate their mass at zero, effectively diminishing their influence on all other points.

By defining p(x) to be δ shifted by −à we obtain an infinitely narrow and infinitely high peak of probability mass where x= à.

A common use of the Dirac delta distribution is as a component of anempirical distribution, ˆ p( ) =x 1 m

The Dirac delta distribution assigns a probability mass of 1/m to each of the m points in a dataset, facilitating the definition of empirical distributions for continuous variables In contrast, for discrete variables, the empirical distribution can be understood as a multinoulli distribution, where the probability for each possible input value corresponds to its empirical frequency in the training set.

The empirical distribution derived from a training dataset defines the sampling distribution used during model training Additionally, this distribution represents the probability density that optimally maximizes the likelihood of the training data.

Probability distributions can often be defined by combining simpler distributions, a process commonly achieved through mixture distributions A mixture distribution consists of multiple component distributions, where the selection of which component generates a sample is determined by sampling from a multinoulli distribution during each trial.

P( = c i P) (x c| = i) (3.29) where P( )c is the multinoulli distribution over component identities.

We have already seen one example of a mixture distribution: the empirical distribution over real-valued variables is a mixture distribution with one Dirac component for each training example.

The mixture model is an effective method for integrating multiple probability distributions to form a more complex overall distribution In Chapter 16, we delve deeper into the techniques for constructing intricate probability distributions from simpler components.

Useful Properties of Common Functions

Certain functions arise often while working with probability distributions, especially the probability distributions used in deep learning models.

One of these functions is the logistic sigmoid: σ x( ) = 1

The logistic sigmoid is commonly used to produce the φparameter of a Bernoulli x 1 x 2

Figure 3.2 illustrates samples from a Gaussian mixture model consisting of three components The first component features an isotropic covariance matrix, indicating uniform variance in all directions The second component utilizes a diagonal covariance matrix, allowing for independent variance control along each axis, with greater variance along the x2 axis compared to the x1 axis The third component employs a full-rank covariance matrix, enabling variance control along any arbitrary direction Additionally, the sigmoid function, shown in figure 3.3, has a range of (0,1) for the φ parameter and exhibits saturation at extreme values, resulting in a flat response that is less sensitive to small input changes.

Another commonly encountered function is the softplus function (Dugaset al., 2001): ζ x( ) = log (1 + exp( ))x (3.31)

The softplus function is valuable for generating the β or σ parameters of a normal distribution due to its range of (0,∞) It frequently appears in expressions involving sigmoids, and its name reflects its nature as a smoothed or "softened" alternative to the function x + = max(0, x).

See ﬁgure 3.4 for a graph of the softplus function.

The following properties are all useful enough that you may wish to memorize them:

Figure 3.3: The logistic sigmoid function.

Figure 3.4: The softplus function. σ x( ) = exp( )x exp( ) + exp(0)x (3.33) d dxσ x( ) = ( )(1σ x −σ x( )) (3.34)

The function σ − 1 (x) is called the logit in statistics, but this term is more rarely used in machine learning.

The softplus function, defined in Equation 3.41, serves as a smoothed counterpart to the positive part function, represented by x + max{0, x} This function contrasts with the negative part function, x − = max{0, x− } To create a smooth equivalent of the negative part, one can utilize ζ(−x) Similar to how one can derive x from its positive and negative components using the identity x + −x − = x, the same relationship applies between ζ(x) and ζ(−x), as illustrated in Equation 3.41.

Bayes’ Rule

We often ﬁnd ourselves in a situation where we know P(y x| ) and need to know

P(x y| ) Fortunately, if we also knowP(x), we can compute the desired quantity using Bayes’ rule:

Note that while P(y) appears in the formula, it is usually feasible to compute

P( ) =y  x P(y |x P x) ( ), so we do not need to begin with knowledge of P( )y

Bayes' rule, named after Reverend Thomas Bayes, is easily derived from the definition of conditional probability This formula is significant in statistics and is often referenced in various texts The general version of Bayes' rule was also independently discovered by Pierre-Simon Laplace.

Technical Details of Continuous Variables

A comprehensive understanding of continuous random variables and probability density functions necessitates the exploration of probability theory through measure theory, a specialized branch of mathematics While this textbook does not delve deeply into measure theory, it is important to highlight some key issues that this mathematical framework addresses.

In section 3.3.2, we explored how the probability of a continuous vector-valued variable \( x \) belonging to a set \( S \) is determined by integrating the probability density function \( p(x) \) over that set However, certain selections of the set \( S \) can lead to paradoxical outcomes, highlighting the complexities involved in probability theory.

In probability theory, we encounter sets S1 and S2 where the sum of their probabilities exceeds one, yet they remain disjoint (S1 ∩ S2 = ∅) These sets often utilize the infinite precision of real numbers, such as through fractal shapes or transformations of rational numbers A significant contribution of measure theory is its ability to define the types of sets for which we can compute probabilities without facing paradoxes In this book, we focus on integrating over sets with straightforward descriptions, thereby avoiding complications related to measure theory.

Measure theory is essential for formulating theorems applicable to most points in R n while acknowledging certain exceptions It rigorously defines sets of points that are negligibly small, referred to as having measure zero, which intuitively means they occupy no volume in the measured space For instance, in R 2, a line is considered to have measure zero, whereas a filled polygon has a positive measure Similarly, an individual point also has measure zero Importantly, any countable union of sets with measure zero retains a measure of zero, exemplified by the set of all rational numbers.

Another useful term from measure theory is almost everywhere A property that holds almost everywhere holds throughout all of space except for on a set of

The Banach-Tarski theorem illustrates intriguing properties of sets with measure zero, which occupy an insignificant amount of space Due to their negligible nature, these exceptions can often be disregarded in various applications Additionally, significant findings in probability theory are applicable to all discrete values, yet they only hold “almost everywhere” for continuous values.

When dealing with continuous random variables that are deterministic functions of each other, such as x and y where y = g(x) and g is an invertible, continuous, differentiable transformation, it is a common assumption that the probability density function p_y(y) can be expressed as p_x(g^(-1)(y)) However, this assumption is incorrect.

In this example, we consider scalar random variables x and y, where y is defined as y = x² and x follows a uniform distribution U(0,1) By applying the transformation rule p_y(y) = p_x(2y), we find that the probability density function p_y is equal to 0 outside the interval [0, 1/2] and reaches a value of 1 within this interval.

A common mistake in probability distribution analysis is the failure to consider the spatial distortion introduced by the function g This oversight occurs because the probability of x existing within an infinitesimally small volume, represented as p(x)δx, does not account for the fact that g can either expand or contract space Consequently, the infinitesimal volume surrounding x in x-space may differ significantly in volume when translated to y-space.

To see how to correct the problem, we return to the scalar case We need to preserve the property

|p y ( ( ))g x dy| = |p x ( )x dx | (3.44) Solving from this, we obtain p y ( ) = y p x (g − 1 ( ))y

In higher dimensions, the derivative generalizes to the determinant of the Jacobian matrix—the matrix with J i,j = ∂x ∂y i j Thus, for real-valued vectors x and ,y p x ( ) = x p y ( ( ))g x det

Information Theory

Information theory is a vital branch of applied mathematics focused on quantifying information in signals, initially developed for analyzing message transmission over noisy channels like radio It provides insights into designing optimal coding schemes and calculating expected message lengths from specific probability distributions In machine learning, information theory extends to continuous variables, enhancing our understanding beyond discrete messaging This field plays a crucial role in electrical engineering and computer science, primarily used to characterize and quantify similarity between probability distributions For further exploration of information theory, refer to Cover and Thomas (2006) or MacKay.

Information theory suggests that discovering an unlikely event is more informative than learning about a likely one For instance, a message stating "the sun rose this morning" lacks significance, while a message like "there was a solar eclipse this morning" conveys substantial information.

We would like to quantify information in a way that formalizes this intuition. Speciﬁcally,

• Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever.

• Less likely events should have higher information content.

Independent events provide additive information; for instance, discovering that a coin toss resulted in heads twice conveys double the information compared to learning it resulted in heads just once.

In order to satisfy all three of these properties, we deﬁne theself-information of an event x= x to be

In this book, we define the logarithm as the natural logarithm with base e, leading to our measurement of information in nats A single nat represents the information gained from observing an event with a probability of 1/e While other texts may use base-2 logarithms and refer to units as bits or shannons, it's important to note that information measured in bits is simply a rescaling of that measured in nats.

In the context of continuous variables, the definition of information remains analogous to that of discrete cases; however, certain properties are diminished Notably, an event characterized by unit density possesses zero information, even though it is not an event that is certain to happen.

Self-information deals only with a single outcome We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy:

The Shannon entropy, denoted as H(P), quantifies the expected amount of information produced by an event from a distribution P It is mathematically represented as H( ) = x E x ∼ P [ ( )] = I x −E x ∼ P [log ( )]P x This concept establishes a lower bound on the average number of bits required to encode symbols from the distribution, with the logarithm's base determining the units of measurement.

Distributions with low entropy indicate outcomes that are nearly certain, while those closer to uniform exhibit high entropy In the case of continuous variables, the Shannon entropy is referred to as differential entropy.

If we have two separate probability distributions P(x) andQ(x) over the same random variable x, we can measure how diﬀerent these two distributions are using the Kullback-Leibler (KL) divergence:

In the context of discrete variables, the additional information required to transmit a message with symbols from probability distribution P—measured in bits (using base 2 logarithm) or nats (using natural logarithm)—is determined by the efficiency of a coding scheme optimized for a different probability distribution Q.

The KL divergence is a non-negative measure of the difference between two probability distributions, being zero only when the distributions are identical for discrete variables or equal "almost everywhere" for continuous variables While it is often interpreted as a distance metric between distributions, it is not a true distance measure due to its lack of symmetry; specifically, D KL (P || Q) does not equal D KL (Q || P) for certain distributions P and Q.

Sha nno n entr o p y in na ts

Figure 3.5: This plot shows how distributions that are closer to deterministic have low Shannon entropy while distributions that are close to uniform have high Shannon entropy.

The probability of a binary random variable is represented on the horizontal axis as p, with entropy calculated using the formula 1 (p − 1)log(1 − p) − p log p When p approaches 0 or 1, the distribution becomes nearly deterministic, indicating that the random variable consistently takes on a value of 0 or 1, respectively Conversely, when p equals 0.5, entropy reaches its maximum, reflecting a uniform distribution across both outcomes Additionally, the choice between using D KL (P || Q) or D KL (Q || P) carries significant implications due to asymmetry, as illustrated in Figure 3.6.

A quantity that is closely related to the KL divergence is the cross-entropy

H(P, Q) =H(P) +D KL(P Q ), which is similar to the KL divergence but lacking the term on the left:

Minimizing the cross-entropy with respect to Q is equivalent to minimizing the

KL divergence, because Q does not participate in the omitted term.

In information theory, expressions like 0 log 0 often arise during calculations By convention, these expressions are evaluated as the limit of x approaching 0, resulting in the conclusion that lim x → 0 x log x equals 0.

Structured Probabilistic Models

Machine learning algorithms often involve probability distributions over a very large number of random variables Often, these probability distributions involve direct interactions between relatively few variables Using a single function to x

P ro ba bi li ty D en si ty q ∗ = argmin q D KL ( p q  ) p x ( ) q ∗ ( ) x x

P ro ba bi li ty D en si ty q ∗ = argmin q D KL ( q p  ) p( ) x q ∗ ( ) x

The KL divergence is an asymmetric measure that allows us to compare two probability distributions, p(x) and q(x) When approximating p(x) with q(x), we can minimize either D KL (p || q) or D KL (q || p) This choice significantly impacts the results, as demonstrated by using a mixture of two Gaussians for p and a single Gaussian for q Understanding the direction of KL divergence is crucial for effective distribution approximation.

KL divergence is application-specific, with some scenarios requiring an approximation that aligns high probability with the true distribution, while others necessitate an approximation that avoids assigning high probability to low-probability areas The direction of KL divergence chosen reflects the priorities of each application Minimizing D_KL(p || q) results in selecting a q that has high probability where p is high, often leading to the blurring of multiple modes Conversely, minimizing D_KL(q || p) involves selecting a q that has low probability where p is low, typically favoring a single mode when p has widely separated modes This approach minimizes probability mass in low-probability regions between the modes, demonstrating how the choice of q can significantly impact the KL divergence outcome.

KL divergence can still choose to blur the modes. describe the entire joint probability distribution can be very ineﬃcient (both computationally and statistically).

Instead of representing a probability distribution with a single function, we can decompose it into multiple factors that are multiplied together For instance, consider three random variables: a, b, and c, where a influences b, b influences c, and a and c are independent given b In this scenario, the probability distribution for all three variables can be expressed as the product of the individual probability distributions for two variables: p(a, b, c) = p(a) * p(b | a) * p(c | b).

Factorizations can significantly minimize the number of parameters required to define a distribution Each factor relies on a number of parameters that grows exponentially with the number of variables involved By identifying a factorization that involves distributions over fewer variables, we can substantially decrease the cost of representing a distribution.

We can describe these kinds of factorizations using graphs Here we use the word

In graph theory, a "graph" consists of a collection of vertices connected by edges When we use a graph to depict the factorization of a probability distribution, it is referred to as a structured probabilistic model or graphical model.

Structured probabilistic models are primarily classified into two categories: directed and undirected models These graphical models utilize a graph G, where each node represents a random variable An edge connecting two nodes indicates that the probability distribution captures direct interactions between the corresponding random variables.

Directed models utilize graphs with directed edges to represent factorizations into conditional probability distributions Each random variable \( x_i \) in the distribution corresponds to a factor that defines its conditional distribution given its parent variables, expressed as \( P_a^G(x_i) \) The overall probability distribution is formulated as \( p() = \prod_{i} p(x_i | P_a^G(x_i)) \).

See ﬁgure3.7 for an example of a directed graph and the factorization of probability distributions it represents.

Undirected models use graphs with undirected edges, and they represent factorizations into a set of functions; unlike in the directed case, these functions a a cc b b ee d d

Figure 3.7: A directed graphical model over random variables a, b, c, d and e This graph corresponds to probability distributions that can be factored as p , , , , (a b c d e) = ( ) p a p ( b a | ) p (c a | , b) ( p d b | ) p ( e c | ) (3.54)

This graph provides a clear visualization of the distribution's properties, highlighting that nodes a and c have a direct interaction, while a and e interact indirectly through c In this context, a clique is defined as a set of interconnected nodes within the graph G Each clique \( C_i \) in an undirected model is linked to a factor \( \phi(C_i) \), which functions as a mathematical expression rather than a probability distribution Although the output of each factor must be non-negative, there is no requirement for these factors to sum or integrate to one, distinguishing them from traditional probability distributions.

The probability of a configuration of random variables is determined by the product of various factors, with higher factor values indicating greater likelihood However, this product does not necessarily equal one, necessitating the use of a normalizing constant Z This constant is defined as the sum or integral of the product of the φ functions across all states, allowing us to derive a normalized probability distribution.

See ﬁgure 3.8 for an example of an undirected graph and the factorization of probability distributions it represents.

Graphical representations of factorizations serve as a language for describing probability distributions, highlighting that they are not mutually exclusive families The distinction between directed and undirected is not inherent to the probability distribution itself but rather pertains to the specific description employed.

Figure 3.8: An undirected graphical model over random variables a, b, c, d and e This graph corresponds to probability distributions that can be factored as p , , , , (a b c d e) = 1

The graph illustrates key properties of the distribution, highlighting that while variables a and c interact directly, the interaction between a and e occurs indirectly through c This demonstrates that any probability distribution can be represented in multiple ways.

In Parts I and II of this book, structured probabilistic models serve as a framework to illustrate the direct probabilistic relationships that various machine learning algorithms select to represent A deeper understanding of these models is not required until Part III, where we will delve into structured probabilistic models in greater detail, focusing on relevant research topics.

This chapter has reviewed the basic concepts of probability theory that are most relevant to deep learning One more set of fundamental mathematical tools remains: numerical methods.

Machine learning algorithms often demand extensive numerical computation, utilizing iterative methods to refine solution estimates instead of deriving formulas analytically Key operations in this process include optimization, aimed at finding values that minimize or maximize functions, and solving linear equation systems Additionally, evaluating mathematical functions on digital computers can be challenging, particularly when dealing with real numbers that cannot be precisely represented with limited memory.

Overﬂow and Underﬂow

The primary challenge of executing continuous mathematics on digital computers lies in the necessity to represent an infinite number of real numbers using a finite set of bit patterns Consequently, most real numbers are subject to approximation errors during representation While these errors often manifest as rounding errors, they can become significant when compounded through multiple operations If algorithms are not specifically designed to minimize the accumulation of rounding errors, they may fail in practical applications despite being theoretically sound.

One form of rounding error that is particularly devastating is underﬂow.

Underflow happens when small numbers close to zero are rounded down to zero, leading to significant changes in the behavior of functions For instance, division by zero is typically avoided, with some software throwing exceptions while others return a not-a-number placeholder Additionally, taking the logarithm of zero is generally treated as negative infinity, which can result in not-a-number outcomes in subsequent arithmetic operations.

Numerical errors such as overflow can significantly impact calculations, leading to large magnitude numbers being approximated as infinity (∞) or negative infinity (−∞) Subsequent arithmetic operations typically convert these infinite values into not-a-number (NaN) results A critical function that requires stabilization against both underflow and overflow is the softmax function, which is commonly employed to predict probabilities in a multinoulli distribution The softmax function is mathematically defined as softmax(x_i) = exp(x_i).

When all inputs \( x_i \) are equal to a constant \( c \), the expected output of the softmax function should theoretically be \( \frac{1}{n} \) However, numerical issues arise when \( c \) is of large magnitude; a very negative \( c \) can lead to underflow, making the denominator zero and rendering the result undefined, while a very large positive \( c \) can cause overflow, again resulting in an undefined expression To mitigate these issues, the softmax function can be evaluated using \( z = x - \max_i x_i \) This adjustment ensures that the largest exponent in the softmax calculation is zero, preventing overflow, and guarantees that at least one term in the denominator equals one, thus avoiding underflow and division by zero.

A potential issue arises when underflow in the numerator leads the entire expression to evaluate to zero, resulting in an erroneous output of -∞ when implementing log softmax(x) by first applying the softmax function To avoid this, it is essential to create a dedicated function that computes log softmax in a numerically stable manner, utilizing the same stabilization technique applied to the softmax function.

While this book does not delve into all the numerical considerations necessary for implementing various algorithms, it is crucial for low-level library developers to be aware of these issues in deep learning Most readers can depend on established libraries that offer stable implementations Additionally, some new algorithms can be automatically stabilized upon implementation An example of such a software package is Theano, which effectively identifies and stabilizes many common numerically unstable expressions encountered in deep learning.

Poor Conditioning

Conditioning describes the sensitivity of a function to small variations in its inputs Functions that exhibit rapid changes in response to slight input perturbations can pose challenges in scientific computation, as even minor rounding errors in the inputs can lead to significant fluctuations in the output.

Consider the function f(x) = A − 1 x When A ∈ R n n × has an eigenvalue decomposition, its condition number is max i,j

This is the ratio of the magnitude of the largest and smallest eigenvalue When this number is large, matrix inversion is particularly sensitive to error in the input.

The sensitivity of a matrix is an inherent characteristic of the matrix itself, rather than a consequence of rounding errors that occur during the inversion process Poorly conditioned matrices have a tendency to amplify existing errors when multiplied by the true matrix inverse, thereby exacerbating the issue Furthermore, numerical errors that arise during the inversion process can compound these errors, leading to a significant deterioration in accuracy.

Gradient-Based Optimization

Deep learning algorithms primarily focus on optimization, which involves adjusting variables to minimize or maximize a function f(x) Typically, optimization problems are framed as minimizing f(x), and maximization can often be achieved by employing a minimization algorithm.

The objective function, also known as the criterion, is the function we aim to minimize or maximize When focusing on minimization, it is often referred to as the cost function, loss function, or error function In this book, we use these terms interchangeably, although some machine learning literature may assign specific meanings to them.

We often denote the value that minimizes or maximizes a function with a superscript For example, we might say∗ x ∗ = arg min ( )f x

Since f  ( ) = 0, gradient x descent halts here.

For x < 0, we have f  ( ) x < 0, so we can decrease f by moving rightward.

For x > 0, we have f  ( ) x > 0, so we can decrease f by moving leftward. f x ( ) = 1 2 x 2 f  ( ) = x x

Figure 4.1: An illustration of how the gradient descent algorithm uses the derivatives of a function can be used to follow the function downhill to a minimum.

We assume the reader is already familiar with calculus, but provide a brief review of how calculus concepts relate to optimization here.

In the context of a function y = f(x), where both x and y represent real numbers, the derivative, denoted as f'(x) or dy/dx, indicates the slope of the function at a specific point x This derivative quantifies how a small change in the input x results in a corresponding change in the output, represented by the formula f(x + Δx) ≈ f(x) + Δx * f'(x).

The derivative is essential for minimizing a function, as it indicates how to adjust x to achieve a slight enhancement in y Specifically, if we consider f(x−ε sign(f'(x))), it will be less than f(x) for sufficiently small ε By making small adjustments to x in the opposite direction of the derivative, we can effectively decrease f(x) This method is known as gradient descent, a technique introduced by Cauchy in 1847.

When the derivative f'(x) equals zero, it indicates critical or stationary points, where the function's direction is unclear A local minimum occurs at these points, where f(x) is less than the values at neighboring points, preventing further decreases in f(x) through infinitesimal changes.

A local maximumis a point where f(x) is higher than at all neighboring points,

Critical points in one-dimensional functions are defined as points where the slope is zero, indicating a potential change in the function's behavior These points can be categorized into three types: local minima, which are lower than adjacent points; local maxima, which are higher than their neighbors; and saddle points, which are positioned between higher and lower neighboring points Notably, saddle points do not represent local maxima or minima, highlighting the complexity of function behavior at critical points.

ﬁgure 4.2 for examples of each type of critical point.

A point that obtains the absolute lowest value of f(x) is a global minimum.

In optimization, a function may have one global minimum or multiple global minima, along with local minima that are not globally optimal In deep learning, we often encounter functions with numerous local minima and saddle points within flat regions, complicating the optimization process, particularly in multidimensional inputs Consequently, we typically aim to find a value of the function that is significantly low, rather than strictly minimal.

We often minimize functions that have multiple inputs: f : R n → R For the concept of “minimization” to make sense, there must still be only one (scalar) output.

In functions with multiple inputs, partial derivatives are essential for understanding how a function changes with respect to individual variables The partial derivative ∂x ∂ i f(x) indicates the change in function f as the variable x i increases at a specific point The gradient extends the concept of derivatives to vector cases, represented as ∇ x f(x), which comprises all the partial derivatives Each element of the gradient corresponds to the partial derivative of f with respect to the variable x i, enabling analysis in multiple dimensions.

Ideally, we would like to arrive at the global minimum, but this might not be possible.

This local minimum performs nearly as well as the global one, so it is an acceptable halting point.

This local minimum performs poorly and should be avoided.

Optimization algorithms often struggle to identify a global minimum due to the presence of multiple local minima or plateaus In deep learning, it is generally acceptable to accept solutions that are not absolute minima, provided they yield significantly low values of the cost function Critical points occur where the gradient's components equal zero The directional derivative, representing the slope of the function in a specific direction, is calculated as the derivative of the function f(x + αu) with respect to α, evaluated at α = 0 By applying the chain rule, we find that this derivative simplifies to u ⋅ ∇f(x) when α equals zero.

To minimize f, we would like to ﬁnd the direction in which f decreases the fastest We can do this using the directional derivative: min u u ,  u=1 u  ∇ x f( )x (4.3)

The method of steepest descent, also known as gradient descent, involves minimizing a function by moving in the direction of the negative gradient Specifically, when the angle θ between a unit vector u and the gradient ∇f(x) is considered, the minimization simplifies to minimizing cosθ This occurs when u is oriented opposite to the gradient, indicating that the gradient points uphill while the negative gradient points downhill, allowing for a decrease in the function f by following the negative gradient direction.

Steepest descent proposes a new point x  = x− ∇ x f( )x (4.5) where  is the learning rate, a positive scalar determining the size of the step.

There are multiple methods to select the step size, denoted as α A common technique is to assign α a small constant value Alternatively, we can determine the step size that causes the directional derivative to equal zero Another effective strategy involves conducting a line search, where we evaluate the function f(x - ∇f(x)) for various α values and select the one that yields the lowest objective function value.

The steepest descent method converges when all components of the gradient approach zero In certain situations, it is possible to bypass the iterative process and directly reach the critical point by solving the equation ∇x f(x) = 0.

Gradient descent, while primarily used for optimization in continuous spaces, can be adapted to discrete spaces through a similar approach of iterative improvements This method, known as hill climbing, involves making small, strategic adjustments to discrete parameters in order to ascend an objective function and achieve better configurations.

4.3.1 Beyond the Gradient: Jacobian and Hessian Matrices

To find all partial derivatives of a function with vector inputs and outputs, we utilize the Jacobian matrix For a function \( f: \mathbb{R}^m \rightarrow \mathbb{R}^n \), the Jacobian matrix \( J \in \mathbb{R}^{n \times m} \) is defined by the elements \( J_{i,j} = \frac{\partial f_i}{\partial x_j} \), capturing the relationship between the input vector \( x \) and the output vector \( f(x) \).

In calculus, we often explore the concept of a second derivative, which refers to the derivative of a derivative For a function f: R^n → R, the second derivative with respect to variables x_i and x_j is represented as ∂²f/∂x_i∂x_j This notation captures the rate of change of the rate of change, providing deeper insights into the function's behavior.

Constrained Optimization

In constrained optimization, the goal is to find the maximum or minimum value of a function f(x) within a specific set S, rather than over all possible values of x The values of x that fall within this set S are referred to as feasible points.

In optimization problems where a compact solution is desired, imposing a norm constraint, such as ||x|| ≤ x₁, is a common strategy A straightforward method for constrained optimization involves adapting gradient descent to incorporate the constraint By utilizing a small constant step size, gradient descent can be performed, followed by projecting the outcome back into the feasible set S Alternatively, a line search can be conducted, restricting the step sizes to those that result in feasible x points or projecting each point along the line back into the constraint region To enhance efficiency, it is beneficial to project the gradient into the tangent space of the feasible region before taking a step or initiating the line search, as suggested by Rosen (1960).

A sophisticated method for solving constrained optimization problems involves designing an unconstrained optimization problem whose solution can be converted back to the original problem For instance, to minimize f(x) for x in R² with the constraint of having a unit L² norm, one can minimize g(θ) = f([cosθ, sinθ]ᵀ) with respect to θ, then use [cosθ, sinθ] as the solution for the original problem This technique demands creativity, as the transformation between the optimization problems must be tailored to each specific case.

The Karush–Kuhn–Tucker (KKT) method offers a comprehensive solution for constrained optimization problems This approach introduces the generalized Lagrangian, also known as the generalized Lagrange function, which plays a crucial role in formulating these optimization challenges.

To define the Lagrangian, we must express the set S using equations and inequalities Specifically, we seek to characterize S through m functions \( g_i(x) \) and n functions \( h_j(x) \), such that \( S = \{ x | \forall i, g_i(x) = 0 \text{ and } \forall j, h_j(x) \leq 0 \} \) The functions \( g_i(x) \) represent the equality constraints, while the functions \( h_j(x) \) denote the inequality constraints.

We introduce new variables λ i andα j for each constraint, these are called the KKT multipliers The generalized Lagrangian is then deﬁned as

We can address a constrained minimization problem through the unconstrained optimization of the generalized Lagrangian As long as there is at least one feasible point and the function f(x) does not reach infinity, the optimization problem min x max λ max α α ≥ 0 L(x, λ, α) yields the same optimal objective function value and set of optimal points x as the constrained minimization min x ∈ S f(x).

This follows because any time the constraints are satisﬁed, max λ max α α , ≥ 0 L(x λ α) = ( ), , f x , (4.17) while any time a constraint is violated, max λ max α α , ≥ 0 L(x λ α) = , , ∞ (4.18)

1 The KKT approach generalizes the method of Lagrange multipliers which allows equality constraints but not inequality constraints.

These properties guarantee that no infeasible point can be optimal, and that the optimum within the feasible points is unchanged.

To perform constrained maximization, we can construct the generalized La- grange function of −f( )x , which leads to this optimization problem: min x max λ max α α , ≥ 0 −f( ) +x  i λ i g ( ) i ( ) +x  j α j h ( ) j ( )x (4.19)

We may also convert this to a problem with maximization in the outer loop: max x min λ min α α , ≥ 0 f( ) +x  i λ i g ( ) i ( )x − j α j h ( ) j ( )x (4.20)

The sign of the term for equality constraints is flexible; we can define it using either addition or subtraction, as the optimization process allows for the selection of any sign for each λ i.

Inequality constraints play a crucial role in optimization problems, particularly when considering active and inactive constraints A constraint \( h_i(x) \) is deemed active if it equals zero at the optimal solution \( x^* \) Inactive constraints do not affect the local optimality of the solution, yet they may still exclude other potential solutions For instance, in a convex problem with a broad region of globally optimal points, certain inactive constraints can eliminate parts of this region, while in non-convex problems, they might prevent access to superior local stationary points Despite the presence of inactive constraints, the converged point remains a stationary point Since inactive constraints yield negative values, the solution to the optimization problem indicates that \( \alpha_i = 0 \) Consequently, at the solution, at least one of the constraints \( \alpha_i \geq 0 \) or \( h_i(x) \leq 0 \) must be active This illustrates that the solution either lies on the boundary defined by the inequality, necessitating the use of its KKT multiplier to influence the solution, or the inequality is non-influential, represented by a zero KKT multiplier.

The Karush-Kuhn-Tucker (KKT) conditions, established by Karush in 1939 and later expanded by Kuhn and Tucker in 1951, outline essential properties that characterize optimal points in constrained optimization problems While these conditions are necessary for identifying optimality, they do not guarantee sufficiency, meaning that a point meeting the KKT conditions may not always be the best solution.

• The gradient of the generalized Lagrangian is zero.

• All constraints on both x and the KKT multipliers are satisﬁed.

• The inequality constraints exhibit “complementary slackness”: αh(x) = 0.For more information about the KKT approach, see Nocedal and Wright 2006( ).

Example: Linear Least Squares

Suppose we want to ﬁnd the value of x that minimizes f( ) =x 1

Specialized linear algebra algorithms can efficiently address this problem, but we can also demonstrate the solution through gradient-based optimization, providing a straightforward example of the application of these techniques.

First, we need to obtain the gradient:

We can then follow this gradient downhill, taking small steps See algorithm 4.1 for details.

Algorithm 4.1 An algorithm to minimize f(x) = 1 2 ||Ax− ||b 2 2 with respect to x using gradient descent, starting from an arbitrary value of x

Set the step size ( ) and tolerance ( ) to small, positive numbers. δ while ||A  Ax−A  b|| 2 > δ do x← −x 

Newton’s method can effectively solve this problem, as it utilizes a quadratic approximation that is exact for the true quadratic function, allowing the algorithm to converge to the global minimum in just one step.

Now suppose we wish to minimize the same function, but subject to the constraint x  x≤ 1 To do so, we introduce the Lagrangian

We can now solve the problem min x max λ,λ ≥ 0 L(x, λ ) (4.24)

The smallest-norm solution for the unconstrained least squares problem can be determined using the Moore-Penrose pseudoinverse, expressed as x = A + b If this solution is feasible, it also serves as the solution to the constrained problem However, if it is not feasible, we need to identify a solution that satisfies the active constraint By differentiating the Lagrangian with respect to the variable, we derive the necessary equation for x.

This tells us that the solution will take the form x= (A  A+ 2λI) − 1 A  b (4.26)

The magnitude of λ must be chosen such that the result obeys the constraint We can ﬁnd this value by performing gradient ascent on To do so, observeλ

When the norm of x exceeds 1, the derivative becomes positive, prompting an increase in λ to elevate the Lagrangian This adjustment raises the coefficient on the x penalty, leading to a solution with a smaller norm when solving the linear equation for x The iterative process of solving the linear equation and fine-tuning λ persists until x achieves the desired norm and the derivative with respect to λ equals zero.

This concludes the mathematical preliminaries that we use to develop machine learning algorithms We are now ready to build and analyze some full-ﬂedged learning systems.

Deep learning is a specialized subset of machine learning that requires a strong grasp of fundamental machine learning principles This chapter serves as a concise introduction to the key concepts that will be explored further in the book Beginners or readers seeking a broader understanding are advised to refer to comprehensive machine learning textbooks, such as those authored by Murphy, for a more in-depth exploration of the basics.

If you have a foundational understanding of machine learning, you may want to proceed directly to section 5.11 This section explores key traditional machine learning techniques that have significantly shaped the evolution of deep learning algorithms.

A learning algorithm is defined, with linear regression as an example, highlighting the distinction between fitting training data and identifying patterns that generalize to new data Hyperparameters, which must be set externally, are discussed in relation to additional data Machine learning is viewed as applied statistics, focusing more on computational estimation of complex functions than on establishing confidence intervals, introducing frequentist estimators and Bayesian inference as central statistical approaches The article categorizes machine learning algorithms into supervised and unsupervised learning, providing examples of each It also explains the significance of stochastic gradient descent in deep learning and how to integrate various components—optimization algorithms, cost functions, models, and datasets—into a cohesive machine learning algorithm Lastly, it addresses the limitations of traditional machine learning in generalization and the subsequent rise of deep learning algorithms to address these challenges.

Learning Algorithms

A machine learning algorithm is designed to learn from data, enhancing its performance over time According to Mitchell (1997), a computer program learns from experience (E) in relation to specific tasks (T) and performance metrics (P) when its ability to perform these tasks improves with accumulated experience This definition highlights the diverse range of experiences that can contribute to the learning process.

In this book, we do not offer formal definitions for entities such as tasks (T), performance measures (P), and experiences (E) Instead, we focus on providing intuitive descriptions and examples of various tasks and performance measures that can aid in the development of machine learning algorithms.

Machine learning enables us to address complex tasks that traditional human-written programs cannot solve From both scientific and philosophical perspectives, it is crucial to explore machine learning, as doing so enhances our comprehension of the fundamental principles of intelligence.

In a formal definition of "task," the act of learning is not the task itself; rather, learning serves as the method to acquire the skills necessary to complete the task For instance, if the objective is to enable a robot to walk, then walking is identified as the task.

We could program the robot to learn to walk, or we could attempt to directly write a program that speciﬁes how to walk manually.

Machine learning tasks involve the processing of examples, which are collections of quantitatively measured features from objects or events These examples are typically represented as vectors, where each entry corresponds to a specific feature For instance, in the case of images, the features are represented by the pixel values.

Many kinds of tasks can be solved with machine learning Some of the most common machine learning tasks include the following:

Classification tasks involve determining the category to which a given input belongs In this process, a learning algorithm generates a function f: R^n → {1, , k}, where the model assigns an input vector x to a specific category represented by a numeric code y Additionally, some classification variants allow the function f to output a probability distribution across different classes.

Object recognition is a classification task where an image's pixel brightness values are analyzed to output a numeric code that identifies the object For instance, the Willow Garage PR2 robot exemplifies this by functioning as a waiter capable of recognizing various drinks and serving them on command Modern advancements in object recognition are primarily driven by deep learning techniques This technology also enables computers to recognize faces, facilitating automatic tagging in photo collections and promoting more natural interactions between computers and users.

Classifying data with missing inputs presents a significant challenge, as it requires the learning algorithm to develop multiple classification functions for various subsets of available inputs This situation often arises in medical diagnosis, where tests can be costly or invasive To address this, an efficient approach involves learning a probability distribution over all relevant variables, allowing the algorithm to classify by marginalizing out the missing inputs Instead of creating 2^n different functions for each possible input combination, the algorithm only needs to learn a single function that describes the joint probability distribution This method is exemplified in Goodfellow et al (2013b), showcasing a deep probabilistic model adept at handling such tasks Classifying with missing inputs is just one of many applications of machine learning in diverse fields.

Regression tasks involve predicting a numerical value based on input data, requiring the learning algorithm to output a function f: R n → R Unlike classification, regression focuses on numerical outcomes Examples include predicting the expected claim amount for insurance premiums and forecasting future security prices, both of which are essential in algorithmic trading.

Transcription involves using machine learning systems to convert unstructured data into a structured textual format For instance, optical character recognition (OCR) allows a computer program to analyze a photograph of text and output the characters in formats like ASCII or Unicode A practical application of this is seen in Google Street View, which utilizes deep learning to interpret address numbers Additionally, speech recognition technology processes audio waveforms to generate corresponding text or words.

Deep learning plays a vital role in contemporary speech recognition systems utilized by leading companies such as Microsoft, IBM, and Google These systems rely on ID codes that accurately represent the spoken words in audio recordings, enhancing their efficiency and effectiveness in processing and understanding human speech.

Machine translation involves converting a sequence of symbols from one language into another, such as translating English to French This task has seen significant advancements due to deep learning technologies, which have enhanced the accuracy and efficiency of translations (Sutskever et al., 2014; Bahdanau et al., 2015).

Structured output tasks involve generating complex data structures that showcase important relationships among various elements This category encompasses transcription and translation, as well as tasks like parsing, where a natural language sentence is transformed into a grammatical tree with tagged nodes such as verbs and nouns Deep learning applications, such as those demonstrated by Collobert (2011), illustrate parsing capabilities Another example is pixel-wise image segmentation, where each pixel is categorized, as seen in Mnih and Hinton's (2010) work on annotating road locations in aerial images Additionally, in image captioning, a program analyzes an image and produces a coherent natural language sentence, as explored by Kiros et al (2014) and others These structured output tasks are defined by their requirement for interrelated outputs, ensuring that, for instance, the words generated by an image captioning system form a grammatically correct sentence.

Anomaly detection involves a computer program analyzing a set of events or objects to identify and flag those that are unusual or atypical A prime example of this is credit card fraud detection, where companies model purchasing habits to spot misuse When a thief uses stolen credit card information, their purchases typically deviate from the legitimate cardholder's usual spending patterns By recognizing these discrepancies, credit card companies can promptly place a hold on accounts to prevent further fraud For a comprehensive overview of anomaly detection methods, refer to Chandola et al (2009).

Synthesis and sampling in machine learning involve generating new examples that resemble training data, which is particularly beneficial in media applications where manual content creation is time-consuming For instance, video games can utilize algorithms to automatically create textures for large objects or landscapes, alleviating the need for artists to label each pixel Additionally, in structured output tasks like speech synthesis, a written sentence can prompt the program to produce an audio waveform of the spoken text This process allows for significant variation in the output, enhancing its naturalness and realism.

Capacity, Overﬁtting and Underﬁtting

The primary challenge in machine learning lies in achieving strong performance on new, unseen data, rather than solely on the training set This capability to effectively handle previously unobserved inputs is known as generalization.

In machine learning, we begin with a training set to compute the training error, which we aim to minimize through optimization However, the key distinction in machine learning is the importance of also minimizing the generalization error, or test error, which reflects the model's performance on unseen data The generalization error is defined as the expected error on new inputs, calculated across various potential inputs from the distribution we anticipate the model will face in real-world applications.

We typically estimate the generalization error of a machine learning model by measuring its performance on a test setof examples that were collected separately from the training set.

In our linear regression example, we trained the model by minimizing the training error,

In statistical learning theory, the performance on the test set is crucial, as indicated by the error measurement 1/m ||X(test)w - y(test)||² While we primarily observe the training set, understanding how to influence test set performance is essential If the training and test sets are collected randomly, options for improvement are limited However, by making certain assumptions about the collection methods of these sets, we can enhance our approach and achieve better results.

The train and test data are created through a probability distribution known as the data generating process, based on a set of assumptions called i.i.d assumptions These assumptions state that the examples within each dataset are independent and that both the train and test sets are identically distributed, originating from the same probability distribution This framework enables us to represent the data generating process with a probability distribution for individual examples, which is consistently applied to generate all train and test examples.

The shared underlying distribution, referred to as the data generating distribution (p data), enables a probabilistic framework that, along with i.i.d assumptions, facilitates a mathematical analysis of the connection between training error and test error.

The relationship between training and test error reveals that the expected training error of a randomly selected model is equal to its expected test error When we sample from a probability distribution p(x, y) to create both the training and test sets, the expected errors for both sets remain identical for a fixed value w This equivalence arises because both errors are derived from the same sampling process; the only distinction lies in the designation of the dataset.

When utilizing a machine learning algorithm, parameters are not predetermined; instead, the training set is sampled first to optimize these parameters for minimizing training error, followed by sampling the test set Consequently, the expected test error will always be equal to or greater than the expected training error The performance of a machine learning algorithm is influenced by its capability to effectively generalize from the training data to unseen data.

1 Make the training error small.

2 Make the gap between training and test error small.

In machine learning, two key challenges are underfitting and overfitting Underfitting happens when a model fails to achieve a low error rate on the training data, indicating it is too simplistic Conversely, overfitting occurs when there is a significant disparity between training and test errors, suggesting the model is overly complex and tailored to the training data.

To manage a model's tendency to overfit or underfit, we can adjust its capacity, which refers to its capability to accommodate diverse functions Models with low capacity often find it challenging to accurately represent the training data, while those with high capacity risk overfitting by memorizing specific details of the training set that may not be beneficial for performance on the test set.

To manage the capacity of a learning algorithm, it is essential to select its hypothesis space, which defines the functions the algorithm can use as potential solutions For instance, the hypothesis space of the linear regression algorithm consists of all linear functions of its input By expanding this hypothesis space to incorporate polynomial functions, we can enhance the model's capacity beyond just linear relationships.

A polynomial of degree one gives us the linear regression model with which we are already familiar, with prediction ˆ y = +b wx (5.15)

By introducing x 2 as another feature provided to the linear regression model, we can learn a model that is quadratic as a function of :x ˆ y = +b w 1 x+w 2 x 2 (5.16)

This model utilizes a quadratic function of its input, while maintaining a linear relationship with the parameters, allowing for the application of normal equations for closed-form training By incorporating higher powers of x as additional features, we can extend the model to achieve a polynomial of degree 9, represented as ˆ y = +b.

Machine learning algorithms achieve optimal performance when their capacity aligns with the complexity of the task and the available training data Insufficiently capable models struggle with complex tasks, while models with excessive capacity can overfit, despite being able to handle intricate challenges.

Figure 5.2 illustrates the comparison between linear, quadratic, and degree-9 predictors in fitting a quadratic function The linear predictor fails to capture the curvature, leading to underfitting, while the degree-9 predictor can fit the training data perfectly but risks overfitting due to its complexity and the abundance of parameters In contrast, the quadratic model aligns precisely with the true function, enabling it to generalize effectively to new data.

In this analysis, we examined three models fitted to a synthetic training set generated by sampling x values and applying a quadratic function to determine y values The left model, a linear function, demonstrates underfitting as it fails to capture the inherent curvature of the data The center model, a quadratic function, effectively generalizes to unseen data without significant overfitting or underfitting Conversely, the right model, a polynomial of degree 9, exhibits overfitting, as it passes through all training points but does not accurately represent the underlying structure of the data The Moore-Penrose pseudoinverse was employed to address the underdetermined normal equations in this fitting process.

The current model exhibits a significant gap between two training points that is not reflective of the actual underlying function Additionally, it shows a steep increase on the left side of the dataset, contrasting with the true function, which declines in this region.

Hyperparameters and Validation Sets

Machine learning algorithms rely on specific settings, known as hyperparameters, to dictate their behavior Unlike the algorithm's internal parameters, hyperparameters are not automatically adjusted during the learning process However, it is possible to create a nested learning approach where one algorithm optimizes the hyperparameters for another, enhancing overall performance.

In the polynomial regression example illustrated in Figure 5.2, the degree of the polynomial serves as a key hyperparameter that determines the model's capacity Additionally, the λ value, which regulates the strength of weight decay, represents another important hyperparameter in this context.

In machine learning, certain settings are designated as hyperparameters because they are challenging to optimize or inappropriate to learn from the training set This is particularly true for hyperparameters that influence model capacity, as learning them on the training data can lead to overfitting by favoring maximum model capacity For instance, a higher degree polynomial with no weight decay will always fit the training set better than a lower degree polynomial with a positive weight decay, demonstrating the importance of carefully selecting hyperparameters to avoid overfitting.

To solve this problem, we need avalidation set of examples that the training algorithm does not observe.

To accurately estimate a learner's generalization error, it is crucial to use a held-out test set that is composed of examples from the same distribution as the training set, ensuring that these test examples are never involved in model selection or hyperparameter tuning Consequently, the validation set is constructed solely from the training data by splitting it into two disjoint subsets: one for learning model parameters and the other for validation purposes Typically, around 80% of the training data is allocated for training, while 20% is reserved for validation This validation set helps in estimating the generalization error during or after training, although it tends to underestimate this error, typically more so than the training error Once hyperparameter optimization is finalized, the generalization error can then be accurately estimated using the test set.

Repeated use of the same test set to evaluate various algorithms over the years can lead to overly optimistic performance assessments, particularly as the scientific community strives to surpass existing state-of-the-art results Consequently, benchmarks may become outdated and fail to accurately represent the real-world performance of trained systems Fortunately, the community often transitions to newer, larger, and more ambitious benchmark datasets to address this issue.

Dividing a dataset into a fixed training set and a small test set can lead to significant issues, as a limited test set introduces statistical uncertainty in the estimated average test error This uncertainty complicates the ability to confidently assess the performance of the algorithm.

A works better than algorithm B on the given task.

When dealing with datasets containing hundreds of thousands of examples, estimating the mean test error is straightforward However, with smaller datasets, alternative methods can be employed, albeit at a higher computational cost These methods involve repeatedly training and testing on different randomly selected subsets of the original dataset A widely used technique is k-fold cross-validation, where the dataset is divided into k non-overlapping subsets The average test error is then calculated across k trials, using one subset as the test set while the remaining data serves as the training set Although unbiased estimators for the variance of these average error estimators are lacking, approximations are commonly utilized.

Estimators, Bias and Variance

Statistics provides essential tools for machine learning, enabling solutions that extend beyond the training set to ensure effective generalization Key concepts like parameter estimation, bias, and variance play a crucial role in formally defining the principles of generalization, as well as the challenges of underfitting and overfitting.

Point estimation aims to deliver the most accurate single prediction of a specific quantity of interest, which can be a single parameter or a vector of parameters within a parametric model, like the weights in linear regression Additionally, point estimation can extend to encompass an entire function.

In order to distinguish estimates of parameters from their true value, our convention will be to denote a point estimate of a parameter θ by θ.ˆ

Let {x (1) , ,x ( ) m } be a set of m independent and identically distributed

The k-fold cross-validation algorithm (Algorithm 5.1) is essential for accurately estimating the generalization error of a learning algorithm A, especially when the dataset D is too small for reliable train/test splits This method mitigates high variance in the mean loss L derived from small test sets In supervised learning, the dataset D consists of (input, target) pairs, while in unsupervised learning, it comprises inputs only The algorithm produces a vector of errors e for each example in D, with the mean representing the estimated generalization error Although the confidence intervals computed from these errors may not be robust after cross-validation, they are frequently used to compare algorithms, asserting that algorithm A outperforms algorithm B if A's error interval lies entirely below B's.

Require: D, the given dataset, with elements z ( ) i

Require: A, the learning algorithm, seen as a function that takes a dataset as input and outputs a learned function

Require: L, the loss function, seen as a function from a learned function f and an example z ( ) i ∈D to a scalar ∈ R

Require: k, the number of folds

Split D into k mutually exclusive subsets D i , whose union is D. for i from 1 to k do f i = (A D D\ i ) for z ( ) j in D i do e j = (L f i ,z ( ) j ) end for end for

(i.i.d.) data points A point estimator or statistic is any function of the data: θˆ m = (g x (1) , ,x ( ) m ) (5.19)

The definition does not necessitate that the function g produces a value near the true θ, nor does it require that the range of g aligns with the permissible values of θ.

A point estimator is a versatile tool that offers significant flexibility in its design While many functions can serve as estimators, an effective one is characterized by its ability to produce outputs that closely approximate the true underlying parameter θ that generated the training data.

In the frequentist approach to statistics, we consider the true parameter value θ to be fixed yet unknown, while the point estimate θˆ is derived from the data Given that the data originates from a random process, any function of this data, including θˆ, is also random, making θˆ a random variable.

Point estimation can also refer to the estimation of the relationship between input and target variables We refer to these types of point estimates as function estimators.

Function estimation, also known as function approximation, involves predicting a variable \( y \) based on an input vector \( x \) by assuming a relationship defined by a function \( f(x) \) This relationship can be expressed as \( y = f(x) + \epsilon \), where \( \epsilon \) represents the unpredictable component of \( y \) The goal of function estimation is to approximate the function \( f \) with an estimate \( \hat{f} \), making it analogous to estimating a parameter \( \theta \) In this context, the function estimator \( \hat{f} \) serves as a point estimator within function space Examples such as linear regression and polynomial regression illustrate scenarios where the task can be viewed as either estimating a parameter or mapping a function from \( x \) to \( y \).

We now review the most commonly studied properties of point estimators and discuss what they tell us about these estimators.

The bias of an estimator, denoted as bias(ˆθ_m), is defined as the difference between the expected value of the estimator, E(ˆθ_m), and the true parameter value, θ An estimator is considered unbiased if its bias equals zero, indicating that the expected value of the estimator matches the true parameter Furthermore, an estimator is asymptotically unbiased if its bias approaches zero as the sample size increases, which means that the expected value of the estimator converges to the true parameter value in the limit.

Example: Bernoulli Distribution Consider a set of samples {x (1) , , x ( ) m } that are independently and identically distributed according to a Bernoulli distribution with mean :θ

A common estimator for the θ parameter of this distribution is the mean of the training samples: ˆθ m = 1 m

To determine whether this estimator is biased, we can substitute equation 5.22 into equation 5.20: bias(ˆθ m ) = [E θˆ m ]−θ (5.23)

Since bias(ˆθ) = 0, we say that our estimatorθˆis unbiased.

Example: Gaussian Distribution Estimator of the Mean Now, consider a set of samples {x (1) , , x ( ) m } that are independently and identically distributed according to a Gaussian distributionp(x ( ) i ) =N(x ( ) i ;à, σ 2 ), wherei∈ {1, , m}.

Recall that the Gaussian probability density function is given by p x( ( ) i ;à, σ 2 ) = 1

A common estimator of the Gaussian mean parameter is known as the sample mean: ˆ à m = 1 m

To determine the bias of the sample mean, we are again interested in calculating its expectation: bias(ˆà m ) = [ˆEà m ]−à (5.31)

Thus we ﬁnd that the sample mean is an unbiased estimator of Gaussian mean parameter.

In this article, we examine two distinct estimators for the variance parameter σ² of a Gaussian distribution, focusing on their potential bias The first estimator we analyze is the sample variance, denoted as ˆσ²ₘ, which is calculated using the formula ˆσ²ₘ = 1/m This evaluation aims to determine the accuracy and reliability of these estimators in statistical analysis.

, (5.36) where àˆ m is the sample mean, deﬁned above More formally, we are interested in computing bias(ˆσ 2 m ) = [ˆEσ m 2 ]−σ 2 (5.37)

We begin by evaluating the term E[ˆσ m 2 ]:

Returning to equation 5.37, we conclude that the bias of σˆ 2 m is −σ 2 /m Therefore, the sample variance is a biased estimator.

The unbiased sample variance estimator ˜ σ 2 m = 1 m−1

(5.40) provides an alternative approach As the name suggests this estimator is unbiased. That is, we ﬁnd that E[˜σ 2 m ] = σ 2 :

In statistical analysis, we differentiate between biased and unbiased estimators, with unbiased estimators generally being preferred However, biased estimators can still be valuable due to their possession of other significant properties that may enhance their overall effectiveness in certain scenarios.

When evaluating an estimator, it's important to consider its variability based on the data sample Similar to how we assess an estimator's bias by calculating its expectation, we can also determine its variance The variance of an estimator reflects the extent to which it may fluctuate, providing insights into its reliability and precision.

Var( ˆθ) (5.45) where the random variable is the training set Alternately, the square root of the variance is called the standard error, denoted SE(ˆθ).

The variance or standard error of an estimator indicates the expected variability of our computed estimate when resampling the dataset from the underlying data generating process Similar to our preference for an estimator with low bias, it is also desirable for the estimator to have low variance.

When estimating a true underlying parameter from a finite number of samples, there is inherent uncertainty due to the possibility of obtaining different samples from the same distribution, leading to variations in their statistics This expected variation in any estimator represents a source of error that we aim to quantify.

The standard error of the mean is given by

The standard error is typically estimated using an approximation of the true variance (σ²) from sample data (xᵢ) However, both the square root of the sample variance and the unbiased estimator of variance fall short as unbiased estimates of the standard deviation, often leading to underestimations Despite this limitation, these methods remain prevalent in practice, with the square root of the unbiased estimator providing a slightly less pronounced underestimate This approximation becomes increasingly accurate as the sample size (m) grows larger.

The standard error of the mean is very useful in machine learning experiments.

To estimate the generalization error, we calculate the sample mean of the error on the test set, with the accuracy of this estimate influenced by the number of examples in the test set Utilizing the central limit theorem, which indicates that the mean is approximately normally distributed, we can apply the standard error to determine the probability that the true expectation lies within a specified interval For instance, we can construct a 95% confidence interval centered around the mean.

(ˆà m −1 96SE(ˆ à m ) ˆ, à m + 1 96SE(ˆ à m )), (5.47) under the normal distribution with mean àˆ m and variance SE(ˆà m ) 2 In machine learning experiments, it is common to say that algorithmA is better than algorithm

B if the upper bound of the 95% conﬁdence interval for the error of algorithmA is less than the lower bound of the 95% conﬁdence interval for the error of algorithm

Example: Bernoulli Distribution We once again consider a set of samples

{x (1) , , x ( ) m } drawn independently and identically from a Bernoulli distribution (recall P(x ( ) i ;θ) =θ x ( ) i (1 −θ) (1 − x ( ) i ) ) This time we are interested in computing the variance of the estimator ˆθ m = m 1  m i=1 x ( ) i

Maximum Likelihood Estimation

In our previous discussions, we explored various definitions of common estimators and examined their properties However, it is essential to understand the origins of these estimators Instead of randomly selecting functions to serve as estimators and subsequently evaluating their bias and variance, we aim to establish a foundational principle that allows us to derive effective estimators tailored to different models.

The most common such principle is the maximum likelihood principle.

Consider a set ofm examplesX= {x (1) , ,x ( ) m } drawn independently from the true but unknown data generating distribution p data ( )x

Letp model (x;θ) be a parametric family of probability distributions over the same space indexed byθ In other words, p model (x;θ) maps any conﬁgurationx to a real number estimating the true probability p data ( )x

The maximum likelihood estimator for θ is then deﬁned as θ ML = arg max θ p model( ; )X θ (5.56)

This product may present several inconveniences, particularly due to its susceptibility to numerical underflow However, by applying the logarithm to the likelihood, we can reformulate the optimization problem into a more manageable form This transformation preserves the arg max while converting the product into a sum, leading to a more convenient analysis of the maximum likelihood estimation.

The arg max remains unchanged when the cost function is rescaled, allowing us to divide by m and express the criterion as an expectation based on the empirical distribution pˆ data derived from the training data Thus, we have θ ML = arg max θ.

Maximum likelihood estimation can be understood as the process of minimizing the dissimilarity between the empirical distribution (pˆdata) derived from the training set and the model distribution This dissimilarity is quantified using the Kullback-Leibler (KL) divergence, which serves as a measure of how one probability distribution diverges from a second expected probability distribution.

D KL (ˆp data p model ) = E x ∼ p ˆ data[log ˆp data ( )x −logp model ( )]x (5.60)

The left term solely depends on the data generating process rather than the model itself Consequently, when training the model to reduce KL divergence, our focus should be exclusively on minimizing this term.

−E x ∼ p ˆ data[logp model( )]x (5.61) which is of course the same as the maximization in equation 5.59.

Minimizing KL divergence is equivalent to minimizing the cross-entropy between two distributions While many authors refer to "cross-entropy" as the negative log-likelihood of Bernoulli or softmax distributions, this is technically incorrect Cross-entropy encompasses any loss that involves negative log-likelihood, representing the divergence between the empirical distribution from the training set and the model's probability distribution For instance, mean squared error can be viewed as the cross-entropy between the empirical distribution and a Gaussian model.

Maximum likelihood estimation aims to align the model distribution with the empirical distribution, denoted as pˆdata While the goal is to match the true data-generating distribution, p data, direct access to this distribution is often unavailable.

The optimal parameter θ remains consistent whether maximizing likelihood or minimizing KL divergence, although the values of the objective functions differ In software applications, both concepts are typically framed as minimizing a cost function Therefore, maximizing likelihood translates to minimizing the negative log-likelihood (NLL) or cross-entropy Viewing maximum likelihood through the lens of minimum KL divergence is beneficial, as KL divergence has a minimum value of zero It's important to note that the negative log-likelihood can take on negative values when x is real-valued.

5.5.1 Conditional Log-Likelihood and Mean Squared Error

The maximum likelihood estimator can be extended to estimate the conditional probability P(y | x; θ), which aims to predict the outcome y based on the input x This approach is fundamental to supervised learning, where X denotes the inputs and Y represents the observed targets Consequently, the conditional maximum likelihood estimator is defined as θ ML = arg max θ.

If the examples are assumed to be i.i.d., then this can be decomposed into θ ML = arg max θ

Linear regression can be understood as a maximum likelihood estimation method Initially, it was framed as an algorithm that learns to map an input \( x \) to an output value \( y \) by minimizing mean squared error However, when viewed through the lens of maximum likelihood estimation, the model produces a conditional distribution \( p(y | x) \) With a sufficiently large training set, multiple training examples with the same \( x \) can yield different \( y \) values, and the learning algorithm aims to fit the distribution \( p(y | x) \) to these compatible \( y \) values To derive the linear regression algorithm, we define \( p(y | x) = N(y; \hat{y}(x; w), \sigma^2) \), where \( \hat{y}(x; w) \) represents the mean prediction of the Gaussian, assuming a fixed variance \( \sigma^2 \) This choice of functional form for \( p(y | x) \ ensures that the maximum likelihood estimation aligns with the previously developed learning algorithm, as the examples are considered to be independent and identically distributed (i.i.d.), leading to the conditional log-likelihood expression.

2σ 2 , (5.65) whereyˆ ( ) i is the output of the linear regression on thei-th input x ( ) i and m is the number of the training examples Comparing the log-likelihood with the mean squared error,

Maximizing the log-likelihood with respect to w yields the same estimate of the parameters w as minimizing the mean squared error, as evident from the equation ||yˆ ( ) i −y ( ) i || 2 Although the two criteria have different values, they share the same location of the optimum, justifying the use of Mean Squared Error (MSE) as a maximum likelihood estimation procedure This equivalence underscores the desirable properties of the maximum likelihood estimator.

The maximum likelihood estimator is highly regarded for its asymptotic properties, demonstrating that it becomes the most efficient estimator as the sample size approaches infinity Its rate of convergence improves with an increasing number of examples, making it a preferred choice in statistical estimation.

Under suitable conditions, the maximum likelihood estimator exhibits consistency, indicating that as the number of training examples increases indefinitely, the maximum likelihood estimate of a parameter aligns with its true value.

• The true distribution p data must lie within the model family p model (ã;θ).

Otherwise, no estimator can recover p data.

The true distribution \( p \) of data must align with a single value of \( \theta \); otherwise, while maximum likelihood can identify the correct \( p \), it cannot ascertain the specific \( \theta \) used in the data generation process Beyond maximum likelihood estimators, there are various inductive principles that also function as consistent estimators However, these consistent estimators can vary in statistical efficiency, meaning that one may achieve a lower generalization error with a fixed number of samples, or conversely, require fewer samples to reach a predetermined level of generalization error.

Statistical efficiency is often analyzed in the context of parametric models, such as linear regression, where the focus is on estimating a specific parameter rather than a function The expected mean squared error serves as a key metric to gauge the accuracy of our parameter estimates, calculated by measuring the squared difference between the estimated and true parameter values across training samples from the data generating distribution As the sample size increases, the parametric mean squared error tends to decrease, and for large samples, the Cramér-Rao lower bound indicates that no consistent estimator can achieve a mean squared error lower than that of the maximum likelihood estimator.

Bayesian Statistics

In our previous discussions, we explored frequentist statistics, which focuses on estimating a single value of θ for making predictions Alternatively, Bayesian statistics takes a different approach by considering all possible values of θ when generating predictions.

The frequentist perspective posits that the true parameter value θ is fixed yet unknown, while the point estimate θˆ is considered a random variable due to its dependence on the dataset, which is treated as random.

The Bayesian approach to statistics fundamentally differs from traditional methods by utilizing probability to express varying degrees of certainty regarding knowledge states In this perspective, the dataset is observed directly and is not considered random, while the true parameter θ remains unknown or uncertain, and is therefore treated as a random variable.

Before analyzing the data, we define our understanding of θ through the prior probability distribution, p(θ) Typically, machine learning practitioners choose a broad prior distribution, indicating significant uncertainty about θ's value before data observation For instance, one might assume θ exists within a specific range, represented by a uniform distribution Additionally, many priors favor "simpler" solutions, such as smaller coefficients or functions that are nearly constant.

To understand how data influences our beliefs about the parameter θ, we can apply Bayes' rule to combine the data likelihood p(x(1), , x(m) | θ) with our prior belief p(θ) This relationship is expressed as p(θ | x(1), , x(m)) = p(x(1), , x(m) | θ) * p(θ) / p(x(1), , x(m)) By utilizing this formula, we can effectively update our beliefs based on the observed data.

Bayesian estimation often starts with a uniform or Gaussian prior distribution characterized by high entropy As data is observed, the posterior distribution becomes more concentrated, losing entropy and focusing on a limited number of highly probable parameter values.

Bayesian estimation differs from maximum likelihood estimation in two significant ways Unlike the maximum likelihood method, which relies on a point estimate of θ for predictions, Bayesian estimation utilizes a full distribution over θ For instance, after observing m examples, the predicted distribution for the next data sample, x(+1)m, is expressed as p(x(+1)m | x(1), , x(m)) = ∫ p(x(+1)m | θ) (p(θ | x(1), , x(m))) dθ.

Each value of θ, associated with a positive probability density, plays a role in predicting the next example, with its influence determined by the posterior density When we have observed the data set {x (1), , x (m)} and still face significant uncertainty regarding θ, this uncertainty is directly reflected in our predictions.

In section 5.4, we explored how the frequentist approach manages uncertainty in point estimates of θ by evaluating variance, which indicates how estimates may vary with different data samples Conversely, the Bayesian method addresses this uncertainty by integrating over the estimator, effectively minimizing the risk of overfitting This integration aligns with probability laws, providing a straightforward justification for the Bayesian approach, while the frequentist method relies on the arbitrary choice of summarizing all data knowledge into a single point estimate.

The Bayesian approach to estimation differs significantly from the maximum likelihood method due to the influence of the prior distribution, which shifts probability mass density towards preferred regions of the parameter space This prior often favors simpler or smoother models However, critics argue that the incorporation of the prior introduces subjective human judgment, potentially affecting the accuracy of predictions.

Bayesian methods typically generalize much better when limited training data is available, but typically suﬀer from high computational cost when the number of training examples is large.

Bayesian Linear Regression utilizes a Bayesian estimation approach to determine the parameters of linear regression This method involves learning a linear relationship between an input vector \( x \in \mathbb{R}^n \) and a scalar output \( y \in \mathbb{R} \) The prediction is represented by the equation \( \hat{y} = w^T x \), where \( w \in \mathbb{R}^n \) is the parameter vector.

Given a set ofm training samples(X ( train ) ,y ( train ) ), we can express the prediction of y over the entire training set as: ˆ y ( train ) = X ( train ) w (5.70)

Expressed as a Gaussian conditional distribution on y ( train ) , we have p(y ( train ) | X ( train ) ,w) = (N y ( train ) ;X ( train ) w I, ) (5.71)

In this article, we adopt the conventional Mean Squared Error (MSE) formulation, assuming a Gaussian variance of one for the variable y To simplify our notation, we will henceforth refer to the training data as (X, y).

To establish the posterior distribution of the model parameter vector \( w \), it is essential to define a prior distribution that encapsulates our initial beliefs about these parameters Although articulating these prior beliefs in relation to the model's parameters can be challenging, we generally adopt a broad distribution that indicates significant uncertainty regarding \( \theta \) For real-valued parameters, a Gaussian prior distribution is commonly utilized, represented as \( p(w) = N(w; 0, \Lambda_0) \propto \exp \).

, (5.73) where à 0 and Λ0 are the prior distribution mean vector and covariance matrix respectively 1

With the prior thus speciﬁed, we can now proceed in determining theposterior distribution over the model parameters. p(w X| ,y) ∝p(y X| ,w) ( )p w (5.74)

Using these new variables, we ﬁnd that the posterior may be rewritten as a Gaussian distribution: p(w X| ,y) ∝ exp

In the context of multivariate Gaussian distributions, all terms excluding the parameter vector w have been omitted for simplicity, as they are inherently implied by the requirement that the distribution must be normalized to integrate to 1 Equation 3.23 illustrates the process of normalizing a multivariate Gaussian distribution effectively.

Analyzing the posterior distribution enhances our understanding of Bayesian inference's effects Typically, we set the parameter à 0 to 0, and if we choose Λ0 = 1/α I, the Bayesian estimate à m aligns with the frequentist linear regression outcome when a weight decay penalty of αw is applied However, a key distinction arises: the Bayesian estimate becomes undefined if α is set to zero, as it prohibits initiating the learning process with an infinitely wide prior on w More critically, the Bayesian approach yields a covariance matrix, reflecting the likelihood of various w values, whereas frequentist methods only provide the estimate à m.

While the most principled approach is to make predictions using the full Bayesian posterior distribution over the parameter θ, it is still often desirable to have a

Supervised Learning Algorithms

Supervised learning algorithms are designed to establish a relationship between input data and corresponding output, utilizing a training set of input-output examples Often, the output values can be challenging to gather automatically and require human input for accurate collection.

“supervisor,” but the term still applies even when the training set targets were collected automatically.

Most supervised learning algorithms discussed in this book focus on estimating the probability distribution p(y | x) This can be achieved through maximum likelihood estimation, which identifies the optimal parameter vector θ for a given parametric family of distributions p(y | x; θ).

We have already seen that linear regression corresponds to the family p y( | x θ; ) = ( ;N y θ  x I, ) (5.80)

Linear regression can be adapted for classification by utilizing a distinct set of probability distributions In a binary classification scenario with two classes, class 0 and class 1, it suffices to define the probability of one class Specifically, the probability of class 1 inherently establishes the probability of class 0, as the total probability must equal 1.

In linear regression, the normal distribution is defined by a mean, allowing any valid value for this mean However, when dealing with a binary variable, the mean must be constrained between 0 and 1 To address this, the logistic sigmoid function can be employed to transform the output of the linear function into the range (0, 1), interpreting this result as a probability: p(y = 1 | x; θ) = σ(θ ⋅ x).

This approach is known as logistic regression(a somewhat strange name since we use the model for classiﬁcation rather than regression).

Logistic regression poses a greater challenge than linear regression, as it lacks a closed-form solution for determining optimal weights Instead, we must optimize these weights by maximizing the log-likelihood, which can be achieved through minimizing the negative log-likelihood (NLL) using gradient descent.

This strategy is applicable to various supervised learning problems by defining a parametric family of conditional probability distributions tailored to the appropriate input and output variables.

The support vector machine (SVM) is a highly influential method in supervised learning, introduced by Boser et al in 1992 and further developed by Cortes and Vapnik in 1995 Similar to logistic regression, the SVM utilizes a linear function represented by w ⋅ x + b However, unlike logistic regression, the SVM does not provide probability estimates; it solely outputs class identities Specifically, the SVM predicts the presence of the positive class when w ⋅ x + b is positive and identifies the negative class when this expression is negative.

A significant advancement in support vector machines is the kernel trick, which allows various machine learning algorithms to be expressed solely through dot products between data points This technique demonstrates that the linear function employed by support vector machines can be reformulated as \( w \cdot x + b \).

The learning algorithm can be reformulated by substituting the training example \( x^{(i)} \) with a feature function output \( \phi(x) \) and replacing the dot product with a kernel function \( k(x, x^{(i)}) = \phi(x) \cdot \phi(x^{(i)}) \) The operator \( \cdot \) denotes an inner product similar to \( \phi(x) \cdot \phi(x^{(i)}) \) In certain feature spaces, particularly in infinite-dimensional spaces, alternative inner products, such as those based on integration rather than summation, may be necessary However, a comprehensive exploration of these inner products is not covered in this book.

After replacing dot products with kernel evaluations, we can make predictions using the function f( ) = x b+  i α i k(x x, ( ) i ) (5.83)

The function exhibits a nonlinear relationship with respect to x, while the connections between φ(x) and f(x), as well as α and f(x), remain linear Utilizing the kernel-based function effectively transforms the data through φ(x), enabling the learning of a linear model in this new space The kernel trick is advantageous for two main reasons: it allows for the efficient learning of nonlinear models using convex optimization techniques, as it treats φ as fixed while optimizing α, and it often provides a more computationally efficient implementation compared to directly calculating the dot product of two φ(x) vectors.

In certain scenarios, the feature mapping φ(x) can be infinite dimensional, leading to substantial computational costs when using a naive, explicit approach However, many times the kernel function k(x, x') remains a nonlinear and tractable function of x, even if φ(x) is intractable For instance, we can illustrate an infinite-dimensional feature space with a tractable kernel by defining a feature mapping φ(x) for non-negative integers x, where this mapping produces a vector filled with x ones followed by an infinite number of zeros.

We can write a kernel function k(x, x ( ) i ) = min(x, x ( ) i ) that is exactly equivalent to the corresponding inﬁnite-dimensional dot product.

The Gaussian kernel, expressed as k(u, v) = N(u−v; 0, σ²I), is the most commonly utilized kernel in machine learning Also referred to as the radial basis function (RBF) kernel, its value diminishes along radial lines in the v space emanating from u Notably, the Gaussian kernel represents a dot product in an infinite-dimensional space, although deriving this space is more complex compared to the min kernel over integers.

The Gaussian kernel functions as a template matching mechanism, where each training example \( x \) is linked to its corresponding label \( y \) When a test point \( x' \) is close to \( x \) in terms of Euclidean distance, the Gaussian kernel generates a strong response, suggesting a high similarity between \( x' \) and the template \( x \) Consequently, the model assigns significant weight to the related training label \( y \).

Overall, the prediction will combine many such training labels weighted by the similarity of the corresponding training examples.

Kernel methods, also known as kernel machines, extend beyond support vector machines, allowing various linear models to be enhanced through the kernel trick This technique, as highlighted by Williams and Rasmussen (1996) and Schölkopf et al (1999), enables improved performance across a range of algorithms.

Kernel machines face a significant limitation as the cost of evaluating the decision function increases linearly with the number of training examples, since each example contributes a term to the function However, support vector machines address this issue by learning an α vector that predominantly consists of zeros Consequently, classifying a new example necessitates evaluating the kernel function solely for those training examples with non-zero α i, referred to as support vectors.

Unsupervised Learning Algorithms

Unsupervised algorithms operate by analyzing "features" without any supervisory signals, making the distinction between supervised and unsupervised methods somewhat ambiguous Unlike supervised learning, which relies on labeled data, unsupervised learning focuses on extracting information from data distributions without requiring human annotation This approach is commonly linked to tasks such as density estimation, sampling from distributions, data denoising, identifying data manifolds, and clustering related examples into groups.

In unsupervised learning, a key objective is to identify the optimal representation of data The term "optimal" can have various interpretations, but it typically refers to a representation that retains as much information about the original data (x) as possible This process often involves applying certain penalties or constraints to ensure that the resulting representation is simpler and more manageable than the original data itself.

Simpler representations can be defined in several ways, with three common types being lower-dimensional, sparse, and independent representations Low-dimensional representations aim to condense as much information as possible about a dataset into a smaller format Sparse representations, as discussed by Barlow (1989), Olshausen and Field (1996), and others, involve embedding the dataset in a space where most entries are zero, necessitating a higher dimensionality to preserve information This approach typically leads to a structure that distributes data along the axes of the representation space Meanwhile, independent representations focus on disentangling the sources of variation in the data, ensuring that the dimensions of the representation are statistically independent.

Low-dimensional representations are not mutually exclusive and often exhibit fewer or weaker dependencies compared to their high-dimensional counterparts This reduction in size is achieved by identifying and eliminating redundancies within the data By effectively removing these redundancies, dimensionality reduction algorithms can achieve greater compression while preserving essential information.

Representation is a fundamental concept in deep learning and a key focus of this book This section presents straightforward examples of representation learning algorithms that illustrate how to implement the three criteria discussed Subsequent chapters will explore additional representation learning algorithms, expanding on these criteria in various ways or introducing new ones.

In section 2.12, we explored how the principal components analysis (PCA) algorithm compresses data effectively PCA serves as an unsupervised learning technique that develops a simplified representation of data, adhering to key criteria for effective data representation.

PCA (Principal Component Analysis) identifies a linear projection that aligns the direction of maximum variance with the axes of a new space, transforming the original data into a lower-dimensional representation In this new space, the data exhibits the greatest variance along the first axis, z1, while the second most variance is observed along z2 Additionally, PCA ensures that the elements of this representation are uncorrelated, serving as a foundational step towards achieving statistical independence among the variables For complete independence, further representation learning techniques are necessary to eliminate any nonlinear relationships present in the data.

Principal Component Analysis (PCA) is a powerful technique that performs an orthogonal linear transformation, projecting input data \( x \) into a new representation \( z \) As discussed in section 2.12, PCA can derive a one-dimensional representation that optimally reconstructs the original data based on mean squared error, which aligns with the first principal component This method effectively reduces dimensionality while retaining significant information, as quantified by least-squares reconstruction error Furthermore, PCA aids in decorrelating the original data representation \( X \).

In this article, we explore the n-dimensional design matrix X, assuming that the data has a mean of zero (E[x] = 0) If the data does not meet this criterion, it can be centered by subtracting the mean from all examples during preprocessing Additionally, we present the unbiased sample covariance matrix associated with X.

PCA ﬁnds a representation (through linear transformation) z = x  W where Var[ ]z is diagonal.

In section 2.12, we saw that the principal components of a design matrix X are given by the eigenvectors of X  X From this view,

This section explores an alternative method for deriving principal components through singular value decomposition (SVD) Specifically, the principal components correspond to the right singular vectors of the data matrix X By considering W as the right singular vectors in the decomposition X = U WΣ, we can reformulate the original eigenvector equation using W as the basis for eigenvectors.

The SVD is helpful to show that PCA results in a diagonalVar [z] Using the SVD of X, we can express the variance of X as:

In the context of singular value decomposition, the orthogonality of the U matrix, where U U = I, allows us to establish that if we set z = x W, the covariance of z will be diagonal as desired This relationship is crucial for ensuring the desired properties of the transformed data.

= 1 m−1Σ 2 , (5.95) where this time we use the fact thatW  W =I, again from the deﬁnition of the

The analysis indicates that projecting the data \( x \) onto \( z \) through the linear transformation \( W \) results in a representation characterized by a diagonal covariance matrix (denoted as \( \Sigma^2 \)) This condition directly implies that the elements of \( z \) are mutually uncorrelated.

PCA's key feature is its ability to transform data into a representation where elements are uncorrelated, effectively disentangling the underlying factors of variation This process involves finding a rotation of the input space, represented by W, which aligns the principal axes of variance with the new representation space associated with z.

Understanding correlation is crucial for analyzing data dependencies, but to effectively capture more complex feature relationships, we require advanced techniques beyond basic linear transformations.

K-means clustering is a straightforward representation learning algorithm that segments a training set into k distinct clusters based on proximity This algorithm generates a k-dimensional one-hot code vector, where each vector represents an input x If the input x is assigned to cluster i, the corresponding entry h_i equals 1, while all other entries in the vector are zero.

Stochastic Gradient Descent

Stochastic gradient descent (SGD) is a crucial algorithm that drives most of deep learning, serving as an extension of the traditional gradient descent method.

A recurring problem in machine learning is that large training sets are necessary for good generalization, but large training sets are also more computationally expensive.

In machine learning, the cost function typically breaks down into a summation of individual loss functions for each training example For instance, the negative conditional log-likelihood of the training dataset can be expressed as a sum of these per-example losses.

L(x ( ) i , y ( ) i ,θ) (5.96) where L is the per-example loss L(x, y,θ) = log− p y( | x θ; ).

For these additive cost functions, gradient descent requires computing

The computational cost of this operation isO(m) As the training set size grows to billions of examples, the time to take a single gradient step becomes prohibitively long.

Stochastic gradient descent operates on the principle that the gradient represents an expectation, which can be roughly estimated using a limited sample size During each iteration of the algorithm, a mini-batch of examples is sampled to facilitate this estimation.

B= {x (1) , ,x (m  ) }drawn uniformly from the training set The minibatch size m  is typically chosen to be a relatively small number of examples, ranging from

As the training set size increases, it is essential to keep the parameter m fixed, allowing us to efficiently fit a training set containing billions of examples while using updates derived from just a hundred examples.

The estimate of the gradient is formed as g = 1 m  ∇ θ m 

L(x ( ) i , y ( ) i ,θ) (5.98) using examples from the minibatch The stochastic gradient descent algorithmB then follows the estimated gradient downhill: θ← −θ  ,g (5.99) where  is the learning rate.

Gradient descent has traditionally been viewed as a slow or unreliable method, particularly when applied to non-convex optimization problems However, contemporary understanding reveals that machine learning models can be effectively trained using gradient descent While this optimization algorithm may not consistently reach a local minimum in a timely manner, it frequently identifies a sufficiently low cost function value quickly enough to be practical for applications.

Stochastic gradient descent (SGD) is crucial not only in deep learning but also for training large linear models on extensive datasets The cost of each SGD update remains constant regardless of the training set size, allowing for scalability While larger training sets typically require more updates for convergence, the model can achieve its optimal test error without needing to sample every training example Consequently, as the training set size approaches infinity, the asymptotic cost of training with SGD can be considered O(1) in relation to the dataset size.

Before deep learning emerged, nonlinear models primarily relied on the kernel trick combined with linear models, necessitating the construction of an m×m matrix G i,j = k(x ( ) i ,x ( ) j ), which incurs a computational cost of O(m²) This approach is impractical for datasets with billions of examples Since 2006, deep learning gained traction in academia for its superior generalization capabilities on medium-sized datasets, typically containing tens of thousands of examples Its ability to efficiently train nonlinear models on large datasets soon attracted significant interest from the industry.

Stochastic gradient descent and many enhancements to it are described further in chapter 8

Building a Machine Learning Algorithm

Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe: combine a speciﬁcation of a dataset, a cost function, an optimization procedure and a model.

For example, the linear regression algorithm combines a dataset consisting of

J(w, b) = −E x,y ∼ p ˆ datalogp model (y | x), (5.100) the model speciﬁcation p model(y | x) =N(y;x  w+b,1), and, in most cases, the optimization algorithm deﬁned by solving for where the gradient of the cost is zero using the normal equations.

By realizing that we can replace any of these components mostly independently from the others, we can obtain a very wide variety of algorithms.

The cost function generally incorporates a component that facilitates statistical estimation during the learning process The negative log-likelihood is the most prevalent cost function, and minimizing it leads to maximum likelihood estimation.

The cost function may also include additional terms, such as regularization terms For example, we can add weight decay to the linear regression cost function to obtain

J(w, b) = λ|| ||w 2 2 −E x,y ∼ p ˆ data logp model (y | x) (5.101) This still allows closed-form optimization.

Switching to a nonlinear model complicates the optimization of most cost functions, making closed-form solutions impractical Consequently, it becomes necessary to adopt iterative numerical optimization methods, like gradient descent, for effective optimization.

The process of creating a learning algorithm involves integrating models, costs, and optimization methods to facilitate both supervised and unsupervised learning For instance, linear regression exemplifies support for supervised learning, while unsupervised learning can be achieved by defining a dataset with only input features (X) and establishing a suitable unsupervised cost function and model A practical application of this is obtaining the first PCA vector by defining the appropriate loss function.

J( ) = w E x ∼ p ˆ data || −x r( ;x w)|| 2 2 (5.102) while our model is deﬁned to havew with norm one and reconstruction function r( ) = x w  xw.

In certain situations, evaluating the cost function directly may be computationally infeasible However, we can still achieve approximate minimization through iterative numerical optimization, provided we have a method to estimate its gradients.

Most machine learning algorithms adhere to a common framework, which may not be immediately apparent Even those that appear unique often utilize specialized optimizers tailored for their specific needs For instance, models like decision trees and k-means necessitate these optimizers due to their cost functions featuring flat regions, making them unsuitable for gradient-based minimization Understanding that many machine learning algorithms can be classified within this framework allows us to view them as part of a broader taxonomy, highlighting their shared principles and related tasks instead of treating them as an unrelated collection of methods.

Challenges Motivating Deep Learning

While the straightforward machine learning algorithms discussed in this chapter effectively address numerous significant issues, they fall short in tackling key challenges in artificial intelligence, including speech recognition and object identification.

The development of deep learning was motivated in part by the failure of traditional algorithms to generalize well on such AI tasks.

Generalizing to new examples in high-dimensional data presents significant challenges, as traditional machine learning methods struggle to learn complex functions in these spaces Additionally, high-dimensional environments often incur substantial computational costs Deep learning has been developed to address these issues and enhance generalization capabilities in such intricate scenarios.

High-dimensional data poses significant challenges in machine learning, a challenge often referred to as the curse of dimensionality This issue arises because the number of potential unique configurations among a set of variables grows exponentially with an increase in the number of variables.

As the number of relevant data dimensions increases, the configurations of interest can grow exponentially In a one-dimensional scenario, we focus on distinguishing 10 specific regions of interest When there are sufficient examples within each region, learning algorithms can effectively generalize the data.

To effectively generalize, one can estimate the target function's value within distinct regions and interpolate between adjacent areas In two dimensions, distinguishing ten different values for each variable becomes challenging, requiring us to manage up to 1,000 regions and at least as many examples to adequately cover them As the dimensions increase, the complexity escalates; for three dimensions, this results in 1,000 regions, and for d dimensions with v values on each axis, the requirement expands to O(v^d) regions and examples This phenomenon exemplifies the curse of dimensionality.

The curse of dimensionality arises in many places in computer science, and especially so in machine learning.

One challenge posed by the curse of dimensionality is a statistical challenge.

As illustrated in ﬁgure 5.9, a statistical challenge arises because the number of possible conﬁgurations of x is much larger than the number of training examples.

To grasp the concept, envision the input space structured as a grid In low-dimensional spaces, fewer grid cells are primarily filled with data When faced with a new data point, we can often determine the appropriate response by examining the training examples within the same grid cell For instance, to estimate the probability density at a specific point x, we simply calculate the ratio of training examples in the corresponding unit volume cell to the total number of training examples.

To classify or make predictions for unseen data points, we can leverage the most common class or average target values of nearby training examples However, in high-dimensional spaces, most grid cells lack associated training data, making it challenging to make meaningful predictions Traditional machine learning algorithms often address this issue by assuming the output at a new point is similar to the output at the nearest training point, effectively relying on proximity-based interpolation.

5.11.2 Local Constancy and Smoothness Regularization

To ensure effective generalization, machine learning algorithms must be guided by prior beliefs regarding the functions they are intended to learn These priors can be explicitly represented as probability distributions over model parameters or can influence the function directly, affecting parameters indirectly Furthermore, prior beliefs may be implicitly expressed through the selection of algorithms that favor certain classes of functions, even when these biases cannot be articulated in terms of a probability distribution reflecting our beliefs about various functions.

One of the most commonly utilized implicit priors in machine learning is the smoothness prior, also known as the local constancy prior This principle asserts that the function we aim to learn should exhibit minimal variation within small, localized areas.

Simpler algorithms often depend solely on prior knowledge to generalize effectively, which limits their ability to address the complexities of AI-level challenges This book explores how deep learning incorporates both explicit and implicit priors to minimize generalization error in advanced tasks We will also discuss the inadequacy of relying solely on the smoothness prior for achieving success in these sophisticated applications.

There are various methods to express the belief that a learned function should be smooth or locally constant, both implicitly and explicitly These approaches aim to guide the learning process toward a function \( f^* \) that maintains the condition \( f^*(x) \approx f^*(x + \epsilon) \) for most configurations \( x \) and small changes \( \epsilon \) Essentially, if we have a reliable output for an input \( x \), such as a labeled training example, it is likely that this output will also be valid in the vicinity of \( x \) When multiple reliable outputs exist within a neighborhood, they can be combined through averaging or interpolation to produce a consensus answer that aligns closely with these outputs.

The k-nearest neighbors (k-NN) algorithms exemplify the local constancy approach in machine learning, where predictions remain constant within regions defined by the same k nearest neighbors from the training data Specifically, when k equals 1, the maximum number of distinguishable regions is limited to the total number of training examples available.

The k-nearest neighbors algorithm relies on outputs from nearby training examples, while kernel machines interpolate outputs from closely related training data A significant subset of kernels, known as local kernels, exhibits a high similarity when inputs are identical and diminishes as the inputs diverge These local kernels function as similarity measures, assessing how closely a test example aligns with each training example The exploration of local template matching's limitations has significantly influenced the development of deep learning models, which excel in scenarios where traditional local matching methods fall short (Bengio et al., 2006b).

Decision trees face limitations due to their reliance on smoothness-based learning, as they partition the input space into numerous regions corresponding to their leaves, assigning distinct parameters to each To accurately represent a target function that requires a tree with at least n leaves, a minimum of n training examples is essential Additionally, to ensure a sufficient level of statistical confidence in the predictions, a multiple of n training examples is necessary.

To effectively identify O(k) regions within the input space, various methods necessitate O(k) examples Generally, there are O(k) parameters, with O(1) parameters linked to each of the O(k) regions This is exemplified in a nearest neighbor scenario, where each training example defines a maximum of one region, as depicted in figure 5.10.

Example: Learning XOR

To make the idea of a feedforward network more concrete, we begin with an example of a fully functioning feedforward network on a very simple task: learning the XOR function.

The XOR function, or "exclusive or," operates on two binary values, x1 and x2, returning 1 when exactly one of the values is 1, and 0 otherwise This function defines the target function y = f*(x) that we aim to learn Our model generates a function y = f(x; θ), and the learning algorithm adjusts the parameters θ to align f with the target function f*.

In this simple example, we will not be concerned with statistical generalization.

We want our network to perform correctly on the four points X ={[0,0]  , [0,1]  ,

[1,0]  , and [1,1]  } We will train the network on all four of these points The only challenge is to ﬁt the training set.

This issue can be approached as a regression problem utilizing a mean squared error (MSE) loss function for simplicity in this example However, it's important to note that MSE is generally not suitable for modeling binary data in practical scenarios More suitable methods are outlined in section 6.2.2.2.

Evaluated on our whole training set, the MSE loss function is

Now we must choose the form of our model,f(x;θ) Suppose that we choose a linear model, with θ consisting of w and Our model is deﬁned to beb f( ;x w, b) = x  w+b (6.2)

We can minimize J(θ) in closed form with respect tow andb using the normal equations.

After solving the normal equations, we find that the linear model yields w = 0 and b = 0.5, resulting in a constant output of 0.5 This limitation occurs because a linear model cannot adequately represent the XOR function, as illustrated in Figure 6.1 To address this issue, employing a model that can learn a different feature space allows for the effective representation of the XOR solution.

This article introduces a straightforward feedforward neural network featuring one hidden layer with two hidden units, as illustrated in figure 6.2 The hidden units, represented by the vector h, are computed using the function f(1)(x; Wc, ) These values serve as inputs for the output layer, which operates as a linear regression model applied to h instead of the original input x The network comprises two interconnected functions: h = f(1)(x; Wc, ) and y = f(2)(h; w, b), culminating in the complete model expressed as f( ; x, Wc, w, b) = f(2)(f(1)(x)).

The function f(1) should not be linear, despite the success of linear models, as this would result in the entire feedforward network remaining a linear function of its input If we assume f(1)(x) = W ⋅ x and f(2)(h) = h ⋅ w, then the overall function can be expressed as f(x) = w ⋅ W ⋅ x This representation simplifies to f( ) = x ⋅ x ⋅ w', where w' = W ⋅ w, highlighting the need for non-linearity in the network's architecture.

To effectively describe features in neural networks, a nonlinear function is essential Most neural networks implement this by applying an affine transformation governed by learned parameters, succeeded by a fixed nonlinear function known as the activation function In this context, we define \( h = g(W \cdot x + c) \), where \( W \) represents the weights for the linear transformation and \( c \) denotes the biases This approach builds upon the principles of linear regression.

Figure 6.1: Solving the XOR problem by learning a representation The bold numbers printed on the plot indicate the value that the learned function must output at each point.

A linear model cannot directly implement the XOR function due to its reliance on fixed coefficients, which prevent it from adjusting based on input values Specifically, when x1 equals 0, the output must increase with x2, while it must decrease when x1 equals 1, making it impossible for the model to adapt However, by utilizing features extracted through a neural network, the problem can be solved in a transformed space In this context, the two critical points corresponding to an output of 1 are merged into a single point in feature space, allowing the linear model to effectively represent the function as increasing in h1 and decreasing in h2.

In this context, the primary goal of understanding the feature space is to enhance the model's capacity to accurately fit the training data However, in practical scenarios, the representations learned from the feature space can significantly improve the model's ability to generalize beyond the training set.

Figure 6.2 illustrates a feedforward network designed to address the XOR problem, featuring a single hidden layer with two units The left side of the figure presents this network style by representing each unit as a distinct node within the graph.

The explicit style of graph representation for neural networks can become space-consuming for larger networks A more compact approach involves drawing a node for each vector representing a layer’s activations Edges in the graph may be annotated with parameters that define the relationships between layers, such as matrix W for mapping input x to hidden layer h, and vector w for mapping hidden layer h to output y, while typically omitting intercept parameters In this context, an affine transformation from input vector x to output vector h requires an entire vector of bias parameters The activation function g is generally applied element-wise, with h_i = g(x ⋅ W_:,i + c_i) Modern neural networks commonly utilize the rectified linear unit (ReLU) as the activation function, defined as g(z) = max(0, z).

We can now specify our complete network as f( ;x W c w, , , b) = w  max 0{ ,W  x+c}+b (6.3)

We can now specify a solution to the XOR problem Let

The rectified linear activation function is the recommended default for most feedforward neural networks, transforming linear outputs into nonlinear ones while maintaining a piecewise linear structure This near-linearity allows rectified linear units to retain essential optimization properties of linear models, facilitating effective gradient-based methods and enhancing generalization In computer science, complex systems can be constructed from simple components, similar to how a Turing machine operates with binary states; thus, rectified linear functions serve as foundational elements for creating universal function approximators.

The model processes a batch of inputs by utilizing a design matrix, denoted as X, which encompasses all four points within the binary input space, with each example represented in a separate row.

The ﬁrst step in the neural network is to multiply the input matrix by the ﬁrst layer’s weight matrix:

Next, we add the bias vector , to obtainc

In this context, all examples are positioned along a line with a specific slope As we progress along this line, the output starts at 0, increases to 1, and then decreases back to 0.

A linear model cannot implement such a function To ﬁnish computing the value of hfor each example, we apply the rectiﬁed linear transformation:

The transformation has altered the relationship among the examples, moving them from a single linear arrangement to a multidimensional space, as illustrated in figure 6.1 This change enables a linear model to effectively solve the problem.

We ﬁnish by multiplying by the weight vector w:

The neural network has obtained the correct answer for every example in the batch.

In real-world scenarios involving billions of model parameters and training examples, simply guessing a solution is impractical Instead, gradient-based optimization algorithms are essential for identifying parameters that minimize error effectively The solution to the XOR problem we discussed represents a global minimum of the loss function, allowing gradient descent to converge to this optimal point However, gradient descent may also uncover other equivalent solutions, with its convergence heavily influenced by the initial parameter values Typically, the outcomes from gradient descent are not as straightforward or easily interpretable as the clean, integer-valued solution presented in this example.

Gradient-Based Learning

Designing and training a neural network closely resembles the process of training other machine learning models using gradient descent As outlined in section 5.10, creating a machine learning algorithm involves defining an optimization procedure, selecting a cost function, and choosing a model family.

The key distinction between linear models and neural networks lies in the nonlinearity of the latter, which results in most loss functions being non-convex Consequently, neural networks are typically trained using iterative, gradient-based optimizers that aim to minimize the cost function, unlike linear regression models that utilize linear equation solvers or convex optimization algorithms with guaranteed global convergence While convex optimization can converge from any initial parameters, stochastic gradient descent applied to non-convex loss functions lacks such guarantees and is sensitive to initial parameter values For feedforward neural networks, it is crucial to initialize weights to small random values, with biases set to zero or small positives The training algorithms for these networks, primarily based on gradient descent methods, will be explored in detail in later chapters, particularly focusing on parameter initialization and refinements of the stochastic gradient descent algorithm.

Training models like linear regression and support vector machines with gradient descent is common, especially with large datasets Training a neural network is similar to training other models, though computing the gradient is more complex However, it can still be done efficiently and accurately Section 6.5 will explain how to obtain the gradient using the back-propagation algorithm and its modern generalizations.

To effectively implement gradient-based learning in machine learning models, it is essential to select an appropriate cost function and determine the representation of the model's output In this context, we focus on the specific design considerations pertinent to neural networks.

The selection of the cost function is a crucial element in designing deep neural networks Thankfully, the cost functions utilized for neural networks closely resemble those used in other parametric models, including linear models.

Our parametric model typically establishes a distribution p(y | x;θ), and we apply the principle of maximum likelihood This approach involves utilizing cross-entropy between the training data and the model's predictions as the cost function.

In some cases, we opt for a more straightforward method by predicting specific statistics of y based on given conditions, rather than forecasting the entire probability distribution of y Utilizing specialized loss functions enables us to effectively train a model to generate these estimates.

The total cost function for training neural networks typically integrates a primary cost function with a regularization term Regularization methods, such as weight decay, which were previously illustrated in linear models, are also effective for deep neural networks and are widely used More sophisticated regularization techniques for neural networks will be explored in Chapter 7.

6.2.1.1 Learning Conditional Distributions with Maximum Likelihood

Most contemporary neural networks utilize maximum likelihood for training, which involves a cost function defined as the negative log-likelihood This cost function can also be interpreted as the cross-entropy between the training data and the model's predicted distribution.

The cost function's specific form varies across models, influenced by the structure of the logp model Typically, the expansion of this equation produces terms independent of the model parameters, which can be disregarded For instance, as demonstrated in section 5.5.1, when the model is defined as p model (y|x) = N(y; f(x; θ), I), we derive the mean squared error cost.

The equation presented, 2E x y , ∼ p ˆ data || −y f( ; )x θ || 2 + const, highlights a relationship between maximum likelihood estimation and mean squared error minimization, applicable to linear models Notably, this equivalence extends beyond linear models, remaining valid for any function f( ; )x θ that predicts the mean of a Gaussian distribution The constant term, which we chose not to parametrize, is based on the variance of the Gaussian distribution, emphasizing the underlying statistical principles.

Deriving the cost function from maximum likelihood simplifies the modeling process by automatically defining the cost function as log p(y|x,θ) when specifying a model A key aspect of neural network design is ensuring that the gradient of the cost function is sufficiently large and predictable to effectively guide the learning algorithm Saturated functions can hinder this process by resulting in very small gradients, often due to the activation functions of hidden or output units The negative log-likelihood mitigates this issue for many models, as it counteracts the saturation of output units that may involve an exponential function when their arguments are significantly negative The log function in the negative log-likelihood cost function effectively addresses this saturation problem, and the relationship between the cost function and output unit selection will be explored further in section 6.2.2.

The cross-entropy cost used for maximum likelihood estimation often lacks a minimum value when applied to common models For discrete output variables, many models are designed to avoid representing probabilities of zero or one, instead approaching these extremes Logistic regression exemplifies this behavior In the case of real-valued output variables, models that can adjust the density of the output distribution, such as by learning a Gaussian variance parameter, can assign very high density to correct training outputs, causing cross-entropy to trend towards negative infinity To address this issue, regularization techniques, as discussed in chapter 7, offer various methods to modify the learning process, preventing models from achieving unlimited rewards.

Instead of learning a full probability distributionp(y x| ;θ) we often want to learn just one conditional statistic of y given x

For example, we may have a predictor f(x;θ) that we wish to predict the mean of y

A sufficiently powerful neural network can represent any function from a broad class, constrained mainly by continuity and boundedness rather than a specific parametric form This perspective allows us to view the cost function as a functional, which maps functions to real numbers Consequently, learning can be understood as the process of selecting a function, rather than simply adjusting a set of parameters.

We can tailor our cost functional to ensure its minimum aligns with a specific desired function For instance, we can configure the cost functional so that its minimum corresponds to the function that represents the expected value of y given x.

Hidden Units

In our previous discussions, we explored design choices applicable to various parametric machine learning models that utilize gradient-based optimization Now, we will focus on a specific aspect unique to feedforward neural networks: selecting the appropriate type of hidden unit for the model's hidden layers.

The design of hidden units is an extremely active area of research and does not yet have many deﬁnitive guiding theoretical principles.

Rectified linear units (ReLUs) are often the preferred default for hidden units in neural networks, although various types exist Choosing the right hidden unit can be challenging, but ReLUs typically perform well Understanding the underlying intuitions behind different hidden units can guide experimentation Since predicting the most effective unit in advance is generally impossible, the design process relies on trial and error—hypothesizing which hidden unit might be effective, training the network, and assessing its performance on a validation set.

Although some hidden units in neural networks are not differentiable at all input points, they can still be used effectively in machine learning tasks The rectified linear function, for example, is not differentiable at z = 0, but gradient descent algorithms can still perform well due to the nature of neural network training In practice, these algorithms do not typically reach a local minimum of the cost function, but rather significantly reduce its value Non-differentiable hidden units are usually only non-differentiable at a small number of points, and software implementations often return one-sided derivatives to avoid reporting undefined derivatives As a result, the non-differentiability of certain hidden unit activation functions can be safely disregarded in practice, allowing for effective use in neural network training.

Most hidden units in neural networks process input vectors \( x \) by performing an affine transformation \( z = W \cdot x + b \), followed by the application of a nonlinear activation function \( g(z) \) The primary distinction among hidden units lies in the specific form of the activation function \( g(z) \) utilized.

6.3.1 Rectiﬁed Linear Units and Their Generalizations

Rectiﬁed linear units use the activation function g z( ) = max 0{ , z}.

Rectified linear units (ReLUs) are highly optimized due to their similarity to linear units, differing primarily in that they output zero across half their domain This characteristic ensures that the derivatives remain large when the unit is active, providing consistent gradients With the second derivative of the rectifying operation being nearly zero and the first derivative equal to one during activation, ReLUs offer a more effective gradient direction for learning compared to activation functions that incorporate second-order effects.

Rectiﬁed linear units are typically used on top of an aﬃne transformation: h= (g W  x+ ).b (6.36)

Initializing affine transformation parameters can be optimized by setting all elements of b to a small, positive value, such as 0.1, which increases the likelihood of rectified linear units being initially active for most inputs in the training set, allowing derivatives to pass through.

Several generalizations of rectiﬁed linear units exist Most of these generalizations perform comparably to rectiﬁed linear units and occasionally perform better.

Rectified linear units (ReLUs) have a limitation in that they cannot learn from examples where their activation is zero, making gradient-based learning ineffective in such cases However, several generalizations of ReLUs have been developed to ensure that they receive gradients at all points, enhancing their learning capabilities.

Rectified linear units (ReLU) can be generalized by introducing a non-zero slope, α_i, for negative inputs, resulting in the function h_i = g(z_i) = max(0, z_i) + α_i min(0, z_i) One specific case is the absolute value rectification, where α_i is set to -1, yielding g(z) = |z| This method is particularly useful in object recognition tasks from images, as highlighted by Jarrett et al (2009), since it allows for the extraction of features that remain consistent despite changes in input illumination polarity Additionally, other generalizations of ReLU, such as leaky ReLU, have broader applications in various machine learning contexts.

2013) ﬁxes α i to a small value like 0.01 while a parametric ReLU or PReLU treats α i as a learnable parameter (He et al., 2015).

Maxout units, as introduced by Goodfellow et al (2013a), extend the concept of rectified linear units by grouping inputs into sets of k values Each maxout unit computes the maximum value from its designated group, defined as g(z_i) = max_{j ∈ G(i)} z_j, where G(i) represents the indices of inputs in group i This approach enables the learning of piecewise linear functions that can adapt to various directions within the input space, enhancing model flexibility and performance.

Maxout units are capable of learning piecewise linear, convex functions with up to k segments, allowing them to effectively learn the activation function itself rather than merely the relationships between units With a sufficiently large k, a maxout unit can approximate any convex function with high accuracy Specifically, a maxout layer with two pieces can replicate functions similar to those produced by traditional layers using rectified linear, absolute value rectification, or leaky and parametric ReLU activation functions, while also having the flexibility to learn entirely different functions Although a maxout layer may implement the same function as other layer types, it is parametrized differently, resulting in distinct learning dynamics.

Maxout units are characterized by k weight vectors, requiring more regularization compared to rectified linear units However, they can perform effectively without regularization when trained on large datasets and maintaining a low number of pieces per unit (Cai et al., 2013).

Maxout units offer several advantages, including statistical and computational benefits by reducing the number of parameters required When the features from multiple linear filters can be effectively summarized by taking the maximum over groups of features, the subsequent layer can operate with significantly fewer weights, specifically k times fewer.

Maxout units, which are driven by multiple filters, exhibit redundancy that aids in mitigating catastrophic forgetting—a phenomenon where neural networks lose the ability to perform previously learned tasks (Goodfellow et al., 2014a).

Rectified linear units and their generalizations enhance model optimization by promoting linear behavior This principle is not limited to deep linear networks; it also applies to recurrent networks, which learn from sequences and generate a series of states and outputs Training these networks requires the propagation of information over multiple time steps, a process made simpler through linear computations with directional derivatives close to 1 The Long Short-Term Memory (LSTM) architecture exemplifies this by utilizing summation to propagate information through time, representing a straightforward form of linear activation.

6.3.2 Logistic Sigmoid and Hyperbolic Tangent

Prior to the introduction of rectiﬁed linear units, most neural networks used the logistic sigmoid activation function g z( ) = ( )σ z (6.38) or the hyperbolic tangent activation function g z( ) = tanh( )z (6.39)

These activation functions are closely related because tanh( ) = 2 (2 )z σ z −1.

Sigmoid units are commonly used as output units to predict the probability of a binary variable However, unlike piecewise linear units, they exhibit saturation across most of their range, becoming less sensitive to input values when they are either very positive or very negative, and only responding strongly when the input is near zero This widespread saturation can complicate gradient-based learning, leading to a decline in their effectiveness as hidden units in feedforward networks Nevertheless, they can still be utilized as output units in conjunction with an appropriate cost function that mitigates the saturation effects in the output layer.

Architecture Design

A crucial aspect of designing neural networks is establishing the architecture, which encompasses the network's overall structure, including the number of units and their interconnections.

Neural networks consist of layers, which are groups of units structured in a chain Typically, each layer functions based on the output of the previous one, creating a sequential architecture The initial layer is represented as h (1) = g (1).

, (6.40) the second layer is given by h (2) = g (2) 

In chain-based architectures, key architectural decisions involve selecting the network's depth and the width of each layer A single hidden layer can adequately fit the training set, while deeper networks may require fewer units and parameters, often leading to better generalization on the test set However, optimizing deeper networks can be challenging Therefore, determining the optimal architecture for a specific task necessitates experimentation, with a focus on monitoring validation set error.

6.4.1 Universal Approximation Properties and Depth

A linear model utilizes matrix multiplication to map features to outputs, allowing it to represent only linear functions Its simplicity in training is a key advantage, as many loss functions lead to convex optimization problems However, the limitation arises when we seek to learn nonlinear functions.

Learning a nonlinear function does not necessarily require a specialized model, as feedforward networks with hidden layers offer a universal approximation framework According to the universal approximation theorem, a feedforward network with a linear output layer and at least one hidden layer using any "squashing" activation function can approximate any Borel measurable function from one finite-dimensional space to another with a desired level of accuracy, given sufficient hidden units Additionally, the derivatives of these networks can also closely approximate the function's derivatives Importantly, any continuous function on a closed and bounded subset of R^n is Borel measurable and can be approximated by a neural network The theorem has been expanded to include a broader range of activation functions, including the popular rectified linear unit (ReLU).

The universal approximation theorem asserts that a sufficiently large multi-layer perceptron (MLP) can represent any function, but it does not guarantee that the training algorithm will successfully learn that function Learning can fail due to limitations in the optimization algorithm, which may struggle to find the appropriate parameter values, or because of overfitting, where the algorithm selects an incorrect function Additionally, the "no free lunch" theorem highlights that no single machine learning algorithm is universally superior While feedforward networks can universally represent functions, there is no definitive method for analyzing a specific training set to select a function that generalizes well to unseen data.

The universal approximation theorem asserts that a sufficiently large neural network can achieve any desired level of accuracy, but it does not specify the necessary size of such a network According to Barron (1993), there are bounds on the size of a single-layer network required to approximate a wide range of functions However, in the worst-case scenario, an exponential number of hidden units may be necessary, especially evident in binary functions, where the number of possible binary functions on vectors v ∈ {0,1}^n is 2^(2^n) Consequently, selecting one of these functions necessitates 2^n bits, leading to a requirement of O(2^n) degrees of freedom.

A single-layer feedforward network can theoretically represent any function; however, it may require an impractically large number of units and struggle with learning and generalization Utilizing deeper models often leads to a more efficient representation of the desired function and can significantly decrease generalization error.

Certain families of functions can be efficiently approximated by deep neural network architectures, while restricting the depth to a certain level (d) may necessitate a significantly larger model, often requiring an exponential number of hidden units in shallow models Initial findings, which were based on models unlike the continuous, differentiable neural networks used in machine learning today, were first established for logic gate circuits (Hồstad, 1986) and later extended to linear threshold units with non-negative weights (Hồstad and Goldmann, 1991; Hajnal et al., 1993), and subsequently to networks with continuous-valued activations (Maass, 1992; Maass et al., 1994) Modern neural networks frequently utilize rectified linear units (ReLU), and while Leshno et al (1993) showed that shallow networks with a variety of non-polynomial activation functions, including ReLU, possess universal approximation capabilities, these findings do not tackle the issues of depth or efficiency, only indicating that a sufficiently wide ReLU network can represent any function.

Research from 2014 demonstrated that deep rectifier networks can require an exponential number of hidden units when compared to shallow networks with only one hidden layer Specifically, it was found that piecewise linear networks, which utilize rectifier nonlinearities or maxout units, can represent functions with an exponential number of regions relative to the network's depth An illustration shows how networks using absolute value rectification create mirror images of functions based on hidden units, effectively folding the input space to generate mirrored responses around the nonlinearity By combining these folding operations, the networks can produce an exponentially vast number of piecewise linear regions, enabling them to capture a wide array of regular patterns.

Montufar et al (2014) provide a geometric interpretation of the exponential advantages of deeper rectifier networks The left illustration depicts an absolute value rectification unit, which produces identical outputs for mirrored input pairs, defined by a hyperplane of weights and bias The resulting function, represented by the green decision surface, mirrors a simpler pattern across the symmetry axis In the center, the function is visualized as a folded space around this axis of symmetry On the right, an additional repeating pattern is overlaid, facilitated by another downstream unit, resulting in a new symmetry that is replicated four times across two hidden layers.

The primary theorem presented by Montufar et al (2014) establishes that a deep rectifier network, characterized by d inputs, a specified depth, and l units in each hidden layer, generates a distinct number of linear regions.

, (6.42) i.e., exponential in the depth In the case of maxout networks withl k ﬁlters per unit, the number of linear regions is

Of course, there is no guarantee that the kinds of functions we want to learn in applications of machine learning (and in particular for AI) share such a property.

Choosing a deep model in machine learning reflects a belief that the desired function involves composing several simpler functions, suggesting that the learning problem is about uncovering underlying factors of variation This approach can be viewed as a multi-step computer program where each step builds on the previous output, facilitating internal organization within the network Empirical evidence indicates that increased model depth often leads to improved generalization across various tasks.

Research by Farabet et al (2013), Couprie et al (2013), Kahou et al (2013), Goodfellow et al (2013), and Szegedy et al (2014) demonstrates that deep architectures effectively capture useful priors in the function space learned by models Empirical results illustrated in figures 6.6 and 6.7 support this finding, highlighting the advantages of deep learning methodologies.

So far we have described neural networks as being simple chains of layers, with the main considerations being the depth of the network and the width of each layer.

In practice, neural networks show considerably more diversity.

Numerous neural network architectures have been designed for particular tasks, including convolutional networks specialized for computer vision, as discussed in Chapter 9 Additionally, feedforward networks can be adapted into recurrent neural networks for sequence processing, which are explored in Chapter 10 and involve unique architectural considerations.

Back-Propagation and Other Diﬀerentiation Algorithms

In a feedforward neural network, input \( x \) is processed to generate output \( \hat{y} \) through a process known as forward propagation, where information flows through the hidden units in each layer During training, this forward propagation continues until a scalar cost \( J(\theta) \) is produced The backpropagation algorithm, introduced by Rumelhart et al in 1986, enables the cost information to flow backward through the network, allowing for the computation of the gradient necessary for updating the model.

Calculating the gradient analytically is simple; however, its numerical evaluation can be costly in terms of computation The back-propagation algorithm efficiently addresses this challenge through a straightforward and cost-effective method.

Back-propagation is commonly misinterpreted as the entire learning algorithm for multi-layer neural networks; however, it specifically refers to the technique used for calculating gradients While algorithms like stochastic gradient descent utilize these gradients for learning, back-propagation is not exclusive to multi-layer networks and can compute derivatives for any function, although some functions may have undefined derivatives This article will detail the process of gradient computation.

The expression ∇ x f(x, y) represents the gradient of an arbitrary function f, where x denotes a set of variables for which derivatives are sought, while y represents additional input variables whose derivatives are not needed In the context of learning algorithms, the primary gradient of interest is that of the cost function concerning the parameters.

Many machine learning tasks require the computation of various derivatives, either during the learning phase or for model analysis The backpropagation algorithm is versatile and can be utilized beyond just calculating the gradient of the cost function in relation to the parameters This method of deriving information through a network is broadly applicable and can also determine values like the Jacobian for functions with multiple outputs, although this discussion will focus on the more prevalent scenario where the function has a single output.

So far we have discussed neural networks with a relatively informal graph language.

To describe the back-propagation algorithm more precisely, it is helpful to have a more precisecomputational graph language.

Many ways of formalizing computation as graphs are possible.

Here, we use each node in the graph to indicate a variable The variable may be a scalar, vector, matrix, tensor, or even a variable of another type.

To formalize our graphs, we also need to introduce the idea of an operation.

An operation refers to a fundamental function involving one or more variables, and our graph language is supported by a defined set of permissible operations More complex functions can be represented by combining multiple operations from this set.

We define an operation that returns a single output variable, which can represent multiple entries, such as a vector While software implementations of back-propagation typically accommodate operations with multiple outputs, we focus on single output for clarity, avoiding unnecessary complexities that may distract from the core conceptual understanding.

In a directed graph, when a variable y is derived from a variable x through a specific operation, we represent this relationship with a directed edge from x to y Occasionally, we label the output node with the name of the operation performed, although this annotation may be omitted if the operation is evident from the surrounding context.

Examples of computational graphs are shown in ﬁgure6.8.

The chain rule of calculus is essential for finding the derivatives of composite functions, utilizing known derivatives of individual functions Back-propagation serves as an efficient algorithm that implements the chain rule with a specific sequence of operations.

In calculus, when dealing with two functions, f and g, that map real numbers to real numbers, the chain rule provides a method for finding the derivative of a composite function If we define y as g(x) and z as f(g(x)), the chain rule expresses the relationship between the derivatives as dz/dx = (dz/dy) * (dy/dx) This formula allows us to calculate the rate of change of z with respect to x by multiplying the rate of change of z with respect to y by the rate of change of y with respect to x.

We can generalize this beyond the scalar case Suppose that x∈ R m , y ∈R n , zz x x yy

Figure 6.8: Examples of computational graphs (a) The graph using the × operation to compute z = xy (b) The graph for the logistic regression prediction y ˆ = σ  x  w + b 

In computational graphs, intermediate variables often lack names in algebraic expressions but require labels in graphical representations, which we denote as u(i) for the i-th variable For instance, the expression H = max{0, XW + b} computes a design matrix of rectified linear unit activations H from a minibatch of inputs X While examples a–c demonstrate the application of a single operation to each variable, it is also feasible to perform multiple operations An illustration of this is a computation graph that applies several operations to the weights w of a linear regression model, where these weights contribute to both the prediction ŷ and the weight decay penalty λ ∑i w²i In this context, g maps from R^m to R^n, and f maps from R^n to R, establishing a relationship where if y = g(x) and z = f(y), then the computations are interconnected.

In vector notation, this may be equivalently written as

∇ y z, (6.46) where ∂y ∂x is the n×m Jacobian matrix of g

The gradient of a variable \( x \) can be calculated by multiplying the Jacobian matrix \( \frac{\partial y}{\partial x} \) with the gradient \( \nabla y z \) The back-propagation algorithm executes this Jacobian-gradient product for every operation within the computational graph.

The back-propagation algorithm is typically applied to tensors of arbitrary dimensionality rather than just vectors Conceptually, this process mirrors back-propagation with vectors, with the primary distinction being the arrangement of numbers in a grid to create a tensor One approach involves flattening each tensor into a vector prior to executing back-propagation, allowing for the computation of a vector-valued gradient, which can then be reshaped back into a tensor Ultimately, back-propagation remains a process of multiplying Jacobians by gradients, regardless of the dimensionality involved.

To express the gradient of a value \( z \) with respect to a tensor \( X \), we denote it as \( \nabla_X z \), similar to how we treat vectors Tensors, such as 3-D tensors, have indices represented by multiple coordinates, which we can simplify by using a single variable \( i \) to indicate the complete index tuple Consequently, for every index tuple \( i \), \( (\nabla_X z)_i \) corresponds to \( \frac{\partial z}{\partial X_i} \) This notation parallels the treatment of vectors, where \( (\nabla_x z)_i \) represents \( \frac{\partial z}{\partial x_i} \) Utilizing this framework, we can articulate the chain rule for tensors, where if \( Y = g(X) \) and \( z = f(Y) \), the relationships can be effectively represented.

6.5.3 Recursively Applying the Chain Rule to Obtain Backprop

Utilizing the chain rule allows for the easy formulation of an algebraic expression representing the gradient of a scalar concerning any node in the computational graph that generated that scalar However, evaluating this expression on a computer entails additional considerations.

Historical Notes

Feedforward networks can be seen as eﬃcient nonlinear function approximators based on using gradient descent to minimize the error in a function approximation.

From this point of view, the modern feedforward network is the culmination of centuries of progress on the general function approximation task.

The chain rule, foundational to the back-propagation algorithm, was developed in the 17th century by mathematicians such as Leibniz and L'Hôpital While calculus and algebra have historically been employed to address optimization problems in closed form, the gradient descent method, which iteratively approximates solutions to these problems, emerged in the 19th century, notably introduced by Cauchy in 1847.

Starting in the 1940s, function approximation techniques laid the groundwork for machine learning models like the perceptron Initially, these models relied on linear frameworks, but critics, including Marvin Minsky, highlighted significant limitations, such as the inability to learn the XOR function This criticism sparked a backlash against the neural network approach as a whole.

The development of multilayer perceptrons and gradient computation was essential for learning nonlinear functions, with early applications of the chain rule emerging in the 1960s and 1970s primarily for control and sensitivity analysis In 1981, Werbos introduced these techniques for training artificial neural networks, which were later independently rediscovered by researchers like LeCun and Parker The influential book "Parallel Distributed Processing" showcased early successful experiments with back-propagation, significantly popularizing the method and spurring research into multi-layer neural networks Beyond back-propagation, the ideas presented by Rumelhart and Hinton emphasized the computational aspects of cognition and learning, leading to the establishment of "connectionism," which focuses on the connections between neurons as the foundation of learning and memory.

In particular, these ideas include the notion of distributed representation (Hinton et al., 1986).

The popularity of neural network research surged after the success of back-propagation, peaking in the early 1990s However, interest shifted towards other machine learning techniques until the resurgence of deep learning began in 2006.

Modern feedforward networks have maintained their foundational concepts since the 1980s, relying on the same back-propagation algorithm and gradient descent methods The significant enhancements in neural network performance between 1986 and 2015 can largely be attributed to two key factors: the availability of larger datasets, which have alleviated challenges in statistical generalization, and the increased size of neural networks, facilitated by advancements in computing power and software infrastructure Additionally, a few algorithmic modifications have led to noticeable improvements in neural network performance.

The transition from mean squared error to cross-entropy loss functions marked a significant shift in algorithmic practices within machine learning While mean squared error was widely used in the 1980s and 1990s, it was gradually supplanted by cross-entropy losses and the principle of maximum likelihood, reflecting the growing collaboration between the statistics and machine learning communities This shift enhanced the performance of models utilizing sigmoid and softmax outputs, addressing issues of saturation and slow learning that were prevalent with mean squared error.

The performance of feedforward networks has significantly improved with the shift from sigmoid hidden units to piecewise linear hidden units, particularly rectified linear units (ReLUs) Rectification, using the max{0, z} function, has its origins in early neural network models like the Cognitron and Neocognitron, although these models did not initially employ ReLUs While rectification gained popularity, it was largely overshadowed by sigmoids in the 1980s, likely due to the better performance of sigmoids in smaller neural networks By the early 2000s, there was a hesitation to use ReLUs due to concerns over non-differentiable points However, this perception shifted around 2009 when Jarrett et al highlighted that utilizing a rectifying nonlinearity is crucial for enhancing recognition system performance.

Jarrett et al (2009) found that in small datasets, the implementation of rectifying non-linearities is crucial, often more so than optimizing the weights of hidden layers They demonstrated that random weights can effectively transmit valuable information through a rectified linear network, enabling the classifier layer to accurately associate various feature vectors with their corresponding class identities.

Increased data availability enables the extraction of valuable knowledge, leading to improved performance over randomly selected parameters Research by Glorot et al (2011a) indicates that deep rectified linear networks facilitate easier learning compared to deep networks with curvature or dual saturation in their activation functions.

Rectified linear units (ReLU) are significant in the evolution of deep learning algorithms, reflecting the ongoing influence of neuroscience As highlighted by Glorot et al (2011a), ReLU is inspired by biological neuron behavior, exhibiting three key characteristics: first, biological neurons can be entirely inactive for certain inputs; second, their output is proportional to the input for other values; and third, neurons typically function in a state of inactivity, leading to sparse activations.

Since the resurgence of deep learning in 2006, feedforward networks were initially seen as ineffective without the support of other models, such as probabilistic frameworks However, from 2012 onward, it became clear that with proper resources and engineering, feedforward networks excel in performance Today, gradient-based learning in these networks is integral to developing advanced probabilistic models like variational autoencoders and generative adversarial networks This shift has transformed the perception of feedforward networks from unreliable to a robust technology applicable across various machine learning tasks, highlighting a trend where supervised learning increasingly supports unsupervised learning methodologies.

Feedforward networks hold significant untapped potential, with anticipated applications across a broader range of tasks in the future Advances in optimization algorithms and model design are expected to enhance their performance further This chapter focuses on the neural network family of models, while upcoming chapters will explore their practical application, including strategies for regularization and training.

A key challenge in machine learning is creating algorithms that excel not only on training data but also on unseen inputs To address this, various strategies are employed to minimize test error, often at the cost of higher training error, collectively referred to as regularization Numerous forms of regularization exist for deep learning practitioners, and enhancing these strategies has become a significant focus of research in the field.

Chapter 5 covers essential concepts such as generalization, underfitting, overfitting, bias, variance, and regularization For those unfamiliar with these terms, it is recommended to review that chapter before proceeding.

This chapter delves into regularization techniques, emphasizing strategies tailored for deep models and their components, which can serve as foundational elements in constructing deep learning architectures.

Some sections of this chapter deal with standard concepts in machine learning.

Parameter Norm Penalties

Regularization has been used for decades prior to the advent of deep learning Linear models such as linear regression and logistic regression allow simple, straightforward, and eﬀective regularization strategies.

Regularization techniques aim to enhance model performance by constraining the capacity of various algorithms, including neural networks, linear regression, and logistic regression This is achieved by incorporating a parameter norm penalty, denoted as Ω(θ), into the objective function J Consequently, the modified objective function is represented as J:˜.

The equation J˜( ;θ X y) = ( ;, J θ X y) + Ω( ), α θ illustrates the relationship between the objective function J and the norm penalty term Ω, with α ∈ [0,∞) serving as a hyperparameter that balances their contributions When α is set to 0, regularization is absent, while increasing values of α enhance the level of regularization applied to the model.

Minimizing the regularized objective function J˜ during training leads to a reduction in both the original objective J on the training dataset and the magnitude of the parameters θ The selection of the parameter norm Ω influences the preferred solutions, highlighting the importance of norm choice in the training process.

In this section, we discuss the eﬀects of the various norms when used as penalties on the model parameters.

In neural networks, it is common to apply a parameter norm penalty Ω that focuses solely on the weights of the affine transformation at each layer, while leaving the biases unregularized This approach is justified because biases generally require less data for accurate fitting compared to weights, which represent the interaction between two variables and necessitate diverse observational conditions In contrast, each bias influences only a single variable, minimizing the risk of excessive variance when left unregularized Additionally, regularizing bias parameters can lead to significant underfitting Consequently, we denote the weights affected by the norm penalty with the vector w, while the vector θ encompasses all parameters, including both w and the unregularized biases.

In neural networks, employing distinct penalty coefficients for each layer can be beneficial; however, searching for optimal values for multiple hyperparameters can be costly To streamline the process and minimize the search space, it is often practical to apply a uniform weight decay across all layers.

The L2 parameter norm penalty, widely recognized as weight decay, is a fundamental regularization technique discussed in section 5.2.2 This approach encourages the weights to converge towards the origin by incorporating a regularization term, Ω(θ) = 1/2 ||w||², into the objective function In various academic fields, L2 regularization is also referred to as ridge regression.

Studying the gradient of the regularized objective function provides valuable insights into the behavior of weight decay regularization For clarity, we will exclude the bias parameter, treating θ simply as w Consequently, the total objective function for this model can be expressed as follows.

2w  w+ (J w X y; , ), (7.2) with the corresponding parameter gradient

To take a single gradient step to update the weights, we perform this update: w ← w− α( w+∇ w J( ;w X y)), (7.4) Written another way, the update is: w ← (1−α)w− ∇ w J( ;w X y), (7.5)

The introduction of a weight decay term alters the learning rule by consistently reducing the weight vector by a fixed factor at each step, prior to the standard gradient update This process raises the question of its cumulative effect throughout the entire training period.

To enhance our analysis, we will employ a quadratic approximation of the objective function near the weight values that minimize the unregularized training cost, denoted as w ∗ = arg min w J(w) This approach is particularly relevant when the objective function exhibits a quadratic nature, such as in fitting a linear regression model.

Regularizing model parameters towards a specific point in space can yield a regularization effect, with optimal results achieved when the parameters are closer to their true values Using zero as a default value is sensible when the correct value's sign is unknown, as regularizing towards zero is a common practice This approach enhances the model's performance, especially when minimizing mean squared error, leading to an ideal approximation of the target function.

In the quadratic approximation of the function J, the expression is given by 2(w−w ∗ )^T H w(−w ∗ ), where H represents the Hessian matrix of J evaluated at the minimum point w ∗ Since w ∗ is defined as a minimum, there is no first-order term present in this approximation, indicating that the gradient at this point is zero Additionally, as w ∗ corresponds to a minimum of J, it can be concluded that the Hessian matrix H is positive semidefinite.

The minimum of Jˆoccurs where its gradient

To analyze the impact of weight decay, we enhance equation 7.7 by incorporating the weight decay gradient, allowing us to determine the minimum of the regularized version of Jˆ We denote the position of this minimum with the variable w˜, leading to the equation αw˜ + H( ˜w − w ∗) = 0 (7.8).

As α approaches 0, the regularized solution w˜ converges to w ∗ However, the behavior of the solution changes as α increases Given that H is real and symmetric, it can be decomposed into a diagonal matrix Λ accompanied by an orthonormal basis of eigenvectors, Q.

H = Q QΛ  Applying the decomposition to equation 7.10, we obtain: ˜ w = (Q QΛ  +αI) − 1 Q QΛ  w ∗ (7.11)

Weight decay impacts the scaling of the optimal weight vector \( w^* \) along the eigenvector axes of the Hessian matrix \( H \) Specifically, the portion of \( w^* \) aligned with the i-th eigenvector is adjusted by a factor of \( \frac{\lambda}{\lambda_i + \alpha} \) For a deeper understanding of this scaling process, refer to figure 2.3.

In directions where the eigenvalues of H are significantly large, such as when λ i > α, the impact of regularization is minimal Conversely, components with λ i < α experience substantial shrinkage, resulting in nearly zero magnitude This phenomenon is visually represented in Figure 7.1.

Norm Penalties as Constrained Optimization

Consider the cost function regularized by a parameter norm penalty:

To minimize a function under constraints, we can utilize a generalized Lagrange function, which combines the original objective function with a series of penalties These penalties are derived from the product of a Karush–Kuhn–Tucker (KKT) multiplier and a function that indicates whether the constraint is met For instance, if we aim to restrict Ω(θ) to remain below a certain constant, we can formulate an appropriate generalized Lagrange function.

L(θ, α;X y, ) = ( ;J θ X y, ) + (Ω( )α θ −k ) (7.26) The solution to the constrained problem is given by θ ∗ = arg min θ α,α max ≥ 0 L(θ, α ) (7.27)

To address the problem outlined in section 4.4, adjustments to both θ and α are necessary Section 4.5 illustrates a practical example of linear regression with an L2 constraint Various methods can be employed, including gradient descent or analytical solutions where the gradient equals zero In all cases, α must increase when Ω(θ) exceeds k and decrease when Ω(θ) falls below k Positive values of α promote the reduction of Ω(θ), with the optimal value α* effectively encouraging this shrinkage without allowing Ω(θ) to drop below k.

To gain some insight into the eﬀect of the constraint, we can ﬁxα ∗ and view the problem as just a function of :θ θ ∗ = arg min θ L(θ, α ∗ ) = arg min θ

This is exactly the same as the regularized training problem of minimizing J.˜

A parameter norm penalty can be viewed as a constraint on weights, with the L2 norm confining weights within an L2 ball and the L1 norm limiting them to a region of restricted L1 norm The size of this constraint region, determined by weight decay coefficient α*, is not directly known, as α* does not reveal the exact value of k Although k can theoretically be solved, its relationship with α* varies based on the form of J Nevertheless, we can manage the constraint region by adjusting α; increasing α results in a smaller constraint region, while decreasing α leads to a larger one.

In certain situations, it may be preferable to apply explicit constraints instead of penalties in optimization algorithms By adjusting methods like stochastic gradient descent, we can first take a step downwards on the objective function J(θ) and subsequently project θ to the nearest point that meets the constraint Ω(θ) < k This approach is particularly beneficial when we have a clear understanding of the suitable value for k and wish to avoid the time-consuming process of determining the corresponding learning rate α.

Using explicit constraints and reprojection instead of penalties is beneficial because penalties can lead to non-convex optimization issues, causing the process to become trapped in local minima associated with small θ values This challenge is particularly evident in neural network training, where it often results in suboptimal performance.

"Dead units" refer to components in a neural network that have minimal influence on the function's behavior, as indicated by their negligible incoming and outgoing weights During training, especially with weight norm penalties, these configurations may achieve local optimality, despite the potential for significant weight reduction.

Increasing the weights can enhance performance, and explicit constraints through re-projection are particularly effective in these scenarios Unlike other methods, re-projection does not push weights towards the origin, allowing them to remain impactful These constraints are most beneficial when the weights grow large and seek to exit the designated constraint region.

Explicit constraints with reprojection enhance stability in optimization by preventing a positive feedback loop that can occur with high learning rates This feedback loop, where large weights lead to larger gradients and subsequent weight updates, can result in θ rapidly diverging from the origin and causing numerical overflow By implementing these constraints, the continuous increase in weight magnitude is curtailed Hinton et al (2012c) suggest that combining these constraints with high learning rates enables quick exploration of the parameter space while ensuring stability.

Hinton et al (2012c) advocate for a strategy by Srebro and Shraibman (2005) that involves constraining the norm of each column in a neural network's weight matrix, rather than applying a constraint to the entire matrix's Frobenius norm This approach prevents any single hidden unit from acquiring excessively large weights When this constraint is integrated into a Lagrange function, it resembles L2 weight decay but utilizes distinct KKT multipliers for each hidden unit's weights These multipliers are dynamically updated to ensure compliance with the constraints In practice, the limitation on column norms is typically enforced through explicit constraints with reprojection.

Regularization and Under-Constrained Problems

In machine learning, regularization is essential for properly defining certain problems Many linear models, such as linear regression and PCA, rely on inverting the matrix X^T X, which can become problematic under specific conditions.

The matrix X is considered singular when the data generating distribution exhibits no variance in a specific direction This can also occur if the number of examples (rows of X) is less than the number of input features, leading to a lack of observed variance.

(columns ofX) In this case, many forms of regularization correspond to inverting

X  X +αI instead This regularized matrix is guaranteed to be invertible.

Linear problems yield closed form solutions when the associated matrix is invertible However, underdetermined problems may lack closed form solutions, such as in logistic regression with linearly separable classes In such cases, if a weight vector \( w \) achieves perfect classification, then doubling the weight vector to \( 2w \) will also result in perfect classification and an increased likelihood.

Stochastic gradient descent, an iterative optimization method, theoretically perpetually increases the weight magnitude, never reaching a halt However, in practical applications, numerical implementations of gradient descent can lead to excessively large weights, resulting in numerical overflow Consequently, the behavior of the algorithm becomes contingent on the programmer's approach to managing non-real number values.

Regularization techniques play a crucial role in ensuring the convergence of iterative methods for underdetermined problems One such method, weight decay, effectively halts the increase of weight magnitudes during gradient descent when the slope of the likelihood matches the weight decay coefficient.

The idea of using regularization to solve underdetermined problems extends beyond machine learning The same idea is useful for several basic linear algebra problems.

As we saw in section 2.9, we can solve underdetermined linear equations using the Moore-Penrose pseudoinverse Recall that one deﬁnition of the pseudoinverse

Equation 7.29 represents linear regression with weight decay, serving as the limit of equation 7.17 as the regularization coefficient approaches zero This allows us to interpret the pseudoinverse as a means to stabilize underdetermined problems through regularization.

Dataset Augmentation

To enhance the generalization of a machine learning model, training it on a larger dataset is essential However, due to practical limitations on data availability, generating synthetic data can effectively supplement the training set In certain machine learning applications, producing this artificial data is relatively uncomplicated.

This approach is easiest for classiﬁcation A classiﬁer needs to take a complicated, high dimensional input xand summarize it with a single category identity y.

A key challenge for classifiers is achieving invariance to diverse transformations By applying various transformations to the x inputs in our training set, we can effortlessly create new (x, y) pairs.

This method may not be easily applicable to various tasks, particularly in generating new fake data for density estimation Achieving this requires prior resolution of the density estimation problem itself.

Dataset augmentation is a highly effective strategy for enhancing object recognition in classification tasks Given the high dimensionality and diverse variations in images, simple manipulations like translating images by a few pixels can significantly boost model generalization This improvement occurs even when the model incorporates convolution and pooling techniques designed for translation invariance Additionally, other augmentative techniques such as image rotation and scaling have also demonstrated considerable effectiveness in improving recognition performance.

When augmenting datasets for optical character recognition tasks, it is crucial to avoid transformations that could alter the correct classification of characters For instance, horizontal flips and 180° rotations are unsuitable, as they can confuse similar characters like 'b' and 'd' or '6' and '9'.

Classifiers should ideally be invariant to certain transformations, such as out-of-plane rotation, which cannot be easily achieved through simple geometric operations on input pixels.

Dataset augmentation is eﬀective for speech recognition tasks as well (Jaitly and Hinton 2013, ).

Injecting noise into neural network inputs serves as a form of data augmentation, enhancing model robustness, particularly for classification and regression tasks While neural networks generally struggle with noise, training them with random noise can improve their resilience This technique is utilized in unsupervised learning methods, such as denoising autoencoders Additionally, applying noise to hidden units allows for dataset augmentation at various abstraction levels Recent studies indicate that carefully tuning noise magnitude can significantly enhance performance Dropout, a robust regularization method, effectively creates new inputs through noise multiplication, further aiding in model generalization.

When comparing machine learning benchmark results, it's crucial to consider the impact of dataset augmentation, as well-designed augmentation can significantly lower generalization error Controlled experiments are essential for accurately assessing the performance of different algorithms, ensuring that both are evaluated under the same dataset augmentation conditions For instance, if algorithm A shows poor performance without augmentation while algorithm B excels with various synthetic transformations, the improvements may stem from the augmentations rather than the algorithm itself Properly controlling experiments can sometimes involve subjective judgment, particularly when distinguishing between general operations, like adding Gaussian noise—which is integrated into the algorithm—and application-specific pre-processing steps, such as random image cropping.

Noise Robustness

Section 7.4 highlights the effectiveness of using noise as a dataset augmentation strategy, where adding infinitesimal variance noise to model inputs can serve as a penalty on weight norms Importantly, noise injection can be more impactful than merely reducing parameter sizes, particularly when applied to hidden units This topic is significant enough to warrant a dedicated discussion, with the dropout algorithm in section 7.12 being a key development in this area.

Incorporating noise into the weights of models is a technique commonly employed to regularize recurrent neural networks, as noted by Jim et al (1996) and Graves (2011) This approach serves as a stochastic implementation of Bayesian inference, treating model weights as uncertain and representable by a probability distribution that captures this uncertainty By adding noise to the weights, practitioners effectively introduce a practical method to reflect uncertainty in the learning process.

Applying noise to weights can be viewed as a traditional regularization method that promotes stability in the learning function In a regression context, the goal is to train a function \( \hat{y}(x) \) that maps features \( x \) to a scalar, utilizing the least-squares cost function to minimize the difference between model predictions \( \hat{y}(x) \) and actual values \( y \).

The training set consists of m labeled examples {(x (1) , y (1) ), ,(x ( ) m , y ( ) m )}.

In our approach, we introduce a random perturbation of the network weights, denoted as W ∼ N(μ; 0, ηI), for each input presentation in a standard l-layer MLP This results in a perturbed model represented as ŷ_W(x) Despite the added noise, our goal remains to minimize the squared error of the network's output, leading us to redefine the objective function accordingly.

For small η, the minimization ofJ with added weight noise (with covariance ηI) is equivalent to minimization of J with an additional regularization term: ηE p (x ,y ) 

Regularization encourages model parameters to settle in areas of parameter space where minor weight changes have minimal impact on the output This approach helps the model find not just any minima, but those that are surrounded by flat regions, enhancing stability (Hochreiter and Schmidhuber, 1995) In the case of linear regression, this regularization term simplifies to ηE p( ) x.

, which is not a function of parameters and therefore does not contribute to the gradient of J˜ W with respect to the model parameters.

7.5.1 Injecting Noise at the Output Targets

Most datasets contain errors in their labels, which can negatively impact the optimization of logp(y | x) To address this issue, one effective approach is to model label noise explicitly, assuming that the training label y is correct with a probability of 1−ε, while any other label could also be correct This assumption can be seamlessly integrated into the cost function without the need for drawing noise samples For instance, label smoothing modifies the traditional hard classification targets in a softmax model by adjusting them to k−ε and 1−ε Utilizing standard cross-entropy loss with these soft targets allows for improved learning dynamics Unlike hard targets, which may prevent convergence in maximum likelihood learning, label smoothing helps avoid extreme predictions while still promoting accurate classification This technique, which has been in use since the 1980s, remains a key component in modern neural network architectures.

Semi-Supervised Learning

In the paradigm of semi-supervised learning, both unlabeled examples from P(x) and labeled examples fromP(x y, )are used to estimateP(y x| ) or predictyfrom x.

In deep learning, semi-supervised learning focuses on developing a representation h = f(x) that ensures examples from the same class have similar representations Unsupervised learning plays a crucial role by providing insights into how to group examples in representation space, where tightly clustered examples in the input space should correspond to similar representations This approach can enhance generalization when applying a linear classifier in the new representation space (Belkin and Niyogi 2002; Chapelle et al., 2003) A traditional method in this context involves using principal components analysis as a preprocessing step prior to classifier application on the projected data.

Instead of having separate unsupervised and supervised components in the model, one can construct models in which a generative model of either P(x) or

The model P(x, y) shares parameters with the discriminative model P(y | x), allowing for a balance between the supervised criterion -logP(y | x) and the unsupervised generative criterion (-logP(x) or -logP(x, y)) This generative criterion reflects a specific prior belief regarding the supervised learning problem, suggesting a connection between the structures of P(x) and P(y | x) through shared parametrization By adjusting the contribution of the generative criterion in the overall model, one can achieve a more effective balance than relying solely on either a generative or discriminative training approach.

Salakhutdinov and Hinton 2008( ) describe a method for learning the kernel function of a kernel machine used for regression, in which the usage of unlabeled examples for modeling P( )x improves P(y x| ) quite signiﬁcantly.

See Chapelle et al (2006) for more information about semi-supervised learning.

Multi-Task Learning

Multi-task learning, as introduced by Caruana in 1993, enhances generalization by combining examples from multiple tasks, effectively imposing soft constraints on model parameters This approach parallels the benefits of additional training examples, which guide model parameters toward values that generalize effectively By sharing components of a model across different tasks, the model becomes more constrained to optimal values, often resulting in improved generalization, provided that the sharing is justified.

Figure 7.2 depicts a prevalent approach to multi-task learning, where various supervised tasks predict \( y_i \) based on a shared input \( x \) and a common intermediate-level representation \( h \) that encapsulates shared factors The model typically consists of two main components, each with its associated parameters.

1 Task-specific parameters (which only benefit from the examples of their task to achieve good generalization) These are the upper layers of the neural network in figure 7.2.

In neural network architectures, generic parameters are utilized across all tasks, benefiting from the pooled data from each task These parameters correspond to the lower layers of the network, as illustrated in Figure 7.2, where shared representations enhance the model's performance by leveraging information from multiple sources.

Multi-task learning in deep learning frameworks involves sharing a common input across different tasks that predict distinct target variables In this setup, lower layers of the network can be utilized by all tasks, while task-specific parameters are adjusted for individual outputs The model operates on the assumption that a shared pool of factors accounts for variations in the input, with each task drawing from a specific subset of these factors Additionally, top-level hidden units are tailored to their respective tasks, while an intermediate representation is common across all tasks In unsupervised learning scenarios, some top-level factors may not contribute to any specific output tasks, serving to explain variations in the input without influencing the predictions.

Improved generalization and generalization error bounds (Baxter 1995, ) can be achieved because of the shared parameters, for which statistical strength can be

L o ss (ne g a ti v e lo g -l ik el ih o o d)

Training set loss Validation set loss

Learning curves illustrate the progression of negative log-likelihood loss over time, measured by training iterations or epochs In this case, a maxout network is trained on the MNIST dataset, revealing a consistent decrease in training loss However, the average loss on the validation set eventually rises, resulting in an asymmetric U-shaped curve This behavior indicates that performance improves with an increased number of examples for shared parameters, particularly when certain assumptions about the statistical relationships between different tasks hold true, suggesting some commonality among the tasks.

In deep learning, it is believed that certain factors influencing data variations are common across multiple tasks, suggesting a shared underlying structure that can enhance understanding and performance across these tasks.

Early Stopping

When training large models with high representational capacity, it's common to see a steady decrease in training error over time, while validation set error tends to increase after a certain point This phenomenon is consistently observed, as illustrated in figure 7.3.

To achieve a model with improved validation set error, we revert to the parameter settings from the point of lowest validation error Each time the validation error decreases, we save a copy of the model parameters Upon completion of the training algorithm, we utilize these saved parameters instead of the most recent ones The algorithm concludes when there are no improvements in the parameters over the best recorded validation error for a predetermined number of iterations, as outlined in Algorithm 7.1.

Algorithm 7.1 outlines the early stopping meta-algorithm, designed to optimize training duration This versatile strategy effectively integrates with various training algorithms and methods for assessing error on the validation set.

Let n be the number of steps between evaluations.

Let p be the “patience,” the number of times to observe worsening validation set error before giving up.

Let θ o be the initial parameters. θ← θ o i← 0 j ←0 v ← ∞ θ ∗ ← θ i ∗ ← i while j < p do

Update θ by running the training algorithm for n steps. i← i+n v  ← ValidationSetError( )θ if v  < v then j ←0 θ ∗ ← θ i ∗ ← i v ← v  else j ←j + 1 end if end while

Best parameters are θ ∗ , best number of training steps is i ∗

Early stopping is a widely utilized regularization technique in deep learning, favored for its effectiveness and simplicity This strategy helps prevent overfitting by halting the training process once performance on a validation dataset begins to decline.

One way to think of early stopping is as a very eﬃcient hyperparameter selection algorithm In this view, the number of training steps is just another hyperparameter.

The validation set performance curve for this hyperparameter exhibits a U-shape, similar to most hyperparameters that influence model capacity Early stopping is a technique that regulates the model's effective capacity by limiting the number of training steps Typically, selecting hyperparameters involves a costly trial-and-error approach, where a hyperparameter is set at the beginning of training, followed by running several training steps to evaluate its impact.

The "training time" hyperparameter is distinct because it evaluates multiple values during a single training run The primary cost of automatically selecting this hyperparameter through early stopping is the periodic validation set evaluations conducted during training Ideally, these evaluations should occur in parallel on separate hardware resources, such as a different machine, CPU, or GPU If such resources are unavailable, the cost can be mitigated by using a smaller validation set relative to the training set or by reducing the frequency of evaluations to obtain a lower resolution estimate of the optimal training time.

Early stopping incurs a minor cost due to the necessity of preserving the optimal parameters, which can be stored in slower, larger memory such as host memory or disk drives while training continues in GPU memory Since these best parameters are infrequently written and not accessed during training, the impact of these occasional slower writes on overall training time is minimal.

Early stopping is an effective and unobtrusive regularization technique that minimally impacts the training process, objective function, or parameter settings Its simplicity allows for easy implementation without disrupting learning dynamics, unlike weight decay, which requires careful tuning to avoid poor local minima associated with excessively small weights.

Early stopping can be utilized independently or alongside other regularization techniques Despite employing regularization methods that adjust the objective function to enhance generalization, achieving optimal generalization is seldom observed at a local minimum of the training objective.

Early stopping necessitates the use of a validation set, resulting in some training data being withheld from the model To maximize the utility of this additional data, one can conduct further training after the initial early stopping phase During this subsequent training step, all available training data is utilized There are two primary strategies to implement during this second training process.

One effective strategy for model optimization involves reinitializing the model and retraining it using the entire dataset During this second training phase, it's essential to adhere to the optimal number of training steps identified through the early stopping method in the initial pass However, challenges arise in determining whether to retrain based on the same number of parameter updates or the same number of dataset passes, as each complete pass in the second round necessitates more parameter updates due to the increased size of the training set.

Algorithm 7.2 A meta-algorithm for using early stopping to determine how long to train, then retraining on all the data.

Let X ( train ) and y ( train ) be the training set.

Split X ( train ) and y ( train ) into (X ( subtrain ) , X (valid) ) and (y ( subtrain ) , y (valid) ) respectively.

Run early stopping (algorithm7.1) starting from random θusing X ( subtrain ) and y ( subtrain ) for training data and X (valid) andy (valid) for validation data This returns i ∗ , the optimal number of steps.

Set θ to random values again.

Train on X ( train ) and y ( train ) for i ∗ steps.

An effective strategy for utilizing all available data involves retaining the parameters from the initial training round and continuing the training process with the complete dataset At this phase, instead of relying on a predetermined number of steps to determine when to stop, we monitor the average loss function on the validation set, continuing until it drops below the training set objective where early stopping was previously initiated While this approach reduces the expense of retraining the model from the beginning, it lacks reliability, as there is no assurance that the validation set objective will reach the target value, leaving the process potentially without a defined endpoint This procedure is elaborated in Algorithm 7.3.

Early stopping is an effective strategy that minimizes the computational costs associated with training models By limiting the number of training iterations, it not only reduces expenses but also offers regularization benefits This approach eliminates the need for adding penalty terms to the cost function or calculating their gradients, streamlining the training process.

Early stopping serves as an effective regularization technique by preventing overfitting in machine learning models This is evidenced by learning curves that display a U-shaped pattern in validation set error, indicating that training the model beyond a certain point leads to increased error on unseen data By halting training at the optimal moment, early stopping helps maintain a balance between bias and variance, ultimately enhancing model performance.

Algorithm 7.3 Meta-algorithm using early stopping to determine at what objective value we start to overﬁt, then continue training until that value is reached.

Let X ( train ) and y ( train ) be the training set.

Split X ( train ) and y ( train ) into (X ( subtrain ) , X (valid) ) and (y ( subtrain ) , y (valid) ) respectively.

Run early stopping (algorithm7.1) starting from random θusing X ( subtrain ) and y ( subtrain ) for training data and X (valid) andy (valid) for validation data This updates θ

← J(θ X, ( subtrain ) ,y ( subtrain ) ) while J(θ X, (valid) ,y (valid) ) >  do

Early stopping regularizes the model by limiting the optimization process to a small region of the parameter space near the initial value θ₀ According to Bishop (1995a) and Sjöberg and Ljung (1995), this approach involves taking τ optimization steps with a learning rate of 𝜖, where the product 𝜖τ serves as a measure of effective capacity By constraining both the number of iterations and the learning rate, the volume of parameter space that can be explored from θ₀ is restricted Thus, 𝜖τ functions similarly to the reciprocal of the coefficient used for weight decay, effectively controlling the model's complexity.

Indeed, we can show how—in the case of a simple linear model with a quadratic error function and simple gradient descent—early stopping is equivalent to L 2 regularization.

Parameter Tying and Parameter Sharing

In this chapter, we have explored the addition of constraints or penalties to model parameters, typically focusing on a fixed point, such as L2 regularization, which penalizes deviations from zero However, there are instances where we need alternative methods to convey our prior knowledge regarding appropriate parameter values While we may not have exact values for the parameters, our understanding of the domain and model architecture suggests that certain dependencies between the parameters should exist.

In many scenarios, it's crucial to express a dependency where specific parameters need to remain closely aligned For instance, when comparing two models tasked with the same classification objective, albeit with varying input distributions, it becomes essential to analyze their performance under these differing conditions.

A with parameters w ( ) A and model B with parameters w ( ) B The two models map the input to two diﬀerent, but related outputs: yˆ ( ) A = f(w ( ) A ,x) and ˆ y ( ) B = (g w ( ) B ,x).

When tasks share similar input and output distributions, it's reasonable to assume that the model parameters for these tasks should be closely aligned To capitalize on this similarity, we can implement regularization through a parameter norm penalty Specifically, we can apply an L2 penalty defined as Ω(w(A), w(B)) = ||w(A) - w(B)||² While we have chosen an L2 penalty in this instance, alternative penalty options are also available.

Lasserre et al (2006) introduced a novel approach that regularizes the parameters of a supervised classifier model to align closely with those of an unsupervised model, which is designed to capture the distribution of observed input data This innovative architecture allows for a direct pairing of many parameters between the classifier and the unsupervised model, enhancing the overall performance and coherence of the learning process.

Parameter sharing is a popular method of regularization that enforces equality among sets of parameters, contrasting with the norm penalty approach that merely encourages parameters to be similar This technique allows different models or components to utilize a shared set of parameters, significantly reducing memory storage requirements In models like convolutional neural networks, parameter sharing can lead to a substantial decrease in memory footprint, making it a more efficient choice for managing model complexity.

Convolutional Neural Networks By far the most popular and extensive use of parameter sharing occurs inconvolutional neural networks (CNNs) applied to computer vision.

Natural images exhibit statistical properties that remain consistent regardless of translation, meaning a cat photo retains its identity even when shifted by one pixel Convolutional Neural Networks (CNNs leverage this characteristic by sharing parameters across various image locations, allowing the same feature to be computed at different positions Consequently, a cat detector can successfully identify a cat in any part of the image, whether it is located in column i or column i+1.

Parameter sharing in Convolutional Neural Networks (CNNs) has significantly reduced the number of unique model parameters, enabling larger network sizes without necessitating additional training data This approach exemplifies the effective integration of domain knowledge into network architecture Further details on CNNs will be provided in Chapter 9.

Sparse Representations

Weight decay imposes a penalty on model parameters, while an alternative strategy penalizes the activations of neural network units to promote sparsity This approach indirectly creates a complex penalty on the model parameters, enhancing overall model performance.

L1 penalization leads to sparse parametrization in models, resulting in numerous parameters being zero or nearly zero In contrast, representational sparsity refers to a representation where many elements are also zero or close to zero This distinction can be effectively illustrated through the example of linear regression.

This article discusses a sparsely parametrized linear regression model, highlighting its effectiveness in representing data It contrasts this with a linear regression approach that utilizes a sparse representation, denoted as h, which functions as a concise summary of the information contained in the data vector x.

Representational regularization is accomplished by the same sorts of mechanisms that we have used in parameter regularization.

Norm penalty regularization of representations is performed by adding to the loss function J a norm penalty on the representation This penalty is denoted

Ω( )h As before, we denote the regularized loss function byJ˜:

J˜( ;θ X y) = ( ;, J θ X y) + Ω( ), α h (7.48) where α ∈ [0,∞) weights the relative contribution of the norm penalty term, with larger values of α corresponding to more regularization.

An L1 penalty applied to the elements of a representation promotes representational sparsity, expressed mathematically as Ω(h) = ||h||₁ = ∑ᵢ |hᵢ| While the L1 penalty is a common choice for achieving sparse representations, alternatives exist, such as penalties derived from a Student-t prior (Olshausen and Field, 1996; Bergstra, 2011) and KL divergence penalties (Larochelle and Bengio, 2008), which are particularly effective for representations constrained to the unit interval Strategies proposed by Lee et al (2008) and Goodfellow et al (2009) involve regularizing the average activation across multiple examples, aiming to align it with a target value, such as a vector where each entry is 0.01.

Other approaches obtain representational sparsity with a hard constraint on the activation values For example,orthogonal matching pursuit (Pati et al.,

In 1993, a method was introduced for encoding an input \( x \) into a representation \( h \) that addresses the constrained optimization problem, minimizing the difference between the weighted representation and the input while considering the number of non-zero entries in \( h \) This optimization is efficiently solvable when the weight matrix \( W \) is orthogonal, leading to the method being commonly referred to in the field.

OMP-k, defined by the parameter k to specify the maximum number of non-zero features permitted, has been shown by Coates and Ng (2011) to serve as a highly effective feature extractor for deep learning architectures.

Essentially any model that has hidden units can be made sparse Throughout this book, we will see many examples of sparsity regularization used in a variety of contexts.

Bagging and Other Ensemble Methods

Bagging, or bootstrap aggregating, is a method used to decrease generalization error by integrating multiple models Introduced by Breiman in 1994, this technique involves training various distinct models independently and then aggregating their outputs through a voting mechanism for test examples This approach exemplifies a broader machine learning strategy known as model averaging, and techniques that utilize this strategy are categorized as ensemble methods.

The reason that model averaging works is that diﬀerent models will usually not make all the same errors on the test set.

In a scenario with k regression models, each model incurs an error denoted as  i for each example These errors are derived from a zero-mean multivariate normal distribution, characterized by a variance of E[ 2 i ] = v and covariances represented as E[ i  j ] = c Consequently, the overall error associated with the average prediction of the ensemble of models can be analyzed.

 i  i The expected squared error of the ensemble predictor is

When errors are perfectly correlated and the constant \( c = v \), the mean squared error equals \( v \), indicating that model averaging is ineffective Conversely, when errors are perfectly uncorrelated and \( c = 0 \), the expected squared error of the ensemble is only \( \frac{1}{k} v \), demonstrating that the expected squared error decreases linearly with the ensemble size This implies that, on average, the ensemble outperforms any individual member, and if the members' errors are independent, the ensemble's performance improves significantly Various ensemble methods create model combinations differently, such as training each member of the ensemble independently.

Bagging is an ensemble learning technique that improves model robustness by training multiple instances of the same model on resampled datasets created through bootstrapping For example, when training an 8-digit detector, two different resampled datasets can be generated: one that emphasizes the loop on top of the 8 by repeating it and omitting the 9, and another that focuses on the loop at the bottom by repeating the 9 and omitting the 6 Although each individual classification rule may be fragile, averaging the outputs of these models enhances overall performance, ensuring that the detector achieves high confidence only when both loops of the digit 8 are present This approach allows for the reuse of the same model, training algorithm, and objective function, making bagging a powerful method in machine learning.

Bagging, or bootstrap aggregating, involves creating k distinct datasets, each containing the same number of samples as the original dataset These datasets are formed by sampling with replacement, which often results in some examples from the original dataset being omitted and others being duplicated Typically, about two-thirds of the original examples are included in each training set of the same size Each model, referred to as model i, is trained on its corresponding dataset, leading to variations among the trained models due to the differences in included examples.

Neural networks can achieve diverse solution points, making them advantageous for model averaging, even when trained on the same dataset Variations in random initialization, minibatch selection, hyperparameters, and non-deterministic implementations often lead to ensemble members making partially independent errors.

Model averaging is a highly effective technique for minimizing generalization error in machine learning However, it is often avoided in benchmarking algorithms for scientific research, as virtually any machine learning model can significantly improve performance through model averaging, albeit at the cost of higher computational and memory requirements Consequently, benchmarks typically utilize a single model for comparison.

Machine learning contests are usually won by methods using model averaging over dozens of models A recent prominent example is the Netﬂix Grand Prize (Koren 2009, ).

Not all ensemble construction techniques aim to regularize the ensemble beyond the capabilities of individual models One notable method, boosting, increases the ensemble's capacity compared to its constituent models This technique has been effectively utilized to create ensembles of neural networks by progressively incorporating additional networks Furthermore, boosting can also be applied by viewing a single neural network as an ensemble, allowing for the incremental addition of hidden units to enhance its performance.

Dropout

Dropout, as introduced by Srivastava et al (2014), is an effective and computationally efficient regularization technique for various models, particularly in the context of large neural networks It serves as a practical alternative to bagging, which traditionally requires training multiple models for each test instance, a process that can be prohibitively expensive in terms of runtime and memory with large networks While ensembles typically consist of five to ten neural networks—such as the six used by Szegedy et al (2014a) to achieve victory in the ILSVRC—scaling beyond this becomes cumbersome Dropout offers a simplified method to approximate the training and evaluation of an extensive ensemble of neural networks, making it a valuable tool in deep learning.

Dropout is a technique that trains an ensemble of sub-networks by removing non-output units from a base network, as shown in figure 7.6 In modern neural networks that rely on a series of affine transformations and nonlinearities, a unit can effectively be removed by multiplying its output by zero While this approach is straightforward, it may require slight adjustments for models like radial basis function networks, which depend on the difference between a unit's state and a reference value For simplicity, we describe the dropout algorithm using multiplication by zero, but it can easily be adapted for other methods of unit removal.

Bagging involves creating k distinct models by sampling datasets from the training set with replacement, training each model on its corresponding dataset Dropout mimics this process by utilizing a vast number of neural networks It employs a minibatch-based learning approach, like stochastic gradient descent, where a unique binary mask is randomly generated for each minibatch, affecting all input and hidden units in the network Each unit's mask is independently sampled, with a fixed hyperparameter determining the inclusion probability—typically 0.8 for input units and 0.5 for hidden units Standard forward propagation, back-propagation, and learning updates are then executed as usual.

More formally, suppose that a mask vector à specifies which units to include, andJ(θ à, ) defines the cost of the model defined by parameters θ and maskà.

Then dropout training consists in minimizing E à J(θ à, ) The expectation contains exponentially many terms but we can obtain an unbiased estimate of its gradient by sampling values of à.

Dropout training differs significantly from bagging training, primarily in how models are structured and trained While bagging involves independent models that each converge on their specific training sets, dropout leverages parameter sharing among models derived from a parent neural network This allows for an exponential representation of models using manageable memory In dropout, only a small fraction of possible sub-networks are trained for a brief period, as the sheer number of combinations makes exhaustive training impractical Despite this, the shared parameters enable other sub-networks to achieve effective parameter settings Both methods utilize subsets of the original training set sampled with replacement, aligning dropout closely with the bagging algorithm while introducing unique characteristics in model training.

Dropout trains an ensemble of sub-networks by removing non-output units from a base network In this example, we start with a base network featuring two visible and two hidden units, resulting in sixteen possible subnetworks Many of these subnetworks may lack input units or paths connecting inputs to outputs However, this issue diminishes in significance as the network layers widen, reducing the likelihood of losing all paths from inputs to outputs.

In a feedforward network example utilizing dropout, we have two input units, a hidden layer with two units, and one output unit During forward propagation, a binary vector is randomly sampled for each input and hidden unit, with entries independently determined; typically, the probability of an entry being 1 is set at 0.5 for hidden layers and 0.8 for input layers Each unit's output is multiplied by the corresponding mask from this vector, allowing the forward propagation process to proceed through the network as usual This method effectively selects a sub-network for processing, akin to the approach illustrated in figure 7.6.

In the context of making predictions, a bagged ensemble aggregates votes from its individual members through a process known as inference While previous discussions of bagging and dropout didn't necessitate a probabilistic model, we now consider the model's function to be generating a probability distribution Specifically, in bagging, each model outputs a probability distribution p(i)(y | x), and the ensemble's final prediction is determined by the arithmetic mean of these distributions.

In the context of dropout, each sub-model characterized by a mask vector defines a probability distribution p(y | x) The overall arithmetic mean across all masks is represented by the equation ∑ p(α)p(y | xα), where p(α) denotes the probability distribution utilized for sampling α during the training phase.

Evaluating the sum in deep neural networks is challenging due to the exponential number of terms involved, making it intractable without simplification Currently, deep neural networks do not allow for any straightforward simplification Instead, we can approximate inference through sampling, where averaging the outputs from multiple masks—typically 10 to 20—can yield satisfactory performance.

An improved method for approximating predictions from an ensemble of models involves using the geometric mean instead of the arithmetic mean of the predicted distributions This approach allows for accurate predictions with just one forward propagation Warde-Farley et al (2014) provide both theoretical arguments and empirical evidence supporting the effectiveness of the geometric mean in this context.

The geometric mean of multiple probability distributions does not inherently result in a valid probability distribution To ensure the outcome is a probability distribution, it is essential that none of the sub-models assigns a probability of zero to any event, followed by the renormalization of the resulting distribution The unnormalized probability distribution derived from the geometric mean is expressed as ˜p ensemble (y | x) = 2∏ d p y( | x), where d represents the number of units that may be excluded While a uniform distribution is utilized for simplicity, non-uniform distributions can also be applied For accurate predictions, the ensemble must be renormalized to yield p ensemble (y | x) = p˜ ensemble (y | x).

A significant insight from Hinton et al (2012) regarding dropout is that we can approximate the ensemble prediction p(y | x) using a single model This model retains all units, but adjusts the outgoing weights of each unit by multiplying them by the probability of that unit being included This modification aims to accurately reflect the expected output value from each unit, a method we refer to as the weight scaling inference rule.

There is not yet any theoretical argument for the accuracy of this approximate inference rule in deep nonlinear networks, but empirically it performs very well.

To maintain consistency between training and testing phases in neural networks, a common practice is to use an inclusion probability of 1/2, which leads to the weight scaling rule of dividing weights by 2 after training Alternatively, this can be achieved by multiplying the states of the units by 2 during training The primary objective is to ensure that the expected total input to each unit during testing aligns closely with the expected input during training, despite the fact that, on average, half of the units are absent during training.

In models lacking nonlinear hidden units, the weight scaling inference rule proves to be precise For instance, a softmax regression classifier can be illustrated with an input vector comprising n variables.

We can index into the family of sub-models by element-wise multiplication of the input with a binary vector :d

The ensemble predictor is deﬁned by re-normalizing the geometric mean over all ensemble members’ predictions:

To see that the weight scaling rule is exact, we can simplify P˜ ensemble :

BecauseP˜ will be normalized, we can safely ignore multiplication by factors that are constant with respect to :y

Substituting this back into equation7.58we obtain a softmax classiﬁer with weights

The weight scaling rule is precise in various contexts, such as regression networks with conditionally normal outputs and deep networks lacking nonlinear hidden layers However, it serves as merely an approximation for deep models that incorporate nonlinearities, with its effectiveness not fully characterized theoretically but often performing well in practice Goodfellow et al (2013) demonstrated that this weight scaling approximation can outperform Monte Carlo approximations for ensemble predictors, even when sampling up to 1,000 sub-networks Conversely, Gal and Ghahramani (2015) found that certain models achieve higher classification accuracy using just twenty samples with the Monte Carlo method Ultimately, the best inference approximation appears to be dependent on the specific problem at hand.

Adversarial Training

Neural networks have achieved human-level performance on i.i.d test sets, raising questions about their true understanding of tasks To assess their comprehension, researchers examine instances of misclassification Szegedy et al (2014b) discovered that even high-accuracy neural networks can exhibit nearly 100% error rates on examples specifically crafted through optimization to differ significantly in model output from similar inputs Often, these adversarial examples are so closely related to the original that humans struggle to distinguish between them, yet the networks produce vastly different predictions.

Adversarial examples play a significant role in computer security and have broader implications that extend beyond this discussion They are particularly noteworthy in the context of regularization, as adversarial training—where models are trained on adversarially perturbed examples from the training set—can effectively reduce the error rate on the original independent and identically distributed (i.i.d.) test set (Szegedy et al., 2014; Goodfellow et al., 2014).

Goodfellow et al (2014b) identified excessive linearity as a key factor contributing to adversarial examples in neural networks Since neural networks are predominantly constructed from linear components, they often exhibit highly linear functions in certain experiments This linearity simplifies the behavior of these networks, making them more susceptible to adversarial attacks.

sign(∇ x J(θ x, , y)) y =“panda” “nematode” “gibbon” w/ 57.7% confidence w/ 8.2% confidence w/ 99.3 % confidence

Adversarial example generation, as demonstrated on GoogLeNet using ImageNet, involves adding a small vector derived from the gradient of the cost function to alter the network's classification This method highlights the sensitivity of linear functions with numerous inputs, where minor changes can lead to significant variations in output To mitigate this issue, adversarial training promotes local constancy around training data, effectively introducing a local constancy prior into supervised neural networks.

Adversarial training demonstrates the effectiveness of utilizing a diverse function family alongside strong regularization techniques Unlike linear models, such as logistic regression, which struggle against adversarial examples due to their inherent linearity, neural networks possess the capability to model a wide range of functions This flexibility allows them to identify linear patterns in training data while simultaneously developing resilience against local perturbations.

Adversarial examples can be leveraged for semi-supervised learning by utilizing a model's assigned label at an unlabeled point When a high-quality model assigns a label, it has a high probability of being the true label, which can be used to generate virtual adversarial examples These examples are created by perturbing the input to cause the classifier to output a different label, and then training the classifier to assign the same label to both the original and perturbed inputs This approach encourages the classifier to learn a robust function that is resistant to small changes, assuming that different classes lie on disconnected manifolds and a small perturbation cannot transition between them.

Tangent Distance, Tangent Prop, and Manifold Tangent Classiﬁer 270

Many machine learning algorithms aim to overcome the curse of dimensionality by assuming that the data lies near a low-dimensional manifold, as described in section 5.11.3.

The tangent distance algorithm, introduced by Simard et al in the 1990s, is an innovative non-parametric nearest-neighbor method that utilizes a metric based on the manifolds where probability concentrates, rather than the standard Euclidean distance This approach assumes that examples within the same manifold belong to the same category, allowing the classifier to remain invariant to local variations on the manifold To determine the nearest-neighbor distance between points on different manifolds, the algorithm approximates each manifold by its tangent plane at the respective point, enabling the measurement of distance between these tangents While this computation can be complex, it can be simplified by solving a low-dimensional linear system, although it necessitates the specification of tangent vectors.

The tangent prop algorithm, introduced by Simard et al in 1992, enhances neural network classifiers by incorporating a penalty that enforces local invariance of the output f(x) to known factors of variation These factors are associated with movements along the manifold where examples of the same class are concentrated To achieve local invariance, the algorithm ensures that the gradient ∇ x f(x) remains orthogonal to the known manifold tangent vectors v(i) at point x This is accomplished by minimizing the directional derivative of f at x in the directions of v(i) through the addition of a regularization penalty Ω.

The regularizer can be scaled using a suitable hyperparameter, and in most neural networks, it is necessary to sum over multiple outputs rather than just a single output for simplicity Tangent vectors, similar to the tangent distance algorithm, are typically derived a priori from the formal understanding of transformations like translation, rotation, and scaling in images Tangent propagation has applications in both supervised learning (Simard et al., 1992) and reinforcement learning (Thrun, 1995).

Tangent propagation and dataset augmentation are interconnected concepts where users encode prior knowledge through specified transformations that should not affect network output Unlike dataset augmentation, which trains the network to classify distinct inputs from significant transformations, tangent propagation analytically regularizes the model to resist perturbations without explicitly visiting new input points However, tangent propagation has notable limitations: it only addresses infinitesimal perturbations, while dataset augmentation provides resistance to larger variations Additionally, it poses challenges for models using rectified linear units, which can only adjust derivatives by turning off units or reducing weights, unlike sigmoid or tanh units that can saturate at high values In contrast, dataset augmentation effectively utilizes rectified linear units, allowing different subsets to activate for various transformed versions of the original input.

Tangent propagation is also related to double backprop(Drucker and LeCun,

Double backpropagation and adversarial training are techniques aimed at enhancing model robustness Double backpropagation regularizes the Jacobian, ensuring it remains small, while adversarial training focuses on generating inputs close to the originals, training the model to maintain consistent outputs Both methods necessitate model invariance to small changes in input, with dataset augmentation serving as a broader application of tangent propagation and adversarial training acting as an extensive version of double backpropagation.

The manifold tangent classiﬁer (Rifai et al., 2011c), eliminates the need to know the tangent vectors a priori As we will see in chapter 14, autoencoders can x 1 x 2 Normal Tangent

Figure 7.9: Illustration of the main idea of the tangent prop algorithm ( Simard et al ,

The manifold tangent classifier, introduced by Rifai et al in 2011, regularizes the output function of the classifier, denoted as f(x) Each curve in this model represents a distinct class, visualized as a one-dimensional manifold within a two-dimensional space.

In our analysis, we explore a point on a curve where we draw both tangent and normal vectors to the class manifold, recognizing that in multiple dimensions, there can be numerous tangent and normal directions We anticipate that the classification function will exhibit rapid changes in the normal direction while remaining stable along the manifold Tangent propagation and the manifold tangent classifier work to regularize the function f(x), ensuring minimal changes as x traverses the manifold Tangent propagation necessitates user-defined functions for tangent direction computation, while the manifold tangent classifier leverages autoencoders to automatically estimate these directions This approach extends beyond classical geometric invariants, incorporating object-specific factors learned through the data The proposed algorithm involves using an autoencoder for unsupervised learning of the manifold structure, followed by employing these tangents to regularize a neural network classifier, akin to tangent propagation methods.

This chapter outlines key strategies for regularizing neural networks, a crucial aspect of machine learning that will be revisited in subsequent chapters Additionally, optimization, another fundamental theme in machine learning, will be discussed next.

Optimization for Training Deep Models

Deep learning algorithms frequently require optimization across various scenarios, such as when performing inference in models like PCA, which necessitates solving optimization problems Additionally, analytical optimization plays a crucial role in formulating proofs and designing algorithms effectively.

Neural network training stands out as one of the most challenging optimization problems in deep learning, often requiring significant investments of time and computational resources Training a single neural network can take anywhere from days to months, even when utilizing hundreds of machines Due to its complexity and high computational costs, specialized optimization techniques have been developed to tackle this problem efficiently This chapter delves into these expert techniques, providing insights into optimizing neural network training for improved performance and efficiency.

If you are unfamiliar with the basic principles of gradient-based optimization, we suggest reviewing chapter That chapter includes a brief overview of numerical4 optimization in general.

This chapter delves into a specific optimization case, focusing on identifying the optimal parameters θ for a neural network that substantially minimize a cost function J(θ) The cost function typically encompasses a performance measure evaluated across the entire training set, along with additional regularization terms that prevent overfitting and promote generalization By optimizing these parameters, the neural network can achieve improved performance and accuracy.

This article explores the distinction between optimization as a training algorithm for machine learning and traditional optimization methods It highlights the specific challenges faced in optimizing neural networks, such as complexity and non-convexity The discussion includes practical algorithms for optimization and strategies for parameter initialization, emphasizing advanced techniques that adjust learning rates during training and utilize second derivatives of the cost function The article concludes by reviewing various optimization strategies that integrate simple algorithms into more sophisticated procedures, enhancing the overall effectiveness of machine learning tasks.

How Learning Diﬀers from Pure Optimization

Optimization algorithms used for training of deep models diﬀer from traditional optimization algorithms in several ways Machine learning usually acts indirectly.

In most machine learning scenarios, we care about some performance measure

In the context of machine learning, we optimize a cost function J(θ) indirectly to enhance the performance measure P, which is often intractable and defined concerning the test set Unlike pure optimization, where the primary goal is to minimize J, our focus is on improving P through this process Additionally, optimization algorithms for training deep models usually incorporate specific adaptations tailored to the unique structure of machine learning objective functions.

Typically, the cost function can be written as an average over the training set, such as

In this article, we discuss the per-example loss function, denoted as L, which is defined by the predicted output f(x;θ) based on input x and the empirical distribution pˆdata In the context of supervised learning, the target output is represented by y The focus of this chapter is on the unregularized supervised learning scenario, utilizing f(x;θ) and y as the primary arguments for L However, the framework can easily be adapted to incorporate additional variables such as θ or x, or to omit y, thereby facilitating the exploration of various regularization techniques and unsupervised learning methods.

Equation 8.1 outlines an objective function based on the training set, and it is generally more desirable to minimize this objective function by considering the expectation across the data generating distribution \( p_{\text{data}} \) rather than solely relying on the limited finite training set.

The primary objective of a machine learning algorithm is to minimize the expected generalization error, also known as the risk, which is calculated based on the true underlying distribution of the data Ideally, if the true distribution of the data was known, risk minimization would be a straightforward optimization task However, in real-world scenarios, the true distribution is often unknown, and only a limited training set of samples is available, thereby framing the problem as a machine learning task.

To transform a machine learning problem into an optimization problem, the most straightforward approach is to minimize the expected loss on the training set This involves substituting the true distribution p(x, y) with the empirical distribution p(x, y)ˆ derived from the training data Consequently, we focus on minimizing the empirical risk.

L f( (x ( ) i ; )θ , y ( ) i ) (8.3) where m is the number of training examples.

Empirical risk minimization is a training process that aims to minimize the average training error in machine learning This approach closely resembles traditional optimization, focusing on optimizing empirical risk rather than the true risk directly The goal is to achieve a significant reduction in true risk, supported by various theoretical results that outline the conditions under which this decrease can be anticipated.

Empirical risk minimization often leads to overfitting, as high-capacity models can memorize training data instead of generalizing Moreover, many effective optimization algorithms, primarily based on gradient descent, struggle with loss functions like 0-1 loss that lack useful derivatives Consequently, in deep learning, empirical risk minimization is rarely employed Instead, we adopt a modified approach where the optimization target diverges from the actual desired outcome.

8.1.2 Surrogate Loss Functions and Early Stopping

In scenarios where the desired loss function, such as classification error, is challenging to optimize directly due to its intractability—like the expected 0-1 loss, which becomes exponential in complexity with increasing input dimensions—using a surrogate loss function becomes essential A common choice for a surrogate is the negative log-likelihood of the correct class, which enables the model to estimate the conditional probabilities of classes based on the input By effectively estimating these probabilities, the model can select classes that minimize classification error in expectation, thereby improving overall performance.

Surrogate loss functions can enhance learning by improving classifier robustness For instance, when using the log-likelihood surrogate, the test set's 0-1 loss may continue to decrease even after the training set's 0-1 loss reaches zero This occurs because, despite achieving an expected 0-1 loss of zero, further separation of classes can lead to a more confident and reliable classifier Consequently, this approach allows for the extraction of more information from the training data compared to merely minimizing the average 0-1 loss on the training set.

A key distinction between general optimization and optimization in training algorithms is that the latter typically does not stop at a local minimum Instead, machine learning algorithms minimize a surrogate loss function and stop based on an early stopping criterion, which is often tied to the true underlying loss function, such as 0-1 loss on a validation set This criterion is designed to prevent overfitting, leading to the algorithm halting even when the surrogate loss function still exhibits significant derivatives This approach contrasts with traditional optimization, where convergence is defined by a very small gradient.

Machine learning algorithms differ from general optimization algorithms in that their objective functions often decompose into a sum over training examples These algorithms typically update parameters by estimating the expected value of the cost function using only a subset of the full cost function's terms For instance, maximum likelihood estimation problems, when analyzed in logarithmic space, can be expressed as a summation over individual examples.

Maximizing this sum is equivalent to maximizing the expectation over the empirical distribution deﬁned by the training set:

The objective function J is defined as the expectation of the log probability of the model given the data, represented mathematically as J( ) = θ E x,y ∼ p ˆ data logp model(x, y; )θ Many optimization algorithms rely on properties of this objective function, with the gradient being the most frequently utilized characteristic, which is also derived from expectations over the training set.

Calculating the exact expectation is costly, as it involves evaluating the model on every instance in the dataset Instead, we can efficiently estimate these expectations by randomly sampling a small subset of examples and averaging the results from that limited selection.

The standard error of the mean, represented by σ/√n, indicates that as the number of samples (n) increases, the returns in estimating the gradient diminish For instance, using 100 samples versus 10,000 samples shows that while the latter demands 100 times more computational effort, it only decreases the standard error by a factor of 10 Consequently, many optimization algorithms achieve faster convergence when they utilize quick approximate gradient estimates rather than relying on the slower computation of exact gradients.

Statistical estimation of the gradient from a limited number of samples is motivated by the potential redundancy in the training set In the worst-case scenario, all samples could be identical, allowing for an accurate gradient estimate from just one sample, thus significantly reducing computational effort While this extreme situation is rare, it is common to encounter many examples that contribute similarly to the gradient, highlighting the efficiency of sampling-based methods in gradient computation.

Challenges in Neural Network Optimization

Optimization presents significant challenges, particularly in the context of training neural networks, which often involve non-convex problems While traditional machine learning has navigated these difficulties by crafting convex objective functions and constraints, deep learning must address the complexities of general non-convex optimization Even in convex scenarios, complications arise, highlighting the need to understand the prominent challenges associated with optimizing deep models effectively.

Optimizing convex functions presents several challenges, with ill-conditioning of the Hessian matrix H being the most significant This issue is prevalent in various numerical optimization scenarios, both convex and non-convex, and is further elaborated in section 4.3.1.

The ill-conditioning problem is generally believed to be present in neural network training problems Ill-conditioning can manifest by causing SGD to get

“stuck” in the sense that even very small steps increase the cost function.

Recall from equation 4.9 that a second-order Taylor series expansion of the cost function predicts that a gradient descent step of −g will add

Ill-conditioning of the gradient poses a challenge in neural network training when the Hessian matrix's squared gradient norm exceeds the gradient's squared norm To assess the impact of ill-conditioning on training effectiveness, it is essential to monitor the squared gradient norm throughout the process.

G ra di ent no rm

C la ss iﬁc a ti o n er ro r ra te

In the context of training convolutional networks for object detection, Figure 8.1 illustrates that gradient descent may not always converge to a critical point The left scatterplot displays the distribution of individual gradient norms over time, highlighting that the running average of these norms increases rather than decreases during training This trend suggests a lack of convergence, yet the training remains effective, as evidenced by the decreasing validation set classification error Notably, even though the gradient norm does not significantly shrink, the g ⋅ Hg term can grow substantially, leading to slower learning rates due to the need to adjust for increased curvature This example underscores the complexity of the learning process, where strong gradients do not guarantee rapid convergence.

Ill-conditioning affects various domains, but many techniques used to address it in other areas are not suitable for neural networks While Newton's method is effective for minimizing convex functions with poorly conditioned Hessian matrices, it necessitates substantial modifications for effective application in neural network training.

Convex optimization problems are characterized by the ability to reduce them to finding a local minimum, which is guaranteed to be a global minimum While some convex functions may exhibit a flat region at their base instead of a singular global minimum, any point within this flat region is considered an acceptable solution When optimizing a convex function, identifying any critical point signifies that a satisfactory solution has been achieved.

Non-convex functions, like those found in neural networks, often exhibit numerous local minima, and deep models typically possess a vast array of these minima Despite this complexity, the presence of multiple local minima does not inherently pose a significant issue.

Neural networks and models with multiple equivalently parameterized latent variables often encounter multiple local minima due to the model identifiability problem A model is considered identifiable when a sufficiently large training dataset can isolate a single configuration of its parameters However, models with latent variables frequently lack identifiability, as equivalent models can be created by interchanging latent variables For instance, in a neural network, one can swap the incoming and outgoing weight vectors of different units within a layer With m layers and n units per layer, there are n! m possible arrangements of hidden units, leading to a phenomenon known as weight space symmetry, which contributes to non-identifiability.

Many neural networks, beyond weight space symmetry, face non-identifiability issues For instance, in rectified linear or maxout networks, scaling all incoming weights and biases by a factor of α while simultaneously scaling outgoing weights by 1/α results in equivalent outputs Consequently, if the cost function lacks weight decay terms that directly influence weights, every local minimum in these networks exists on an (m×n)-dimensional hyperbola of equivalent local minima.

Model identifiability issues in neural networks can lead to a vast, potentially infinite number of local minima within the cost function Despite this abundance, all these local minima, stemming from non-identifiability, share the same cost function value Consequently, these local minima do not pose a significant challenge regarding non-convexity.

Local minima can pose significant challenges in optimization, especially when their cost exceeds that of the global minimum Research by Sontag and Sussman (1989) and Gori and Tesi (1992) illustrates that even small neural networks, including those without hidden units, can exhibit local minima with higher costs than the global minimum The prevalence of such high-cost local minima may severely hinder the effectiveness of gradient-based optimization algorithms.

The question of whether high-cost local minima are prevalent in practical neural network applications and whether optimization algorithms frequently encounter them remains unresolved Historically, many practitioners viewed local minima as a significant challenge in neural network optimization However, recent research suggests that, particularly in sufficiently large networks, most local minima tend to have low cost function values Consequently, the focus has shifted from striving for a true global minimum to identifying parameter space points that achieve low, albeit not minimal, cost values.

Many practitioners believe that local minima are the primary challenge in neural network optimization However, it is crucial to conduct specific tests to identify the actual issues One effective method is to plot the gradient norm over time; if it does not decrease to a negligible size, then local minima or other critical points are not the cause of the problem This negative test can effectively eliminate local minima as a potential issue In high-dimensional spaces, confirming that local minima are indeed the problem can be challenging, as various structures can also exhibit small gradients.

8.2.3 Plateaus, Saddle Points and Other Flat Regions

In high-dimensional non-convex functions, saddle points are more common than local minima and maxima At a saddle point, the Hessian matrix displays both positive and negative eigenvalues, indicating that some nearby points have a higher cost while others have a lower cost Specifically, points along the eigenvectors associated with positive eigenvalues exhibit higher costs than the saddle point, whereas those along negative eigenvalues show lower costs Thus, a saddle point can be viewed as a local minimum in one direction of the cost function and a local maximum in another.

In low-dimensional spaces, local minima of random functions are prevalent, while in higher-dimensional spaces, they become rare, with saddle points becoming more frequent The expected ratio of saddle points to local minima increases exponentially with the dimensionality of the space This behavior can be understood by examining the Hessian matrix: at a local minimum, all eigenvalues are positive, whereas at a saddle point, there is a mix of positive and negative eigenvalues Conceptually, if the sign of each eigenvalue is determined by a coin flip, obtaining a local minimum in one dimension is straightforward, but in n-dimensional space, the probability of all eigenvalues being positive becomes exponentially small For further insights, refer to Dauphin et al (2014) for a comprehensive review of the theoretical underpinnings.

Basic Algorithms

The gradient descent algorithm, previously introduced in section 4.3, optimizes by following the gradient of the entire training set This process can be significantly sped up through stochastic gradient descent, which utilizes randomly selected minibatches to accelerate the descent, as detailed in sections 5.9 and 8.1.3.

Stochastic gradient descent (SGD) and its variants are among the most widely utilized optimization algorithms in machine learning, especially in deep learning As outlined in section 8.1.3, an unbiased estimate of the gradient can be achieved by averaging the gradients from a minibatch of m examples, which are independently and identically distributed (i.i.d) from the data generating distribution.

Algorithm 8.1 shows how to follow this estimate of the gradient downhill.

Algorithm 8.1 Stochastic gradient descent (SGD) update at training iteration k

Require: Initial parameter θ whilestopping criterion not met do

Sample a minibatch of m examples from the training set{x (1) , ,x ( ) m }with corresponding targets y ( ) i

Compute gradient estimate: gˆ←+ m 1 ∇ θ  i L f( (x ( ) i ; )θ ,y ( ) i ) Apply update: θ ← −θ ˆg end while

The learning rate is a vital parameter in the Stochastic Gradient Descent (SGD) algorithm While SGD is often described with a fixed learning rate, it is essential to gradually reduce this rate over time Therefore, we refer to the learning rate at iteration k as η_k.

The SGD gradient estimator adds noise due to the random sampling of training examples, which persists even at a minimum In contrast, the true gradient of the total cost function diminishes to zero as we approach a minimum with batch gradient descent, allowing for a fixed learning rate A key condition for ensuring the convergence of SGD is established.

In practice, it is common to decay the learning rate linearly until iteration :τ

 k = (1−α ) 0 +α τ (8.14) with α = k τ After iteration , it is common to leaveτ  constant.

Choosing the learning rate often involves trial and error, but monitoring learning curves that display the objective function over time can lead to better results This process is more of an art than a science, so it's important to approach existing guidance with caution When implementing a linear schedule, key parameters to select include the initial learning rate ( 0), the final learning rate ( τ), and the duration (τ), which is typically set to the number of iterations needed to complete several hundred passes through the training set.

 τ should be set to roughly 1% the value of  0 The main question is how to set  0.

A learning rate that is too large can lead to significant fluctuations in the learning process, often causing the cost function to increase dramatically While gentle oscillations are acceptable, particularly when using stochastic cost functions like those with dropout, a low learning rate can hinder progress and may result in the model getting stuck at a high cost To optimize training time and final cost values, it's advisable to start with a higher initial learning rate than what performs best after the first 100 iterations Monitoring the initial iterations is crucial to find a balance, ensuring the learning rate is sufficiently high to promote stability without causing severe instability.

Stochastic Gradient Descent (SGD) and its variants, such as minibatch or online gradient-based optimization, are characterized by their efficient computation time per update, which remains constant regardless of the number of training examples This efficiency enables convergence even with extensive datasets In fact, with sufficiently large datasets, SGD can achieve a final test set error within a predetermined tolerance before completing the entire training set processing.

To evaluate the convergence rate of an optimization algorithm, it is essential to measure the excess error J(θ)−min θ J(θ), which indicates how much the current cost function exceeds its minimum In convex problems, when using Stochastic Gradient Descent (SGD), the excess error decreases at a rate of O( √ 1/k) after k iterations, while in strongly convex scenarios, it improves to O(1/k) These convergence bounds are optimal unless additional assumptions are made Although batch gradient descent theoretically offers better convergence rates than SGD, the Cramér-Rao bound highlights limitations in estimation accuracy.

In 1945, it was established that generalization error cannot decrease faster than O(1/k), leading Bottou and Bousquet (2008) to suggest that pursuing optimization algorithms with convergence rates faster than O(1/k) may not be beneficial for machine learning, as this could lead to overfitting However, the asymptotic analysis often overlooks the advantages of stochastic gradient descent (SGD), especially in the early stages of training with large datasets SGD's ability to quickly make progress by evaluating gradients on a small number of examples is more advantageous than its slower asymptotic convergence Additionally, many algorithms discussed later in this chapter provide practical benefits that are not captured by the O(1/k) analysis A strategic approach can also be employed by gradually increasing the minibatch size throughout the learning process, balancing the benefits of both batch and stochastic gradient descent.

For more information on SGD, see Bottou 1998( ).

Stochastic gradient descent is a widely used optimization technique, but it can lead to slow learning rates To enhance learning speed, especially in scenarios with high curvature, small consistent gradients, or noisy gradients, the momentum method, introduced by Polyak in 1964, is employed This algorithm works by accumulating an exponentially decaying moving average of previous gradients, allowing the optimization process to maintain momentum in their direction The impact of this momentum approach is visually represented in figure 8.5.

The momentum algorithm incorporates a variable v, representing velocity, to determine the direction and speed of parameter movement in parameter space This velocity is defined as an exponentially decaying average of the negative gradient, drawing a parallel to physics where the negative gradient acts as a force propelling a particle through parameter space, in line with Newton's laws of motion In this context, momentum is understood as the product of mass and velocity.

In the momentum learning algorithm, we consider unit mass, allowing the velocity vector \( v \) to represent the particle's momentum A hyperparameter \( \alpha \) within the range [0,1) controls the rate at which the influence of past gradients diminishes exponentially The update rule is expressed as \( v \leftarrow \alpha v - \nabla \mathcal{L}(\theta) \).

The velocity v accumulates the gradient elements ∇ θ  1 m

. The larger αis relative to , the more previous gradients aﬀect the current direction.

The SGD algorithm with momentum is given in algorithm 8.2.

Momentum addresses two main challenges in optimization: the poor conditioning of the Hessian matrix and the variance in stochastic gradients As illustrated in the contour lines of a quadratic loss function with a poorly conditioned Hessian, momentum effectively navigates this optimization landscape The red path represents the trajectory taken by the momentum learning rule, which minimizes the function more efficiently In contrast, the arrows depict the gradient descent steps, highlighting how it struggles in a long, narrow valley with steep sides While momentum moves lengthwise through the canyon, gradient descent unnecessarily oscillates across its narrow axis For further comparison, refer to figure 4.6, which demonstrates gradient descent's performance without momentum.

The step size in optimization now varies based on the magnitude and alignment of successive gradients, rather than being a fixed norm multiplied by the learning rate It reaches its maximum when multiple gradients consistently point in the same direction In the momentum algorithm, continuous observation of the gradient leads to acceleration in the direction opposite to the gradient, ultimately achieving a terminal velocity where the step size stabilizes.

It is thus helpful to think of the momentum hyperparameter in terms of 1 − 1 α For example, α =.9 corresponds to multiplying the maximum speed by 10 relative to the gradient descent algorithm.

In practical applications, common values for α are 5, 9, and 99 Similar to the learning rate, α can be adjusted as training progresses, usually starting with a smaller value before being increased While adjusting α over time can be beneficial, it is more crucial to reduce the parameter  as training continues.

Algorithm 8.2 Stochastic gradient descent (SGD) with momentum

Require: Learning rate , momentum parameter  α

Require: Initial parameter , initial velocity θ v whilestopping criterion not met do

Compute gradient estimate: g ← m 1 ∇ θ  i L f( (x ( ) i ; )θ ,y ( ) i ) Compute velocity update: v ← αv−g

Parameter Initialization Strategies

Some optimization algorithms are non-iterative and provide a solution point directly, while others are iterative and can converge to acceptable solutions quickly, depending on the problem type In contrast, deep learning training algorithms are typically iterative and require an initial point to start the process The choice of this initial point is crucial, as it can significantly impact the algorithm's ability to converge, with poor initializations potentially leading to numerical difficulties and failure Even when convergence occurs, the initial point influences the speed of convergence and the cost of the final solution Additionally, similar-cost solutions can exhibit vastly different generalization errors, highlighting the importance of the initialization in deep learning model training.

Modern initialization strategies for neural networks are largely heuristic and straightforward, yet improving these strategies remains challenging due to the complex nature of neural network optimization Most of these strategies aim to establish favorable properties at the network's start, but we lack clarity on how these properties are maintained as learning progresses Additionally, while some initial points may optimize performance, they could hinder generalization, and our current understanding of how initial conditions impact generalization is limited, providing minimal guidance for selecting optimal starting points.

To ensure effective learning in neural networks, it is crucial to "break symmetry" by initializing hidden units with different parameters If two units share the same activation function and initial parameters, they will be updated identically by deterministic algorithms, potentially losing unique input and gradient patterns Therefore, initializing each unit to compute distinct functions enhances model performance, preventing information loss during forward and back-propagation While searching for a diverse set of basis functions can be computationally intensive, employing random initialization from a high-entropy distribution in a high-dimensional space is a more efficient alternative, minimizing the risk of units performing identical computations.

In most cases, biases for each unit are established using heuristically selected constants, while weights are initialized randomly Similarly, additional parameters, such as those representing the conditional variance of a prediction, are typically assigned values based on heuristic constants, akin to the biases.

When initializing model weights, we typically use random values from either a Gaussian or uniform distribution, with the specific choice being less critical However, the scale of this initial distribution significantly impacts the optimization process and the network's generalization capabilities.

Larger initial weights enhance symmetry breaking, reducing redundant units and preserving signal integrity during forward and back-propagation in neural networks This is due to the fact that larger matrix values lead to greater outputs in matrix multiplication However, excessively large initial weights can cause exploding values during propagation, potentially leading to chaotic behavior, particularly in recurrent networks.

Extreme sensitivity to minor input changes can make the deterministic forward propagation process seem random The exploding gradient problem can be partially addressed through gradient clipping, which involves thresholding gradient values before updating weights Additionally, large weights may lead to extreme activation function values, causing saturation and a complete loss of gradient Balancing these factors is crucial for determining the optimal initial weight scale.

The initialization of a network can be influenced by both regularization and optimization perspectives, which may lead to differing approaches While optimization suggests that weights should be sufficiently large to effectively propagate information, regularization advocates for smaller weights Stochastic gradient descent, an optimization algorithm, typically makes minor adjustments to weights, often resulting in convergence near initial parameters, thereby implying a prior belief that final weights should not stray far from their initial values Although gradient descent with early stopping resembles weight decay in some models, it does not universally equate to it, yet provides a useful analogy for understanding initialization effects Thus, initializing parameters close to zero aligns with a Gaussian prior that indicates a preference for minimal unit interaction, whereas larger initial values suggest specific interactions among units.

To effectively choose the initial scale of weights in a fully connected layer with minputs and noutputs, one heuristic involves sampling each weight from the uniform distribution U(− √ 1/m, √ 1/m) This method, proposed by Glorot and Bengio, helps optimize the training process.

(2010) suggest using the normalized initialization

This heuristic aims to balance the initialization of layers by ensuring both activation variance and gradient variance are consistent across the network The formula is based on the premise of a network composed solely of matrix multiplications without nonlinearities Although real neural networks deviate from this assumption, many techniques developed for linear models still yield satisfactory results in their nonlinear equivalents.

Saxe et al (2013) suggest using random orthogonal matrices for initialization, along with a scaling factor that considers the nonlinearity at each layer They provide specific scaling factor values tailored for various nonlinear activation functions This initialization approach is based on the concept of modeling a deep network as a series of matrix multiplications devoid of nonlinearities, ensuring that the total training iterations needed for convergence remain unaffected by the network's depth.

Increasing the scaling factor \( g \) enhances the norm of activations as they move forward through the network, while also amplifying the norm of gradients during backward propagation Research by Sussillo (2014) demonstrated that appropriately adjusting the gain factor is essential for effectively training networks with considerable depth.

Using a method that allows for 1,000 layers without orthogonal initializations, feedforward networks exhibit a unique behavior where activations and gradients can fluctuate during forward and back-propagation, resembling a random walk due to the different weight matrices at each layer By tuning this random walk to maintain norms, these networks can largely mitigate the vanishing and exploding gradients issue commonly associated with using the same weight matrix at every step However, optimal initial weight criteria may not guarantee peak performance for several reasons: the preservation of signal norms may not be beneficial, the properties established at initialization might not endure through training, and while these criteria can enhance optimization speed, they could inadvertently increase generalization error Consequently, the scale of weights often needs to be treated as a hyperparameter, with its optimal value being close to but not necessarily matching theoretical predictions.

Scaling rules that initialize weights with the same standard deviation, like √(1/m), can lead to excessively small individual weights in large layers To address this, Martens (2010) proposed sparse initialization, which sets each unit to have a fixed number of non-zero weights, maintaining the total input to the unit independent of the number of inputs This approach enhances diversity among units at initialization but imposes a strong prior on the selection of weights with large Gaussian values Consequently, gradient descent may struggle to adjust these "incorrect" large values, potentially causing issues for units like maxout units that require careful coordination of multiple filters.

When sufficient computational resources are available, it is advisable to treat the initial weight scales for each layer as a hyperparameter, utilizing algorithms like random search for optimal selection The decision between dense and sparse initialization can also be approached as a hyperparameter Alternatively, manual searches for the best initial scales are possible A practical guideline for determining these initial scales is to examine the range or standard deviation of activations or gradients from a single minibatch of data, as excessively small weights can lead to diminished activation ranges during propagation through the network.

Algorithms with Adaptive Learning Rates

Neural network researchers have identified the learning rate as one of the most challenging hyperparameters to set, significantly affecting model performance As discussed in sections 4.3 and 8.2, the cost function is often sensitive to certain parameter directions while remaining insensitive to others Although the momentum algorithm can help address these issues, it introduces an additional hyperparameter This leads to the question of whether an alternative approach exists If we assume that sensitivity directions are somewhat axis-aligned, employing separate, adaptive learning rates for each parameter throughout the learning process could be beneficial.

The delta-bar-delta algorithm, introduced by Jacobs in 1988, is a heuristic method for adjusting individual learning rates of model parameters during training This approach operates on a straightforward principle: if the partial derivative of the loss concerning a specific model parameter maintains the same sign, the learning rate should be increased; conversely, if the sign changes, the learning rate should be decreased However, it is important to note that this rule is applicable solely in the context of full batch optimization.

Recently, several incremental or mini-batch-based methods have emerged that adjust the learning rates of model parameters This section will provide a brief overview of some of these algorithms.

The AdaGrad algorithm customizes the learning rates for each model parameter by adjusting them inversely to the square root of the cumulative sum of their historical squared values, as detailed by Duchi et al (2011) This means that parameters with larger partial derivatives of the loss experience a significant reduction in their learning rates, while those with smaller partial derivatives see a minimal decrease Consequently, this approach enables more effective optimization in the less steep regions of the parameter space.

The AdaGrad algorithm, known for its advantageous theoretical properties in convex optimization, can lead to a significant and premature reduction in the effective learning rate when applied to deep neural networks due to the accumulation of squared gradients from the start of training While AdaGrad demonstrates strong performance for certain deep learning models, it does not universally excel across all types.

The RMSProp algorithm, introduced by Hinton in 2012, enhances the AdaGrad method for improved performance in non-convex scenarios by utilizing an exponentially weighted moving average for gradient accumulation While AdaGrad excels in rapidly converging on convex functions, its application to non-convex functions in neural network training can lead to a complex learning trajectory that navigates various structures, ultimately reaching a locally convex region However, AdaGrad's tendency to reduce the learning rate based on the cumulative squared gradient history may hinder its effectiveness in these situations.

Require: Small constant , perhapsδ 10 − 7 , for numerical stability

Initialize gradient accumulation variable r = 0 whilestopping criterion not met do

Compute gradient: g ← m 1 ∇ θ  i L f( (x ( ) i ; )θ ,y ( ) i ) Accumulate squared gradient: r ← r+gg

Compute update: ∆θ ← − δ+  √ r g (Division and square root applied element-wise)

RMSProp enhances convergence speed by utilizing an exponentially decaying average, effectively discarding outdated information from the past This approach allows the algorithm to quickly adapt after identifying a convex bowl, functioning similarly to the AdaGrad algorithm when initialized within that optimal region.

RMSProp, detailed in algorithm 8.5, incorporates a moving average to enhance performance, while its combination with Nesterov momentum is presented in algorithm 8.6 Unlike AdaGrad, RMSProp introduces a new hyperparameter, ρ, which regulates the length scale of the moving average, allowing for more effective optimization.

RMSProp is a highly effective optimization algorithm for deep neural networks, demonstrating practical performance in various applications It has become a preferred choice among deep learning practitioners for routine optimization tasks.

Adam, an adaptive learning rate optimization algorithm introduced by Kingma and Ba in 2014, combines the principles of RMSProp and momentum, as detailed in algorithm 8.7 The name "Adam" stands for "adaptive moments," and it distinguishes itself by incorporating momentum directly as an estimate of the first-order moment of the gradient with exponential weighting Unlike traditional methods that apply momentum to rescaled gradients, Adam's approach offers a more theoretically grounded methodology for integrating momentum with rescaling.

Require: Global learning rate , decay rate  ρ

Require: Small constant δ, usually 10 − 6 , used to stabilize division by small numbers.

Initialize accumulation variablesr = 0 whilestopping criterion not met do

Compute gradient: g ← m 1 ∇ θ  i L f( (x ( ) i ; )θ ,y ( ) i ) Accumulate squared gradient: r ← ρr+ (1−ρ)gg

The parameter update in the algorithm is computed using the formula ∆θ = − √(δ + r) * g, with the square root applied element-wise The update is then applied as θ ← θ + ∆θ Bias corrections are made to the estimates of both the first-order moments (momentum term) and the uncentered second-order moments to address their initialization at the origin While RMSProp includes an estimate of the uncentered second-order moment, it does not apply a correction factor, leading to potential high bias in its estimates during the early stages of training In contrast, Adam is typically more robust to hyperparameter choices, although adjustments to the learning rate may be necessary beyond the default recommendations.

8.5.4 Choosing the Right Optimization Algorithm

In this section, we explored various algorithms designed to optimize deep models by adjusting the learning rate for individual model parameters This raises an important question: which algorithm is the best choice for your needs?

There is currently no consensus on the effectiveness of optimization algorithms A comprehensive comparison by Schaulet al (2014) evaluated numerous algorithms across various learning tasks The findings indicate that adaptive learning rate algorithms, such as RMSProp and AdaDelta, demonstrate strong performance, yet no definitive best algorithm has been identified.

The most widely utilized optimization algorithms today are SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta, and Adam The selection of an appropriate algorithm largely depends on specific use cases and requirements.

Algorithm 8.6 RMSProp algorithm with Nesterov momentum

Require: Global learning rate , decay rate , momentum coeﬃcient  ρ α

Require: Initial parameter , initial velocity θ v

Initialize accumulation variable r = 0 whilestopping criterion not met do

Compute velocity update: v ← αv− √  r g ( √ 1 r applied element-wise) Apply update: θ ←θ+v end while largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning).

Approximate Second-Order Methods

This section explores the use of second-order methods in training deep networks, referencing earlier work by LeCun et al (1998a) To maintain clarity, we focus exclusively on the empirical risk as the objective function.

However the methods we discuss here extend readily to more general objective functions that, for instance, include parameter regularization terms such as those discussed in chapter 7

In section 4.3, we explored second-order gradient methods, which utilize second derivatives to enhance optimization compared to first-order methods The most prominent second-order technique is Newton's method, which we will now examine more closely, particularly regarding its application in training neural networks.

Require: Step size  (Suggested default: 0 001 )

Require: Exponential decay rates for moment estimates, ρ 1 and ρ 2 in [0,1). (Suggested defaults: 0 9 and 0 999 respectively)

Require: Small constant δ used for numerical stabilization (Suggested default:

Initialize 1st and 2nd moment variables s = 0, r = 0

Initialize time step t = 0 whilestopping criterion not met do

Update biased ﬁrst moment estimate: s ←ρ 1 s+ (1−ρ 1 )g

Update biased second moment estimate: r ←ρ 2 r+ (1−ρ 2 )gg

Correct bias in ﬁrst moment: sˆ← 1 − s ρ t

Correct bias in second moment: rˆ← 1 1 − r ρ t

Compute update: ∆ = θ − √ ˆ s r+δ ˆ (operations applied element-wise) Apply update: θ ←θ+ ∆θ end while

Newton’s method is an optimization scheme based on using a second-order Tay- lor series expansion to approximateJ(θ) near some point θ 0, ignoring derivatives of higher order:

2(θ−θ 0)  H θ( −θ 0), (8.26) where H is the Hessian ofJ with respect to θ evaluated at θ 0 If we then solve for the critical point of this function, we obtain the Newton parameter update rule: θ ∗ = θ 0 −H − 1 ∇ θ J(θ 0) (8.27)

For a locally quadratic function with a positive definite Hessian, rescaling the gradient using the inverse of the Hessian allows Newton's method to directly reach the minimum In cases where the objective function is convex but not purely quadratic, incorporating higher-order terms enables the iterative application of this update, leading to the training algorithm linked with Newton's method, as detailed in algorithm 8.8.

Algorithm 8.8 Newton’s method with objective J(θ) 1 m

Require: Training set of m examples whilestopping criterion not met do

Newton's method can be applied iteratively to non-quadratic surfaces, provided the Hessian is positive definite This process involves two main steps: first, updating or calculating the inverse Hessian through the quadratic approximation, and second, adjusting the parameters in accordance with equation 8.27.

In section 8.2.3, we highlighted that Newton’s method is suitable only when the Hessian matrix is positive definite However, in deep learning, the objective function is often non-convex and features saddle points, which complicate the use of Newton’s method When the Hessian's eigenvalues are not all positive, particularly near saddle points, updates may inadvertently move in the wrong direction To mitigate this issue, regularizing the Hessian is essential, commonly achieved by adding a constant along its diagonal This leads to a regularized update formula: α θ ∗ = θ 0 −[H f( (θ 0 )) +αI] − 1 ∇ θ f(θ 0 ).

The regularization strategy, utilized in approximations to Newton's method like the Levenberg–Marquardt algorithm, is effective when the negative eigenvalues of the Hessian are near zero However, in scenarios with significant curvature, a larger value of α is necessary to counteract these negative eigenvalues As α increases, the Hessian becomes dominated by the αI diagonal, causing the direction of Newton's method to converge towards the standard gradient divided by α In cases of strong negative curvature, α may need to be excessively large, resulting in smaller steps than those taken by gradient descent with an appropriately selected learning rate.

Newton's method faces significant limitations in training large neural networks due to its high computational demands, particularly the need to invert a Hessian matrix that scales with the square of the number of parameters For neural networks with millions of parameters, this results in a computational complexity of O(k³), making it impractical for anything beyond very small networks Additionally, the requirement to recalculate the inverse Hessian at each training iteration further exacerbates this issue Therefore, this section will explore alternative methods that seek to leverage the benefits of Newton's approach while avoiding its computational challenges.

The conjugate gradients method offers an efficient alternative to calculating the inverse Hessian by iteratively descending along conjugate directions This technique is inspired by the limitations of the steepest descent method, which relies on line searches in the gradient's direction As illustrated in Figure 8.6, the steepest descent method often results in an ineffective zig-zag pattern when navigating a quadratic bowl, as each line search direction, dictated by the gradient, remains orthogonal to the previous one.

In optimization, when the previous search direction is denoted as \( d_{t-1} \), the line search concludes at a minimum where the directional derivative is zero in that direction, expressed as \( \nabla_\theta J(\theta) \cdot d_{t-1} = 0 \) Consequently, the new search direction \( d_t = \nabla_\theta J(\theta) \) becomes orthogonal to \( d_{t-1} \) This orthogonality, illustrated in Figure 8.6 through multiple steepest descent iterations, indicates that choosing orthogonal descent directions can disrupt the preservation of the minimum along prior search paths, resulting in a zig-zag pattern of progress As we follow the gradient after each line search, we inadvertently negate some of the advancements made in the previous direction The conjugate gradient method is designed to overcome this challenge.

In the conjugate gradient method, the goal is to identify a search direction that is conjugate to the previous direction, ensuring that it does not reverse any progress made At each training iteration t, the subsequent search direction \( d_t \) is determined accordingly.

The method of steepest descent is utilized to navigate a quadratic cost surface by moving towards the lowest cost point along the gradient defined at the initial position This approach addresses some limitations associated with a fixed learning rate, although the algorithm may still exhibit oscillatory behavior when approaching the optimum At the minimum of the objective in a given direction, the gradient at that point is orthogonal to the direction of movement The update rule for this method is given by \( d_t = \nabla_\theta J(\theta) + \theta \beta_t d_{t-1} \), where \( \beta_t \) is a coefficient that determines the influence of the previous search direction on the current step.

Two directions, d t and d t − 1 , are deﬁned as conjugate ifd  t Hd t − 1 = 0, where

To impose conjugacy effectively, calculating the eigenvectors of H to select β t is not ideal, as it does not align with our objective of creating a more computationally efficient method than Newton’s for large-scale problems However, it is indeed possible to compute the conjugate directions without relying on these calculations.

Two popular methods for computing the β t are:

The conjugate gradient method guarantees that the gradient does not increase in magnitude along previous directions, allowing us to remain at the minimum Consequently, in a k-dimensional parameter space, this method requires a maximum of k line searches to find the minimum The details of the conjugate gradient algorithm are outlined in algorithm 8.9.

Algorithm 8.9 The conjugate gradient method

Require: Training set of m examples

Initialize t= 1 whilestopping criterion not met do

Compute gradient: g t ← m 1 ∇ θ  i L f( (x ( ) i ; )θ ,y ( ) i ) Compute β t = (g t g −  g t − 1 )  g t t − 1 g t − 1 (Polak-Ribière) (Nonlinear conjugate gradient: optionally reset β t to zero, for example if t is a multiple of some constant , such ask k = 5)

Perform line search to ﬁnd:  ∗ = argmin  1 m

 m i=1 L f( (x ( ) i ;θ t +ρ t ),y ( ) i ) (On a truly quadratic cost function, analytically solve for  ∗ rather than explicitly searching for it)

Nonlinear conjugate gradients extend the traditional conjugate gradient method, originally designed for quadratic objective functions, to optimize neural networks and deep learning models with non-quadratic objectives In this context, the conjugate directions may not consistently lead to the minimum, necessitating modifications to the algorithm Therefore, the nonlinear conjugate gradients approach incorporates periodic resets, allowing the method to restart with a line search along the original gradient, ensuring effective optimization in complex landscapes.

Practitioners have observed positive outcomes when applying the nonlinear conjugate gradients algorithm for training neural networks It is often advantageous to begin the optimization process with several iterations of stochastic gradient descent before transitioning to nonlinear conjugate gradients Although traditionally considered a batch method, minibatch versions of the conjugate gradients algorithm have also proven effective in neural network training.

2011) Adaptations of conjugate gradients speciﬁcally for neural networks have been proposed earlier, such as the scaled conjugate gradients algorithm (Moller, 1993).

Optimization Strategies and Meta-Algorithms

Many optimization techniques are not exactly algorithms, but rather general templates that can be specialized to yield algorithms, or subroutines that can be incorporated into many diﬀerent algorithms.

Batch normalization, introduced by Ioffe and Szegedy in 2015, is a groundbreaking technique for enhancing the training of deep neural networks Rather than serving as an optimization algorithm, it functions as an adaptive reparametrization method This innovation addresses the challenges associated with training very deep models, making the process more efficient and effective.

Deep models consist of multiple layers where the gradient informs parameter updates under the assumption that other layers remain unchanged In practice, all layers are updated simultaneously, which can lead to unexpected outcomes since the simultaneous changes are based on gradients calculated with constant function assumptions For instance, in a deep neural network with one unit per layer and no activation functions, the output is a linear function of the input but nonlinear concerning the weights When aiming to decrease the output slightly, the back-propagation algorithm computes a gradient for weight updates However, the first-order Taylor series approximation may predict a decrease in output based on the gradient, while actual updates also involve higher-order effects, complicating the learning process.

The new value of yˆ is given by x w( 1 −g 1 )(w 2 −g 2 ) w( l −g l ) (8.34)

In deep learning, a notable second-order term, represented as  2 g 1 g 2 l i=3 w i, can significantly impact the learning rate selection due to its dependency on the weights of multiple layers If the sum of weights from layers 3 to l is small, this term may be negligible; however, it can become exponentially large with increased weights, complicating the learning rate decision Second-order optimization algorithms attempt to mitigate this challenge by incorporating these interactions, yet in very deep networks, higher-order interactions also play a crucial role Despite their potential, second-order methods are often costly and rely on approximations that limit their effectiveness in addressing all significant interactions, making the development of n-th order optimization algorithms for n > 2 appear impractical.

Batch normalization is a powerful technique for reparametrizing deep networks, effectively addressing the challenge of coordinating updates across multiple layers This method can be implemented in any input or hidden layer, enhancing the training process By normalizing a minibatch of activations, represented as a design matrix with each example's activations in a row, batch normalization significantly improves network performance and stability.

The equation H' = H - à / σ represents the normalization of the matrix H, where à is a vector of means and σ is a vector of standard deviations for each unit This normalization process utilizes broadcasting to apply the mean and standard deviation vectors to each row of the matrix Each element in the matrix H is adjusted by subtracting the corresponding mean (àj) and dividing by the standard deviation (σj) Consequently, the subsequent operations within the network are performed on the normalized matrix H' in the same manner as they were on the original matrix H.

Batch normalization introduces a small positive value, δ (e.g., 10^-8), to prevent undefined gradients at z = 0, allowing for effective back-propagation in calculating the mean and standard deviation for normalizing H This innovative approach ensures that the gradient does not suggest operations that merely increase the mean or standard deviation, as the normalization process eliminates their influence on the gradient Unlike previous methods that either added penalties to the cost function for normalized activation statistics or attempted to renormalize after each gradient descent step—often leading to imperfect normalization and inefficiencies—batch normalization reparametrizes the model to ensure certain units are always standardized, effectively resolving these issues.

During evaluation, the parameters à and σ can be substituted with the running averages obtained during training This approach enables the model to be assessed on individual examples without relying on the definitions of à and σ that are contingent on an entire minibatch.

Batch normalization addresses challenges in learning models by normalizing the output of the previous layer, ensuring it has zero mean and unit variance When the input is drawn from a unit Gaussian, the transformation to the next layer remains linear, but the output may lose these statistical properties By applying batch normalization, the output retains its Gaussian characteristics, allowing the model to learn the output as a simple linear function without significant influence from the lower layers While this simplifies learning, it can render the lower layers ineffective in many cases, as their influence on the output is minimized However, in deep neural networks with nonlinear activation functions, the lower layers can still perform meaningful transformations Batch normalization stabilizes learning by standardizing the mean and variance while allowing for changes in the relationships and nonlinear statistics of the units.

To enhance neural network performance, it may be beneficial to eliminate linear relationships between units within a layer, as suggested by Desjardins et al (2015), which inspired batch normalization However, this approach is costlier than simply standardizing the mean and standard deviation of each unit, making batch normalization the more practical solution for current applications.

Normalizing the mean and standard deviation of a neural network unit can diminish its expressive power To preserve this power, it is common to replace the batch of hidden unit activations \( H \) with \( \gamma H' + \beta \) instead of just using the normalized \( H' \) Here, \( \gamma \) and \( \beta \) are learned parameters that enable the new variable to adopt any mean and standard deviation While it may seem counterintuitive to introduce \( \beta \) after setting the mean to 0, this new parametrization maintains the same family of functions as the previous one but offers improved learning dynamics In the original setup, the mean of \( H \) was influenced by complex interactions among lower layer parameters, whereas in the new setup, the mean of \( \gamma H' + \beta \) is determined solely by \( \beta \), making it significantly easier to optimize using gradient descent.

Neural network layers typically follow the structure φ(XW + b), where φ represents a fixed nonlinear activation function, like the rectified linear transformation This raises the question of whether batch normalization should be applied to the input.

Ioffe and Szegedy (2015) advocate for using a normalized version of the transformed value XW + b, suggesting the omission of the bias term as it becomes redundant with the β parameter introduced by batch normalization Typically, the input to a layer is derived from a nonlinear activation function, like the rectified linear function, from a preceding layer Consequently, the input statistics tend to be more non-Gaussian, making them less suitable for standardization through linear operations.

In convolutional networks, it is crucial to apply consistent normalization across all spatial locations within a feature map This ensures that the statistical properties of the feature map remain uniform, regardless of the specific spatial position.

In optimization problems, breaking the problem into smaller parts can lead to quicker solutions By minimizing the function f(x) with respect to one variable at a time, such as x_i, and then proceeding to x_j, and continuing this cycle through all variables, we can ensure reaching a local minimum This method is called coordinate descent, as it focuses on optimizing one coordinate at a time A broader approach, known as block coordinate descent, involves minimizing a subset of variables simultaneously.

“coordinate descent” is often used to refer to block coordinate descent as well as the strictly individual coordinate descent.

Coordinate descent is particularly effective when the variables in an optimization problem can be distinctly categorized into groups that operate independently This method is most efficient when optimizing one group of variables significantly outperforms the optimization of all variables simultaneously.

This function describes a learning problem called sparse coding, where the goal is to ﬁnd a weight matrix W that can linearly decode a matrix of activation values

The Convolution Operation

Convolution is fundamentally an operation involving two functions that take real-valued inputs To illustrate the concept of convolution, we begin by examining examples of two relevant functions.

We are monitoring the position of a spaceship using a laser sensor that outputs a real-valued measurement, x(t), indicating the spaceship's location at any given time t This allows us to capture varying readings from the sensor at different moments.

To enhance the accuracy of our spaceship's position estimation amidst noisy laser sensor data, we aim to calculate a weighted average of multiple measurements This approach prioritizes recent measurements, ensuring they contribute more significantly to the final estimate, thereby improving the reliability of our position tracking.

We can do this with a weighting function w(a), where ais the age of a measurement.

If we apply such a weighted average operation at every moment, we obtain a new functions providing a smoothed estimate of the position of the spaceship: s t( )  x a w t( ) ( −a da) (9.1)

This operation is called convolution The convolution operation is typically denoted with an asterisk: s t( ) = (x w t∗ )( ) (9.2)

To ensure a valid probability density function in our example, the weight function \( w \) must be zero for all negative arguments to avoid looking into the future, which is beyond our capabilities While these limitations are specific to our case, convolution is generally applicable to any functions where the integral is defined and can serve purposes beyond merely calculating weighted averages.

In convolutional networks, the first argument of the convolution, known as the input, is typically represented by the function x, while the second argument, referred to as the kernel, is represented by the function w The resulting output from this process is commonly called the feature map.

In practical applications, a laser sensor providing continuous measurements is unrealistic; instead, data is typically collected at discrete intervals For instance, a more feasible approach would involve the laser delivering measurements once per second, allowing the time index \( t \) to represent only integer values Consequently, when both \( x \) and \( w \) are defined solely on these integer values, we can establish the concept of discrete convolution, represented mathematically as \( s(t) = (x * w)(t) = \sum_{a=-\infty}^{\infty} x(a) w(t - a) \).

In machine learning, inputs and kernels are represented as multidimensional arrays known as tensors, with the learning algorithm adapting the parameters within these arrays To optimize storage, we assume that these functions are zero except at a finite set of points where values are stored, allowing us to effectively replace infinite summation with a finite summation over the relevant array elements.

In many applications, convolutions are applied across multiple axes simultaneously For instance, when processing a two-dimensional image I, it is common to employ a two-dimensional kernel K to achieve the desired results.

Convolution is commutative, meaning we can equivalently write:

Usually the latter formula is more straightforward to implement in a machine learning library, because there is less variation in the range of valid values of m and n

The commutative property of convolution is established by flipping the kernel relative to the input, where an increase in the input index corresponds with a decrease in the kernel index This flipping is primarily done to achieve the commutative property, which, although beneficial for proofs, is generally not critical in neural network implementations Instead, most neural network libraries utilize a similar function known as cross-correlation, which performs convolution without flipping the kernel.

Many machine learning libraries implement cross-correlation but call it convolution.

In this article, we define both operations as convolution, clarifying when kernel flipping is relevant In machine learning, algorithms learn the optimal kernel values, with those using kernel flipping acquiring a different kernel than those that do not Additionally, convolution is seldom used in isolation; it typically operates alongside other functions, and the interaction of these functions remains non-commutative, regardless of kernel flipping.

See ﬁgure 9.1 for an example of convolution (without kernel ﬂipping) applied to a 2-D tensor.

Discrete convolution can be interpreted as matrix multiplication, where specific entries are equal to others, forming a Toeplitz matrix in univariate cases In two-dimensional scenarios, convolution corresponds to a doubly block circulant matrix Typically, convolution results in a sparse matrix since the kernel is smaller than the input image Neural network algorithms that utilize matrix multiplication can effectively work with convolution without modifications, although convolutional neural networks often implement optimizations for handling large inputs more efficiently.

In Figure 9.1, we illustrate a 2-D convolution process that does not involve kernel-flipping The output is limited to positions where the kernel fits entirely within the image, a method referred to as "valid" convolution in certain contexts The diagram uses boxes with arrows to demonstrate how the upper-left element of the output tensor is generated by applying the kernel to the corresponding upper-left area of the input tensor.

Motivation

Convolution enhances machine learning systems through three key concepts: sparse interactions, parameter sharing, and equivariant representations Additionally, it allows for the processing of inputs of varying sizes Each of these ideas contributes to the overall effectiveness and flexibility of convolutional methods in machine learning.

Traditional neural networks utilize matrix multiplication with parameters that define interactions between each input and output unit, resulting in every output unit connecting with every input unit In contrast, convolutional networks employ sparse interactions by using smaller kernels than the input size, allowing them to detect significant features, such as edges, with minimal parameters This approach not only reduces memory requirements but also enhances statistical efficiency and decreases computational operations For instance, while matrix multiplication demands m × n parameters and has O(m × n) runtime, a sparsely connected model with k connections per output only requires k × n parameters and O(k × n) runtime, often achieving effective performance with k much smaller than m Additionally, in deep convolutional networks, deeper layers can indirectly interact with more of the input, enabling the network to model complex interactions from simple, sparse interactions.

Parameter sharing involves utilizing the same parameters across multiple functions within a model, contrasting with traditional neural networks where each weight matrix element is used only once per layer output This concept is often referred to as tied weights, indicating that the weight applied to one input is linked to weights applied elsewhere In convolutional neural networks, each kernel member is employed at various input positions, enhancing efficiency by not requiring a separate set of parameters for every location This technique allows for consistent feature extraction across the input space, optimizing the learning process and reducing the number of parameters needed.

In Figure 9.2, we illustrate sparse connectivity from a bottom perspective, focusing on the input unit x3 and the corresponding output units in s that it influences When s is generated through convolution with a kernel of a specific width, only three output units are impacted by the input unit x3.

(Bottom)When s is formed by matrix multiplication, connectivity is no longer sparse, so all of the outputs are aﬀected by x 3 x 1 x 1 x x 2 2 x x 3 3 s 2 s 2 s 1 s 1 s s 3 3 x 4 x 4 s 4 s 4 x 5 x 5 s 5 s 5 x 1 x 1 x x 2 2 x x 3 3 s 2 s 2 s 1 s 1 s s 3 3 x 4 x 4 s 4 s 4 x 5 x 5 s 5 s 5

In Figure 9.3, we examine sparse connectivity from a top-down perspective, focusing on the output unit s₃ and its corresponding input units, collectively known as the receptive field of s₃ When s₃ is generated through convolution with a kernel of a specific width, only three input units influence its output Conversely, when s₃ is derived from matrix multiplication, the connectivity becomes dense, allowing all input units to impact the output This distinction highlights the differences in how convolution and matrix multiplication affect network connectivity and the resulting output units.

In convolutional networks, the receptive field of units in deeper layers is significantly larger than that of units in shallower layers, a phenomenon that is enhanced by architectural features such as strided convolution and pooling This implies that, despite the sparse direct connections within a convolutional network, deeper layer units can achieve indirect connections to the majority of the input image, allowing for more comprehensive feature extraction.

Parameter sharing in convolutional models allows a single parameter to be utilized across all input locations, as illustrated by the black arrows in Figure 9.5 This contrasts with fully connected models, where parameters are used only once, resulting in higher storage demands Despite learning only one set of parameters for every location, the runtime for forward propagation remains O(k×n), while storage requirements are significantly reduced to k parameters, where k is typically much smaller than m Given that m and n are usually of similar size, k becomes negligible compared to m×n Consequently, convolution offers superior efficiency over dense matrix multiplication in terms of both memory usage and statistical performance For a visual representation of parameter sharing, refer to Figure 9.5.

Figure 9.6 illustrates the significant efficiency gains achieved in edge detection within an image through the application of sparse connectivity and parameter sharing principles.

Convolutional layers exhibit a property known as equivariance to translation due to their unique parameter-sharing structure This means that when the input to the layer is altered, the output responds in a corresponding manner More formally, a function f(x) is considered equivariant to another function g if the relationship f(g(x)) = g(f(x)) holds true.

Convolution is equivariant to any function that translates the input, meaning that shifting an image function results in a corresponding shift in the convolution output For instance, if an image function I is shifted one unit to the right using a function g, the convolution applied to I will yield the same output as applying convolution to the shifted image I' followed by g This property is crucial in time series data, where moving an event later in time will reflect the same representation in the output, just delayed Similarly, in image processing, convolution generates a 2-D map indicating feature locations, allowing for consistent detection of features like edges across the entire image However, in cases such as facial image processing, where different features are located at varying positions, it may be necessary to extract distinct features at different locations, as the top of the face requires detection of eyebrows, while the bottom focuses on the chin.

Convolution does not inherently maintain equivariance to certain transformations, including variations in scale and rotation of images Therefore, additional mechanisms are required to effectively manage these types of transformations.

Certain data types cannot be effectively processed by neural networks that rely on fixed-shape matrix multiplication However, convolution techniques allow for the handling of these data types Further details on this topic will be explored in section 9.7.

Pooling

A typical layer of a convolutional network consists of three stages (see ﬁgure 9.7).

In the initial stage, multiple convolutions are executed simultaneously to generate linear activations The subsequent stage involves applying a nonlinear activation function, like the rectified linear activation function, often referred to as the detector stage Finally, a pooling function is utilized to further refine the layer's output.

A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs For example, the max pooling (Zhou

Edge detection efficiency is illustrated by comparing the original image with an output image created by subtracting neighboring pixel values, highlighting vertically oriented edges crucial for object detection The input image measures 320 pixels in width and 280 pixels in height, while the output image is 319 pixels wide This process involves a convolution kernel with two elements, requiring 267,960 floating point operations, significantly more efficient than matrix multiplication, which would demand over eight billion entries and sixteen billion operations Consequently, convolution is approximately 60,000 times more computationally efficient Even when accounting for nonzero entries in the matrix, convolution remains a superior method for applying linear transformations across local regions of the input image.

Nonlinearity e.g., rectiﬁed linear Pooling stage Next layer

Detector layer: Nonlinearity e.g., rectiﬁed linear Pooling layer

Next layer Complex layer terminology Simple layer terminology

A typical convolutional neural network layer consists of several components, described using two common terminologies One approach views the network as having a few complex layers, each containing multiple "stages." In this framework, there is a direct correspondence between kernel tensors and network layers, which is the terminology predominantly used in this book.

In convolutional networks, the architecture is composed of multiple simple layers, with each processing step considered a distinct layer, indicating that not every layer possesses parameters Among the various pooling functions, the maximum output within a rectangular neighborhood is commonly utilized, alongside alternatives such as the average, L2 norm, and weighted averages based on proximity to the central pixel.

Pooling enhances representation by making it approximately invariant to small translations of the input, meaning that slight shifts do not significantly alter most pooled outputs This translation invariance is particularly beneficial when the presence of a feature is more critical than its exact location For instance, in face detection, it suffices to know that there are eyes on either side of the face rather than their precise pixel positions Conversely, in scenarios where the location of features is crucial, such as identifying corners formed by two edges at a specific angle, maintaining the edges' positions is essential for accurate detection.

Pooling introduces a robust prior, enforcing that the learned function remains invariant to minor translations When this assumption holds true, it significantly enhances the statistical efficiency of the network.

Pooling across spatial regions enhances translation invariance; however, pooling the outputs of independently parameterized convolutions allows features to adaptively learn which transformations to become invariant to.

Pooling allows for summarizing responses across a neighborhood, enabling the use of fewer pooling units compared to detector units by reporting summary statistics for pooling regions spaced k pixels apart This approach enhances computational efficiency, as the subsequent layer processes approximately k times fewer inputs Additionally, when the next layer's parameters depend on input size, such as in fully connected layers using matrix multiplication, this reduction in input size can lead to improved statistical efficiency and lower memory requirements for parameter storage.

Pooling is crucial for managing inputs of different sizes, particularly in tasks like image classification where the input to the classification layer needs to be uniform This is typically achieved by adjusting the offset between pooling regions, ensuring that the classification layer consistently receives the same number of summary statistics, irrespective of the input size For instance, the final pooling layer of a network can be designed to produce four sets of summary statistics, corresponding to each quadrant of an image, regardless of the overall dimensions of the image.

Max pooling enhances invariance in convolutional neural networks by focusing on the maximum values within defined regions In a visual comparison, the output from a convolutional layer is shown alongside the results of a nonlinear activation function When applying max pooling with a stride of one pixel and a pooling width of three pixels, only half of the outputs change when the input is shifted by one pixel, demonstrating that max pooling is less sensitive to the exact location of features and instead emphasizes the strongest activations in the neighborhood.

Large response in pooling unit

Large response in pooling unit Large response in detector unit 1

Large response in detector unit 3

A pooling unit can achieve invariance to input transformations by pooling over multiple features learned with distinct parameters For instance, a set of three filters designed to detect handwritten '5's can adapt to different orientations of the digit Each filter targets a specific angle, leading to significant activation in the corresponding detector unit when a '5' is present Regardless of which detector unit is activated, the max pooling unit maintains a strong activation signal This mechanism demonstrates how the network effectively processes varied inputs, resulting in similar effects on the pooling unit, a concept utilized in maxout networks (Goodfellow et al.).

2013a) and other convolutional networks Max pooling over spatial positions is naturally invariant to translation; this multi-channel approach is only necessary for learning other transformations.

Max-pooling with a pool width of three and a stride of two effectively reduces the representation size by half, alleviating computational and statistical demands on subsequent layers It's important to include the rightmost pooling region, despite its smaller size, to ensure that no detector units are overlooked.

Theoretical research provides insights into the appropriate types of pooling to employ in different contexts (Boureau et al., 2010) Dynamic feature pooling can be achieved through clustering algorithms that analyze the locations of significant features, resulting in unique pooling regions for each image (Boureau et al., 2011) Alternatively, a unified pooling structure can be developed and consistently applied across all images (Jia et al., 2012).

Pooling can introduce complexities in certain neural network architectures that rely on top-down information, including Boltzmann machines and autoencoders These challenges will be explored in detail in part III of the article Specifically, pooling in convolutional Boltzmann machines is addressed in section 20.6, while the inverse operations on pooling units required for some differentiable networks are discussed in section 20.10.6.

Some examples of complete convolutional network architectures for classiﬁcation using convolution and pooling are shown in ﬁgure 9.11.

Convolution and Pooling as an Inﬁnitely Strong Prior

A prior probability distribution represents our beliefs about the plausibility of different models before observing any data, as discussed in section 5.2 It serves as a foundational element in modeling, encapsulating our assumptions about the parameters involved.

Priors can be classified as weak or strong based on the concentration of their probability density A weak prior, characterized by high entropy—like a Gaussian distribution with high variance—permits considerable flexibility for data to influence the parameters In contrast, a strong prior, with low entropy exemplified by a Gaussian distribution with low variance, exerts a significant influence on the final parameter values.

An infinitely strong prior assigns a zero probability to certain parameters, deeming them completely forbidden, irrespective of the data's support for those values.

A convolutional neural network can be envisioned as a fully connected network with an infinitely strong prior on its weights This prior enforces that the weights of neighboring hidden units are identical but spatially shifted, while also stipulating that weights outside a specific, small receptive field assigned to each unit must be zero Consequently, convolution introduces a robust prior probability distribution over the parameters of a layer, enhancing the network's ability to learn spatial hierarchies in data.

Output of pooling with stride 4:

Output of reshape to vector:

Output of matrix multiply: 1,000 units

Output of pooling to 3x3 grid: 3x3x64

Output of reshape to vector:

Output of matrix multiply: 1,000 units

Figure 9.11 illustrates various architectures for classification using convolutional networks The specific strides and depths depicted are not recommended for practical applications, as they are intentionally shallow to accommodate the page layout In real-world scenarios, convolutional networks typically feature considerable branching rather than the simplified chain structures shown The left side of the figure presents a convolutional network designed to handle images of a fixed size.

After several layers of convolution and pooling, the tensor from the convolutional feature map is reshaped to flatten its spatial dimensions The subsequent layers of the network function as a standard feedforward classifier, as outlined in Chapter 6.

A convolutional network processes variable-sized images while retaining a fully connected section, utilizing a pooling operation with fixed pools to generate a consistent vector of 576 units for the fully connected layer In contrast, another convolutional network eliminates fully connected weight layers, allowing the last convolutional layer to produce one feature map per class, which indicates the likelihood of each class at different spatial locations By averaging the feature map to a single value, it feeds into the softmax classifier This design emphasizes that the layer should focus on local interactions and be invariant to small translations, supported by pooling that acts as a strong prior for translation invariance.

While treating a convolutional neural network as a fully connected network with an infinitely strong prior may seem computationally inefficient, this perspective can provide valuable insights into the functioning of convolutional networks.

Convolution and pooling can lead to underfitting if the assumptions behind them are not accurate, particularly in tasks that require precise spatial information Using pooling on all features may increase training errors in such scenarios Some convolutional network architectures, like those proposed by Szegedy et al (2014), strategically apply pooling to select channels, balancing the need for invariant features while minimizing underfitting when the translation invariance assumption fails Additionally, when tasks necessitate information from distant input locations, the convolutional prior may not be suitable.

When evaluating the statistical learning performance of convolutional models, it is crucial to compare them exclusively with other convolutional models Non-convolutional models can still learn from images even if their pixels are permuted For various image datasets, benchmarks differentiate between models that are permutation invariant and must learn spatial relationships, and those that have spatial knowledge hard-coded by their designers.

Variants of the Basic Convolution Function

In the realm of neural networks, the term "convolution" often diverges from the conventional discrete convolution defined in mathematical literature This article outlines the subtle distinctions between standard convolution and its practical application in neural networks, while also emphasizing the beneficial properties of the functions employed in this context.

In neural networks, convolution typically refers to the parallel application of multiple convolution operations, as a single kernel can only extract one type of feature across various spatial locations To enhance feature extraction, it is essential for each layer of the network to identify diverse features at numerous locations.

In convolutional networks, the input is typically a grid of vector-valued observations, such as the red, green, and blue intensities in a color image The second layer of a multilayer convolutional network receives input from the first layer, which consists of various convolutions applied at each pixel location When processing images, both the input and output of convolutions are often represented as 3-D tensors, where one index corresponds to different color channels and the other two indices represent spatial coordinates While software implementations generally utilize 4-D tensors to accommodate batch processing, this discussion will focus on 3-D tensors for clarity.

Convolutional networks often employ multi-channel convolution, which can lead to non-commutative linear operations, even with kernel-flipping Commutativity in these multi-channel operations is only achieved when the number of output channels matches the number of input channels.

In a 4-D kernel tensor \( K \), the element \( K_{i,j,k,l} \) represents the connection strength between a unit in channel \( i \) of the output and a unit in channel \( j \) of the input, with an offset of \( k \) rows and \( l \) columns The input data is represented by \( V \), where \( V_{i,j,k} \) denotes the value of the input unit in channel \( i \) at row \( j \) and column \( k \) The output \( Z \) follows the same structure as \( V \) When \( Z \) is generated by convolving \( K \) with \( V \) without flipping \( K \), the resulting output reflects the direct application of the kernel tensor to the observed data.

In the equation V l,j m + − 1 + ,k n − 1 K i,l,m,n, the summation is performed over all valid tensor indexing operations for the indices m and n It is important to note that in linear algebra, indexing typically starts at 1, which is why the formula includes a -1 adjustment However, programming languages like C and Python use 0-based indexing, simplifying the expression further.

To reduce computational costs while performing convolution, we can downsample the output by skipping certain positions of the kernel, which may result in less precise feature extraction By sampling every s pixels in each direction of the output, we can define a downsampled convolution function, c, that effectively manages this trade-off.

We refer to s as the stride of this downsampled convolution It is also possible to deﬁne a separate stride for each direction of motion See ﬁgure 9.12 for an illustration.

An essential aspect of convolutional network implementation is the capability to implicitly zero-pad the input V, which effectively increases its width This zero padding prevents the representation's width from shrinking by one pixel less than the kernel width at each layer, allowing for independent control over kernel width and output size Without zero padding, the network faces the dilemma of either rapidly reducing its spatial extent or relying on small kernels, both of which considerably restrict the network's expressive power.

Three notable cases of zero-padding in convolutional networks are important to understand The first is "valid convolution," where no zero-padding is applied, allowing the convolution kernel to only overlap fully within the image This results in a smaller output size, calculated as m−k + 1, which can significantly reduce the spatial dimensions with each layer, potentially limiting the number of layers The second case is "same convolution," where sufficient zero-padding maintains the output size equal to the input size, allowing for more convolutional layers without altering the architecture However, this can lead to underrepresentation of border pixels The third case, "full convolution," adds enough padding so that each pixel is visited multiple times, resulting in an output size of m + k − 1 While this allows for comprehensive coverage, it complicates the learning of a single effective kernel across the feature map Typically, the optimal zero-padding lies between valid and same convolution for best classification accuracy.

Figure 9.12: Convolution with a stride In this example, we use a stride of two.

Convolution with a stride length of two can be executed in a single operation, making it more efficient than the traditional two-step method This two-step approach, which involves performing convolution with a unit stride followed by downsampling, is mathematically equivalent but computationally inefficient, as it calculates numerous values that are ultimately discarded.

In a convolutional network utilizing a kernel width of six without pooling, the absence of implicit zero padding results in a significant reduction of the network size, decreasing the representation by five pixels at each layer Starting with an input of sixteen pixels, this limitation permits only three convolutional layers, with only two functioning effectively as convolutional layers While employing smaller kernels can reduce the rate of shrinking, it compromises expressiveness Conversely, incorporating five implicit zeros at each layer prevents size reduction with depth, enabling the construction of an arbitrarily deep convolutional network.

In certain scenarios, it may be preferable to utilize locally connected layers instead of convolutional layers, as highlighted by LeCun in 1986 and 1989 In this approach, the adjacency matrix of the multi-layer perceptron (MLP) remains unchanged, but each connection is assigned a unique weight defined by a 6-D tensor W The indices for this tensor include: i for the output channel, j for the output row, k for the output column, l for the input channel, m for the row offset within the input, and n for the column offset within the input Consequently, the linear component of a locally connected layer is formulated accordingly.

Unshared convolution, akin to discrete convolution with a small kernel, operates without sharing parameters across different locations This method emphasizes local connections, contrasting with traditional convolution and full connections, as illustrated in Figure 9.14.

Locally connected layers are beneficial when features are expected to depend on localized areas rather than being consistent throughout the entire space For instance, to determine if an image contains a face, it suffices to focus on the mouth located in the lower half of the image.

Creating specialized versions of convolutional or locally connected layers can enhance efficiency by limiting the connectivity between output and input channels For instance, the initial m output channels can be linked exclusively to the first n input channels, while subsequent m output channels connect to the next n input channels, and so forth This approach, illustrated in figure 9.15, enables the model to focus on interactions among a select few channels, resulting in fewer parameters, reduced memory usage, and improved statistical efficiency Additionally, it decreases the computational load required for both forward and back-propagation, all without compromising the number of hidden units.

Structured Outputs

Convolutional networks are capable of producing high-dimensional, structured outputs, moving beyond simple class label predictions in classification tasks or real value outputs in regression tasks Generally, this output takes the form of a tensor generated by a standard convolutional layer, such as a tensor S.

The probability S(i,j,k) indicates the likelihood that the pixel located at (j, k) in the input image corresponds to class i This capability enables the model to assign labels to each pixel, facilitating the creation of accurate masks that closely adhere to the contours of distinct objects.

One issue that often comes up is that the output plane can be smaller than the

A recurrent convolutional network for pixel labeling processes an image tensor, where the axes represent image rows, columns, and color channels (red, green, blue) The objective is to generate a tensor of labels with a probability distribution for each pixel Instead of producing the output in one go, the network iteratively refines the label estimates by using previous outputs as inputs, applying the same parameters at each step The convolution kernel tensor is utilized to compute hidden representations, while another kernel tensor generates label estimates based on these hidden values In the initial step, the input to the hidden layer is set to zero, demonstrating the recurrent nature of the network To maintain output dimensions similar to the input, pooling layers with large strides are typically avoided, or a lower-resolution grid of labels may be produced, allowing for flexible output generation.

A strategy for pixel-wise image labeling involves generating an initial label guess and refining it by considering the interactions between neighboring pixels This iterative refinement process utilizes shared weights across the final layers of a deep network, akin to recurrent networks, as demonstrated by Jain et al (2007) and further explored by Pinheiro and Collobert (2014, 2015) The architecture of this recurrent convolutional network is illustrated in Figure 9.17.

After generating predictions for each pixel, various techniques can be employed to process these predictions and achieve image segmentation into distinct regions The core principle is that large clusters of adjacent pixels are likely to share the same label This relationship can be effectively modeled using graphical models that capture the probabilistic connections between neighboring pixels Additionally, convolutional networks can be trained to enhance an approximation of the training objectives associated with graphical models.

Data Types

Convolutional networks typically process data composed of multiple channels, where each channel represents a distinct observation of a specific quantity at a given spatial or temporal location For a clearer understanding, refer to Table 9.1, which illustrates various data types along with their dimensionalities and channel counts.

For an example of convolutional networks applied to video, see Chen et al.

Convolutional networks offer the unique advantage of processing inputs with varying spatial dimensions, unlike traditional matrix multiplication-based neural networks, which cannot accommodate such variability This capability makes convolutional networks a compelling choice, even when concerns about computational cost and overfitting are minimal.

Modeling a collection of images with varying widths and heights poses challenges for fixed-size weight matrices Convolution is effectively applied by adjusting the kernel's application based on the input size, resulting in an output that scales accordingly This process can be seen as matrix multiplication, where the same convolution kernel generates different doubly block circulant matrices for each input size In scenarios where the output is allowed to vary, such as assigning class labels to individual pixels, no additional design is required However, when a fixed-size output is necessary, like assigning a single class label to an entire image, extra design steps are essential This often involves incorporating a pooling layer with regions that scale with the input size to ensure a consistent number of pooled outputs.

1-D Audio waveform: The axis we convolve over corresponds to time We discretize time and measure the amplitude of the waveform once per time step.

Skeleton animation data involves the creation of 3-D computer-rendered character animations by modifying the poses of a virtual skeleton over time Each pose is defined by the angles of the joints in the character's skeleton, with data channels representing the angle of each joint around a specific axis This technique enables dynamic and realistic character movements in digital environments.

2-D Audio data that has been prepro- cessed with a Fourier transform:

We can convert audio waveforms into 2D tensors, where rows represent different frequencies and columns denote various time points Applying convolution in the time domain ensures that the model is invariant to time shifts, while convolution across the frequency axis maintains invariance to frequency variations This means that the same melody played in different octaves yields consistent representations, albeit at different levels in the network's output.

Color images consist of three channels: red, green, and blue pixels A convolution kernel traverses the image along both horizontal and vertical axes, ensuring translation equivariance in both directions.

3-D Volumetric data: A common source of this kind of data is medical imaging technology, such as

Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame.

Table 9.1: Examples of diﬀerent formats of data that can be used with convolutional networks.

Convolution is effective for processing variable-sized inputs only when those inputs contain varying amounts of the same type of observation, such as different lengths of recordings over time or varying widths of spatial observations However, it is not appropriate to use convolution for inputs with variable sizes that include different types of observations For instance, in the case of college applications where features include grades and standardized test scores, convolution is not suitable if not every applicant has standardized test scores, as it would not make sense to apply the same weights across both grades and test scores.

Eﬃcient Convolution Algorithms

Modern convolutional networks frequently utilize architectures with over one million units, necessitating robust implementations that leverage parallel computing resources Additionally, optimizing convolution speed can often be achieved by choosing the right convolution algorithm.

Convolution can be efficiently performed by transforming both the input and the kernel into the frequency domain via a Fourier transform, followed by point-wise multiplication of the resulting signals The final step involves using an inverse Fourier transform to return to the time domain This method can be faster than traditional discrete convolution for certain problem sizes.

A separable d-dimensional kernel can be represented as the outer product of d individual vectors, one for each dimension While naive convolution may seem straightforward, it is actually inefficient in this case Instead, performing d separate one-dimensional convolutions with these vectors is significantly faster than executing a single d-dimensional convolution using their outer product Additionally, representing the kernel as vectors requires fewer parameters, enhancing overall efficiency.

Naive multidimensional convolution, when the kernel has w elements in each dimension, necessitates O(w^d) in runtime and parameter storage In contrast, separable convolution optimizes this to O(w×d) for both runtime and storage However, it's important to note that not all convolutions can be represented through separable convolution methods.

Research is focused on developing faster methods for convolution and approximate convolution that maintain model accuracy Enhancements that improve the efficiency of forward propagation are particularly valuable, as commercial applications often allocate more resources to deploying a network than to its training.

Random or Unsupervised Features

The most costly aspect of training convolutional networks is feature learning, while the output layer remains less expensive due to the reduced number of input features after pooling During supervised training with gradient descent, each gradient step necessitates full forward and backward propagation through the network To minimize training costs, utilizing features that are not trained in a supervised manner can be an effective strategy.

There are three main strategies for obtaining convolution kernels without supervised training: initializing them randomly, designing them manually to detect specific features like edges, or learning them using unsupervised methods For instance, Coates et al (2011) utilized k-means clustering on small image patches to create convolution kernels from learned centroids This unsupervised learning approach allows for the extraction of features independently from the classifier layer, enabling the construction of a new training set for the final layer Consequently, optimizing the last layer typically becomes a convex optimization problem, particularly when using models like logistic regression or SVM.

Random ﬁlters often work surprisingly well in convolutional networks (Jarrett et al., 2009 Saxe; et al., 2011 Pinto; et al., 2011 Cox and Pinto 2011; , ) Saxe et al.

In 2011, researchers demonstrated that convolutional layers followed by pooling exhibit frequency selectivity and translation invariance when initialized with random weights They propose a cost-effective method for selecting convolutional network architectures: initially train only the final layer of various architectures to assess their performance, and subsequently refine the best-performing architecture through a more intensive training process.

An effective intermediate approach to feature learning involves methods that minimize the need for full forward and back-propagation during each gradient step Similar to multilayer perceptrons, greedy layer-wise pretraining is utilized, where the first layer is trained in isolation, features are extracted, and then the second layer is trained using those features, continuing this process for subsequent layers Chapter 8 outlines the process for supervised greedy layer-wise pretraining, while part III expands on this by incorporating an unsupervised criterion for each layer A prominent example of this technique in convolutional models is the convolutional deep belief network, as demonstrated by Lee et al (2009) Convolutional networks enhance the pretraining strategy beyond multilayer perceptrons by allowing the training of small patches, as illustrated by Coates et al (2011) using k-means clustering.

By utilizing parameters from a patch-based model, we can define the kernels of a convolutional layer, enabling the training of a convolutional network through unsupervised learning without the need for convolution during the training phase This method allows for the training of large models while incurring significant computational costs solely during inference.

From 2007 to 2013, the use of small labeled datasets and limited computational power led to a popular approach in training convolutional networks Currently, most convolutional networks are trained using a fully supervised method, which involves complete forward and back-propagation through the entire network during each training iteration.

Unsupervised pretraining presents challenges in identifying the specific reasons behind its benefits This approach may provide regularization advantages compared to supervised training, or it could enable the training of significantly larger models due to the lower computational costs associated with the learning process.

The Neuroscientiﬁc Basis for Convolutional Networks

Convolutional networks stand out as a remarkable achievement in the realm of biologically inspired artificial intelligence While their development has been influenced by various disciplines, the foundational design principles of these neural networks are deeply rooted in neuroscience.

The history of convolutional networks is rooted in neuroscientific research, particularly the pioneering work of neurophysiologists David Hubel and Torsten Wiesel, who studied the mammalian vision system Their groundbreaking experiments, which led to a Nobel Prize, involved recording the activity of individual neurons in cats to understand their responses to visual stimuli They discovered that neurons in the early visual system were highly responsive to specific patterns of light, such as oriented bars, while showing minimal response to other patterns This foundational knowledge has significantly influenced modern deep learning models.

The research has significantly advanced our understanding of various brain functions, although this book does not cover all aspects For the purposes of deep learning, we will adopt a simplified, illustrative perspective on brain function.

The primary visual cortex, known as V1, is the initial area in the brain that processes visual information Light enters the eye and stimulates the retina, a light-sensitive tissue that performs basic preprocessing of the image without significant alteration The visual signal then travels through the optic nerve and the lateral geniculate nucleus, primarily serving to transmit the information to V1, located at the back of the head.

A convolutional network layer is designed to capture three properties of V1:

V1 is organized in a two-dimensional spatial map that reflects the retinal structure, where light impacting the lower half of the retina influences the corresponding half of V1 This spatial arrangement is mirrored in convolutional networks, which define their features using two-dimensional maps to capture this property effectively.

V1 is composed of numerous simple cells, each of which exhibits activity that can be described by a linear function within a small, localized receptive field Convolutional network detector units are specifically designed to replicate the characteristics of these simple cells.

V1 contains numerous complex cells that respond to features akin to those identified by simple cells, yet these complex cells maintain invariance to minor positional shifts of the features This characteristic has influenced the design of pooling units in convolutional networks Additionally, complex cells exhibit invariance to certain lighting variations that basic pooling cannot address, inspiring various cross-channel pooling strategies in convolutional networks, including maxout units (Goodfellow et al., 2013a).

Research indicates that the principles observed in V1 likely extend to other regions of the visual system The visual processing strategy involves initial detection followed by pooling as information is transmitted deeper into the brain As we navigate through various anatomical layers, we encounter specialized cells that respond to specific concepts while remaining invariant to numerous input transformations These cells, often referred to as "grandmother cells," are thought to activate in response to familiar images, such as a person's grandmother, regardless of the image's orientation, lighting, or distance.

Research has identified the existence of "grandmother cells" in the human brain, specifically within the medial temporal lobe (Quiroga et al., 2005) These neurons are responsive to images of well-known individuals, exemplified by the discovery of the "Halle Berry neuron," which activates upon seeing her photo, a drawing, or even the text of her name This phenomenon is not exclusive to Halle Berry, as other neurons similarly respond to different famous figures, such as Bill Clinton and Jennifer Aniston.

Medial temporal lobe neurons exhibit a broader generalization capability compared to modern convolutional networks, which struggle to identify individuals or objects solely based on their names The inferotemporal cortex (IT) serves as the closest equivalent to the final layer of features in a convolutional network Object recognition initiates with information flowing from the retina through various brain areas, including the LGN, V1, V2, and V4, reaching IT within the first 100 milliseconds Prolonged observation allows the brain to utilize top-down feedback to refine activations in lower-level areas However, if gaze is interrupted and only the initial 100 milliseconds of feedforward activation are analyzed, IT's firing rates closely resemble those of convolutional networks These networks not only predict IT firing rates effectively but also perform comparably to humans in time-constrained object recognition tasks.

While convolutional networks and the mammalian vision system share similarities, significant differences exist between them Some of these distinctions are recognized by computational neuroscientists, yet they fall beyond the scope of this discussion Additionally, many fundamental questions regarding the functioning of the mammalian vision system remain unresolved.

The human eye has low resolution except for a small area known as the fovea, which perceives details comparable to a thumbnail at arm's length This creates the illusion of high-resolution vision, as our brain subconsciously combines multiple small glimpses of a scene While convolutional networks typically process large, full-resolution images, the human brain employs saccades to focus on the most important visual elements Current research is exploring the integration of attention mechanisms, which have proven effective in natural language processing, into deep learning models Although several visual models incorporating foveation techniques have been developed, they have yet to become the leading method in the field.

• The human visual system is integrated with many other senses, such as hearing, and factors like our moods and thoughts Convolutional networks so far are purely visual.

The human visual system goes beyond simple object recognition, enabling us to comprehend complex scenes with multiple objects and their interrelationships It processes intricate 3-D geometric information, which is essential for our interaction with the environment While convolutional networks have been introduced to tackle some of these challenges, their applications are still in the early stages of development.

Research indicates that even basic brain regions, such as V1, are significantly influenced by feedback from higher cognitive levels While feedback mechanisms have been widely studied in neural network models, compelling evidence demonstrating substantial enhancements from these feedback systems remains lacking.

Feedforward IT firing rates share similarities with convolutional network features, yet the nature of their intermediate computations remains uncertain The brain likely employs distinct activation and pooling functions, suggesting that a single linear filter response may not adequately describe an individual neuron's activation A recent model of V1 proposes that each neuron utilizes multiple quadratic filters (Rust et al., 2005) This challenges the traditional distinction between "simple cells" and "complex cells," indicating that they may actually represent variations of the same type of cell, with their parameters facilitating a spectrum of behaviors from "simple" to "complex."

Convolutional Networks and the History of Deep Learning

Convolutional networks have significantly influenced deep learning, showcasing the application of brain-inspired insights in machine learning They were among the first deep models to achieve notable success, paving the way for the viability of complex architectures These networks have been instrumental in commercial applications, such as the 1990s AT&T research group's development of a convolutional network for check reading, which later led to NEC processing over 10% of checks in the US Additionally, Microsoft implemented various OCR and handwriting recognition systems utilizing convolutional networks For a comprehensive exploration of these applications and the evolution of convolutional networks, refer to Chapter 12 and LeCun et al (2010).

Convolutional networks were also used to win many contests The current intensity of commercial interest in deep learning began when Krizhevsky et al.

(2012) won the ImageNet object recognition challenge, but convolutional networks had been used to win other machine learning and computer vision contests with less impact for years earlier.

Convolutional networks were among the first successful deep networks trained using back-propagation, possibly due to their computational efficiency compared to fully connected networks, which facilitated easier experimentation and hyperparameter tuning Larger networks tend to be easier to train, and with advancements in modern hardware, fully connected networks now perform well on various tasks, even with previously challenging datasets and activation functions The initial skepticism surrounding neural networks may have hindered their adoption, as practitioners were reluctant to invest effort in their development Regardless, the early success of convolutional networks significantly advanced the field of deep learning, fostering broader acceptance of neural networks.

Convolutional networks specialize neural networks for grid-structured data, particularly excelling in two-dimensional image processing and enabling scalability to large models For one-dimensional sequential data, recurrent neural networks offer a robust alternative within the neural networks framework.

Sequence Modeling: Recurrent and Recursive Nets

Recurrent neural networks (RNNs), introduced by Rumelhart et al in 1986, are specialized neural networks designed for processing sequential data Unlike convolutional networks that handle grid-like structures such as images, RNNs focus on sequences of values, enabling them to effectively manage longer and variable-length sequences This unique capability allows RNNs to scale efficiently, making them suitable for tasks that require understanding temporal dependencies in data.

Transitioning from multi-layer networks to recurrent networks involves leveraging the concept of parameter sharing, a foundational idea from 1980s machine learning and statistical models This approach enables the model to adapt to various input forms, such as differing sequence lengths, and enhances its ability to generalize across these variations Without parameter sharing, the model would struggle to handle unseen sequence lengths during training and would lack the statistical strength to operate effectively across different time positions This technique is crucial for scenarios where specific information may appear at multiple points within a sequence, as illustrated by the sentences “I went to Nepal in 2009” and “In 2009, I traveled to Nepal.”

In a machine learning context, extracting specific information, such as the year 2009 from a sentence like "I went to Nepal," highlights the importance of model design A traditional fully connected feedforward network would require distinct parameters for each input feature, necessitating the learning of language rules independently at each position In contrast, a recurrent neural network streamlines this process by utilizing shared weights across multiple time steps, enhancing its ability to recognize relevant information within varying sentence structures.

The use of convolution in 1-D temporal sequences forms the foundation of time-delay neural networks, as established by Lang and Hinton in 1988 and further developed by Waibel et al in 1989 and Lang et al in 1990 This convolutional method enables parameter sharing across time, although it remains a shallow approach In this context, the output sequence is generated based on a limited number of neighboring input elements, employing the same convolution kernel at each time step Conversely, recurrent networks achieve parameter sharing differently, where each output element is derived from previous outputs using a consistent update rule, leading to a deeper computational graph.

Recurrent Neural Networks (RNNs) process sequences of vectors \( x_t \) over time steps \( t \) ranging from 1 to \( \tau \) Typically, RNNs handle minibatches of sequences, each with varying lengths, simplifying notation by omitting minibatch indices The time step index may not represent actual time but rather the sequence position RNNs can also be utilized for two-dimensional spatial data, such as images, and can incorporate backward connections in time when the entire sequence is available for processing.

This chapter explores the concept of computational graphs with cycles, illustrating how the present value of a variable can impact its future value This framework is essential for defining recurrent neural networks (RNNs) We will also discuss various methods for constructing, training, and utilizing RNNs effectively.

For more information on recurrent neural networks than is available in this chapter, we refer the reader to the textbook of Graves 2012( ).

Unfolding Computational Graphs

A computational graph formalizes the relationships between inputs, parameters, outputs, and loss in a set of computations This section discusses how to unfold recursive or recurrent computations into a repetitive computational graph, which reflects a sequence of events By unfolding this graph, parameters can be shared across a deep network structure, enhancing efficiency and performance.

For example, consider the classical form of a dynamical system: s ( ) t = (f s ( t − 1) ; )θ , (10.1) where s ( ) t is called the state of the system.

Equation10.1 is recurrent because the deﬁnition of s at time t refers back to the same deﬁnition at time t−1.

For a ﬁnite number of time steps τ, the graph can be unfolded by applying the deﬁnition τ −1 times For example, if we unfold equation 10.1 for τ = 3 time steps, we obtain s (3) = (f s (2) ; )θ (10.2)

By repeatedly applying the definition, we can derive an expression that eliminates recurrence, allowing it to be represented as a traditional directed acyclic computational graph The unfolded computational graphs for equations 10.1 and 10.3 are depicted in figure 10.1, showcasing the transformation of the equations into a more manageable format.

Figure 10.1 illustrates a classical dynamical system through an unfolded computational graph, where each node signifies the state at a specific time t The function f transitions the state from time t to t + 1, utilizing consistent parameters (the same value of θ) across all time steps.

As another example, let us consider a dynamical system driven by an external signal x ( ) t , s ( ) t = (f s ( t − 1) ,x ( ) t ; )θ , (10.4) where we see that the state now contains information about the whole past sequence.

Recurrent neural networks (RNNs) can be constructed in various forms, as they encompass a wide range of functions Similar to how almost any function can be represented by a feedforward neural network, any function that incorporates recurrence can be classified as a recurrent neural network.

Recurrent neural networks (RNNs) commonly utilize equations like 10.5 to define their hidden units' values By introducing the variable h to represent the state of the hidden units, we can express this as h(t) = f(h(t−1), x(t); θ) As illustrated in figure 10.2, typical RNN architectures often incorporate additional features, such as output layers, which extract information from the state h to facilitate predictions.

Recurrent networks, when trained for future prediction based on past data, create a lossy summary of relevant input sequences through the vector h(t) This summary is inherently lossy, as it condenses an arbitrary-length sequence into a fixed-length representation Depending on the training objective, the network may prioritize certain aspects of the input sequence over others For instance, in statistical language modeling, the RNN focuses on retaining only the necessary information from previous words to effectively predict the next word, rather than storing the entire input sequence.

The most demanding situation is when we ask h ( ) t to be rich enough to allow one to approximately recover the input sequence, as in autoencoder frameworks (chapter 14). ff h h x x h (t−1) h (t−1) h h ( ) ( ) t t h h ( +1) ( +1) t t x (t − 1) x (t − 1) x x ( ) ( ) t t x x ( +1) ( +1) t t h ( ) h ( ) h h ( ( ) ) ff

A recurrent network without outputs processes input information by integrating it into a state that is carried forward over time The circuit diagram illustrates this with a black square representing a one-time-step delay Additionally, the network can be visualized as an unfolded computational graph, where each node corresponds to a specific time instance.

Equation 10.5 can be illustrated in two distinct ways: one as a real-time circuit diagram, representing a biological neural network with nodes for each physical component, and the other as an unfolded computational graph, where each component is depicted by multiple variables corresponding to different time steps In the circuit diagram, a black square signifies interactions occurring with a one-time-step delay, transitioning from state t to state t+1 The unfolding process transforms the circuit representation into a computational graph, where each time step is represented by a separate node, resulting in a graph size that varies based on the sequence length.

We can represent the unfolded recurrence after t steps with a function g ( ) t : h ( ) t =g ( ) t (x ( ) t ,x ( t − 1) ,x ( t − 2) , ,x (2) ,x (1) ) (10.6)

The function g(t) processes the entire past sequence (x(t), x(t-1), x(t-2), , x(2), x(1)) to generate the current state However, the unfolded recurrent structure enables us to decompose g(t) into multiple applications of a function f This unfolding process provides two significant advantages.

The learned model maintains a consistent input size regardless of sequence length, as it is defined by the transition between states rather than by a variable-length history of those states.

2 It is possible to use thesame transition function f with the same parameters at every time step.

By leveraging two key factors, it becomes feasible to develop a unified model, f, that functions across all time steps and sequence lengths, eliminating the need for separate models, g(t), for each possible time step This approach facilitates generalization to sequence lengths not present in the training data, significantly reducing the number of training examples required through parameter sharing.

The recurrent graph and the unrolled graph each serve distinct purposes; the recurrent graph is concise, while the unrolled graph offers a clear depiction of the necessary computations Additionally, the unrolled graph effectively demonstrates the flow of information over time, showcasing both the forward flow of outputs and losses, as well as the backward flow for gradient computation, by explicitly illustrating the paths of this information flow.

Recurrent Neural Networks

Armed with the graph unrolling and parameter sharing ideas of section 10.1, we can design a wide variety of recurrent neural networks.

Figure 10.3: The computational graph to compute the training loss of a recurrent network that maps an input sequence of x values to a corresponding sequence of output o values.

The loss function L quantifies the difference between the model's output o and the target y during training In the context of softmax outputs, o represents the unnormalized log probabilities, and L calculates the softmax probabilities y ˆ by applying the softmax function to o, which is then compared to the target y The recurrent neural network (RNN) utilizes a weight matrix U for input-to-hidden connections, a weight matrix W for hidden-to-hidden recurrent connections, and a weight matrix V for hidden-to-output connections Forward propagation in this model is defined by Equation 10.8.

RNN and its loss drawn with recurrent connections (Right)The same seen as an time- unfolded computational graph, where each node is now associated with one particular time instance.

Some examples of important design patterns for recurrent neural networks include the following:

• Recurrent networks that produce an output at each time step and have recurrent connections between hidden units, illustrated in ﬁgure 10.3.

Recurrent networks generate outputs at each time step, featuring recurrent connections exclusively from the output of one time step to the hidden units of the subsequent time step, as illustrated in figure 10.4.

• Recurrent networks with recurrent connections between hidden units, that read an entire sequence and then produce a single output, illustrated in ﬁgure 10.5.

ﬁgure 10.3 is a reasonably representative example that we return to throughout most of the chapter.

The recurrent neural network (RNN) illustrated in figure 10.3 and described by equation 10.8 is universal, capable of computing any function that a Turing machine can, using a finite-sized architecture The output from the RNN can be derived after a number of time steps that scales linearly with both the time steps of the Turing machine and the input length (Siegelmann and Sontag, 1991; Siegelmann, 1995; Siegelmann and Sontag, 1995; Hyotyniemi, 1996) Notably, the functions computable by a Turing machine are discrete, indicating that these findings pertain to exact function implementations rather than approximations When functioning as a Turing machine, the RNN accepts a binary sequence as input, necessitating the discretization of its outputs to yield a binary result Remarkably, a single specific RNN of finite size, as demonstrated by Siegelmann and Sontag (1995) with 886 units, can compute all functions in this context The Turing machine's input specifies the function to be computed, and thus, the same RNN that simulates the Turing machine is adequate for addressing all related problems The theoretical RNN used in the proof can simulate an unbounded stack by representing its activations and weights with rational numbers of unbounded precision.

In this section, we outline the forward propagation equations for the recurrent neural network (RNN) illustrated in figure 10.3, assuming the use of the hyperbolic tangent activation function for the hidden units Although the figure does not detail the output and loss functions, we consider the output to be discrete, suitable for predicting words or characters To represent discrete variables effectively, we treat the output \(o\) as providing unnormalized log probabilities for each potential value Subsequently, we apply the softmax operation to generate a vector \( \hat{y} \) of normalized probabilities The forward propagation process initiates with the definition of the initial state \(h(0)\), followed by calculations at each time step.

The RNN depicted in Figure 10.4 features a feedback connection from the output to the hidden layer, with inputs at each time step denoted as \(x_t\), hidden layer activations as \(h_t\), outputs as \(o_t\), targets as \(y_t\), and loss as \(L_t\) Unlike the more powerful RNN in Figure 10.3, which can incorporate various past information into its hidden representation \(h\), this RNN is limited to sending only the output \(o\) to future time steps, lacking direct connections from \(h\) Consequently, unless \(o\) is high-dimensional, it may miss crucial past information, making this RNN less powerful but potentially easier to train Each time step can be trained independently, facilitating greater parallelization The update equations for the RNN are given as follows: \(a_t = b + W h_{t-1} + U x_t\), \(h_t = \text{tanh}(a_t)\), \(o_t = c + V h_t\), and \(\hat{y}_t = \text{softmax}(o_t)\), where \(b\) and \(c\) are bias vectors, and \(W\), \(U\), and \(V\) are weight matrices.

In a recurrent network, connections are denoted as U, V, and W, representing input-to-hidden, hidden-to-output, and hidden-to-hidden links, respectively This architecture effectively maps an input sequence to an output sequence of identical length The overall loss for a sequence of x values paired with corresponding y values is calculated as the sum of losses across all time steps For instance, if L(t) represents the negative log-likelihood of y(t) given the input sequence x(1), , x(t), then the total loss can be derived from this framework.

The model \( p(y_t | \{x^{(1)}, \ldots, x^{(t)}\}) \) derives its output \( \hat{y_t} \) from a complex process that involves costly gradient computations This process requires a forward propagation pass through the unrolled graph, followed by a backward pass, resulting in a runtime of O(τ) that cannot be optimized through parallelization due to its sequential nature Additionally, the memory cost remains O(τ) as states from the forward pass must be retained for the backward pass This method, known as back-propagation through time (BPTT), highlights the power and complexity of recurrent networks, raising the question of whether there are more efficient alternatives for training.

10.2.1 Teacher Forcing and Networks with Output Recurrence

The network with recurrent connections limited to output-to-hidden units at a single time step is less powerful due to the absence of hidden-to-hidden connections, making it incapable of simulating a universal Turing machine This design requires output units to encapsulate all relevant past information for future predictions, which is challenging unless the user accurately describes the system's full state in the training set targets However, the lack of hidden-to-hidden recurrence allows for decoupled time steps, enabling parallelized training where the gradient for each step can be computed independently, eliminating the need to reference previous outputs since the training set provides the ideal values directly.

A time-unfolded recurrent neural network (RNN) with a single output at the end of the sequence effectively summarizes the input sequence, generating a fixed-size representation for subsequent processing This architecture allows for the possibility of a target output at the sequence's conclusion, enabling the gradient of the output to be back-propagated from additional downstream modules for enhanced learning and accuracy.

Models with recurrent connections can utilize teacher forcing during training, a method based on the maximum likelihood criterion In this approach, the model is fed the actual output at time t as input for the next time step, t+1 This can be illustrated by analyzing a sequence over two time steps, highlighting the relationship between the model's outputs and inputs across these time intervals.

Teacher forcing is a training technique used in recurrent neural networks (RNNs) that involves feeding the correct output from the training set into the model's hidden states during training This process ensures that at each time step, the model learns from the actual expected output However, during deployment, the true output is typically unknown, so the model utilizes its own previous outputs as input for subsequent predictions The effectiveness of this technique can be evaluated using the likelihood criterion, which calculates the probability of the outputs given the inputs.

At time t = 2, the model focuses on maximizing the conditional probability of y(2) based on the preceding x sequence and the prior y value from the training data Maximum likelihood training emphasizes using target values instead of the model's own outputs, ensuring that the connections are guided by the correct expected results, as illustrated in figure 10.6.

Teacher forcing is a technique that helps avoid back-propagation through time (BPTT) in models without hidden-to-hidden connections It can also be utilized in models with such connections, provided there are links from the output of one time step to the computations of the next However, once hidden units depend on previous time steps, BPTT becomes essential Consequently, some models can be trained using a combination of teacher forcing and BPTT.

Strict teacher forcing can create challenges when a network is later used in open-loop mode, as the inputs during training may differ significantly from those encountered during testing To address this issue, one effective strategy is to train the network with a combination of teacher-forced inputs and free-running inputs, allowing it to predict targets several steps ahead through unfolded recurrent paths This approach enables the network to adapt to input conditions it generates itself and learn to revert to states that produce accurate outputs Additionally, a method proposed by Bengio et al (2015) involves randomly selecting between generated and actual data values as inputs, utilizing a curriculum learning strategy to gradually increase the proportion of generated inputs over time.

10.2.2 Computing the Gradient in a Recurrent Neural Network

Calculating the gradient in a recurrent neural network (RNN) is a simple process, as it involves using the generalized back-propagation algorithm on the unrolled computational graph There is no need for specialized algorithms, and the gradients obtained through back-propagation can be effectively utilized with various general-purpose gradient-based methods to train the RNN.

Bidirectional RNNs

Recurrent networks typically exhibit a causal structure, where the state at time t is influenced solely by past inputs (x(1), , x(t-1)) and the current input (x(t)) Additionally, certain models enable past output values (y) to impact the current state when those values are accessible.

In various applications, it is essential to generate predictions that consider the entire input sequence For instance, in speech recognition, accurately interpreting a phoneme often relies on subsequent phonemes due to co-articulation, and may even require context from upcoming words due to linguistic dependencies When faced with multiple acoustically plausible interpretations of a word, it may be necessary to analyze both past and future context for disambiguation This principle also applies to handwriting recognition and numerous other sequence-to-sequence learning tasks.

Bidirectional recurrent neural networks (RNNs) were introduced by Schuster and Paliwal in 1997 to meet specific computational needs They have proven highly effective in various applications, including handwriting recognition, as demonstrated by Graves et al in 2008, and further explored by Graves and Schmidhuber in 2009 Additionally, bidirectional RNNs have made significant contributions to speech recognition, with notable work by Graves and Schmidhuber in 2005 and subsequent studies in 2013 Their impact extends to the field of bioinformatics, highlighted by Baldi et al in 1999.

Bidirectional RNNs integrate two recurrent neural networks (RNNs) that process data in opposite directions: one moving forward from the beginning of a sequence and the other backward from the end This architecture, as depicted in Figure 10.11, features h(t) for the forward-moving RNN and g(t) for the backward-moving RNN This design enables the output units o(t) to generate representations that are informed by both past and future inputs, with a heightened sensitivity to the values surrounding time t.

fixed-size window around t (as one would have to do with a feedforward network, a convolutional network, or a regular RNN with a fixed-size look-ahead buffer).

The concept of extending RNNs to 2-dimensional inputs, like images, involves utilizing four RNNs that process data in the up, down, left, and right directions Each output O i,j at a grid point captures local information while potentially integrating long-range inputs, depending on the RNN's ability to retain that information Although RNNs for images are generally more computationally intensive than convolutional networks, they facilitate long-range lateral interactions between features within the same feature map The forward propagation equations for these RNNs illustrate their use of convolution to compute bottom-up inputs for each layer, followed by recurrent propagation that incorporates lateral interactions across the feature map.

Encoder-Decoder Sequence-to-Sequence Architectures

An RNN is capable of mapping an input sequence to a fixed-size vector, as illustrated in figure 10.5 Additionally, it can convert a fixed-size vector back into a sequence, demonstrated in figure 10.9 Furthermore, figures 10.3, 10.4, 10.10, and 10.11 show how an RNN can effectively map an input sequence to an output sequence of equal length.

The encoder-decoder architecture, also known as sequence-to-sequence RNN, is designed for generating an output sequence based on a given input sequence It consists of an encoder RNN that processes the input sequence and a decoder RNN that produces the output sequence or calculates the probability of the output The final hidden state from the encoder RNN is transformed into a fixed-size context variable C, which serves as a semantic summary of the input and is utilized by the decoder RNN to generate the output.

This article explores the training of Recurrent Neural Networks (RNNs) to effectively map input sequences to output sequences of varying lengths This capability is crucial in various applications, including speech recognition, machine translation, and question answering, where the lengths of input and output sequences in training datasets often differ, though they may share a relational context.

In the realm of Recurrent Neural Networks (RNNs), the input is commonly referred to as the "context." Our objective is to create a representation of this context, denoted as C This context C can take the form of a vector or a sequence of vectors that encapsulate the essence of the input sequence X = (x(1), , x(nx)).

The encoder-decoder or sequence-to-sequence architecture, initially proposed by Cho et al (2014) and Sutskever et al (2014), revolutionized variable-length sequence mapping in machine translation Cho's system scores proposals from another translation system, while Sutskever's employs a standalone recurrent network for direct translation generation This architecture consists of two main components: an encoder RNN that processes the input sequence and produces a context vector C from its final hidden state, and a decoder RNN that utilizes this fixed-length vector to generate the output sequence.

The innovative architecture allows for variable lengths in input and output sequences, denoted as n_x and n_y, unlike previous models that required them to be equal In a sequence-to-sequence framework, two recurrent neural networks (RNNs) are trained together to optimize the average log probability of the output sequence given the input sequence The final state of the encoder RNN, h_n_x, serves as a representation of the input sequence, which is then utilized by the decoder RNN.

In the context of a vector C, the decoder RNN functions as a vector-to-sequence RNN, as outlined in section 10.2.4 There are two primary methods for a vector-to-sequence RNN to receive input: it can either use the input as the initial state of the RNN or connect the input to the hidden units at each time step Additionally, these two methods can be utilized in combination for enhanced performance.

There is no constraint that the encoder must have the same size of hidden layer as the decoder.

A significant limitation of the architecture arises when the context vector C produced by the encoder RNN is insufficiently sized to effectively summarize lengthy sequences This issue was highlighted by Bahdanau et al (2015) in their research on machine translation To address this, they proposed transforming C into a variable-length sequence instead of a fixed-size vector Furthermore, they introduced an attention mechanism that enables the model to learn associations between elements of the context sequence C and the output sequence For more detailed information, refer to section 12.4.5.1.

Deep Recurrent Networks

The computation in most RNNs can be decomposed into three blocks of parameters and associated transformations:

1 from the input to the hidden state,

2 from the previous hidden state to the next hidden state, and

3 from the hidden state to the output.

The RNN architecture illustrated in figure 10.3 features three blocks, each linked to a single weight matrix, resulting in a shallow transformation when the network is unfolded This shallow transformation can be likened to a single layer within a deep multi-layer perceptron (MLP), typically comprising a learned affine transformation followed by a fixed nonlinearity.

Introducing depth in operations appears to be beneficial, as experimental evidence from Graves et al (2013) and Pascanu et al (2014a) supports the notion that sufficient depth is essential for effective mapping This aligns with earlier findings by Schmidhuber (1992).

El Hihi and Bengio 1996( ), or Jaeger 2007a( ) for earlier work on deep RNNs.

Graves et al (2013) demonstrated the advantages of decomposing the state of a recurrent neural network (RNN) into multiple layers, enhancing the transformation of raw input into a more suitable representation for higher levels of the hidden state Building on this, Pascanu et al (2014a) proposed utilizing separate multi-layer perceptrons (MLPs) for each of the three key components of the RNN While increasing representational capacity in each step is important, adding depth can complicate optimization, as shallower architectures are generally easier to optimize This added depth can extend the shortest path between variables across time steps, particularly when using an MLP with a single hidden layer, effectively doubling the length of the shortest path compared to traditional RNNs.

Recurrent neural networks (RNNs) can be deepened through various methods, as illustrated by Pascanu et al (2014a) One approach involves hierarchically organizing the hidden recurrent states into groups Additionally, deeper computations, such as multi-layer perceptrons (MLPs), can be integrated into the input-to-hidden, hidden-to-hidden, and hidden-to-output connections, potentially increasing the shortest path between different time steps However, this path-lengthening effect can be alleviated by implementing skip connections in the hidden-to-hidden paths, enhancing the network's efficiency.

Recursive Neural Networks

A recursive network expands the computational graph of a recurrent network from a linear chain to a tree structure This architecture allows for the mapping of variable-size sequences, such as x(1), x(2), , x(t), to a fixed-size representation, denoted as the output o, while utilizing a consistent set of parameters.

(the weight matrices U, V , W ) The ﬁgure illustrates a supervised learning case in which some target y is provided which is associated with the whole sequence.

Recursive neural networks are an advanced form of recurrent networks that utilize a deep tree structure for their computational graph, distinguishing them from the chain-like structure of traditional RNNs This unique architecture allows for more complex data representations and relationships, as depicted in the accompanying illustration.

2 We suggest to not abbreviate “recursive neural network” as “RNN” to avoid confusion with

Recurrent Neural Networks (RNNs) were first introduced by Pollack in 1990, highlighting their potential for learning reasoning tasks Bottou in 2011 further explored their capabilities Additionally, recursive networks have proven effective in processing data structures as inputs for neural networks, as demonstrated by Frasconi et al in 1997.

1998), in natural language processing (Socher et al., 2011a c 2013a, , ) as well as in computer vision (Socher et al., 2011b).

Recursive networks offer a significant advantage over recurrent nets by reducing the depth from τ to O(logτ) for sequences of the same length, which aids in managing long-term dependencies A key consideration is how to optimally structure the tree, with one approach being the use of a fixed tree structure, such as a balanced binary tree, independent of the data In certain domains, external methods can inform the tree structure; for instance, in natural language processing, the recursive network can adopt the parse tree structure provided by a natural language parser Ultimately, the ideal scenario would involve the learner autonomously discovering and inferring the most suitable tree structure for each input.

Various adaptations of the recursive net concept exist, as demonstrated by Frasconi et al (1997, 1998), who link data to a tree structure, assigning inputs and targets to specific nodes Each node's computation can extend beyond traditional artificial neuron functions, such as affine transformations followed by monotone nonlinearities For instance, Socher et al (2013a) suggest employing tensor operations and bilinear forms, which have proven effective in modeling relationships between concepts represented by continuous vector embeddings (Weston et al., 2010; Bordes et al., 2012).

The Challenge of Long-Term Dependencies

Learning long-term dependencies in recurrent networks presents a significant mathematical challenge, primarily due to the issues of vanishing and exploding gradients Even when parameters ensure stability in the recurrent network, the difficulty lies in the exponentially smaller weights assigned to long-term interactions, as these involve the multiplication of multiple Jacobians, making them less influential compared to short-term interactions For a more comprehensive understanding, readers can refer to deeper analyses by Hochreiter (1991), Doya (1993), and others.

P ro je ct io n o f o utput

When composing multiple nonlinear functions, such as the linear-tanh layer, the outcome is significantly nonlinear, characterized by a predominance of values with small derivatives, occasional large derivatives, and frequent fluctuations between increasing and decreasing trends The plot illustrates a linear projection of a 100-dimensional hidden state reduced to a single dimension on the y-axis, while the x-axis represents the initial state coordinates along a random direction in the 100-dimensional space This visualization serves as a linear cross-section of a high-dimensional function, showcasing the function's behavior after each time step or composition of the transition function.

1994 Pascanu; et al., 2013) In this section, we describe the problem in more detail The remaining sections describe approaches to overcoming the problem.

Recurrent networks involve the composition of the same function multiple times, once per time step These compositions can result in extremely nonlinear behavior, as illustrated in ﬁgure 10.15.

Recurrent neural networks (RNNs) utilize function composition similar to matrix multiplication, where the recurrence relation h(t) = W * h(t - 1) represents a basic RNN model without a nonlinear activation function or external inputs This relation can be understood as an application of the power method, as detailed in section 8.2.5, and can be simplified to h(t) =

W t   h (0) , (10.37) and ifW admits an eigendecomposition of the form

W = Q QΛ  , (10.38) with orthogonal Q, the recurrence may be simpliﬁed further to h ( ) t = Q  Λ t Qh (0) (10.39)

Raising the eigenvalues to the power of t leads to the decay of eigenvalues with a magnitude less than one towards zero, while those with a magnitude greater than one increase exponentially Consequently, any component of h(0) that does not align with the largest eigenvector will ultimately be eliminated.

Recurrent networks face a unique challenge related to the behavior of weights over time In the scalar scenario, repeatedly multiplying a weight \( w \) can lead to either a vanishing or exploding product, contingent on the weight's magnitude Conversely, in non-recurrent networks, using a distinct weight \( w(t) \) at each time step alters the dynamics, mitigating the issues seen in recurrent architectures.

In deep feedforward networks, if the initial state is defined, the state at time \( t \) is represented as \( \theta(t) w(t) \) The values of \( w(t) \) are generated randomly and independently, exhibiting zero mean and variance \( v \) The variance of the product is proportional to \( O(v^n) \) To achieve a desired variance \( v^* \), individual weights can be selected with variance \( v = \sqrt{n} v^* \) This careful scaling allows very deep networks to effectively mitigate the vanishing and exploding gradient issues, as demonstrated by Sussillo in 2014.

The vanishing and exploding gradient problem in recurrent neural networks (RNNs) was identified by researchers Hochreiter (1991) and Bengio et al (1993, 1994) While it may seem possible to avoid this issue by remaining in a parameter space where gradients are stable, RNNs must actually operate in regions where gradients vanish to effectively store memories that are resilient to minor disturbances.

In 1994, Bengio et al highlighted that while models can represent long-term dependencies, the gradients associated with these interactions are significantly smaller than those for short-term interactions This disparity doesn't render learning impossible, but it does imply that acquiring long-term dependencies may require extensive training time, as these signals can be obscured by minor fluctuations from short-term dependencies Their experiments demonstrated that as the length of dependencies increases, the challenges of gradient-based optimization also rise, leading to a rapid decline in the success rate of training traditional RNNs via stochastic gradient descent (SGD) for sequences as short as 10 to 20.

For a deeper treatment of recurrent networks as dynamical systems, see Doya

Research by Bengio et al (1993), Bengio et al (1994), and Siegelmann and Sontag (1995), along with a review by Pascanu et al (2013), highlights ongoing challenges in learning long-term dependencies in deep learning Despite various proposed approaches aimed at facilitating this learning process—enabling recurrent neural networks (RNNs) to capture dependencies across hundreds of steps—the difficulty of mastering long-term dependencies persists as a significant obstacle in the field.

Echo State Networks

Learning the recurrent weights mapping from h(t-1) to h(t) and the input weights from x(t) to h(t) poses significant challenges in recurrent networks To address this, researchers like Jaeger (2003) and Maass et al (2002) suggest setting recurrent weights to effectively capture the history of past inputs while only training the output weights This concept underlies echo state networks (ESNs) and liquid state machines, which utilize continuous-valued and spiking neurons, respectively Both approaches fall under the umbrella of reservoir computing, highlighting how hidden units act as a reservoir of temporal features, capturing various aspects of input history.

Reservoir computing recurrent networks can be likened to kernel machines, as they transform an arbitrary length sequence of inputs into a fixed-length vector known as the recurrent state \( h(t) \) This vector serves as the basis for applying a linear predictor, such as linear regression, to address specific problems The training criterion can be designed to be convex concerning the output weights, facilitating reliable solutions with straightforward learning algorithms For instance, using mean squared error as the training criterion for linear regression from hidden units to output targets ensures convexity, streamlining the training process (Jaeger 2003).

To effectively represent a diverse range of histories in a recurrent neural network (RNN) state, it is crucial to determine the appropriate configuration of input and recurrent weights The reservoir computing literature suggests treating the recurrent network as a dynamical system, adjusting the weights to position the system near the edge of stability.

The initial objective was to ensure that the eigenvalues of the Jacobian matrix for the state-to-state transition function closely approach a specific value As detailed in section 1.8.2.5, a key feature of recurrent networks is the spectrum of eigenvalues associated with their Jacobians.

∂s ( t− 1) Of particular importance is the spectral radius of J ( ) t , deﬁned to be the maximum of the absolute values of its eigenvalues.

To understand the impact of the spectral radius in back-propagation, consider a scenario where the Jacobian matrix J remains constant over time, typical in purely linear networks If J has an eigenvector v associated with an eigenvalue λ, the process of back-propagating a gradient vector g will yield J g after one step and J^n g after n steps Introducing a perturbed gradient vector, g + δv, leads to J(g + δv) after one step and J^n(g + δv) after n steps This reveals that the two back-propagation paths diverge by δJ^n v after n steps Specifically, if v is a unit eigenvector of J, the difference is scaled at each step, resulting in a separation of δ|λ|^n When v corresponds to the largest eigenvalue |λ|, this perturbation maximizes the separation from the initial perturbation size δ.

In a recurrent network without nonlinearity, the Jacobian remains constant at each time step However, when nonlinearity is introduced, the derivative often approaches zero, which helps mitigate the risk of explosion caused by a large spectral radius Recent research on echo state networks suggests utilizing a spectral radius significantly greater than one (Yildiz et al., 2012; Jaeger, 2012).

Everything we have said about back-propagation via repeated matrix multiplication applies equally to forward propagation in a network with no nonlinearity, where the state h ( +1) t = h ( ) t  W.

A linear map W is defined as contractive when it consistently reduces the L2 norm If the spectral radius is below one, the transition from h(t) to h(t+1) is contractive, resulting in diminishing effects from small changes over time Consequently, this property leads to the network losing information about past states, especially when using limited precision storage methods like 32-bit integers for the state vector.

The Jacobian matrix illustrates how a small change in h(t) influences the next step and how the gradient of h(t+1) is affected in the back-propagation process Although W and J are square and real, they do not need to be symmetric, allowing for complex-valued eigenvalues and eigenvectors that can indicate oscillatory behavior when the Jacobian is applied repeatedly While h(t) and its variations are real-valued, they can be represented in a complex-valued framework The key focus is on the behavior of the magnitude of these complex-valued basis coefficients when the matrix interacts with the vector An eigenvalue with a magnitude exceeding one signifies magnification, leading to exponential growth with iterative application, while a magnitude less than one indicates shrinking or exponential decay.

In nonlinear mappings, the Jacobian can vary at each step, leading to more complex dynamics This complexity means that even a minor initial variation can escalate into a significant change after several iterations Unlike purely linear systems, the introduction of squashing nonlinearities, such as the tanh function, can constrain recurrent dynamics It is important to note that back-propagation may still exhibit unbounded dynamics, even if forward propagation remains bounded, particularly when a sequence of tanh units operates within their linear range and is interconnected by weight matrices with a spectral radius exceeding one However, it is uncommon for all tanh units to be simultaneously positioned at their linear activation point.

Echo state networks employ a strategy of fixing weights to maintain a specific spectral radius, allowing information to be transmitted over time while preventing instability This stability is achieved through the saturating nonlinearities, such as the tanh function, which help to control the flow of information and avoid explosive growth.

Recent studies indicate that the weight-setting techniques utilized in Echo State Networks (ESNs) can effectively initialize weights in fully trainable recurrent networks This approach enhances the learning of long-term dependencies, as evidenced by research from Sutskever et al (2012, 2013) Notably, using an initial spectral radius of 1.2, along with a sparse initialization scheme, yields optimal performance in this context.

Optimization for Long-Term Dependencies

N um be r o f ne ur o ns (l o g a ri thm ic sc a le )

Ant Bee Frog Octopus Human

Figure 1.11: Since the introduction of hidden units, artiﬁcial neural networks have doubled in size roughly every 2.4 years Biological neural network sizes from Wikipedia 2015 ( ).

2 Adaptive linear element ( Widrow and Hoﬀ 1960 , )

4 Early back-propagation network ( Rumelhart et al , 1986b )

5 Recurrent neural network for speech recognition (Robinson and Fallside 1991 , )

6 Multilayer perceptron for speech recognition ( Bengio et al , 1991 )

7 Mean ﬁeld sigmoid belief network ( Saul et al , 1996 )

9 Echo state network ( Jaeger and Haas 2004 , )

10 Deep belief network ( Hinton et al , 2006 )

11 GPU-accelerated convolutional network ( Chellapilla et al , 2006 )

12 Deep Boltzmann machine (Salakhutdinov and Hinton 2009a , )

13 GPU-accelerated deep belief network ( Raina et al , 2009 )

14 Unsupervised convolutional network ( Jarrett et al , 2009 )

15 GPU-accelerated multilayer perceptron ( Ciresan et al , 2010 )

16 OMP-1 network ( Coates and Ng 2011 , )

17 Distributed autoencoder ( Le et al , 2012 )

18 Multi-GPU convolutional network ( Krizhevsky et al , 2012 )

19 COTS HPC unsupervised convolutional network ( Coates et al , 2013 )

IL SVRC c la ss iﬁ ca ti o n er ro r ra te

Since deep networks achieved the scale required for the ImageNet Large Scale Visual Recognition Challenge, they have consistently triumphed annually, demonstrating progressively lower error rates This trend is supported by data from Russakovsky et al (2014) and He et al (2015).

Learning Basics allow us to deﬁne functions of many variables, ﬁnd the highest and lowest points on these functions and quantify degrees of belief.

The primary objectives of machine learning involve defining a model that encapsulates specific beliefs, creating a cost function to evaluate the alignment of these beliefs with actual outcomes, and employing a training algorithm to reduce the value of this cost function effectively.

This foundational framework underpins a wide range of machine learning algorithms, extending beyond deep learning methods In later sections of the book, we will explore the development of deep learning algorithms built upon this framework.

Linear algebra, a crucial branch of mathematics, plays a significant role in various scientific and engineering fields Despite its importance, many computer scientists lack experience with linear algebra due to its continuous nature, contrasting with the discrete mathematics they are more familiar with.

A solid grasp of linear algebra is crucial for comprehending and applying various machine learning algorithms, particularly in deep learning Consequently, we will first present the essential linear algebra concepts needed before delving into deep learning topics.

If you're familiar with linear algebra, you can skip this chapter For those needing a detailed reference, we suggest The Matrix Cookbook (Petersen and Pedersen, 2006) If you're new to linear algebra, this chapter will provide the basics necessary to understand this book, but we recommend consulting additional resources like Shilov (1977) for a comprehensive learning experience Please note that this chapter will not cover many important linear algebra topics that are not critical for deep learning.

2.1 Scalars, Vectors, Matrices and Tensors

The study of linear algebra involves several types of mathematical objects:

• Scalars: A scalar is just a single number, in contrast to most of the other objects studied in linear algebra, which are usually arrays of multiple numbers.

In mathematical notation, scalars are represented in italics and are typically assigned lower-case variable names When introducing scalars, it is essential to clarify their type; for instance, one might state, "Let s∈R be the slope of the line" to define a real-valued scalar, or "Let n∈N be the number of units" when specifying a natural number scalar.

Vectors are ordered arrays of numbers, where each number can be identified by its index They are typically represented by lowercase bold letters, such as **x**, and their elements are denoted in italic with subscripts, like x₁ for the first element and x₂ for the second It's important to specify the type of numbers contained in the vector; for instance, if all elements belong to the set of real numbers (R), and the vector has n elements, it is classified within Rⁿ, the Cartesian product of R taken n times To explicitly list the elements of a vector, they can be presented as a column within square brackets.

We can think of vectors as identifying points in space, with each element giving the coordinate along a diﬀerent axis.

In vector indexing, we can define a set of indices to access specific elements For instance, if we want to access the elements x₁, x₃, and x₆, we create a set S = {1, 3, 6} and denote it as xₛ To reference the complement of a set, we use the minus sign; for example, x₋₁ represents the vector containing all elements of x except x₁, while x₋ₛ includes all elements of x except for x₁, x₃, and x₆.

A matrix is a two-dimensional array of numbers, where each element is identified by two indices Typically, matrices are represented by upper-case bold letters, such as A If matrix A has dimensions m (height) and n (width), it is denoted as A ∈ R^(m × n) Elements within the matrix are identified using italicized names and comma-separated indices, for instance, A_(1,1) represents the upper left entry, while A_(m,n) indicates the bottom right entry To denote all numbers in a specific row, we use A_(i,:) for the i-th row, and A(:,_i) for the i-th column.

The transpose of a matrix serves as a mirror image across its main diagonal, effectively transforming the columns of the original matrix into rows To clearly denote the elements of a matrix, we represent them as an array enclosed in square brackets.

When indexing matrix-valued expressions, it is essential to use subscripts without converting any letters to lowercase For instance, f(A) i,j denotes the element located at position (i, j) in the matrix resulting from applying the function f to the matrix A.

• Tensors: In some cases we will need an array with more than two axes.

A tensor is defined as an array of numbers organized on a regular grid with a varying number of axes We represent a tensor by the symbol “A,” and the specific element of tensor A located at coordinates (i, j, k) is denoted as A i,j,k.

The transpose of a matrix is a crucial operation that creates a mirror image of the matrix across its main diagonal, which extends from the upper left corner to the lower right.

ﬁgure 2.1 for a graphical depiction of this operation We denote the transpose of a matrix A as A  , and it is deﬁned such that

Vectors are essentially matrices with a single column, while the transpose of a vector results in a matrix with a single row A vector can also be defined by listing its elements inline as a row matrix and then applying the transpose operator to convert it into a standard column vector, such as x = [x₁, x₂, x₃]ᵀ.

A scalar can be thought of as a matrix with only a single entry From this, we can see that a scalar is its own transpose: a = a 

We can add matrices to each other, as long as they have the same shape, just by adding their corresponding elements: C = A+B where C i,j = A i,j +B i,j

We can also add a scalar to a matrix or multiply a matrix by a scalar, just by performing that operation on each element of a matrix: D = aãB+c where

In the context of deep learning, we also use some less conventional notation.

Explicit Memory

The model \( p(y_t | \{x(1), \ldots, x(t)\}) \) derives predictions from the output vector \( \hat{y}(t) \) However, computing the gradient of the loss function with respect to the model parameters is a resource-intensive process This involves a forward propagation pass through the unrolled graph, followed by a backward propagation pass, resulting in a runtime complexity of \( O(\tau) \) Due to the sequential nature of forward propagation, parallelization is not feasible, and memory costs also remain at \( O(\tau) \) since states from the forward pass must be retained for the backward pass The back-propagation algorithm utilized in this context, known as back-propagation through time (BPTT), is elaborated in section 10.2.2 While recurrent networks are highly effective, their training demands significant computational resources Is there a more efficient alternative?

10.2.1 Teacher Forcing and Networks with Output Recurrence

The network with recurrent connections limited to output-to-hidden interactions at each time step is less powerful due to the absence of hidden-to-hidden connections, which prevents it from simulating a universal Turing machine This limitation means that the output units must encapsulate all relevant past information for future predictions, a challenging task unless the user provides a comprehensive description of the system's state in the training set However, the lack of hidden-to-hidden recurrence allows for decoupled training across time steps, enabling parallelization of the training process, as each step's gradient can be computed independently without needing prior outputs.

A time-unfolded recurrent neural network (RNN) with a single output at the sequence's end effectively summarizes sequences by generating a fixed-size representation This representation serves as input for additional processing tasks The network can either have a target directly at the end or obtain the gradient of the output through back-propagation from subsequent modules.

Recurrent models can be trained using a method known as teacher forcing, which is based on the maximum likelihood criterion This technique involves providing the model with the actual output, denoted as y(t), as input for the subsequent time step, t+1 By analyzing a sequence over two time steps, we can better understand the conditional relationships between inputs and outputs in the model's architecture.

Teacher forcing is a training technique used in recurrent neural networks (RNNs) that involves feeding the correct output from the training set into the model's hidden states at the next time step during training This method ensures that the model learns from accurate data by using the true output y(t) as input for h(t+1) However, during deployment, the actual output is typically unknown, leading to the model approximating the correct output by using its own predictions o(t) as input The effectiveness of this approach can be evaluated using the likelihood criterion, expressed as log p(y(1), y(2) | x(1), x(2)).

At time t = 2, the model focuses on maximizing the conditional probability of y(2) based on the preceding x sequence and the previous y value from the training set Maximum likelihood training emphasizes using target values instead of the model's own outputs for feedback, ensuring the model learns the correct output This concept is visually represented in figure 10.6.

Teacher forcing is a technique that helps avoid back-propagation through time (BPTT) in models without hidden-to-hidden connections It can also be used in models with such connections, provided there are links from the output of one time step to the computations of the next However, once hidden units depend on previous time steps, BPTT becomes essential Consequently, some models can be trained using both teacher forcing and BPTT methods.

Strict teacher forcing can pose challenges when a network is later utilized in open-loop mode, as the inputs encountered during training may differ significantly from those at test time To address this issue, one effective strategy is to incorporate both teacher-forced and free-running inputs during training, allowing the network to predict targets several steps ahead through its recurrent output-to-input pathways This enables the network to adapt to input conditions it generates itself and learn how to adjust its state for generating accurate outputs Additionally, a method proposed by Bengio et al (2015) involves randomly selecting between generated values and actual data as inputs, employing a curriculum learning approach that gradually increases the use of generated inputs over time.

10.2.2 Computing the Gradient in a Recurrent Neural Network

Calculating the gradient in a recurrent neural network (RNN) is a simple process that involves applying the generalized back-propagation algorithm to the unrolled computational graph There is no need for specialized algorithms, as the gradients derived from back-propagation can be effectively utilized with various general-purpose gradient-based techniques to train the RNN.

To understand the behavior of the Backpropagation Through Time (BPTT) algorithm, we illustrate the gradient computation process for the relevant RNN equations The computational graph consists of parameters such as U, V, W, b, and c, along with nodes indexed by time t for inputs x(t), hidden states h(t), outputs o(t), and losses L(t) It is essential to calculate the gradient for each node N in this graph.

∇ N L recursively, based on the gradient computed at nodes that follow it in the graph We start the recursion with the nodes immediately preceding the ﬁnal loss

In this derivation, we consider the outputs \( o(t) \) as inputs to the softmax function, which generates the probability vector \( \hat{y} \) over the possible outputs Additionally, we define the loss function as the negative log-likelihood of the actual target \( y(t) \) based on the preceding inputs The gradient \( \nabla o(t) L \) concerning the outputs at each time step \( t \) is calculated for all indices \( i \) and \( t \).

We work our way backwards, starting from the end of the sequence At the ﬁnal time step ,τ h ( ) τ only has o ( ) τ as a descendent, so its gradient is simple:

To back-propagate gradients through time, we iterate from \( t = \tau - 1 \) down to \( t = 1 \) It is important to note that for \( t < \tau \), \( h(t) \) influences both \( o(t) \) and \( h(t + 1) \) Consequently, the gradient of \( h(t) \) is determined by these relationships.

1− h ( +1) t 2 indicates the diagonal matrix containing the elements

1−(h ( +1) i t ) 2 This is the Jacobian of the hyperbolic tangent associated with the hidden unit at timei t+ 1.

After obtaining the gradients on the internal nodes of the computational graph, we can proceed to calculate the gradients for the parameter nodes Given that parameters are shared across multiple time steps, it is crucial to approach calculus operations involving these variables with caution We will implement the equations using the bprop method outlined in section 6.5.6, which assesses the contribution of a single edge in the computational graph to the gradient However, the ∇ W f operator in calculus considers the contributions of W to the value of f from all edges in the computational graph To clarify this, we introduce dummy variables W(t) that serve as copies of W, ensuring precise differentiation.

W ( ) t used only at time step t We may then use ∇ W ( ) t to denote the contribution of the weights at time step t to the gradient.

Using this notation, the gradient on the remaining parameters is given by:

We do not need to compute the gradient with respect to x ( ) t for training because it does not have any parameters as ancestors in the computational graph deﬁning the loss.

10.2.3 Recurrent Networks as Directed Graphical Models

In the recurrent network example developed, the losses \( L(t) \) were calculated as cross-entropies between the training targets \( y(t) \) and the outputs \( o(t) \) Similar to feedforward networks, various loss functions can be employed in recurrent networks, with the choice depending on the specific task Typically, the output of the RNN is interpreted as a probability distribution, leading to the common use of cross-entropy to define the loss Additionally, mean squared error serves as the cross-entropy loss for an output distribution resembling a unit Gaussian, consistent with the principles applied in feedforward networks.

In predictive log-likelihood training, we train the RNN to estimate the conditional distribution of the next sequence element \( y_t \) based on previous inputs This involves maximizing the log-likelihood \( \log(p(y_t | x^{(1)}, \ldots, x^{(t)})) \), or, if the model incorporates connections from the output of one time step to the next, we maximize \( \log(p(y_t | x^{(1)}, \ldots, x^{(t)}, y^{(1)}, \ldots, y^{(t-1)})) \).

Performance Metrics

Establishing your goals regarding the appropriate error metric is crucial, as it will influence all subsequent decisions Additionally, it's important to have a clear understanding of the desired performance level.

Achieving absolute zero error in most applications is impossible, as defined by the Bayes error, which represents the minimum achievable error rate even with infinite training data and knowledge of the true probability distribution This limitation arises from the potential incompleteness of input features regarding the output variable and the inherent stochastic nature of the system Additionally, the finite amount of training data further constrains error reduction efforts.

The limitations of training data can arise from various factors, impacting the development of effective real-world products or services While it is often possible to gather more data to enhance accuracy, it is crucial to evaluate the cost—whether in terms of time, finances, or ethical considerations, such as invasive medical procedures Conversely, when the objective is to assess algorithm performance against a predetermined benchmark, the training set is typically fixed, restricting the ability to acquire additional data.

To determine a reasonable level of performance expectations, one can reference established error rates from prior benchmark results in academic settings In practical applications, understanding the necessary error rate for safety, cost-effectiveness, or consumer appeal is crucial By identifying a realistic desired error rate, you can effectively guide your design decisions to achieve this target.

When evaluating the effectiveness of applications that incorporate machine learning, selecting the appropriate performance metric is crucial, alongside determining the target value Various performance metrics can be employed, which often differ from the cost function used during model training Commonly, accuracy and error rate are measured, as highlighted in section 5.1.2; however, many applications necessitate the use of more sophisticated metrics to fully capture performance.

In the realm of email spam detection, the consequences of mistakes can vary significantly in cost There are two primary errors: misclassifying a legitimate email as spam and permitting a spam email to reach the inbox The repercussions of blocking a legitimate message are far more severe than those of allowing a spam message to slip through Therefore, instead of solely focusing on the error rate of a spam classifier, it is more prudent to evaluate the total cost associated with these errors, prioritizing the higher cost incurred from blocking legitimate communications over the cost of letting spam messages pass.

Training a binary classifier to detect rare events, such as a rare disease affecting one in a million people, presents unique challenges Achieving 99.9999% accuracy by simply predicting that no one has the disease demonstrates that accuracy alone is not a reliable performance metric Instead, precision and recall provide a more meaningful assessment, where precision measures the correctness of reported detections, and recall indicates the fraction of true events identified For instance, a model that predicts no cases achieves perfect precision but zero recall, while one that predicts all cases has perfect recall but low precision To visualize these metrics, a PR curve is often used, plotting precision against recall A feedforward network can estimate the probability of disease presence, generating scores that inform detection based on a chosen threshold By adjusting this threshold, we can balance precision and recall, and for summarizing classifier performance, precision and recall can be combined into a single metric.

Another option is to report the total area lying beneath the PR curve.

In certain applications, machine learning systems can opt not to make a decision, which is particularly beneficial when the algorithm can assess its confidence in a decision This is crucial in scenarios where incorrect decisions may lead to significant harm, allowing for human intervention when necessary An example of this is the Street View transcription system, which transcribes address numbers from photographs to accurately map locations Given that an inaccurate map can severely diminish its value, it is vital to include an address only if the transcription is deemed correct If the machine learning system determines that it is less likely to achieve accurate transcription than a human, it is preferable to allow a human operator to handle the task.

A machine learning system's effectiveness is determined by its ability to significantly minimize the number of photos that human operators need to handle A key performance metric for evaluating this capability is coverage, which measures the proportion of examples for which the machine learning system can generate a response.

In the Street View project, the objective was to achieve human-level transcription accuracy of 98% while ensuring a coverage rate of 95% While it is possible to attain 100% accuracy by not processing any examples, this would result in a complete lack of coverage, highlighting the trade-off between accuracy and coverage in data processing.

Various metrics can be utilized to assess performance, including click-through rates and user satisfaction surveys Additionally, specific application areas often have their own tailored criteria for evaluation.

To enhance a machine learning system effectively, it is crucial to identify specific performance metrics to improve beforehand Clearly defined goals are essential, as they enable the assessment of whether changes lead to meaningful progress.

Default Baseline Models

Once performance metrics and goals are set, the next crucial step is to promptly establish a comprehensive end-to-end system This section offers guidance on selecting initial baseline algorithms tailored to different scenarios It's important to note that deep learning research evolves rapidly, meaning more effective default algorithms may emerge shortly after this article is published.

For simpler problems, it may be beneficial to start with basic statistical models, such as logistic regression, rather than diving into deep learning If your issue can be effectively addressed by accurately selecting a few linear weights, opting for a straightforward approach could yield quicker and more efficient results.

For challenges categorized as "AI-complete," such as object recognition, speech recognition, and machine translation, starting with a suitable deep learning model is likely to yield successful outcomes.

When selecting a model for your data, start by identifying its general category For supervised learning with fixed-size vector inputs, opt for a feedforward network featuring fully connected layers If your data possesses a known topological structure, such as images, a convolutional network is more suitable In these scenarios, it's advisable to utilize piecewise linear activation functions like ReLUs or their variants, including Leaky ReLUs, PreLUs, and maxout For sequential input or output, consider employing a gated recurrent network, such as LSTM or GRU, to effectively capture the dependencies in the data.

For effective optimization in machine learning, using Stochastic Gradient Descent (SGD) with momentum and a decaying learning rate is recommended, with popular decay methods including linear decay to a minimum rate, exponential decay, or reducing the rate by a factor of 2-10 when validation error stabilizes Alternatively, the Adam optimizer is also a strong choice Additionally, batch normalization can significantly enhance optimization performance, particularly in convolutional networks and those utilizing sigmoidal activations While it may be acceptable to exclude batch normalization in initial baseline models, it should be implemented promptly if optimization issues arise.

Incorporating mild regularization techniques is essential if your training set has fewer than tens of millions of examples Universal application of early stopping is recommended Dropout serves as an effective and easily implementable regularizer compatible with various models and training algorithms Additionally, batch normalization can lower generalization error and may eliminate the need for dropout by introducing noise in the normalization process.

When tackling a task similar to one that has been extensively researched, it is beneficial to utilize the best-performing model and algorithm from that prior study Consider adopting a pre-trained model from that task; for instance, leveraging features from a convolutional network trained on ImageNet is a common practice for addressing various computer vision challenges (Girshick et al., 2015).

When deciding whether to start with unsupervised learning, it's essential to consider the specific domain of your application Domains like natural language processing can greatly benefit from unsupervised techniques, such as learning word embeddings Conversely, in fields like computer vision, unsupervised methods may not provide advantages unless you're in a semi-supervised scenario with very few labeled examples If your application relies on unsupervised learning, it's advisable to incorporate it into your initial end-to-end model However, if it's not crucial for your task, you may want to reserve its use for later iterations, especially if your initial model shows signs of overfitting.

Determining Whether to Gather More Data

Once the initial end-to-end system is in place, the next step is to evaluate the algorithm's performance and identify potential enhancements While many beginners in machine learning may instinctively experiment with various algorithms for improvement, it is generally more effective to focus on gathering additional data rather than solely tweaking the learning algorithm.

To decide whether to gather more data, first assess the performance on the training set; if it’s poor, the learning algorithm isn’t effectively utilizing the existing data, making additional data unnecessary Instead, consider enhancing the model by adding layers or increasing hidden units, and improve the learning algorithm by tuning hyperparameters like the learning rate If these adjustments don’t yield better results, the issue may lie with the quality of the training data, which could be too noisy or lacking essential inputs In such cases, it may be necessary to start anew by collecting cleaner data or a more comprehensive set of features.

To ensure optimal model performance, first evaluate the training set, and if satisfactory, proceed to assess the test set If the test set performance significantly lags behind the training set, collecting additional data is often the most effective solution Key factors to consider include the costs and feasibility of acquiring more data versus other methods to reduce test errors, as well as the amount of data needed for substantial improvement In large internet companies with extensive user bases, acquiring large datasets is typically more cost-effective than alternative approaches For instance, the creation of extensive labeled datasets has been crucial in advancements like object recognition However, in fields such as medical applications, data collection may be prohibitively expensive or impractical An alternative to data gathering is to simplify the model or enhance regularization by fine-tuning hyperparameters like weight decay or employing strategies such as dropout If performance disparities persist despite regularization adjustments, seeking additional data is recommended.

When determining the amount of data to collect, it's crucial to analyze the relationship between training set size and generalization error Plotting curves can help predict the additional training data required for desired performance levels Typically, a small increase in the number of examples does not significantly affect generalization error Therefore, it is advisable to experiment with training set sizes on a logarithmic scale, such as doubling the number of examples in successive experiments.

To enhance generalization error without acquiring additional data, the focus must shift to refining the learning algorithm This area falls under research rather than practical guidance for applied practitioners.

Selecting Hyperparameters

Deep learning algorithms are governed by numerous hyperparameters that influence various aspects of their performance Certain hyperparameters impact the time and memory resources required for algorithm execution, while others determine the quality of the trained model and its effectiveness in making accurate predictions on new data.

When selecting hyperparameters for machine learning models, there are two primary methods: manual selection and automatic selection Manual hyperparameter tuning necessitates a solid understanding of the function of each hyperparameter and the mechanisms behind effective model generalization In contrast, automatic hyperparameter selection algorithms simplify this process, minimizing the need for in-depth knowledge, although they typically entail higher computational costs.

To manually set hyperparameters, it is essential to comprehend the interplay between hyperparameters, training error, generalization error, and available computational resources, including memory and runtime This understanding is rooted in the foundational concepts regarding the effective capacity of a learning algorithm.

The primary objective of manual hyperparameter search is to minimize generalization error while adhering to specific runtime and memory constraints This article does not address the assessment of runtime and memory effects related to different hyperparameters, as these factors can vary significantly across platforms.

The main objective of manual hyperparameter search is to optimize the model's effective capacity to align with the task's complexity This effective capacity is influenced by three key factors: the model's representational capacity, the learning algorithm's ability to minimize the training cost function, and the regularization effects of the cost function and training process A model with additional layers and hidden units possesses greater representational capacity, enabling it to represent more complex functions However, it may struggle to learn these functions if the training algorithm fails to identify effective functions for minimizing training costs or if regularization techniques, like weight decay, restrict certain functions.

The generalization error exhibits a U-shaped curve in relation to hyperparameter values At one end, low hyperparameter values lead to high capacity and underfitting, resulting in elevated training errors Conversely, high hyperparameter values cause overfitting, where the generalization error increases due to a significant gap between training and test errors The optimal model capacity is found in the middle, where a balanced generalization gap and moderate training error yield the lowest generalization error.

Overfitting in machine learning can occur due to both high and low values of hyperparameters For instance, increasing the number of hidden units in a layer enhances the model's capacity, potentially leading to overfitting Conversely, a minimal weight decay coefficient, such as zero, maximizes the effective capacity of the learning algorithm, which can also result in overfitting Therefore, careful tuning of hyperparameters is crucial to balance model capacity and prevent overfitting.

Not all hyperparameters can fully explore the U-shaped performance curve in machine learning Discrete hyperparameters, like the number of units in a layer, allow for limited exploration, while binary hyperparameters function as switches for optional components, offering only two points on the curve Additionally, some hyperparameters are constrained by minimum or maximum values, such as the weight decay coefficient, which cannot explore the overfitting region if set to zero Consequently, certain hyperparameters can only reduce model capacity rather than enhance it.

The learning rate is the most crucial hyperparameter to tune, as it significantly influences the model's effective capacity An optimal learning rate is essential for minimizing training error, as both excessively high and low rates can lead to poor performance Specifically, a learning rate that is too large may cause gradient descent to increase training error, while a rate that is too small can slow down training and result in a model getting stuck at a high training error This phenomenon is particularly complex and does not occur with convex loss functions When tuning other hyperparameters, it's important to monitor both training and test errors to identify issues of overfitting or underfitting, allowing for appropriate adjustments to the model's capacity.

When the error on your training set exceeds the target error rate, increasing the model's capacity becomes essential If regularization is not in use and you trust the effectiveness of your optimization algorithm, consider adding more layers or hidden units to your network However, this adjustment will lead to higher computational costs for the model.

If your error on the test set is higher than than your target error rate, you can

T ra ini ng er ro r

The relationship between learning rate and training error is crucial, as a learning rate above the optimal value can lead to a significant increase in error during a fixed training time While a smaller learning rate may only slow down training, generalization error can be influenced by regularization effects associated with learning rate adjustments Achieving optimal test error involves balancing training error and the gap between training and test error Neural networks excel with low training error and a minimized gap, which can be achieved by adjusting regularization hyperparameters, such as implementing dropout or weight decay Typically, the best performance is attained with a large model that is effectively regularized.

Most hyperparameters can be set by reasoning about whether they increase or decrease model capacity Some examples are included in Table 11.1.

When tuning hyperparameters, it's crucial to focus on achieving strong performance on the test set Regularization is one strategy to enhance this performance, but reducing generalization error can also be accomplished by gathering more training data, especially when training error is low A straightforward method to ensure success is to incrementally increase both model capacity and training set size until the task is effectively addressed However, this approach does raise the computational costs of training and inference, making it viable only with sufficient resources.

Number of hidden units increased Increasing the number of hidden units increases the representational capacity of the model.

Increasing the number of hidden units increases both the time and memory cost of essentially every operation on the model. Learning rate tuned optimally

An improper learning rate, whether excessively high or low, can lead to optimization failures and a model with diminished effective capacity Additionally, increasing the convolution kernel width enhances the model's parameter count, potentially improving its performance.

Using wider kernels in a model leads to a narrower output dimension, which can decrease model capacity unless implicit zero padding is employed to mitigate this effect While wider kernels demand more memory for parameter storage and increase runtime, they also result in lower memory costs due to the reduced output size.

Implicit zero padding increased Adding implicit zeros before convolution keeps the representation size large

Increased time and memory cost of most operations.

Weight decay co- eﬃcient decreased Decreasing the weight decay coeﬃcient frees the model parameters to become larger

Dropout rate decreased Dropping units less often gives the units more oppor- tunities to “conspire” with each other to ﬁt the training set

Table 11.1 illustrates how different hyperparameters impact model capacity While this method may encounter optimization challenges, it often remains effective for many problems, especially when an appropriate model is selected.

Debugging Strategies

Machine learning systems often struggle with performance issues, making it challenging to determine if the root cause lies within the algorithm or if it stems from implementation bugs The debugging process for these systems is complicated due to various factors.

Typically, we cannot determine the expected behavior of an algorithm beforehand The essence of machine learning lies in its ability to uncover valuable patterns that we cannot explicitly define For instance, when training a neural network for a new classification task that results in a 5% test error, it remains unclear whether this outcome reflects optimal performance or subpar results.

Machine learning models often consist of multiple adaptive components, which can mask errors in individual parts For instance, when training a neural network with layers defined by weights and biases, an incorrect implementation of the gradient descent for biases can lead to consistently negative values This mistake, represented by the erroneous update formula, does not utilize the gradient and undermines the learning process However, the model's overall performance may still appear acceptable, as the weights might adjust to offset the impact of the faulty biases, making the underlying issue less noticeable.

Effective debugging strategies for neural networks typically focus on addressing two main challenges: simplifying the case to allow for predictable outcomes or creating tests that isolate and evaluate specific components of the neural network implementation.

Some important debugging tests include:

To effectively evaluate a machine learning model, it's crucial to visualize its performance in action For instance, when training an object detection model, observe images with the model's proposed detections overlaid, and when working with generative speech models, listen to the produced samples While quantitative metrics like accuracy and log-likelihood are important, relying solely on them can be misleading Direct observation of the model's task execution provides valuable insights into its actual performance, helping to identify potential evaluation bugs that may falsely suggest the system is functioning well.

Most models provide a confidence measure for their classifications, such as softmax outputs that assign probabilities to each class While these probabilities often overestimate the likelihood of correct predictions, they can still indicate which examples are less likely to be accurately labeled By examining the hardest-to-classify training examples, one can uncover potential issues with data preprocessing or labeling For instance, the Street View transcription system encountered a challenge where tight cropping of images led to omitted digits, resulting in low confidence scores for the correct answers Analyzing these mistakes revealed a systematic cropping problem, and adjusting the detection system to crop wider images significantly improved overall performance, despite requiring the transcription network to handle greater variability in digit positioning and scale.

Assessing software performance through train and test error can be challenging A low training error paired with a high test error suggests that while the training process is functioning correctly, the model may be overfitting due to inherent algorithmic issues Conversely, if the test error is inaccurately measured—perhaps due to improper model saving and loading or discrepancies in data preparation—this could skew results When both train and test errors are high, it becomes unclear whether the issue lies in software defects or underfitting caused by algorithmic limitations This situation necessitates additional testing to clarify the underlying problems.

To address high error rates on a training set with a tiny dataset, first assess whether the issue stems from genuine underfitting or a software defect Typically, even small models can adequately fit a sufficiently small dataset, such as a classification dataset containing a single example, which can be achieved by correctly adjusting the output layer's biases If a classifier fails to accurately label a single example, or if an autoencoder cannot reproduce it with high fidelity, or a generative model cannot consistently generate samples resembling that example, this indicates a software defect hindering optimization This evaluation can also be applied to small datasets with a limited number of examples.

When implementing your own gradient computations in a software framework or adding a new operation to a differentiation library, it's crucial to ensure the accuracy of your back-propagated derivatives A frequent source of error is the incorrect implementation of gradient expressions To verify the correctness of your automatic differentiation, you can compare the derivatives obtained from your implementation with those calculated using finite differences This comparison serves as a reliable method to validate your gradient computations.

 , (11.5) we can approximate the derivative by using a small, ﬁnite : f  ( ) x ≈ f x( + ) −f x( )

We can improve the accuracy of the approximation by using the centered diﬀer- ence: f  ( ) x ≈ f x( + 1 2 )−f x( − 1 2 )

The perturbation size  must chosen to be large enough to ensure that the perturbation is not rounded down too much by ﬁnite-precision numerical computations.

To effectively test the gradient or Jacobian of a vector-valued function g: R^m → R^n, we face the limitation of finite differencing, which allows only one derivative evaluation at a time This necessitates either running finite differencing mn times for all partial derivatives or applying a new function that incorporates random projections at both the input and output of g For instance, we can test the derivative implementation using f(x) = u^T g(vx), where u and v are randomly selected vectors Correctly computing f'(x) requires accurate back-propagation through g, which is efficient with finite differences since f has a single input and output It is advisable to perform this test with multiple values of u and v to minimize the risk of missing errors that are orthogonal to the random projection.

Accessing numerical computation with complex numbers allows for an efficient numerical estimation of gradients This method, highlighted by Squire and Trapp (1998), utilizes complex inputs to the function, where the relationship can be expressed as f(x + iε) = f(x) + iεf'(x) + O(ε²) Consequently, the real and imaginary parts of the function provide a powerful tool for gradient estimation, enhancing computational accuracy and efficiency.

In contrast to the real-valued scenario, the absence of cancellation effects when calculating the difference between function values at various points enables the use of extremely small values, such as ε = 10^−150 This characteristic renders the O(ε^2) error negligible for all practical applications.

Monitoring histograms of activations and gradients in neural networks is crucial for understanding their performance during training By visualizing the statistics of activations and gradients over numerous iterations, we can assess the saturation of hidden units For rectifier units, we can determine how frequently they are inactive, while for tanh units, the average absolute value of pre-activations indicates their saturation level In deep networks, the rapid growth or decay of propagated gradients can hinder optimization Additionally, comparing the magnitude of parameter gradients to the parameters themselves provides valuable insights into the training process.

According to Bottou (2015), it's important for the magnitude of parameter updates over a minibatch to be around 1% of the parameter's magnitude, avoiding extremes like 50% or 0.001%, which can lead to ineffective learning rates In scenarios with sparse data, such as natural language processing, some parameter groups may update effectively while others remain stagnant, necessitating careful monitoring of their progress.

Many deep learning algorithms offer guarantees regarding the results at each step, which can be crucial for debugging For instance, approximate inference algorithms utilize algebraic solutions to optimization problems, allowing for verification of their guarantees These guarantees often include the assurance that the objective function will not increase after an algorithmic step, the gradient concerning a subset of variables will be zero after each step, and that all variables will have a zero gradient at convergence However, due to rounding errors in digital computations, these conditions may not be met precisely, necessitating the inclusion of a tolerance parameter in debugging tests.

Example: Multi-Digit Number Recognition

This article outlines the application of our design methodology through a detailed overview of the Street View transcription system, focusing specifically on the development of its deep learning components While the deep learning aspects are crucial, it is important to acknowledge the significance of other system components, including the Street View vehicles and the underlying database infrastructure.

The machine learning process commenced with data collection, where raw data was gathered by cars and labeled by human operators Prior to transcription, extensive dataset curation took place, utilizing various machine learning techniques to accurately detect house numbers.

The transcription project prioritized performance metrics aligned with business goals, emphasizing the necessity of high accuracy for effective mapping Aiming for human-level accuracy of 98%, the project faced challenges in achieving this standard while also ensuring adequate coverage Consequently, coverage emerged as the primary performance metric, with accuracy maintained at 98% As advancements in the convolutional network were made, the confidence threshold for transcription was lowered, ultimately surpassing the target of 95% coverage.

After setting quantitative goals, the next step is to quickly establish a sensible baseline system, specifically a convolutional network with rectified linear units for vision tasks The transcription project initiated with this model, which was not commonly used for generating sequences of predictions at the time To create the simplest possible baseline, the initial output layer comprised multiple softmax units designed to predict a sequence of characters Each softmax unit was trained independently, similar to a classification task.

To enhance the Street View transcription system, we recommend an iterative refinement methodology that tests the impact of each change The initial modification stemmed from a theoretical understanding of the coverage metric and data structure, addressing the issue where the network fails to classify input x when the probability of the output sequence p(y | x) falls below a certain threshold t Initially, the calculation of p(y | x) was arbitrary, relying on the simple multiplication of softmax outputs This led to the creation of a specialized output layer and cost function designed to compute a principled log-likelihood, significantly improving the effectiveness of the example rejection mechanism.

Despite coverage being below 90%, our methodology indicates that the similarity in train and test set errors suggests issues of underfitting or potential problems with the training data The project benefited from a dataset containing tens of millions of labeled examples, facilitating smooth progress To diagnose the model's worst errors, we visualized the incorrect training set transcriptions with the highest confidence, revealing that many errors stemmed from images cropped too tightly, obscuring parts of the address Instead of spending extensive time refining the address number detection system, the team opted to widen the crop region, resulting in a significant ten percentage point increase in transcription system coverage.

The final improvements in performance were achieved by fine-tuning hyperparameters, primarily by increasing the model size while keeping computational costs in check With train and test errors remaining similar, it was evident that any performance gaps stemmed from underfitting and some lingering dataset issues Ultimately, the transcription project proved to be highly successful, enabling the transcription of hundreds of millions of addresses more efficiently and cost-effectively than traditional human methods.

We hope that the design principles described in this chapter will lead to many other similar successes.

This chapter explores the application of deep learning in various fields, including computer vision, speech recognition, and natural language processing We start by highlighting the necessity of large-scale neural network implementations for effective AI applications The discussion then shifts to specific areas where deep learning has made significant contributions While deep learning aims to create versatile algorithms, some specialization remains essential; for instance, vision tasks involve processing numerous input features (pixels), whereas language tasks necessitate modeling a vast array of potential values (words in the vocabulary).

Large-Scale Deep Learning

Deep learning, rooted in connectionism, emphasizes that while individual neurons or features lack intelligence, their collective action can produce intelligent behavior The significant increase in the number of neurons is crucial for enhancing the accuracy of neural networks and their ability to tackle complex tasks Over the past three decades, network sizes have expanded exponentially, leading to remarkable advancements in performance, although current artificial neural networks still match only the size of insect nervous systems.

Because the size of neural networks is of paramount importance, deep learning requires high performance hardware and software infrastructure.

Traditionally, neural networks relied on the CPU of a single machine for training; however, this method is now viewed as inadequate Today, the predominant approach involves utilizing GPU computing or the combined processing power of multiple networked machines Prior to adopting these costly solutions, researchers extensively proved that CPUs were incapable of handling the substantial computational demands posed by neural networks.

Implementing efficient numerical CPU code is crucial for enhancing performance, as careful specialization for specific CPU families can lead to significant improvements For instance, Vanhoucke et al (2011) demonstrated that a well-tuned fixed-point implementation achieved a threefold speedup over a robust floating-point system for neural network workloads While fixed-point arithmetic can outperform floating-point in certain scenarios, each CPU model has unique performance characteristics, making it essential to consider both options Additionally, optimizing data structures to minimize cache misses and utilizing vector instructions are vital strategies that can further enhance performance Neglecting these implementation details can limit model size and, consequently, its accuracy in machine learning applications.

Modern neural network implementations primarily utilize graphics processing units (GPUs), which are specialized hardware originally designed for graphics applications The demand for high-performance graphics in video gaming systems has driven advancements in GPU technology, and these performance characteristics are also advantageous for neural networks.

Video game rendering relies on rapid parallel processing to handle numerous operations simultaneously Character and environment models are defined by lists of 3-D vertex coordinates, which graphics cards convert into 2-D on-screen coordinates through efficient matrix multiplication and division Each pixel's color is determined through parallel computations, which are simpler and less branching than typical CPU workloads For instance, all vertices of a rigid object are multiplied by the same matrix without the need for conditional evaluations per vertex The independent nature of these computations allows for easy parallelization, while large memory buffers store texture bitmaps for rendering Consequently, graphics cards are engineered for high parallelism and memory bandwidth, albeit with lower clock speeds and reduced branching capabilities compared to traditional CPUs.

Neural network algorithms demand performance characteristics similar to real-time graphics algorithms, requiring extensive buffers for parameters, activation values, and gradients that must be updated during each training step These buffers often exceed the cache capacity of traditional desktop computers, making memory bandwidth a critical factor GPUs provide a significant advantage over CPUs due to their superior memory bandwidth, and since neural network training typically involves minimal branching or complex control, they are well-suited for GPU hardware Additionally, the independent processing of individual "neurons" within the same layer allows neural networks to effectively leverage the parallelism offered by GPU computing.

Originally, GPU hardware was specialized solely for graphics tasks, but it has evolved to become more flexible, enabling the use of custom subroutines for tasks like transforming vertex coordinates and assigning pixel colors This flexibility means that pixel values do not need to be tied to rendering tasks, allowing GPUs to be utilized for scientific computing by writing computational outputs to pixel value buffers.

In 2005, researchers implemented a two-layer fully connected neural network on a GPU, achieving a threefold speed increase compared to their CPU-based system Following this, Chellapilla et al (2006) showed that this technique could also enhance the performance of supervised convolutional networks.

The rise of general-purpose GPUs (GP-GPUs) significantly boosted the use of graphics cards for neural network training, as these GPUs could run arbitrary code beyond just rendering tasks NVIDIA's CUDA programming language made it easier to write this code in a C-like syntax With their user-friendly programming model, substantial parallel processing capabilities, and high memory bandwidth, GP-GPUs have become the preferred platform for neural network programming, quickly embraced by deep learning researchers following their introduction.

Writing efficient code for GPUs presents unique challenges that require specialized knowledge, as the performance optimization techniques differ significantly from those used for CPUs Unlike CPUs, which rely heavily on cache memory, GPUs often benefit from recalculating values rather than retrieving them from memory due to the lack of caching for writable memory locations Additionally, GPU programming is inherently multi-threaded, necessitating careful coordination among threads to optimize memory operations through coalescing Coalesced memory transactions occur when multiple threads read or write values simultaneously, enhancing performance Effective memory access patterns typically involve threads accessing memory addresses that are multiples of a power of 2 Furthermore, GPU threads are organized into groups called warps, where all threads in a warp must execute the same instruction at the same time, complicating branching and requiring sequential execution of different code paths within the same warp.

To simplify the process of writing high-performance GPU code, researchers should design their workflow to minimize the need for new GPU coding when testing models or algorithms This can be achieved by creating a software library that includes efficient operations, such as convolution and matrix multiplication, and then defining models through calls to this library For instance, the machine learning library Pylearn2 utilizes Theano and cuda-convnet for its algorithms, which facilitates high-performance operations This modular approach not only streamlines the coding process but also enhances compatibility with various hardware, allowing Theano programs to run on both CPU and GPU without modifications Other libraries, including TensorFlow and Torch, offer similar functionalities.

In many cases, the computational resources available on a single machine are insuﬃcient We therefore want to distribute the workload of training and inference across many machines.

Distributing inference is simple, because each input example we want to process can be run by a separate machine This is known as data parallelism.

Model parallelism enables multiple machines to collaborate on a single data point, with each machine executing a distinct segment of the model This approach is applicable for both training and inference processes.

Data parallelism in training poses challenges, as increasing the minibatch size for a single SGD step often yields diminishing returns in optimization performance A more effective approach is to enable multiple machines to compute gradient descent steps concurrently However, traditional gradient descent is inherently sequential, with each step dependent on the parameters from the previous one This limitation can be addressed through asynchronous stochastic gradient descent, where multiple processor cores share memory to compute gradients without locking Although this method may lead to some overlap in progress among cores, it accelerates the overall learning process Dean et al (2012) advanced this concept by implementing a lock-free gradient descent using a parameter server, which has since become the predominant method for training large deep networks, widely adopted by leading deep learning organizations in the industry.

In 2015, it was noted that academic deep learning researchers often lack access to large-scale distributed learning systems However, some studies, such as Coates et al (2013), have explored methods to create distributed networks using affordable hardware that is accessible within university environments.

In commercial applications, minimizing the time and memory costs of running inference in machine learning models is often more crucial than reducing those costs during training For non-personalized applications, a model can be trained once and deployed for use by billions of users Typically, end users face greater resource constraints than developers; for instance, a speech recognition model may be trained on a powerful computer cluster but ultimately deployed on mobile devices.

Computer Vision

Computer vision has long been a prominent research focus in deep learning due to its inherent difficulty for computers, despite being a simple task for humans and animals (Ballard et al., 1983) A significant portion of standard benchmark tasks for deep learning algorithms revolves around object recognition and optical character recognition.

Computer vision is a diverse field focused on image processing and has a wide range of applications It aims to replicate human visual capabilities, such as facial recognition, while also developing innovative visual functionalities A notable recent advancement in this area includes the ability to interpret sound waves by analyzing the vibrations they cause in visible objects within a video.

Most deep learning research in computer vision has concentrated on core AI objectives that mimic human capabilities, primarily focusing on object recognition and detection This includes identifying objects in images, annotating them with bounding boxes, transcribing symbols, and pixel-level labeling Additionally, generative modeling has played a significant role in deep learning, leading to substantial advancements in image synthesis Although creating images from scratch isn't typically classified as a computer vision task, models developed for this purpose are valuable for image restoration, which involves repairing image defects and removing unwanted objects.

Sophisticated preprocessing is essential in various application areas due to the challenging nature of original input formats for many deep learning architectures In contrast, computer vision typically demands minimal preprocessing, primarily focusing on standardizing image pixel values to a consistent range, such as [0,1] or [-1, 1] Mixing images from different ranges, like [0,1] and [0, 255], can lead to processing failures The most critical preprocessing step is ensuring that images are scaled uniformly Additionally, many computer vision models necessitate images of a specific size, requiring cropping or scaling; however, some convolutional models can handle variable-sized inputs and adapt their pooling regions accordingly to maintain a consistent output size.

Various convolutional models, like those developed by Hadsell et al (2007), produce variable-sized outputs that dynamically adjust according to the input, enabling tasks such as image denoising and pixel labeling.

Dataset augmentation serves as an effective preprocessing technique for training sets, significantly reducing the generalization error in computer vision models By presenting the model with various versions of the same input, such as images cropped from different locations, we can implement an ensemble approach during testing This method allows different instantiations of the model to vote on the output, further enhancing accuracy and minimizing generalization error.

Preprocessing techniques are applied to both training and test sets to standardize examples, minimizing variation that models must address By reducing data variability, generalization error decreases, and smaller models can effectively handle simpler tasks, leading to better generalization Such preprocessing typically targets easily describable input variability that is deemed irrelevant to the task by human designers However, with large datasets and models, this preprocessing may be unnecessary, allowing the model to autonomously learn which variations to ignore For instance, the AlexNet system for ImageNet classification employs a single preprocessing step: subtracting the mean pixel values from training examples (Krizhevsky et al., 2012).

One significant source of variation that can be eliminated for various tasks is the contrast level in an image Contrast indicates the difference in intensity between bright and dark pixels There are multiple methods to quantify image contrast, particularly in deep learning, where it is often defined by the standard deviation of pixel values within an image or a specific region.

X ∈R r c × × 3, with X i,j,1 being the red intensity at row i and column j, X i,j,2 giving the green intensity and X i,j,3 giving the blue intensity Then the contrast of the entire image is given by

(12.1) where X¯ is the mean intensity of the entire image:

Global contrast normalization (GCN) addresses the issue of varying image contrasts by subtracting the mean from each image and rescaling it to achieve a constant standard deviation, denoted as s However, this method encounters challenges with zero-contrast images, where all pixels have equal intensity, as no scaling can alter their contrast Additionally, images with very low but non-zero contrast often contain minimal information, and dividing by the true standard deviation can exacerbate sensor noise or compression artifacts To mitigate these issues, a small, positive regularization parameter λ is introduced to adjust the standard deviation estimate, or alternatively, the denominator can be constrained to a minimum value Consequently, given an input image X, GCN generates an output image X'.

Datasets featuring large images cropped to focus on interesting objects typically do not include images with nearly constant intensity Therefore, it is generally acceptable to disregard the small denominator issue by setting λ to 0, while also preventing division by zero in rare instances by assigning a very low value to ε.

Goodfellow et al (2013a) employed a method on the CIFAR-10 dataset that involved randomly cropping small images, which tend to maintain nearly constant intensity This characteristic makes aggressive regularization techniques particularly beneficial Additionally, Coates et al (2011) contributed to this area of research.

= 0 and λ = 10 on small, randomly selected patches drawn from CIFAR-10.

The scale parameter \( s \) is typically set to a fixed value, as demonstrated by Coates et al (2011), or it can be adjusted to ensure that the standard deviation of each individual pixel across examples is approximately 1, as illustrated by Goodfellow et al (2013a).

The standard deviation, as defined in equation 12.3, serves as a rescaling of the L2 norm of an image after removing its mean Utilizing standard deviation for Graph Convolutional Networks (GCN) is advantageous because it normalizes by the number of pixels, allowing consistent scaling across different image sizes While the L2 norm is proportional to the standard deviation, understanding GCN as mapping examples to a spherical shell enhances intuition This property is beneficial since neural networks typically excel at interpreting spatial directions rather than specific locations Efficiently responding to varying distances in a single direction necessitates hidden units with collinear weight vectors and distinct biases, a challenge for learning algorithms Moreover, shallow graphical models struggle to represent multiple separated modes along the same axis, a limitation that GCN overcomes by simplifying each example to a direction alone, rather than a combination of direction and distance.

Sphering is a preprocessing technique distinct from Graph Convolutional Networks (GCN) Unlike the notion of positioning data on a spherical shell, sphering involves rescaling the principal components to achieve a specific statistical property.

GCN maps examples onto a sphere, illustrating how raw input data can have varying norms When applying GCN with λ = 0, all non-zero examples are accurately mapped to a sphere, utilizing parameters s = 1 and ε = 10^−8 Notably, this GCN approach normalizes the standard deviation instead of the L2 norm, resulting in a sphere that differs from the unit sphere Regularization techniques further enhance this mapping process.

Speech Recognition

Speech recognition involves converting spoken language into a sequence of words intended by the speaker, represented as X = (x(1), x(2), , x(T)), where acoustic input is typically processed in 20ms frames While many systems utilize hand-crafted features for preprocessing, some deep learning approaches, as noted by Jaitly and Hinton (2011), learn features directly from raw audio input The goal of automatic speech recognition (ASR) is to develop a function f*ASR that identifies the most probable linguistic sequence y = (y1, y2, , yN) based on the acoustic sequence X, expressed mathematically as f*ASR(X) = arg max y.

P ∗ (y X| = X) (12.4) where P ∗ is the true conditional distribution relating the inputs X to the targets y.

From the 1980s until around 2009-2012, advanced speech recognition systems primarily utilized hidden Markov models (HMMs) in conjunction with Gaussian mixture models (GMMs), where GMMs linked acoustic features to phonemes and HMMs represented phoneme sequences The GMM-HMM framework generated acoustic waveforms by first producing a phoneme sequence through HMMs, followed by GMMs converting these symbols into audio segments Despite GMM-HMM systems' dominance, neural networks were among the first technologies applied in automatic speech recognition (ASR) during the late 1980s and early 1990s, achieving performance comparable to GMM-HMM systems The TIMIT corpus became a benchmark for phoneme recognition, similar to MNIST for object recognition However, the complexity of engineering GMM-HMM systems delayed the industry's shift to neural networks, leading to a focus on enhancing GMM-HMM systems with neural nets until the late 2000s With advancements in deep learning and larger datasets, neural networks began to replace GMMs, significantly improving recognition accuracy Starting in 2009, researchers adopted deep learning techniques, specifically using restricted Boltzmann machines (RBMs) for unsupervised learning in speech recognition.

Unsupervised pretraining was employed to develop deep feedforward networks for speech recognition tasks, initializing each layer with RBM training These networks process spectral acoustic representations within a fixed-size input window to predict the conditional probabilities of HMM states for the central frame This deep learning approach significantly enhanced recognition rates on the TIMIT dataset, reducing the phoneme error rate from approximately 26% to 20.7% Further improvements included the integration of speaker-adaptive features, which lowered the error rate even more Research quickly progressed from phoneme recognition to large-vocabulary speech recognition, expanding capabilities to recognize word sequences The focus of deep networks shifted from pretraining and Boltzmann machines to methods like rectified linear units and dropout By this time, major industry players began collaborating with academic researchers to explore deep learning, leading to significant breakthroughs now implemented in products like mobile phones.

As researchers delved into increasingly expansive labeled datasets and refined their techniques for initializing, training, and structuring deep neural networks, they discovered that the phase of unsupervised pretraining was often superfluous or failed to yield substantial enhancements in performance.

Recent advancements in speech recognition have led to an unprecedented 30% improvement in word error rates, marking a significant shift from the stagnant progress seen over the previous decade with traditional GMM-HMM technology Despite the increasing size of training datasets, error rates remained largely unchanged until the introduction of deep learning techniques Within just two years, deep neural networks became integral to most industrial speech recognition products, igniting a renewed wave of research into deep learning algorithms and architectures for automatic speech recognition (ASR), a trend that continues to evolve today.

One of these innovations was the use of convolutional networks (Sainath et al.,

In 2013, advancements in neural network models introduced two-dimensional convolutional architectures that enhance the earlier time-delay neural networks by replicating weights across both time and frequency These innovative models treat input spectrograms as images, utilizing one axis for time and the other for the frequency of spectral components, leading to improved processing and analysis.

Recent advancements in end-to-end deep learning speech recognition systems have eliminated the need for Hidden Markov Models (HMM) A significant milestone was achieved by Graves et al (2013), who developed a deep Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) utilizing Maximum A Posteriori (MAP) inference for frame-to-phoneme alignment, building on earlier work by LeCun et al (1998b) and the Connectionist Temporal Classification (CTC) framework (Graves et al., 2006; Graves, 2012) This deep RNN features state variables from multiple layers at each time step, resulting in both ordinary depth from the stacked layers and temporal depth from unfolding over time This innovative approach reduced the phoneme error rate on the TIMIT dataset to an unprecedented 17.7% Further developments in deep RNNs have been explored by Pascanu et al (2014a) and Chung et al (2014) in various applications.

A significant advancement in end-to-end deep learning automatic speech recognition (ASR) is enabling systems to learn the alignment between acoustic-level and phonetic-level information (Chorowski et al., 2014; Lu et al., 2015).

Natural Language Processing

Natural Language Processing (NLP) enables computers to understand and generate human languages like English and French Unlike specialized programming languages, natural languages are often ambiguous and complex, posing challenges for formal interpretation NLP encompasses various applications, including machine translation, where a system translates sentences from one language to another Many NLP solutions rely on language models that establish a probability distribution over sequences of words, characters, or bytes, facilitating effective communication in natural languages.

Generic neural network techniques can be effectively utilized in natural language processing; however, achieving optimal performance and scalability in large applications requires domain-specific strategies To create an efficient natural language model, it is essential to employ techniques tailored for sequential data processing Typically, natural language is viewed as a sequence of words instead of individual characters or bytes Given the vast number of potential words, word-based language models must navigate a high-dimensional and sparse discrete space Consequently, various strategies have been developed to enhance the efficiency of modeling this complex space, both computationally and statistically.

A language model establishes a probability distribution for sequences of tokens in natural language, where tokens can represent words, characters, or bytes These tokens are distinct entities, and the foundational language models utilized fixed-length sequences known as n-grams, which are defined as sequences of n tokens.

N-gram models establish the conditional probability of the n-th token based on the preceding n−1 tokens By utilizing the products of these conditional distributions, the model effectively defines the probability distribution for longer sequences.

This decomposition is justiﬁed by the chain rule of probability The probability distribution over the initial sequenceP(x 1 , , x n − 1 )may be modeled by a diﬀerent model with a smaller value of n

Training n-gram models is simple, as the maximum likelihood estimate is derived by counting the occurrences of each n-gram in the training data For decades, n-gram-based models have served as foundational elements in statistical language modeling (Jelinek and Mercer 1980; Katz 1987; Chen and Goodman 1999).

In natural language processing, models are categorized based on the value of n, with specific names assigned to small values: "unigram" for n=1, "bigram" for n=2, and "trigram" for n=3 These terms originate from Latin prefixes that correspond to the numbers, combined with the Greek suffix "-gram," which signifies something that is written.

Usually we train both ann-gram model and ann−1 gram model simultaneously. This makes it easy to compute

P n − 1 (x t n − +1 , , x t − 1 ) (12.6) simply by looking up two stored probabilities For this to exactly reproduce inference in P n , we must omit the ﬁnal character from each sequence when we trainP n − 1.

In a trigram model, the probability of the sentence “THE DOG RAN AWAY” is calculated by first addressing the initial words, which require the use of marginal probability due to the lack of preceding context Therefore, we evaluate P 3(THE DOG RAN) For the final word, we apply the conditional distribution P(AWAY | DOG RAN) to predict it By combining these elements, we arrive at a comprehensive probability assessment for the entire sentence.

P(THE DOG RAN AWAY) = P 3 (THE DOG RAN)P 3 (DOG RAN AWAY)/P 2 (DOG RAN).

A key limitation of maximum likelihood estimation in n-gram models is that the probability P n, derived from training set counts, often equals zero, even if the tuple (x t n − +1, , x t) appears in the test set This can lead to two significant issues: first, if P n − 1 is zero, the resulting ratio becomes undefined, preventing the model from generating meaningful output Second, if P n − 1 is non-zero but still problematic, it can result in unreliable predictions.

When the probability \( P_n \) is zero, the test log-likelihood approaches negative infinity, leading to undesirable outcomes To mitigate this issue, n-gram models typically implement smoothing techniques that redistribute probability mass from observed tuples to similar unobserved ones A fundamental approach involves assigning non-zero probability mass to all possible next symbols, which can be understood through Bayesian inference utilizing a uniform or Dirichlet prior over count parameters Additionally, a common strategy is to create a mixture model that combines higher-order and lower-order n-gram models; the former enhances capacity while the latter helps prevent zero counts Back-off methods are employed to reference lower-order n-grams when the context frequency is too low for reliable higher-order model usage, progressively estimating the distribution over \( x_t \) by utilizing increasingly broader contexts until a dependable estimate is achieved.

Classical n-gram models face significant challenges due to the curse of dimensionality, as the number of possible n-grams grows exponentially with the size of the vocabulary (|V|) Despite utilizing large training datasets, many n-grams, especially with modest values of n, remain unobserved in the training data, leading to inefficiencies in language modeling.

A classical n-gram model can be understood as a nearest-neighbor lookup, functioning similarly to k-nearest neighbors as a local non-parametric predictor However, it faces significant statistical challenges, particularly in language modeling, where different words maintain equal distances in one-hot vector space This uniformity complicates the extraction of useful information from neighboring words, as only training examples that replicate the exact context contribute to local generalization To address these issues, a language model needs to effectively share knowledge among semantically similar words.

To enhance the statistical efficiency of n-gram models, class-based language models utilize word categories to share statistical strength among words within the same category By employing a clustering algorithm, words are grouped based on their co-occurrence frequencies, allowing the model to utilize word class IDs instead of individual word IDs for contextual representation Additionally, composite models can merge word-based and class-based approaches through mixing or back-off techniques While word classes facilitate generalization by substituting words of the same class, this representation may result in the loss of significant information.

Neural language models (NLMs) effectively address the curse of dimensionality in natural language processing by utilizing distributed word representations, enabling them to recognize similarities between words while maintaining their distinct identities Unlike traditional class-based n-gram models, NLMs leverage statistical relationships between words and their contexts, allowing for shared representations of semantically related terms For instance, if "dog" and "cat" share common attributes in their representations, the model can use information from sentences containing "cat" to enhance predictions for those with "dog." This capability allows for significant generalization, linking each training sentence to an exponentially large number of related sentences, thus efficiently countering the challenges posed by the curse of dimensionality.

Word embeddings are representations of words that map raw symbols into a lower-dimensional feature space, simplifying their relationships In this model, each word is initially represented as a one-hot vector in a high-dimensional space corresponding to the vocabulary size, where every pair of words has a Euclidean distance of √2 By embedding these points into a more compact feature space, we capture the semantic similarities between words more effectively.

In the embedding space, words that frequently appear in similar contexts are positioned closely together, leading to semantically similar words being neighbors This phenomenon is illustrated in Figure 12.3, which highlights specific areas of a learned word embedding space, demonstrating how words with similar meanings are represented in proximity to one another.

Other Applications

This section explores various applications of deep learning beyond traditional tasks such as object recognition, speech recognition, and natural language processing In Part III of this book, we will delve deeper into additional tasks that are primarily focused on research.

Machine learning plays a crucial role in the information technology sector, particularly in item recommendations and online advertising These applications focus on predicting user-item associations to forecast actions like product purchases or anticipated gains from advertisements The internet's economy heavily relies on online advertising, with major companies like Amazon and eBay leveraging machine learning, including deep learning, for effective product recommendations Additionally, machine learning extends beyond product sales, influencing areas such as social media content selection, movie recommendations, joke suggestions, expert advice, video game matchmaking, and dating services.

The association problem is frequently approached as a supervised learning challenge, where the goal is to predict user interactions based on item and user data This involves estimating outcomes such as ad clicks, ratings, likes, purchases, monetary spending, or time spent on a product page Typically, this leads to either a regression problem, which focuses on predicting expected values, or a probabilistic classification problem, which aims to determine the likelihood of specific discrete events.

Early recommender systems utilized minimal inputs, primarily user IDs and item IDs, to make predictions based on the similarity of preferences between users or items For instance, if both user 1 and user 2 enjoy items A, B, and C, it suggests they share similar tastes, leading to the inference that user 2 may also like item D if user 1 does This principle is the foundation of collaborative filtering, which includes both non-parametric methods, like nearest-neighbor approaches, and parametric methods that create distributed representations (embeddings) for users and items A notable parametric method is bilinear prediction, which effectively calculates ratings through the dot product of user and item embeddings, adjusted by user and item-specific biases In this context, the prediction matrix Rˆ is derived from user embedding matrix A, item embedding matrix B, and bias vectors b and c, enhancing the accuracy of recommendations in advanced systems.

Typically one wants to minimize the squared error between predicted ratings

User and item embeddings can be effectively visualized in low dimensions or compared similarly to word embeddings These embeddings can be derived from singular value decomposition (SVD) of the actual ratings matrix, R, which factorizes R into lower rank matrices However, SVD has limitations as it arbitrarily assigns zero values to missing entries, which can skew results To address this, minimizing the sum of squared errors on observed ratings through gradient-based optimization offers a more accurate approach Both SVD and bilinear prediction methods showed strong performance in the Netflix prize competition.

Between 2006 and 2009, a competition aimed at predicting film ratings based on previous ratings from a large pool of anonymous users significantly advanced research in recommender systems Many machine learning experts participated, leading to notable improvements in the field Although not a standalone winner, the simple bilinear prediction method and Singular Value Decomposition (SVD) were integral components of the ensemble models used by many competitors, including the eventual winners.

One of the early applications of neural networks in collaborative filtering involves the use of Restricted Boltzmann Machines (RBMs), as highlighted by Salakhutdinov et al (2007) RBMs played a crucial role in the ensemble of techniques that achieved success in the Netflix competition (Tửscher et al., 2009; Koren, 2009) Additionally, the neural networks community has investigated more advanced variations of factorizing the ratings matrix (Salakhutdinov and Mnih, 2008).

Collaborative filtering systems face a significant challenge known as the cold-start problem, which arises when new items or users lack rating history, making it difficult to assess their similarities with existing entities To address this issue, additional information about users and items can be incorporated, such as user profiles or item characteristics These enhanced systems are referred to as content-based recommender systems By utilizing deep learning architectures, a comprehensive mapping from diverse user and item features to embeddings can be effectively learned, improving recommendation accuracy (Huang et al., 2013; Elkahky et al., 2015).

Specialized deep learning architectures, particularly convolutional networks, are utilized to extract features from complex content like musical audio tracks for music recommendation purposes In this approach, acoustic features serve as input to the convolutional network, which generates an embedding for each song The prediction of whether a user will listen to a song is made by calculating the dot product between the song embedding and the user's embedding.

When making recommendations, we encounter challenges that extend beyond traditional supervised learning into reinforcement learning, particularly in the context of contextual bandits Recommendation systems often provide a biased and incomplete understanding of user preferences, as we only observe users' responses to recommended items, neglecting their reactions to unshown alternatives This lack of information can lead to a scenario where we fail to learn about the correct choices, especially if initial recommendations had low probabilities of success In reinforcement learning, only the reward from the chosen action is visible, complicating the learning process The bandit problem simplifies this by associating each action with a specific reward, whereas general reinforcement learning involves tracking the influence of past actions on current rewards Contextual bandits enhance this by allowing actions to be informed by input variables, such as user identity, creating a feedback loop that is crucial for effective learning and decision-making.

Reinforcement learning involves a critical trade-off between exploration and exploitation Exploitation focuses on leveraging the best-known actions from the current policy to secure high rewards, while exploration aims to gather more training data by trying new actions For instance, if a specific action yields a known reward, it may be tempting to continue exploiting it However, exploring alternative actions could lead to discovering even better rewards, despite the risk of lower outcomes Ultimately, both strategies contribute to enhancing our understanding of the environment.

Exploration can be approached in various ways, including random actions that aim to cover all possible options and model-based strategies that select actions based on predicted rewards and the uncertainty surrounding those rewards.

The balance between exploration and exploitation is influenced by various factors, with time scale being a key determinant When an agent has limited time to gather rewards, a focus on exploitation is favored Conversely, with a longer time frame, initial exploration is prioritized to enhance future decision-making through increased knowledge As time advances and the agent's policy becomes more refined, the strategy gradually shifts towards greater exploitation.

Supervised learning eliminates the trade-off between exploration and exploitation, as the supervision signal clearly indicates the correct output for each input This approach removes the necessity of testing various outputs to identify a superior option, since the label provided always represents the optimal output.

In reinforcement learning, a significant challenge beyond the exploration-exploitation trade-off is the evaluation and comparison of different policies The interaction between the learner and the environment creates a feedback loop that complicates performance assessment, as the policy dictates the inputs encountered Techniques for evaluating contextual bandits, as discussed by Dudik et al (2011), provide insights into addressing this issue.

12.5.2 Knowledge Representation, Reasoning and Question An- swering

Probabilistic PCA and Factor Analysis

Probabilistic PCA, factor analysis, and various linear factor models are specific instances of the equations presented in (13.1) and (13.2) The primary differences among these models lie in the selection of the noise distribution and the prior assumptions regarding the latent variables h before the observation of data x.

In factor analysis, as outlined by Bartholomew (1987) and Basilevsky (1994), the latent variable prior is modeled as a unit variance Gaussian distribution, represented as h ∼ N(0, I) The observed variables, denoted as x_i, are considered conditionally independent given the latent variable h Additionally, the noise in the model is assumed to follow a diagonal covariance Gaussian distribution, characterized by a covariance matrix ψ = diag(σ²), where σ² is a vector of variances for each variable, specifically σ² = [σ²₁, σ²₂, , σ²ₙ].

Latent variables play a crucial role in capturing the dependencies among various observed variables, denoted as \( x_i \) It can be demonstrated that these observed variables follow a multivariate normal distribution, represented as \( x \sim N(\mu, \Sigma) \), where \( \mu \) is the mean vector and \( \Sigma \) is the covariance matrix.

To frame Principal Component Analysis (PCA) within a probabilistic context, we can modify the factor analysis model by setting the conditional variances equal to each other This adjustment leads to the covariance of x being expressed as W W^T + σ^2 I, with σ^2 as a scalar Consequently, the conditional distribution is represented as x ∼ N(μ; x, W W^T + σ^2 I), or alternatively, x = W h + b + σ z, where z ∼ N(0, I) signifies Gaussian noise Tipping and Bishop (1999) introduced an iterative Expectation-Maximization (EM) algorithm for estimating the parameters W and σ^2.

The probabilistic PCA model leverages the insight that the majority of data variations can be explained by latent variables, with a minimal residual reconstruction error denoted as σ² According to Tipping and Bishop (1999), as σ approaches zero, probabilistic PCA converges to traditional PCA In this scenario, the conditional expected value of the latent variables given the data x is equivalent to the orthogonal projection of x minus the mean b onto the subspace defined by the d columns of the weight matrix W, mirroring the principles of PCA.

As the parameter σ approaches zero, the density model established by probabilistic PCA becomes highly concentrated around the d dimensions represented by the columns of matrix W Consequently, this sharpness can lead the model to assign significantly low likelihood to data points that do not cluster near a hyperplane.

Independent Component Analysis (ICA)

Independent component analysis (ICA) is among the oldest representation learning algorithms (Herault and Ans 1984 Jutten and Herault 1991 Comon 1994, ; , ; , ; Hyvọrinen 1999 Hyvọrinen, ; et al., 2001a Hinton; et al., 2001 Teh; et al., 2003).

This approach to modeling linear factors aims to decompose an observed signal into multiple independent underlying signals, which are then scaled and combined to recreate the original data The goal is to ensure that these signals are fully independent rather than just decorrelated.

Independent Component Analysis (ICA) encompasses various specific methodologies, with one variant closely resembling other generative models (Pham et al., 1992) This approach involves training a fully parametric generative model, where the prior distribution over the underlying factors, p(h), is predetermined by the user Consequently, the model deterministically generates data using the equation x = W h.

In section 3.8, we explore the distinction between uncorrelated and independent variables To determine p(x), we apply a nonlinear change of variables as outlined in equation 3.47 The model learning process continues in the standard manner, utilizing maximum likelihood estimation.

This approach leverages the independence of p(h) to recover underlying factors that closely resemble independent signals It is primarily utilized to extract low-level signals that have been combined, rather than to identify abstract causal factors In this context, each training example represents a moment in time, where x i denotes a sensor's observation of mixed signals and h i signifies an estimate of one original signal For instance, in a scenario where n individuals speak simultaneously, using n microphones strategically placed can allow Independent Component Analysis (ICA) to discern volume variations among speakers, effectively isolating each voice This technique is particularly valuable in neuroscience, especially in electroencephalography, which records electrical activity from the brain Multiple electrodes on a subject's scalp capture various electrical signals, but the experimenter aims to focus solely on brain signals, often complicated by stronger signals from the heart and eyes ICA is essential to disentangle these mixed signals, enabling clearer analysis of brain activity and interactions among different brain regions.

Various variants of Independent Component Analysis (ICA) exist, with some incorporating noise during the generation of x instead of relying on a deterministic decoder Unlike traditional methods that utilize the maximum likelihood criterion, many approaches focus on ensuring the independence of the elements in h=W − 1 x Numerous criteria can achieve this independence, though Equation 3.47 involves calculating the determinant of W, a process that can be computationally intensive and prone to numerical instability To mitigate this issue, certain ICA variants constrain W to maintain orthogonality.

All variants of ICA require thatp(h) be non-Gaussian This is because if p(h) is an independent prior with Gaussian components, then W is not identiﬁable.

Many values of W can yield the same distribution over p(x), contrasting with other linear factor models like probabilistic PCA and factor analysis, which often necessitate a Gaussian p(h) for closed-form solutions In maximum likelihood approaches, users typically define the distribution, often opting for p(h i ) = dh d i σ(h i ) These non-Gaussian distributions frequently exhibit sharper peaks near 0 compared to Gaussian distributions, indicating that most Independent Component Analysis (ICA) implementations focus on learning sparse features.

Many variants of Independent Component Analysis (ICA) are not considered generative models, as they do not represent the probability distribution p(x) or generate samples from it Instead, these ICA variants focus on transforming data between x and h without explicitly defining p(h) or imposing a distribution over p(x) For instance, while some ICA methods aim to enhance the sample kurtosis of h=W − 1 x to indicate that p(h) is non-Gaussian, they do so without directly representing p(h) Consequently, ICA is primarily utilized as an analytical tool for signal separation rather than for data generation or density estimation.

Nonlinear Independent Component Analysis (ICA) extends traditional ICA by utilizing a nonlinear generative model to produce observed data, as initially explored by Hyvärinen and Pajunen in 1999 This approach has been effectively applied in ensemble learning, as demonstrated by Roberts and Everson in 2001, and Lappalainen et al in 2000 A notable nonlinear extension is the Nonlinear Independent Components Estimation (NICE) method introduced by Dinh et al in 2014, which employs a series of invertible transformations with efficiently computable Jacobian determinants This allows for precise likelihood calculations and aims to transform data into a space with a factorized marginal distribution, leveraging the advantages of a nonlinear encoder The encoder's association with a perfect inverse decoder facilitates straightforward sample generation from the model by sampling from p(h) and applying the decoder.

Independent Component Analysis (ICA) can be generalized to learn groups of features, allowing statistical dependence within groups while discouraging it between them This concept is known as independent subspace analysis when the groups of related units are non-overlapping Additionally, by assigning spatial coordinates to hidden units, overlapping groups of spatially neighboring units can be formed, promoting the learning of similar features among nearby units In the context of natural images, this topographic ICA approach effectively learns Gabor filters, ensuring that neighboring features exhibit similar orientation, location, or frequency The presence of various phase offsets of similar Gabor functions within each region enables translation invariance through pooling over small areas.

Slow Feature Analysis

Slow feature analysis (SFA) is a linear factor model that uses information from time signals to learn invariant features (Wiskott and Sejnowski 2002, ).

Slow feature analysis is grounded in the slowness principle, which posits that significant characteristics of scenes evolve at a much slower pace than the rapid fluctuations of individual measurements In computer vision, for instance, pixel values can fluctuate quickly; as a zebra moves across an image, a pixel may swiftly alternate between black and white due to the zebra's stripes In contrast, features that indicate the presence of a zebra remain constant, while those describing its position change gradually Thus, it is beneficial to regularize our model to prioritize learning features that exhibit slow changes over time.

The slowness principle, which predates slow feature analysis, has been utilized across various models (Hinton 1989; Fürdiók 1989; Mobahi et al., 2009; Bergstra and Bengio 2009) This principle can be applied to any differentiable model that is trained using gradient descent It is typically incorporated by adding a term to the cost function in the form of λt.

In the equation L(f(x(t+1)), f(x(t))) (13.7), λ represents a hyperparameter that controls the intensity of the slowness regularization term The variable t denotes the index within a sequence of time-based examples, while f serves as the feature extractor subject to regularization The loss function L quantifies the difference between the feature outputs f(x(t)) and f(x(t+1)), with the mean squared difference being a frequently utilized choice for L.

Slow feature analysis (SFA) is an efficient application of the slowness principle, particularly when utilized with a linear feature extractor, allowing for closed-form training While similar to some variants of independent component analysis (ICA), SFA is not a generative model; it establishes a linear mapping between input space and feature space without defining a prior distribution over the feature space, thus not imposing a distribution p(x) on the input space.

The SFA algorithm (Wiskott and Sejnowski 2002, ) consists of deﬁning f(x;θ) to be a linear transformation, and solving the optimization problem min θ

To ensure a unique solution in the learned features, it is crucial to impose a zero mean constraint; without it, adding a constant to all feature values could yield different solutions that maintain the same slowness objective Additionally, enforcing unit variance is essential to avoid pathological solutions where all features converge to zero Similar to Principal Component Analysis (PCA), the features derived from Slow Feature Analysis (SFA) are ordered, with the first feature representing the slowest variation To effectively learn multiple features, it is necessary to incorporate this additional constraint.

To ensure that learned features are linearly decorrelated, it is crucial to prevent them from merely capturing the slowest signal While alternative methods like minimizing reconstruction error could encourage feature diversity, the linearity of SFA features allows for a straightforward solution through decorrelation The SFA problem can be efficiently addressed using a linear algebra package, enabling a closed-form solution.

SFA is commonly employed to capture nonlinear features by first applying a nonlinear basis expansion to the input data, x A typical approach involves substituting x with a quadratic basis expansion, which generates a vector comprising elements x_i x_j for all combinations of i and j.

SFA modules can be structured to develop deep nonlinear slow feature extractors by iteratively training a linear SFA feature extractor, enhancing its output with a nonlinear basis expansion, and subsequently training an additional linear SFA feature extractor on the expanded data.

When trained on small spatial patches of natural scene videos, SFA with quadratic basis expansions develops features akin to those of complex cells in the V1 cortex Additionally, when exposed to videos of random motion in 3-D computer-generated environments, deep SFA acquires features resembling those utilized by navigation neurons in rat brains Consequently, SFA emerges as a biologically plausible model for understanding neural feature representation.

A significant benefit of Slow Feature Analysis (SFA) is its ability to theoretically predict the features it will learn, even in complex, nonlinear environments To achieve these predictions, one must understand the dynamics of the environment concerning its configuration space, such as the probability distribution of the camera's position and velocity in a 3-D rendered setting With insights into how the underlying factors evolve, it becomes feasible to analytically determine the optimal functions that represent these factors Empirical studies using deep SFA on simulated data have demonstrated a successful recovery of these theoretically anticipated functions.

Unlike other learning algorithms that heavily rely on specific pixel values in their cost functions, this approach simplifies the process of feature determination for the model, making it easier to understand which features will be learned.

Deep SFA has been utilized for object recognition and pose estimation, but the slowness principle has not yet been leveraged in leading applications The limitations of its performance remain unclear, leading to speculation that the slowness prior may be overly restrictive Instead of enforcing a prior that features should remain nearly constant, it may be more effective to prioritize features that are easily predictable across time steps Notably, an object's position is a valuable feature, irrespective of its velocity; however, the slowness principle tends to overlook the positions of fast-moving objects.

Sparse Coding

Sparse coding, introduced by Olshausen and Field in 1996, is a linear factor model extensively researched for unsupervised feature learning and extraction While "sparse coding" specifically pertains to inferring values within the model, and "sparse modeling" relates to designing and learning the model, the terms are frequently used interchangeably in the literature.

Sparse coding models employ a linear decoder combined with noise to reconstruct data, as outlined in equation 13.2 These models generally assume that the linear factors are influenced by Gaussian noise with isotropic precision, expressed mathematically as β p(x | h) = (N(x | W h + b), 1 βI).

The distribution p(h) is selected for its sharp peaks near zero, as noted by Olshausen and Field (1996) Popular options for this distribution include factorized Laplace, Cauchy, and factorized Student-t distributions For instance, the Laplace prior, defined by the sparsity penalty coefficient λ, is expressed as p(h_i) = Laplace(h_i; 0, 2λ) = λ.

4e − 1 2 λ h | i | (13.13) and the Student- prior byt p h( i ) ∝ 1

Training sparse coding using maximum likelihood is impractical; therefore, the process alternates between encoding the data and enhancing the decoder to improve data reconstruction based on the encoding This method will be further validated as a principled approximation to maximum likelihood in section 19.3.

In contrast to parametric encoder functions like those used in PCA, which rely solely on multiplication by a weight matrix, the encoder utilized in sparse coding is fundamentally different Instead of being parametric, this encoder operates as an optimization algorithm that addresses a specific optimization problem The goal is to identify the single most probable code value, expressed mathematically as h ∗ = arg max p(h | x).

When combined with equation 13.13 and equation 13.12, this yields the following optimization problem: arg max h p(h x| ) (13.16)

= arg min h λ|| ||h 1 +β|| −x W h|| 2 2 , (13.18) where we have dropped terms not depending on h and divided by positive scaling factors to simplify the equation.

Due to the imposition of an L 1 norm on h, this procedure will yield a sparse h ∗ (See section 7.1.2).

To effectively train the model, we alternate between minimizing with respect to h and W, treating β as a hyperparameter typically set to 1, as it shares its role with λ While β could be treated as a learnable parameter, our approach omits certain terms dependent on β that are crucial for its learning; without including these terms, β may collapse to 0.

Sparse coding methods do not always explicitly define the probability distributions p(h) and p(x | h) Instead, the primary goal is often to learn a dictionary of features, where the activation values frequently result in zeros during the inference process.

Sampling h from a Laplace prior results in a zero probability event for any element of h being zero While the generative model is not particularly sparse, the feature extractor is Goodfellow et al (2013) discuss approximate inference in the spike and slab sparse coding model, where samples from the prior often include true zeros.

The sparse coding approach, when paired with a non-parametric encoder, can effectively minimize reconstruction error and log-prior better than traditional parametric encoders Unlike parametric encoders, which must generalize mapping inputs to hidden representations, non-parametric encoders avoid generalization errors, ensuring accurate reconstructions even for atypical inputs While optimization in sparse coding models typically yields optimal codes due to their convex nature, some generalization errors may still occur in decoder weights, affecting performance on unfamiliar data However, the non-generalizing nature of sparse coding's encoding process can enhance its effectiveness as a feature extractor for classifiers, as evidenced by Coates and Ng (2011), who found that sparse coding features outperform those from parametric encoder models like the linear-sigmoid autoencoder in object recognition tasks.

(2013d) showed that a variant of sparse coding generalizes better than other feature extractors in the regime where extremely few labels are available (twenty or fewer labels per class).

The non-parametric encoder presents a significant drawback due to its increased computational time for calculating h given x, as it relies on an iterative algorithm In contrast, the parametric autoencoder, introduced in chapter 14, typically utilizes a fixed number of layers, often just one Additionally, back-propagating through the non-parametric encoder is not straightforward, complicating the pretraining of sparse coding models with unsupervised criteria before fine-tuning them with supervised methods While modified versions of sparse coding that allow for approximate derivatives exist, they remain underutilized in practice (Bagnell and Bradley 2009).

Sparse coding, similar to other linear factor models, frequently yields subpar samples despite effective data reconstruction and feature utility for classifiers This issue arises because, while individual features may be accurately learned, the factorial prior on the hidden code leads to the model incorporating random subsets of features in each generated sample Consequently, there is a need for the advancement of deeper models that can enforce a more structured representation.

The spike and slab sparse coding model trained on the MNIST dataset illustrates a contrast between model samples and training examples, suggesting initial poor fit However, the model's weight vectors successfully capture essential features like penstrokes and entire digits This indicates the model has learned useful representations, despite the challenge posed by the factorial prior over features, which leads to random combinations of subsets Consequently, few of these combinations effectively represent recognizable MNIST digits, highlighting the need for generative models with enhanced distributions over latent codes, as well as the exploration of more advanced shallow models.

Manifold Interpretation of PCA

Linear factor models, such as PCA and factor analysis, can be understood as learning a manifold (Hinton et al., 1997) Probabilistic PCA defines a narrow Gaussian distribution resembling a thin pancake, where the distribution is flat along some axes and elongated along others This concept illustrates how PCA aligns this pancake-like distribution with a linear manifold in a higher-dimensional space This interpretation is applicable not only to traditional PCA but also to any linear autoencoder that aims to minimize the reconstruction error of input data.

The encoder computes a low-dimensional representation of h With the autoencoder view, we have a decoder computing the reconstruction ˆ x= (g h) = +b V h (13.20)

The figure illustrates a flat Gaussian distribution that captures probability concentration near a low-dimensional manifold It depicts the upper half of a "pancake" shape above the central "manifold plane." The variance in the direction orthogonal to the manifold is minimal, resembling "noise," while the variances within the plane are significant, representing the "signal." This visual representation aids in understanding the coordinate system for reduced-dimension data.

The choices of linear encoder and decoder that minimize reconstruction error

E[|| −x xˆ|| 2 ] (13.21) correspond to V =W, à =b=E[x] and the columns of W form an orthonormal basis which spans the same subspace as the principal eigenvectors of the covariance matrix

In Principal Component Analysis (PCA), the columns of the matrix W consist of eigenvectors arranged by the size of their corresponding non-negative eigenvalues Each eigenvalue λ_i of the covariance matrix C represents the variance of the data vector x in the direction of its associated eigenvector v_i For a data vector x in R^D and a reduced representation h in R^d (where d < D), the optimal reconstruction error, when selecting appropriate matrices V and W, is minimized as E[||x - x̂||^2] = Σ (λ_i) from i = d+1 to D.

Hence, if the covariance has rankd, the eigenvalues λ d+1 to λ D are 0 and reconstruction error is 0.

Furthermore, one can also show that the above solution can be obtained by maximizing the variances of the elements of h, under orthogonal W, instead of minimizing reconstruction error.

Linear factor models are foundational generative models that effectively learn data representations Similar to how linear classifiers and regression can evolve into deep feedforward networks, linear factor models can also be enhanced into autoencoder networks and deep probabilistic models, offering greater power and flexibility in their applications.

An autoencoder is a type of neural network designed to replicate its input at the output stage It features a hidden layer, denoted as h, which encodes the input data into a compact representation The architecture comprises two main components: the encoder function h = f(x), which transforms the input, and the decoder function r = g(h), which reconstructs the original input from the encoded representation.

The architecture illustrated in figure 14.1 highlights that an autoencoder's primary goal is not to achieve perfect reconstruction of input data, represented by the equation g(f(x)) = x Instead, autoencoders are intentionally designed with constraints that prevent them from copying inputs precisely This limitation encourages the model to approximate the input, focusing on features that resemble the training data As a result, the autoencoder prioritizes certain aspects of the input, enabling it to learn valuable properties of the data effectively.

Modern autoencoders have generalized the idea of an encoder and a decoder beyond deterministic functions to stochastic mappings p encoder (h x| ) and p decoder (x h| ).

The idea of autoencoders has been part of the historical landscape of neural networks for decades (LeCun 1987 Bourlard and Kamp 1988 Hinton and Zemel, ; , ; ,

Autoencoders, traditionally utilized for dimensionality reduction and feature learning, have gained prominence in generative modeling due to their theoretical connections with latent variable models They can be viewed as a specific type of feedforward network, trained using techniques such as minibatch gradient descent and back-propagation Additionally, autoencoders can be trained via recirculation, a learning algorithm that compares network activations from original inputs to those from reconstructed inputs Although recirculation is considered more biologically plausible than back-propagation, it is infrequently applied in machine learning contexts.

An autoencoder consists of two main components: the encoder, which transforms the input \( x \) into an internal representation or code \( h \), and the decoder, which reconstructs the output \( r \) from this code This structure enables the autoencoder to effectively map inputs to their reconstructions.

Undercomplete Autoencoders

While it may seem pointless to replicate the input in the output, the primary focus is not on the decoder's output itself Instead, the goal of training the autoencoder for this input-copying task is to enable it to acquire valuable characteristics.

To effectively extract valuable features from an autoencoder, it is essential to limit the dimension of the code (h) to be smaller than that of the input (x) An autoencoder designed with a code dimension that is less than the input dimension is referred to as undercomplete This approach compels the autoencoder to focus on the most significant features of the training data, ensuring a more efficient representation.

The learning process is described simply as minimizing a loss function

L(x, g f( ( )))x (14.1) where L is a loss function penalizing g(f(x)) for being dissimilar from x, such as the mean squared error.

An undercomplete autoencoder, when using a linear decoder and mean squared error (MSE) as the loss function, effectively captures the same subspace as Principal Component Analysis (PCA) Consequently, this type of autoencoder, while primarily designed for the task of data reconstruction, inadvertently learns the principal subspace of the training dataset.

Autoencoders utilizing nonlinear encoder and decoder functions can achieve a more advanced nonlinear generalization of PCA However, excessive capacity in the encoder and decoder may lead the autoencoder to merely replicate the input data without extracting meaningful insights about the underlying data distribution Although it's theoretically possible for an autoencoder with a one-dimensional code to represent each training example as an index, this scenario rarely occurs in practice This highlights the risk that an autoencoder, when over-parameterized, may fail to learn valuable information from the dataset, focusing instead on the copying task.

Regularized Autoencoders

Undercomplete autoencoders, characterized by a code dimension smaller than the input dimension, effectively capture the most significant features of data distributions However, if the encoder and decoder possess excessive capacity, these autoencoders struggle to extract meaningful information.

When the hidden code's dimension matches or exceeds that of the input, a significant issue arises In such overcomplete scenarios, both linear encoders and decoders can merely replicate the input to the output, failing to extract meaningful insights about the underlying data distribution.

Regularized autoencoders enable effective training of various architectures by allowing flexibility in code dimensions and the capacities of encoders and decoders, tailored to the complexity of the data distribution Unlike traditional models that restrict capacity through shallow structures and limited code sizes, regularized autoencoders employ a loss function that promotes additional desirable properties These include representation sparsity, minimized derivative size, and resilience to noise or incomplete inputs Consequently, even with nonlinear and overcomplete configurations, regularized autoencoders can extract meaningful insights from data distributions, avoiding the pitfalls of simply learning a trivial identity function.

Generative models with latent variables and inference procedures can be interpreted as autoencoders, including the Helmholtz machine descendants like variational autoencoders and generative stochastic networks These models effectively learn high-capacity, overcomplete encodings of input data, which are inherently useful without the need for regularization Their usefulness stems from the models being trained to maximize the probability of the training data, rather than merely replicating the input in the output.

A sparse autoencoder is simply an autoencoder whose training criterion involves a sparsity penalty Ω(h) on the code layer h, in addition to the reconstruction error:

L(x, g f( ( ))) + Ω( )x h (14.2) where g(h) is the decoder output and typically we have h = f(x), the encoder output.

Sparse autoencoders are designed to extract meaningful features for tasks like classification by incorporating a sparsity penalty during training This regularization encourages the model to capture distinct statistical characteristics of the dataset instead of merely replicating the input As a result, the training process not only focuses on the copying task but also produces a model that learns valuable features as an additional benefit.

The penalty Ω(h) acts as a regularizer in a feedforward network, primarily aimed at copying input to output in an unsupervised learning context, while also potentially addressing supervised tasks reliant on sparse features Unlike traditional regularizers like weight decay, Ω(h) lacks a clear Bayesian interpretation, which is typically associated with regularized maximum likelihood as a MAP approximation to Bayesian inference In this framework, regularization corresponds to prior distributions over model parameters, balancing data log-likelihood with preferences for specific parameter values However, regularized autoencoders challenge this interpretation, as their regularization terms are data-dependent and do not function as true priors, yet they still convey implicit preferences over functions.

Instead of viewing the sparsity penalty merely as a regularizer for the copying task, the sparse autoencoder framework can be understood as an approximation of maximum likelihood training for a generative model with latent variables In this model, we have visible variables \( x \) and latent variables \( h \), with an explicit joint distribution \( p_{\text{model}}(x, h) = p_{\text{model}}(h) p_{\text{model}}(x|h) \) Here, \( p_{\text{model}}(h) \) serves as the prior distribution over the latent variables, reflecting the model’s beliefs before observing \( x \) This differs from the traditional use of "prior," which pertains to the distribution \( p(\theta) \) that captures beliefs about model parameters prior to training data exposure The log-likelihood can be expressed as \( \log p_{\text{model}}(\theta) = \log x + \sum_h p_{\text{model}}(h|x) \).

An autoencoder can be viewed as providing a point estimate for a highly probable value of h, akin to the sparse coding generative model discussed in section 13.4 In this context, h is derived from the output of a parametric encoder rather than through an optimization process that identifies the most likely h By selecting this approach, we aim to maximize the log probability of the model, expressed as log p model (h | x, θ) = log p model (θ) + log p model (x | h) Notably, the term log p model (θ) can promote sparsity, with the Laplace prior, p model (h_i) = λ, serving as a key example.

2e − | λ h i | , (14.5) corresponds to an absolute value sparsity penalty Expressing the log-prior as an absolute value penalty, we obtain

The equation Ω( ) + consth (14.7) highlights that the constant term is solely dependent on the hyperparameter λ, which we often treat as fixed, allowing us to disregard it for parameter learning purposes This approach aligns with the use of other priors, such as the Student-t prior, that can also promote sparsity From the perspective of sparsity arising from the model's effect on approximate maximum likelihood learning, the sparsity penalty is not merely a regularization term but rather a natural outcome of the model's latent variable distribution This perspective redefines the purpose of training an autoencoder, positioning it as a method for approximating a generative model, while also emphasizing that the features learned by the autoencoder are valuable as they encapsulate the latent variables that account for the input data.

The concept of sparse autoencoders was first explored by Ranzato et al in 2007 and 2008, where they discovered a connection between sparsity penalties and the logZ term in undirected probabilistic models This connection reveals that minimizing logZ prevents a model from having high probability everywhere, much like how sparsity prevents an autoencoder from achieving low reconstruction error everywhere A more mathematically straightforward interpretation of sparsity penalties is found in directed models, where it corresponds to logp model(h) To achieve actual zeros in sparse autoencoders, researchers have employed techniques such as using rectified linear units in the code layer, as introduced by Glorot et al in 2011, allowing for indirect control over the average number of zeros in the representation.

Rather than adding a penaltyΩ to the cost function, we can obtain an autoencoder that learns something useful by changing the reconstruction error term of the cost function.

Traditionally, autoencoders minimize some function

The loss function L(x, g(f(x))) penalizes the dissimilarity between g(f(x)) and x, typically using the L2 norm of their difference This approach incentivizes g(f) to approximate an identity function if it has the capability to do so.

A denoising autoencoder or DAE instead minimizes

L(x, g f( ( ˜x))), (14.9) where x˜ is a copy of x that has been corrupted by some form of noise Denoising autoencoders must therefore undo this corruption rather than simply copying their input.

Denoising training enables models to learn the underlying structure of data, as demonstrated by Alain and Bengio (2013) and Bengio et al (2013) Denoising autoencoders illustrate how beneficial properties can arise from minimizing reconstruction error Additionally, they showcase the effective use of overcomplete, high-capacity models as autoencoders, provided measures are taken to avoid learning the identity function For a more comprehensive understanding, refer to section 14.5.

Another strategy for regularizing an autoencoder is to use a penalty Ω as in sparse autoencoders,

L(x, g f( ( ))) + Ω(x h x), , (14.10) but with a diﬀerent form of Ω:

The model is compelled to learn a stable function that remains consistent with minor changes in input values By applying this penalty exclusively during training, the autoencoder is guided to identify features that effectively represent the training distribution's information.

A contractive autoencoder (CAE) is a type of autoencoder that incorporates regularization techniques This method is linked to various concepts, including denoising autoencoders, manifold learning, and probabilistic modeling For a more in-depth understanding of CAEs, refer to section 14.7.

Representational Power, Layer Size and Depth

Autoencoders are often trained with only a single layer encoder and a single layer decoder However, this is not a requirement In fact, using deep encoders and decoders oﬀers many advantages.

Depth in feedforward networks offers numerous advantages, which also extend to autoencoders due to their architecture Both the encoder and decoder in an autoencoder are feedforward networks, allowing each component to independently benefit from increased depth, enhancing their overall performance.

One significant benefit of non-trivial depth in neural networks is the universal approximator theorem, which states that a feedforward neural network with at least one hidden layer can approximate any function to a high degree of accuracy if it has sufficient hidden units This implies that a single-layer autoencoder can effectively represent the identity function across its data domain However, its shallow mapping limits the ability to impose specific constraints, such as achieving sparsity in the code In contrast, a deep autoencoder, featuring at least one additional hidden layer in the encoder, can achieve an arbitrary approximation of the input-to-code mapping, provided it has enough hidden units.

Increasing the depth of neural networks can significantly lower the computational costs associated with representing certain functions Additionally, deeper networks can dramatically reduce the amount of training data required to effectively learn these functions For a comprehensive overview of the benefits of depth in feedforward networks, refer to section 6.4.1.

Experimentally, deep autoencoders yield much better compression than corresponding shallow or linear autoencoders (Hinton and Salakhutdinov 2006, ).

A prevalent approach to training deep autoencoders involves a greedy pretraining method, where a series of shallow autoencoders are trained sequentially This technique is frequently utilized, even when the primary objective is to develop a deep autoencoder.

Stochastic Encoders and Decoders

Autoencoders are just feedforward networks The same loss functions and output unit types that can be used for traditional feedforward networks are also used for autoencoders.

A key approach for designing output units and the loss function in a feedforward network involves defining an output distribution p(y|x) and minimizing the negative log-likelihood, represented as -log p(y|x) In this context, y refers to a vector of targets, which may include class labels.

In an autoencoder, the input and target are the same, allowing us to utilize similar techniques as in traditional models We can conceptualize the decoder as generating a conditional distribution \( p_{\text{decoder}}(x|h) \), and the training process involves minimizing the negative log of this distribution The specific loss function varies based on the decoder's form; for real-valued inputs, we typically use linear output units to represent the mean of a Gaussian distribution, leading to a mean squared error criterion For binary inputs, a Bernoulli distribution is employed with parameters from a sigmoid output unit, while discrete inputs are modeled using a softmax distribution.

In many models, output variables are assumed to be conditionally independent given a hidden variable, h, which simplifies the evaluation of their probability distribution However, certain methods, like mixture density outputs, enable effective modeling of correlated outputs This approach utilizes an encoder to determine the conditional probability of the hidden variable given the input, and a decoder to model the conditional probability of the output given the hidden variable.

A stochastic autoencoder consists of an encoder and decoder that incorporate noise injection, allowing their outputs to be interpreted as samples from specific distributions, denoted as p_encoder(h | x) for the encoder and p_decoder(x | h) for the decoder.

To significantly diverge from traditional feedforward networks, we can extend the concept of an encoding function f(x) to an encoding distribution p_encoder(h|x), as depicted in Figure 14.2.

Any latent variable model p model (h x, ) deﬁnes a stochastic encoder p encoder(h x| ) = p model (h x| ) (14.12) and a stochastic decoder p decoder (x h| ) = p model (x h| ) (14.13)

In general, the encoder and decoder distributions are not necessarily conditional distributions compatible with a unique joint distributionp model (x h, ) Alain et al.

(2015) showed that training the encoder and decoder as a denoising autoencoder will tend to make them compatible asymptotically (with enough capacity and examples).

Denoising Autoencoders

The denoising autoencoder (DAE) is an autoencoder that receives a corrupted data point as input and is trained to predict the original, uncorrupted data point as its output.

The DAE training procedure is illustrated in ﬁgure 14.3 We introduce a corruption process C(˜x | x) which represents a conditional distribution over ˜ x ˜ x L L h h f g x x C(˜ x x | )

The computational graph of a denoising autoencoder illustrates the process of reconstructing clean data from its corrupted counterpart This is achieved by minimizing the loss function \( L = - \log p_{\text{decoder}}(x | h = f(\tilde{x})) \), where \( \tilde{x} \) represents the corrupted data and \( x \) is the original data point The corruption occurs through a defined process \( C(\tilde{x} | x) \) Typically, the decoder employs a factorial distribution, with mean parameters generated by a feedforward network The autoencoder learns to estimate the reconstruction distribution \( p_{\text{reconstruct}}(x | \tilde{x}) \) using training pairs of original and corrupted data.

1 Sample a training example x from the training data.

3 Use(x,x)˜ as a training example for estimating the autoencoder reconstruction distribution p reconstruct(x| ˜x) = p decoder(x h| ) with h the output of encoder f( ˜x) and p decoder typically deﬁned by a decoder g( )h

Typically we can simply perform gradient-based approximate minimization (such as minibatch gradient descent) on the negative log-likelihood −logp decoder (x h| ).

So long as the encoder is deterministic, the denoising autoencoder is a feedforward network and may be trained with exactly the same techniques as any other feedforward network.

We can therefore view the DAE as performing stochastic gradient descent on the following expectation:

∼ C(˜ x | x)logp decoder (x h| = (f x))˜ (14.14) where pˆ data ( )x is the training distribution. x ˜ x g f ◦ ˜ x

A denoising autoencoder is designed to reconstruct original data points from corrupted versions, represented as red crosses near a low-dimensional manifold The corruption process is depicted with a gray circle, indicating the range of possible corruptions A gray arrow illustrates the transformation of a training example into a sample from this corruption process The training objective is to minimize the average squared errors between the reconstructed and original data points.

The reconstruction \( g(f(\tilde{x})) \) aims to estimate the expected value \( E_{x, \tilde{x} \sim p_{\text{data}}}(x | \tilde{x}) \) The vector \( \tilde{g}(f(\tilde{x})) - \tilde{x} \) points towards the nearest point on the manifold, as \( g(f(\tilde{x})) \) approximates the center of mass of the clean data points \( x \) that could correspond to \( \tilde{x} \) Consequently, the autoencoder learns a vector field \( g(f(x)) - x \), represented by green arrows, which estimates the score \( \nabla_x \log p_{\text{data}}(x) \) up to a multiplicative factor reflecting the average root mean square reconstruction error.

Score matching, introduced by Hyvọrinen in 2005, serves as an alternative to maximum likelihood estimation This method offers a consistent estimator for probability distributions by aligning the model's score with the data distribution at each training point, x In this framework, the score refers to a specific gradient field, ensuring that the model accurately reflects the underlying data characteristics.

In section 18.4, score matching is explored in greater detail For our current focus on autoencoders, it is important to note that learning the gradient field of log probability data is an effective method for understanding the underlying structure of the probability data itself.

Denoising autoencoders (DAEs) possess a crucial characteristic: their training criterion, which utilizes a conditionally Gaussian distribution p(x | h), enables the autoencoder to learn a vector field represented by (g(f(x))−x) This vector field effectively estimates the score of the data distribution, as illustrated in figure 14.4.

Denoising training of a specific type of autoencoder, which utilizes sigmoidal hidden units and linear reconstruction units with Gaussian noise and mean squared error as the reconstruction cost, is equivalent to training a restricted Boltzmann machine (RBM) with Gaussian visible units This model, which will be elaborated on in section 20.5.1, is characterized by its ability to provide an explicit probability model, denoted as p(x;θ) When the RBM is trained through denoising score matching, it enhances the model's performance and accuracy.

In 2010, it was established that the learning algorithm used is akin to denoising training in an autoencoder While regularized score matching with a fixed noise level does not yield a consistent estimator and instead retrieves a blurred distribution, consistency can be achieved by selecting a noise level that approaches zero as the number of examples increases indefinitely For a more in-depth discussion on denoising score matching, refer to section 18.5.

Autoencoders and Restricted Boltzmann Machines (RBMs) share several connections, particularly in their training methodologies Score matching for RBMs results in a cost function that aligns with reconstruction error, incorporating a regularization term akin to the contractive penalty found in Contractive Autoencoders (CAEs) (Swersky et al., 2011) Additionally, Bengio and Delalleau (2009) demonstrated that the gradient derived from an autoencoder can serve as an approximation for the contrastive divergence training process used in RBMs.

The denoising criterion, which involves Gaussian corruption and reconstruction distribution, provides an estimator of the score for continuous-valued x This approach is applicable to various encoder and decoder parameterizations, allowing for a generic encoder-decoder architecture to effectively estimate the score by utilizing the squared error criterion for training.

C( ˜x= ˜x x| ) = (N x˜;à = x,Σ = σ 2 I) (14.17) with noise variance σ 2 See ﬁgure 14.5 for an illustration of how this works.

A denoising autoencoder learns a vector field around a 1-D curved manifold in a 2-D space, where data concentration occurs Each arrow represents the difference between the reconstruction and the input vector, indicating directions towards higher probability based on the estimated probability distribution The vector field has zeros at both the maxima and minima of the estimated density function, with local maxima forming a connected spiral arm manifold and local minima situated between these arms A larger reconstruction error, indicated by longer arrows, suggests significant potential for increasing probability by moving towards the arrow's direction, particularly in low probability areas The autoencoder effectively maps these low probability points to higher probability reconstructions, resulting in shorter arrows where probability is maximized due to improved reconstruction accuracy.

Reconstruction of the input x through the function g(f(x)) does not guarantee alignment with the gradient of any function, including the score Early findings by Vincent (2011) were limited to specific parameterizations where g(f(x))−x could be derived from another function's derivative However, Kamyshanska and Memisevic (2015) expanded on Vincent's work by identifying a class of shallow autoencoders, demonstrating that g(f(x))−x consistently corresponds to a score across all models in this family.

The denoising autoencoder not only learns to represent a probability distribution but can also function as a generative model, allowing for the sampling of data from this distribution Further details on this process will be provided in section 20.11.

The concept of utilizing Multi-Layer Perceptrons (MLPs) for denoising originated from the works of LeCun and Gallinari in 1987, with Behnke's 2001 research also exploring recurrent networks for image denoising Denoising autoencoders, a specialized form of MLPs, are designed not only to eliminate noise from inputs but also to develop valuable internal representations as a byproduct of this process, a notion introduced by Vincent et al in 2008 and 2010 These learned representations can subsequently be employed to pretrain either deeper unsupervised networks or supervised models Similar to sparse autoencoders, sparse coding, and contractive autoencoders, the primary goal of denoising autoencoders is to facilitate the learning of a high-capacity encoder while avoiding the trivial learning of an identity function by both the encoder and decoder.

Before the advent of the modern Denoising Autoencoder (DAE), Inayoshi and Kurita (2005) aimed to achieve similar objectives using comparable methods Their approach focused on minimizing reconstruction error alongside a supervised objective by incorporating noise into the hidden layer of a supervised Multi-Layer Perceptron (MLP), ultimately enhancing generalization through this noise and reconstruction error However, their technique relied on a linear encoder, which limited its ability to learn function families as effectively as contemporary DAEs.

Learning Manifolds with Autoencoders

Autoencoders, similar to various machine learning algorithms, leverage the concept that data tends to cluster around low-dimensional manifolds While some algorithms focus on learning functions that perform well on these manifolds, they may exhibit unpredictable behavior when encountering inputs that fall outside of them.

Autoencoders take this idea further and aim to learn the structure of the manifold.

To understand how autoencoders do this, we must present some important characteristics of manifolds.

An important characterization of a manifold is the set of its tangent planes.

At a specific point x on a d-dimensional manifold, the tangent plane is represented by d basis vectors that define the local directions of variation permitted on the manifold These local directions indicate the infinitesimal changes that can be made to x while remaining within the manifold.

All autoencoder training procedures involve a compromise between two forces:

An autoencoder learns a representation \( h \) of a training example \( x \), enabling the approximate recovery of \( x \) through a decoder The significance of \( x \) being drawn from the training data lies in the autoencoder's focus on reconstructing likely inputs, rather than attempting to accurately reconstruct all possible inputs outside the data generating distribution.

To meet the constraint or regularization penalty, one can implement architectural limitations on the autoencoder's capacity or incorporate a regularization term into the reconstruction cost These methods typically favor solutions that exhibit reduced sensitivity to input variations.

An autoencoder effectively combines two forces: copying the input to the output and ignoring the input, which together enable the hidden representation to encapsulate the structure of the data generating distribution This principle allows the autoencoder to focus solely on the variations necessary for reconstructing training examples When the data distribution is concentrated near a low-dimensional manifold, the autoencoder captures a local coordinate system for this manifold, ensuring that only variations tangent to the manifold correspond to changes in the representation Consequently, the encoder learns a mapping from the input space to a representation space that is sensitive to changes along the manifold directions while remaining insensitive to variations orthogonal to the manifold.

Figure 14.7 illustrates a one-dimensional example demonstrating that by designing the reconstruction function to be insensitive to input perturbations near data points, the autoencoder effectively recovers the underlying manifold structure.

Autoencoders are valuable for manifold learning as they provide a unique way to represent data points situated on or near a manifold, distinguishing them from other methods By effectively capturing the underlying structure of the data, autoencoders facilitate a deeper understanding of the manifold's characteristics.

This article illustrates the concept of a tangent hyperplane through a one-dimensional manifold in a 784-dimensional space, using an MNIST image composed of 784 pixels By applying vertical translations to the image, we define coordinates along the manifold, which traces a curved path in image space The plot displays several points along this manifold, projected into two-dimensional space using PCA for visualization It is important to note that an n-dimensional manifold possesses an n-dimensional tangent plane at each point, which touches the manifold precisely at that point and aligns parallel to its surface.

The concept of a one-dimensional manifold is explored by defining the possible directions of movement while remaining on the manifold itself Each manifold has a unique tangent line at any given point, illustrated with an example in image space In this representation, gray pixels signify unchanged pixels along the tangent line, while white pixels represent areas that brighten and black pixels indicate regions that darken.

An autoencoder that learns a reconstruction function invariant to small perturbations captures the manifold structure of the data, represented by a collection of -dimensional manifolds The optimal reconstruction function intersects the identity function at each data point, with the reconstruction direction vector pointing towards the nearest manifold Denoising autoencoders aim to minimize the derivative of the reconstruction function around data points, while contractive autoencoders do the same for the encoder Although the derivative is small near data points, it can be large in the regions between them, allowing the reconstruction function to effectively map corrupted points back onto the manifold This representation is known as an embedding, typically represented by a low-dimensional vector that is a subset of the ambient space Some algorithms learn an embedding for each training example, while others develop a general mapping or representation function that translates any point in the ambient space to its corresponding embedding.

Manifold learning primarily emphasizes unsupervised learning techniques aimed at capturing complex data structures known as manifolds Early research in machine learning related to nonlinear manifolds predominantly utilized non-parametric methods centered around the nearest-neighbor graph, which consists of a node for each training example and edges linking nearby neighbors Key contributions to this field include works by Schölkopf et al (1998), Roweis and Saul (2000), Tenenbaum et al (2000), and Brand (2003), as well as Belkin et al.

Non-parametric manifold learning methods construct a nearest neighbor graph where nodes represent training examples and directed edges indicate nearest neighbor relationships These methods can derive the tangent plane for a neighborhood within the graph and create a coordinate system that links each training example to a real-valued vector position or embedding This representation can be extended to new examples through interpolation, provided there are enough training examples to adequately capture the manifold's curvature and twists Each node is associated with a tangent plane that reflects the variation directions based on the difference vectors between an example and its neighbors, as demonstrated in Figure 14.8.

A global coordinate system can be established by optimizing or solving a linear system This process is visually represented in Figure 14.9, which demonstrates how a manifold can be covered with numerous locally linear Gaussian-like patches, referred to as "pancakes," due to their flatness in the tangent directions.

Local non-parametric approaches to manifold learning face a significant challenge, as highlighted by Bengio and Monperrus (2005) When manifolds exhibit high complexity with numerous peaks, troughs, and twists, a substantial number of training examples may be required to adequately represent each variation.

If the tangent planes at each location are identified, they can be combined to create a comprehensive global coordinate system or density function Each local segment can be perceived as a local Euclidean coordinate system or a locally flat Gaussian.

Contractive Autoencoders

The contractive autoencoder (Rifaiet al.,2011a b, ) introduces an explicit regularizer on the code h= f(x), encouraging the derivatives of f to be as small as possible:

The penalty Ω(h) is the squared Frobenius norm (sum of squared elements) of the Jacobian matrix of partial derivatives associated with the encoder function.

Denoising autoencoders and contractive autoencoders are closely related, as shown by Alain and Bengio (2013), who demonstrated that under small Gaussian input noise, the denoising reconstruction error mirrors a contractive penalty on the reconstruction function mapping x to r = g(f(x)) Essentially, denoising autoencoders enhance the reconstruction function's resilience to small input perturbations, while contractive autoencoders strengthen the feature extraction function against infinitesimal disturbances For optimal classification accuracy when pretraining features f(x) for a classifier, it is more effective to apply the contractive penalty to f(x) rather than g(f(x)) Additionally, a contractive penalty on f(x) is closely linked to score matching.

The term "contractive" refers to the CAE's ability to distort space, as it is designed to minimize the effects of input variations This characteristic leads the CAE to transform a cluster of input points into a more compact set of output points, enhancing its stability and robustness.

We can think of this as contracting the input neighborhood to a smaller output neighborhood.

The Contractive Autoencoder (CAE) operates locally by mapping perturbations of a training point \( x \) close to \( f(x) \) However, it is important to note that globally, distinct points \( x \) and \( x' \) can be transformed into outputs \( f(x) \) and \( f(x') \) that may be further apart than the original points.

The function f may expand between or beyond the data manifolds, as illustrated in the 1-D toy example of figure 14.7 Applying the Ω(h) penalty to sigmoidal units can effectively shrink the Jacobian by driving the sigmoid units to saturate at 0 or 1 This approach encourages the Contractive Autoencoder (CAE) to encode input points using extreme sigmoid values, which can be interpreted as a binary code, while also ensuring that the CAE distributes its code values across the majority of the hypercube spanned by its sigmoidal hidden units.

We can think of the Jacobian matrix J at a point x as approximating the nonlinear encoder f(x) as being a linear operator This allows us to use the word

In the context of linear operators, a linear operator is classified as contractive when the norm of J x is less than or equal to 1 for all unit-norm vectors x This means that J effectively reduces the size of the unit sphere, demonstrating its contractive nature.

The CAE penalizes the Frobenius norm of the local linear approximation of f(x) at each training point x This approach encourages the local linear operators to behave as contractions, promoting stability and improved learning during the training process.

Regularized autoencoders, as outlined in section 14.6, effectively learn manifolds by balancing reconstruction error with a contractive penalty Ω(h) While reconstruction error promotes the learning of an identity function, the contractive penalty ensures that the features learned by the CAE remain sensitive to variations in the input, preventing the model from becoming overly simplistic.

The compromise between these two forces yields an autoencoder whose derivatives

∂x are mostly tiny Only a small number of hidden units, corresponding to a small number of directions in the input, may have signiﬁcant derivatives.

The primary objective of the Contractive Autoencoder (CAE) is to understand the manifold structure of data by identifying directions that significantly alter the output Research by Rifai et al (2011) indicates that training the CAE leads to most singular values of the Jacobian matrix dropping below one, indicating a contractive nature However, some singular values remain high, as the reconstruction error encourages the CAE to capture directions with substantial local variance These larger singular values represent the learned tangent directions, which ideally reflect genuine variations within the data For instance, when applied to image data, a CAE should identify tangent vectors that illustrate gradual changes in object poses Visualizations of the singular vectors obtained from experiments have shown to correspond to meaningful transformations of the input images.

The CAE regularization criterion is computationally efficient for single hidden layer autoencoders but becomes costly with deeper architectures To address this, Rifai et al (2011a) proposed training multiple single-layer autoencoders, each tasked with reconstructing the hidden layer of the preceding autoencoder This method effectively constructs a deep autoencoder, ensuring that each layer is locally contractive, which results in an overall contractive deep autoencoder While this approach differs from jointly training the entire model with a Jacobian penalty, it still retains many desirable qualitative features.

Another practical issue is that the contraction penalty can obtain useless results

Local PCA (no sharing across regions)

The illustration in Figure 14.10 demonstrates the tangent vectors of a manifold estimated through local PCA and a contractive autoencoder (CAE), using a dog image from the CIFAR-10 dataset as the reference point While both methods can effectively capture local tangents, the CAE outperforms local PCA in generating accurate estimates from limited training data due to its ability to share parameters across different locations with overlapping active hidden units The CAE's tangent directions are typically aligned with the dynamic parts of the object, such as the head or legs Furthermore, to prevent the encoder from learning nothing about the distribution while maintaining perfect reconstruction, Rifai et al (2011a) implemented a weight-tying strategy, ensuring that the weight matrix of the decoder is the transpose of that of the encoder.

Predictive Sparse Decomposition

Predictive sparse decomposition (PSD) is a hybrid model that combines sparse coding and parametric autoencoders, as introduced by Kavukcuoglu et al in 2008 This model features a parametric encoder designed to predict outcomes through iterative inference PSD has been effectively utilized for unsupervised feature learning in object recognition across images and videos, with significant contributions from researchers such as Kavukcuoglu et al (2009, 2010) and Jarrett et al (2009), as well as applications in audio processing by Henaff et al (2011) The architecture includes a parametric encoder, f(x), and a decoder, g(h), where the optimization algorithm governs the control of h during training, which aims to minimize a specific loss function.

In sparse coding, the training algorithm alternates between minimizing the cost function related to the model parameters and optimizing the variable h This approach is efficient, as the function f(x) offers a strong initial estimate for h, ensuring that the optimization keeps h close to f(x) By employing simple gradient descent, reasonable values for h can be achieved in as few as ten iterations.

The PSD training procedure uniquely differs from traditional methods by simultaneously optimizing the sparse coding model and the function f(x) to enhance the prediction of sparse coding features This approach effectively regularizes the decoder, ensuring that the parameters utilized allow f(x) to accurately infer optimal code values.

Predictive sparse coding is an example of learned approximate inference.

In section 19.5, the discussion expands on the tools introduced in chapter 19, illustrating that PSD can be understood as the process of training a directed sparse coding probabilistic model This is achieved by maximizing a lower bound on the model's log-likelihood.

In practical applications of Parametric Stochastic Dynamics (PSD), iterative optimization is utilized solely during the training phase Once deployed, the parametric encoder \( f \) efficiently computes learned features, making its evaluation computationally inexpensive compared to gradient descent inference Additionally, since \( f \) is a differentiable parametric function, PSD models can be stacked to initialize deep networks for training with alternative criteria.

Applications of Autoencoders

Autoencoders have proven effective for dimensionality reduction and information retrieval, serving as one of the initial applications of representation learning and deep learning Pioneering work by Hinton and Salakhutdinov in 2006 involved training a stack of Restricted Boltzmann Machines (RBMs) to initialize a deep autoencoder with progressively smaller hidden layers, ultimately resulting in a 30-unit bottleneck This approach produced a learned representation that achieved lower reconstruction error compared to PCA while offering a more interpretable structure, with distinct categories appearing as well-separated clusters Utilizing lower-dimensional representations enhances performance in various tasks, such as classification, as models with reduced dimensionality require less memory and runtime.

Many forms of dimensionality reduction place semantically related examples near each other, as observed bySalakhutdinov and Hinton 2007b( ) and Torralba et al.

(2008) The hints provided by the mapping to the lower-dimensional space aid generalization.

Dimensionality reduction significantly enhances information retrieval, which involves locating database entries that match a query This technique not only improves efficiency by reducing dimensions but also allows for effective storage and retrieval through a binary code hash table By training the dimensionality reduction algorithm to generate low-dimensional binary codes, we can map these codes to database entries, facilitating quick retrieval of matching entries Additionally, this method enables efficient searches for slightly similar entries by adjusting individual bits in the query's encoding This innovative approach, known as semantic hashing, has been successfully utilized in both textual and image data.

To generate binary codes for semantic hashing, an encoding function with sigmoids in the final layer is essential It is crucial for the sigmoid units to be trained to achieve saturation close to 0 or 1 for all input values A useful technique to achieve this is to introduce additive noise prior to the sigmoid nonlinearity during training, with the noise magnitude progressively increasing over time To counteract this noise and retain maximum information, the network must enhance the input magnitude to the sigmoid function until saturation is achieved.

The exploration of hashing functions has expanded to include training representations that optimize a loss function closely related to the task of identifying nearby examples within a hash table.

This chapter explores the concept of learning representations and their significance in designing deep architectures It highlights how learning algorithms leverage statistical strengths across various tasks, utilizing information from unsupervised tasks to enhance supervised learning Shared representations facilitate the handling of multiple modalities and domains, enabling the transfer of learned knowledge to tasks with limited or no examples but existing task representations The discussion concludes by examining the reasons behind the success of representation learning, emphasizing the theoretical benefits of distributed and deep representations, as well as the assumptions regarding the data generating process and the underlying causes of observed data.

The ease or difficulty of information processing tasks is largely influenced by how the information is represented, a principle that applies to daily life, computer science, and machine learning For instance, dividing 210 by 6 is simple using long division, but becomes complex when using Roman numerals, prompting most people to convert to Arabic numerals first Additionally, the efficiency of operations can vary significantly based on representation; inserting a number into a sorted list is an O(n) operation with a linked list, but only O(logn) with a red-black tree.

In machine learning, the effectiveness of a representation is determined by its ability to simplify subsequent learning tasks A superior representation enhances the performance of the chosen learning task, making it easier to achieve desired outcomes Ultimately, the optimal representation is closely tied to the specific learning task at hand.

Feedforward networks trained through supervised learning engage in representation learning, where the final layer typically functions as a linear classifier, such as softmax regression The preceding layers are designed to generate representations that facilitate the classification task This training approach enhances the properties of representations in the hidden layers, particularly towards the top, allowing for previously non-linearly separable classes in the input features to become linearly separable in the last hidden layer Additionally, the final layer could employ alternative models, like a nearest neighbor classifier, with the penultimate layer's features adapting based on the type of classifier used.

Supervised training of feedforward networks does not impose conditions on learned intermediate features, unlike some representation learning algorithms that are specifically designed to shape representations For instance, to facilitate density estimation, one could create an objective function that promotes independence among elements in the representation vector Similar to supervised networks, unsupervised deep learning algorithms primarily focus on a training objective while also learning representations as a byproduct Regardless of the method used to obtain a representation, it can be applied to different tasks, and multiple tasks—both supervised and unsupervised—can be learned concurrently using a shared internal representation.

Most representation learning problems face a tradeoﬀ between preserving as much information about the input as possible and attaining nice properties (such as independence).

Representation learning is a valuable approach for unsupervised and semi-supervised learning, especially when dealing with large volumes of unlabeled data and limited labeled data Supervised learning on the labeled subset can lead to significant overfitting, but semi-supervised learning addresses this issue by leveraging both labeled and unlabeled data By learning effective representations from the unlabeled data, we can enhance the performance of supervised learning tasks.

Humans and animals can learn effectively from minimal labeled examples, though the mechanisms behind this capability remain unclear Possible explanations for enhanced human performance include the brain's use of extensive classifier ensembles or Bayesian inference methods A prominent theory suggests that the brain utilizes unsupervised or semi-supervised learning techniques This chapter emphasizes the potential of leveraging unlabeled data to develop a robust representation.

Greedy Layer-Wise Unsupervised Pretraining

Unsupervised learning significantly contributed to the resurgence of deep neural networks by allowing researchers to train deep supervised networks without the need for specialized architectures such as convolutional or recurrent layers This method, known as greedy layer-wise unsupervised pretraining, exemplifies how a representation learned from unsupervised learning—focused on understanding the input distribution—can be beneficial for supervised learning tasks within the same input domain.

Greedy layer-wise unsupervised pretraining utilizes single-layer representation learning algorithms, including RBMs, single-layer autoencoders, and sparse coding models, to learn latent representations Each layer undergoes unsupervised pretraining, leveraging the output from the previous layer to generate a new, simplified representation of the data, ideally enhancing its distribution or relationships with predictive variables For a formal description, refer to Algorithm 15.1.

Greedy layer-wise training methods utilizing unsupervised criteria have been employed for years to address the challenges of jointly training deep neural network layers for supervised tasks, with origins tracing back to the Neocognitron in 1975 The deep learning resurgence in 2006 highlighted that this greedy learning technique could effectively initialize a joint learning process across all layers, enabling successful training of fully connected architectures Before this breakthrough, only convolutional deep networks or recurrent networks were considered trainable Although we now understand that greedy layer-wise pretraining is not essential for training fully connected deep architectures, it was the pioneering method that achieved success in this area.

Greedy layer-wise pretraining is an algorithm that optimizes each layer of a neural network independently, one at a time, rather than optimizing all layers simultaneously This method involves training the k-th layer while keeping the previously trained layers fixed, meaning that the lower layers, trained first, remain unchanged when upper layers are introduced It is termed unsupervised because each layer is trained using an unsupervised representation learning algorithm Additionally, it is referred to as pretraining since it serves as an initial step before applying a joint training algorithm to fine-tune all layers together In supervised learning tasks, this approach can act as a regularizer, often reducing test error without affecting training error, and provides a form of parameter initialization.

The term "pretraining" often encompasses both the pretraining stage and the subsequent supervised learning phase in a two-phase protocol This supervised phase may involve training a simple classifier using the features obtained during pretraining or fine-tuning the entire network developed in that stage Regardless of the unsupervised learning algorithm or model type used, the overall training approach remains largely consistent Although the specific unsupervised learning algorithm chosen can influence the details, most applications of unsupervised pretraining adhere to this fundamental structure.

Greedy layer-wise unsupervised pretraining can serve as a valuable initialization technique for various unsupervised learning algorithms, including deep autoencoders and probabilistic models featuring multiple layers of latent variables Notably, this approach can be applied to deep belief networks and deep Boltzmann machines, which are both prominent examples of deep generative models that will be explored in further detail in chapter 20.

Greedy layer-wise supervised pretraining, as outlined in section 8.7.4, is based on the idea that training a shallow network is simpler than training a deep network, a concept supported by various studies (Erhan et al., 2010).

15.1.1 When and Why Does Unsupervised Pretraining Work?

Greedy layer-wise unsupervised pretraining significantly enhances test accuracy in classification tasks, sparking renewed interest in deep neural networks since 2006 (Hinton et al.).

Algorithm 15.1 Greedy layer-wise unsupervised pretraining protocol.

An unsupervised feature learning algorithm, denoted as L, processes a training set to produce an encoder or feature function f The input data, represented as X, consists of individual examples arranged in rows, and the output of the first stage encoder on X is given by f(1)(X) When fine-tuning is applied, a learner is utilized to enhance the model's performance.

T which takes an initial function f, input examples X (and in the supervised ﬁne-tuning case, associated targets Y), and returns a tuned function The number of stages is m. f ← Identity function

X˜ ←f ( ) k ( ˜X) end for if ﬁne-tuning then f ← T(f,X Y, ) end if

Unsupervised pretraining can be beneficial for certain tasks, but it may also lead to detrimental effects in others Research by Ma et al (2015) on machine learning models for chemical activity prediction revealed that while pretraining was generally slightly harmful, it proved significantly advantageous for various tasks Understanding the conditions under which unsupervised pretraining is effective or harmful is crucial for assessing its applicability to specific tasks.

This discussion primarily focuses on greedy unsupervised pretraining, while acknowledging the existence of alternative paradigms for semi-supervised learning with neural networks, such as virtual adversarial training Additionally, it is feasible to simultaneously train an autoencoder or generative model alongside a supervised model Notable examples of this single-stage approach include the discriminative RBM and the ladder network, where the overall objective is a clear sum of both supervised and unsupervised components.

Unsupervised pretraining integrates two key concepts: the significant impact of initial parameters on the regularization of deep neural networks, which can also enhance optimization, and the broader principle that understanding the input distribution aids in learning the relationship between inputs and outputs.

Both of these ideas involve many complicated interactions between several parts of the machine learning algorithm that are not entirely understood.

The ﬁrst idea, that the choice of initial parameters for a deep neural network can have a strong regularizing eﬀect on its performance, is the least well understood.

Pretraining, once viewed as a means to guide neural networks toward favorable local minima, is now recognized as less critical for optimization, as standard training often bypasses critical points altogether While pretraining may position models in advantageous regions of the cost function landscape, challenges in gradient estimation and Hessian conditioning persist Consequently, contemporary methods favor simultaneous unsupervised and supervised learning over sequential approaches A practical alternative is to freeze feature extractor parameters and apply supervised learning solely to develop a classifier atop the extracted features.

Learning algorithms can enhance performance in supervised learning by leveraging information gained during the unsupervised phase Features beneficial for unsupervised tasks, like recognizing wheels in images of vehicles, may also aid supervised learning However, the mathematical foundations of this relationship are not fully understood, making it difficult to predict which tasks will benefit The effectiveness of this approach often relies on the specific models used; for instance, pretrained features must enable linear separability for a linear classifier to be effective Consequently, simultaneous supervised and unsupervised learning is often advantageous, as it integrates output layer constraints from the beginning.

Unsupervised pretraining is particularly effective in enhancing representations when the initial representation is weak, as seen with word embeddings Unlike one-hot vectors, which lack informative distance metrics between distinct words, learned word embeddings capture semantic similarities through their spatial relationships in vector space This makes unsupervised pretraining invaluable for natural language processing However, its utility diminishes in image processing, likely due to images already existing in a well-defined vector space where distance metrics may not accurately reflect similarity.

Transfer Learning and Domain Adaptation

Transfer learning and domain adaptation involve leveraging knowledge gained from one context (distribution P1) to enhance performance in a different context (distribution P2) This concept expands on the previous discussion of transferring representations between unsupervised and supervised learning tasks, highlighting the potential for improved generalization across varying data distributions.

Transfer learning involves performing multiple tasks where the variations in the first task, P1, are relevant to learning the second task, P2 This concept is often applied in supervised learning, where the input remains constant but the target differs For instance, one might first learn to identify visual categories like cats and dogs, followed by a different set, such as ants and wasps If the first task has significantly more data, it can aid in developing representations that facilitate quick generalization to P2 with few examples Many visual categories share fundamental elements like edges and shapes, as well as responses to geometric and lighting changes Overall, transfer learning, along with multi-task learning and domain adaptation, can be enhanced through representation learning by leveraging features that are beneficial across various tasks or settings, as depicted in shared lower layers and task-specific upper layers.

In certain scenarios, the shared semantics among various tasks focus on the output rather than the input For instance, a speech recognition system must generate coherent sentences at the output layer, while the initial layers may need to identify diverse versions of the same phonemes based on the speaker In such cases, it is more effective to share the upper layers of the neural network, allowing for task-specific preprocessing, as demonstrated in figure 15.2.

In a multi-task or transfer learning architecture, the output variable \( y \) maintains consistent semantics across all tasks, while the input variable \( x \) varies in meaning and potentially dimension for each task or user, labeled as \( x^{(1)} \), \( x^{(2)} \), and \( x^{(3)} \) The architecture features task-specific lower levels that focus on transforming their unique inputs into a common set of features, while the upper levels are designed to be shared among tasks This design allows for effective learning and feature extraction tailored to each task's requirements.

Domain adaptation involves maintaining the same task and optimal input-output mapping while dealing with slightly different input distributions For instance, in sentiment analysis, which determines if comments convey positive or negative sentiments, a domain adaptation scenario occurs when a sentiment predictor trained on media reviews is applied to comments about consumer electronics Although there exists an underlying function to classify statements as positive, neutral, or negative, variations in vocabulary and style across domains complicate generalization Research has shown that simple unsupervised pretraining methods, such as denoising autoencoders, can significantly enhance sentiment analysis in domain adaptation contexts.

Concept drift refers to the gradual changes in data distribution over time, resembling a type of transfer learning Both concept drift and transfer learning can be categorized under multi-task learning Although "multi-task learning" usually pertains to supervised learning tasks, the broader concept of transfer learning is also relevant to unsupervised learning and reinforcement learning.

The goal of leveraging data from one setting to enhance learning and predictions in another is central to representation learning This approach posits that a shared representation can be advantageous for both tasks, enabling the model to utilize the combined training data effectively By employing a unified representation across different contexts, the model can improve its performance through the insights gained from both data sets.

Unsupervised deep learning for transfer learning has demonstrated success in various machine learning competitions, notably showing that participants can effectively learn a feature space from an initial dataset (distribution P1) This learned transformation enables the training of a linear classifier that generalizes well with minimal labeled examples in a transfer setting (distribution P2) Remarkably, deeper architectures that utilize unsupervised representations from the initial dataset significantly improve learning curves for new categories in the transfer setting, indicating that fewer labeled examples are needed to achieve optimal generalization performance.

One-shot learning and zero-shot learning are two distinct forms of transfer learning In one-shot learning, the model is trained using just a single labeled example of the task, whereas zero-shot learning operates without any labeled examples, relying instead on prior knowledge to make predictions.

One-shot learning, as described by Fei-Fei et al (2006), relies on the ability to distinctly separate underlying classes during the initial learning phase In the subsequent transfer learning phase, a single labeled example suffices to predict labels for numerous test examples that cluster closely in representation space This approach is effective when the relevant factors of variation are clearly distinguished from irrelevant ones in the learned representation space, allowing for accurate discrimination among different object categories.

As an example of a zero-shot learning setting, consider the problem of having a learner read a large collection of text and then solve object recognition problems.

It is possible to identify a specific object class solely based on descriptive text, even without prior visual exposure to the object For instance, if one learns that a cat has four legs and pointy ears, they may accurately deduce that an unseen image depicts a cat.

Zero-data learning and zero-shot learning leverage additional information during training to enhance model performance In this context, we can conceptualize zero-data learning as involving three key variables: the standard inputs (x), the conventional outputs or targets (y), and an extra variable representing the task (T) The model is designed to estimate the conditional distribution p(y | x, T), enabling it to make predictions even in scenarios with limited data.

The task involves training a model to recognize cats, where the output is a binary variable indicating the presence or absence of a cat in an image (y=1 for "yes" and y=0 for "no") The task variable T represents questions like "Is there a cat in this image?" By utilizing a training set with unsupervised examples of objects in the same context as T, we can potentially infer the meanings of previously unseen instances of T.

To recognize cats without prior images, it is crucial to utilize unlabeled text data that includes descriptive sentences like "cats have four legs" and "cats have pointy ears."

Zero-shot learning necessitates that the representation of T enables generalization beyond simple categorizations Instead of relying on a one-hot encoding for object categories, Socher et al (2013b) propose utilizing a distributed representation through learned word embeddings that correspond to each category's associated word.

Semi-Supervised Disentangling of Causal Factors

A key question in representation learning is determining what constitutes a superior representation One prevailing hypothesis suggests that an optimal representation aligns features with the underlying causes of observed data, effectively disentangling these causes in feature space This approach aims to create a robust representation for p(x), which may also facilitate the computation of p(y | x) when y is a significant factor influencing x This concept has significantly influenced deep learning research since the 1990s For additional insights on the advantages of semi-supervised learning over traditional supervised learning, please refer to section 1.2 of Chapelle et al (2006).

In representation learning, achieving a model that is both easy to interpret and effectively separates underlying causal factors is crucial While these properties may not always align, the hypothesis behind semi-supervised learning suggests that they often do When we uncover the fundamental explanations for observed data, it typically simplifies the process of isolating individual attributes Specifically, if a representation captures the key causes of the observed data, predicting the outputs becomes straightforward.

Semi-supervised learning can falter when the unsupervised learning of p(x) does not aid in understanding p(y | x) For instance, if p(x) is uniformly distributed while attempting to learn the function f(x) = E[y | x], the training set of x values alone fails to provide any insights into p(y | x) Consequently, without additional information, it remains impossible to discern the relationship between the input x and the output y.

In Figure 15.4, we illustrate a density function over variable x that represents a mixture of three components The identity of these components serves as a crucial explanatory factor, denoted as y Since the mixture components, such as natural object classes in image data, are statistically significant, modeling p(x) in an unsupervised manner—without any labeled examples—effectively uncovers the underlying factor y.

Semi-supervised learning can effectively succeed in scenarios where data is drawn from a mixture, with distinct components corresponding to each class label When these mixture components are well-separated, accurately modeling the probability distribution p(x) allows for precise identification of each component In such cases, having just one labeled example per class is sufficient to perfectly learn the conditional probability p(y|x) However, it raises the question of what factors might link p(y|x) and p(x) together in more complex situations.

When y is strongly linked to a causal factor of x, the probabilities p(x) and p(y | x) are closely connected This relationship suggests that unsupervised representation learning, which aims to separate the underlying factors of variation, can effectively serve as a semi-supervised learning approach.

In the context of causal relationships, we can assume that y is a contributing factor to x, while h encompasses all relevant factors This relationship can be illustrated through a directed graphical model, where h serves as the parent of x, leading to the generative process expressed mathematically as p(h | x) = p(x | h)p(h).

The marginal probability of the data is represented as p( ) = x E h p(x h| ) This indicates that the optimal model for x, in terms of generalization, is one that reveals the underlying "true" structure, where h serves as a latent variable that accounts for the observed variations in x.

The ideal representation learning aims to uncover latent factors, making it easier to predict outcomes, such as y, from these representations According to Bayes' rule, the conditional distribution of y given x is connected to the components of the representation, expressed as p(y | x) = p(x | y)p(y)/p(x).

The marginal probability p(x) is closely related to the conditional probability p(y|x), indicating that understanding the former can enhance the learning of the latter Consequently, under these assumptions, semi-supervised learning is expected to enhance performance.

A significant research challenge is that most observations stem from a vast array of underlying causes In this context, let y = h_i, but the unsupervised learner remains unaware of the specific h_i A brute force approach involves the unsupervised learner developing a representation that effectively captures all relevant generative factors h_j while disentangling them from one another This process simplifies the prediction of y, regardless of the associated h_i.

The brute force approach to capturing all factors influencing observations is impractical, as it is challenging to encode every detail, such as small background objects in a visual scene Research shows that humans often overlook changes in their environment that are not directly relevant to their current tasks, highlighting the importance of selective encoding A key area of exploration in semi-supervised learning involves determining what information to prioritize in different contexts Currently, two primary strategies are employed: integrating supervised learning signals with unsupervised ones to help models focus on the most pertinent variations, or opting for larger representations when relying solely on unsupervised learning.

An emerging strategy in unsupervised learning involves redefining the criteria for identifying salient underlying causes Traditionally, autoencoders and generative models have been optimized using fixed criteria like mean squared error, which dictate the significance of certain causes For instance, applying mean squared error to image pixels suggests that an underlying cause is only relevant if it significantly alters the brightness of many pixels This approach can be limiting, particularly when addressing tasks that require interaction with small objects.

An autoencoder trained with mean squared error for a robotics task failed to reconstruct a ping pong ball, highlighting the importance of the ball's existence and spatial coordinates in generating relevant images for the task Despite the robot's ability to interact successfully with larger objects like baseballs, the autoencoder's limited capacity did not recognize the ping pong ball as sufficiently salient Salience can be defined in various ways; for instance, a group of pixels forming a recognizable pattern, regardless of brightness, can be considered highly salient A promising method to implement this salience definition is through generative adversarial networks (Goodfellow et al., 2014c).

In this approach, a generative model is trained to deceive a feedforward classifier, which aims to identify samples from the generative model as fake and those from the training set as real The effectiveness of the feedforward network hinges on its ability to recognize salient structured patterns A detailed exploration of generative adversarial networks (GANs) will be provided in section 20.10.4, but it is important to note that these models learn to identify what features are significant Research by Lotter et al (2015) demonstrated that when generating images of human heads, models trained with mean squared error often overlook the ears, while those trained using the adversarial framework successfully include them This discrepancy arises because ears do not stand out in terms of brightness compared to the surrounding skin, making them less salient under mean squared error loss.

Distributed Representation

Distributed representations of concepts, which consist of multiple elements that can be independently adjusted, play a crucial role in representation learning Their strength lies in the ability to utilize n features with k values to represent k^n distinct concepts Throughout this book, we have observed that both neural networks with numerous hidden units and probabilistic models with various latent variables employ distributed representation strategies Furthermore, many deep learning algorithms are based on the premise that hidden units can learn to capture the causal factors underlying the data This approach aligns well with distributed representations, as each direction in the representation space can correspond to a different configuration variable's value.

A distributed representation utilizes a vector of n binary features, allowing for 2^n configurations that correspond to various regions in the input space In contrast, a symbolic representation associates the input with a single symbol or category, limited to n configurations that define n distinct regions in the input space This type of symbolic representation, known as a one-hot representation, is characterized by a binary vector with n mutually exclusive bits, where only one bit can be active at a time Symbolic representations are a specific instance of non-distributed representations, which consist of multiple entries but lack meaningful independent control over each entry.

Examples of learning algorithms based on non-distributed representations include:

• Clustering methods, including the k-means algorithm: each input point is assigned to exactly one cluster.

• k-nearest neighbors algorithms: one or a few templates or prototype examples are associated with a given input In the case of k >1, there are multiple h 1 h 2 h 3 h = [1, , 1 1]  h = [0, , 1 1]  h = [1, , 0 1]  h = [1, , 1 0]  h = [0, , 1 0]  h = [0, , 0 1]  h = [1, , 0 0] 

A learning algorithm utilizing a distributed representation effectively partitions the input space into distinct regions through binary features, h1, h2, and h3, which are defined by thresholding the output of a learned linear transformation Each feature creates half-planes in R², with h+i and h−i representing input points where the feature is 1 or 0, respectively The decision boundaries for each feature indicate the regions they define, and the combined representation yields unique values at the intersections of these half-planes, such as [1, 1, 1] for the region h+1 ∩ h+2 ∩ h+3 In contrast to non-distributed representations, a distributed representation in d dimensions intersects half-spaces, allowing for the assignment of unique codes to O(n^d) regions, significantly surpassing the n regions recognized by the nearest neighbor algorithm However, not all h values are feasible, and a linear classifier atop the distributed representation cannot assign distinct class identities to every neighboring region, as even advanced models like deep linear-threshold networks have limited VC dimensions The synergy between a robust representation layer and a simpler classifier can serve as an effective regularizer, enabling the classifier to generalize concepts like distinguishing "person" from "not a person" without necessitating unique classifications for similar representations.

The concept of a "man without glasses" highlights a capacity constraint that prompts each classifier to concentrate on a limited set of features This encourages the model to learn how to represent classes in a linearly separable manner While the values that describe each input are interconnected, they cannot be independently controlled, which means this approach does not constitute a genuine distributed representation.

• Decision trees: only one leaf (and the nodes on the path from root to leaf) is activated when an input is given.

Gaussian mixtures and mixtures of experts involve templates or experts that are linked to a specific degree of activation Similar to the k-nearest neighbors algorithm, each input is characterized by multiple values; however, these values cannot be easily manipulated independently.

Kernel machines utilizing a Gaussian kernel, or other local kernels, face a similar challenge as Gaussian mixtures, despite the continuous-valued activation of each support vector or template example.

• Language or translation models based on n-grams The set of contexts

(sequences of symbols) is partitioned according to a tree structure of suﬃxes.

A leaf may correspond to the last two words being w 1 andw 2 , for example. Separate parameters are estimated for each leaf of the tree (with some sharing being possible).

Non-distributed algorithms can produce outputs that smoothly interpolate between adjacent regions rather than remaining constant across segments Notably, the correlation between the quantity of parameters or examples and the number of definable regions is linear.

Distributed representations differ from symbolic ones by enabling generalization through shared attributes among concepts While "cat" and "dog" are distinct symbols, a meaningful distributed representation allows for common traits, such as "has_fur" and "number_of_legs," to apply to both This results in better generalization in neural language models that utilize distributed representations compared to those using one-hot representations Such representations create a rich similarity space, where semantically related concepts are positioned closely together, a feature lacking in purely symbolic systems.

When and why can there be a statistical advantage from using a distributed representation as part of a learning algorithm? Distributed representations can

The nearest neighbor algorithm exemplifies a non-distributed learning approach, partitioning the input space into distinct regions, each governed by its own parameters While this method allows for straightforward fitting of training data without complex optimization, it is limited by its local generalization capabilities, making it challenging to model intricate functions with numerous variations Non-distributed algorithms often rely on the smoothness assumption, which posits that similar inputs yield similar outputs, but this can lead to difficulties in high-dimensional spaces where the number of examples required can be substantial Each region can be viewed as a category, enabling unique mappings from symbols to values; however, this framework struggles to generalize to new regions or symbols.

A convolutional network with max-pooling can effectively identify objects within images, demonstrating regularity in its target function This capability allows the network to recognize objects irrespective of their position in the image, even when spatial translations do not align with smooth transformations in the input space.

This article explores a unique distributed representation learning algorithm that generates binary features through thresholding linear functions of input data Each binary feature creates a division in R^d into two half-spaces, as depicted in figure 15.7 The algorithm's ability to differentiate regions is influenced by the exponentially large number of intersections formed by n half-spaces To determine the number of regions created by n hyperplanes in R^d, we can reference a general principle regarding hyperplane intersections (Zaslavsky, 1975) This principle demonstrates that the binary feature representation can distinguish a specific number of regions, as detailed by Pascanu et al (2014b).

Therefore, we see a growth that is exponential in the input size and polynomial in the number of hidden units.

In a d-dimensional space, where each dimension can take on at least two distinct values, a function may exhibit unique behavior across exponentially many regions Specifically, to accurately learn such a function that varies in 2^d different regions, it is necessary to gather O(2^d) training examples.

This article presents a geometric argument highlighting the generalization power of distributed representation, which utilizes O(nd) parameters to distinctly represent O(nd) regions in input space In contrast, a representation that assigns a unique symbol for each region would require O(nd) examples, demonstrating the efficiency of distributed representation This concept extends to nonlinear feature extractors, where a parametric transformation with k parameters can learn about r regions in input space, potentially allowing for better generalization than a non-distributed approach that would need O(r) examples By using fewer parameters, the model demands fewer training examples to achieve effective generalization.

Models based on distributed representations generalize effectively due to their limited capacity, which allows them to encode various regions distinctly The VC dimension of a neural network with linear threshold units is O(w log w), where w represents the number of weights This limitation means that while numerous unique codes can be assigned in the representation space, not all of the code space can be utilized, nor can arbitrary functions be learned using a linear classifier The combination of distributed representation and linear classification reflects a belief that the classes to be recognized are linearly separable based on underlying causal factors Typically, we aim to learn categories like all images of green objects or cars, rather than those requiring nonlinear logic, such as separating red cars from green trucks and vice versa.

Recent research by Zhou et al (2015) demonstrates that hidden units in deep convolutional networks, trained on the ImageNet and Places benchmark datasets, often learn interpretable features These features frequently align with labels that humans intuitively assign, suggesting a tangible link between abstract concepts and experimental validation in deep learning.

Exponential Gains from Depth

Multilayer perceptrons are universal approximators, demonstrating that certain functions can be represented by significantly smaller deep networks compared to shallow ones This reduction in model size enhances statistical efficiency In this section, we explore how these findings extend to various models featuring distributed hidden representations.

In section 15.4, we explored a generative model that effectively learned the underlying factors of facial images, such as gender and eyewear This model utilized a deep neural network, highlighting that shallow networks, like linear models, are insufficient for capturing the complex relationships between abstract factors and image pixels The independence of these high-level factors, which are intricately linked in nonlinear ways to the input, underscores the necessity for deep distributed representations These representations enable the extraction of higher-level features and generative causes through the composition of multiple nonlinear transformations.

Organizing computation through the composition of nonlinearities and a hierarchy of reused features significantly enhances statistical efficiency, building on the advantages of distributed representations Various network types, such as those with saturating nonlinearities, Boolean gates, and radial basis function (RBF) units, demonstrate that a single hidden layer can serve as a universal approximator This means such models can approximate a wide range of functions, including all continuous functions, to any non-zero tolerance level, provided there are enough hidden units—though this number can be quite large Theoretical insights indicate that certain function families can be efficiently represented by deep architectures, while those with insufficient depth (like depth 2 or depth k−1) may require an exponentially greater number of hidden units relative to the input size.

Deterministic feedforward networks serve as universal approximators of functions, as discussed in section 6.4.1 Additionally, structured probabilistic models, such as restricted Boltzmann machines and deep belief networks, with a single hidden layer of latent variables, are recognized as universal approximators of probability distributions (Le Roux and Bengio 2008, 2010; Montúfar and Ay 2011; Montúfar 2014; Krause et al.).

In section 6.4.1, we established that sufficiently deep feedforward networks outperform shallow networks exponentially This principle also applies to probabilistic models, such as sum-product networks (SPNs), which utilize polynomial circuits to calculate probability distributions over random variables Research by Delalleau and Bengio (2011) indicated that certain probability distributions necessitate a minimum depth in SPNs to prevent the need for exponentially large models Furthermore, Martens and Medabalimi (2014) highlighted significant differences between finite depths of SPNs, noting that constraints aimed at enhancing tractability might restrict their representational capabilities.

Recent theoretical findings reveal that deep circuits, particularly those associated with convolutional networks, exhibit exponential advantages over shallow circuits This holds true even when the shallow circuits are permitted to merely approximate the functions computed by their deeper counterparts, underscoring the significant expressive power of deep learning architectures.

2015) By comparison, previous theoretical work made claims regarding only the case where the shallow circuit must exactly replicate particular functions.

Providing Clues to Discover Underlying Causes

In conclusion, the effectiveness of a representation is determined by its ability to disentangle the underlying causal factors that generate data, particularly those relevant to specific applications Representation learning strategies often utilize clues to identify these factors, with supervised learning providing strong guidance through labels that specify variations directly To leverage abundant unlabeled data, representation learning also incorporates implicit prior beliefs from algorithm designers to steer the learning process The necessity of regularization strategies, as highlighted by the no free lunch theorem, underscores the challenge of achieving good generalization Although a universally superior regularization method remains elusive, deep learning aims to establish a set of versatile strategies applicable across various AI tasks, akin to those solved by humans and animals.

This article presents a selection of generic regularization strategies aimed at guiding learning algorithms to identify features that align with underlying factors While the list is not exhaustive, it offers concrete examples to illustrate these methods Originally introduced in section 3.1 of Bengio et al (2013d), the list has been further expanded in this discussion.

Smoothness in machine learning refers to the assumption that f(x + δd) is approximately equal to f(x) for small δ and unit d This concept enables learners to generalize from training examples to adjacent points in the input space While many machine learning algorithms utilize this principle, it is not enough to fully address the challenges posed by the curse of dimensionality.

Many learning algorithms operate under the assumption of linearity, which enables them to make predictions beyond the observed data However, this can result in overly extreme predictions While simple machine learning algorithms typically rely on linearity, they do not necessarily assume smoothness It’s important to note that linear functions with large weights in high-dimensional spaces may lack smoothness, highlighting a critical distinction in these assumptions.

(2014b) for a further discussion of the limitations of the linearity assumption.

Many representation learning algorithms are based on the belief that data is generated by multiple underlying explanatory factors, which can simplify the solution of various tasks when these factors are understood This perspective supports semi-supervised learning through representation learning, as learning the structure of p(x) involves acquiring features that are also beneficial for modeling p(y | x), since both are linked to the same underlying factors Additionally, this viewpoint encourages the use of distributed representations, where distinct directions in representation space correspond to different factors of variation.

The model is designed to treat the learned representation of variation factors as the causes of the observed data, rather than the other way around This approach enhances semi-supervised learning and increases the model's robustness when faced with changes in the distribution of underlying causes or when applied to new tasks.

Depth in a hierarchical organization of explanatory factors involves defining high-level, abstract concepts through simpler ones, creating a structured framework This deep architecture reflects the belief that tasks should be executed through a multi-step process, where each stage builds upon the outputs generated by the preceding steps.

In scenarios with multiple tasks linked to different variables \( y_i \) that share a common input \( x \), it is assumed that each \( y_i \) corresponds to a unique subset of relevant factors \( h \) from a shared pool Due to the overlap of these subsets, leveraging a shared intermediate representation \( P(h | x) \) facilitates the learning of all \( P(y_i | x) \), enabling the sharing of statistical strength across tasks.

Manifolds are regions where probability mass concentrates, characterized by local connectivity and minimal volume In continuous data, these areas can be represented by low-dimensional manifolds, significantly reducing the dimensionality compared to the original data space Many machine learning algorithms, particularly autoencoders, function effectively within these manifolds, as they aim to learn and model their underlying structure (Goodfellow et al., 2014b).

Natural clustering in machine learning posits that each connected manifold in the input space corresponds to a single class, despite the presence of multiple disconnected manifolds This principle underlies several learning algorithms, such as tangent propagation, double backpropagation, the manifold tangent classifier, and adversarial training, all of which leverage the assumption that class consistency is maintained within each manifold.

Slow feature analysis and similar algorithms operate on the principle that key explanatory factors evolve gradually over time, suggesting it is simpler to identify these underlying factors than to forecast raw data like pixel values For a more detailed exploration of this methodology, refer to section 13.3.

In machine learning, sparsity is essential as most features are unlikely to be relevant for every input; for instance, a feature detecting elephant trunks is unnecessary when analyzing a cat image Thus, it is logical to establish a prior that any feature interpreted as "present" or "absent" should predominantly be absent.

Effective high-level representations exhibit straightforward factor dependencies, such as marginal independence where P(h) equals the product of P(h_i) Additionally, linear dependencies or those modeled by shallow autoencoders are valid assumptions This principle is evident in various physical laws and is commonly applied when integrating a linear predictor or a factorized prior with a learned representation.

Representation learning is a fundamental concept that unifies various deep learning methods, including feedforward networks, recurrent networks, autoencoders, and deep probabilistic models These techniques focus on learning and utilizing representations effectively The pursuit of optimizing representations continues to be a vibrant and promising area of research in the field.

Structured Probabilistic Models for Deep Learning

Deep learning utilizes various modeling formalisms to aid researchers in designing algorithms, with structured probabilistic models being a significant one While briefly introduced in section 3.14, this chapter delves deeper into structured probabilistic models, which are crucial for understanding key research topics in deep learning discussed in part III This chapter is designed to be self-contained, allowing readers to engage with the material without needing to revisit previous sections.

The Challenge of Unstructured Modeling

Deep learning aims to enhance machine learning to tackle complex artificial intelligence challenges by effectively interpreting high-dimensional data with intricate structures This includes enabling AI algorithms to comprehend natural images, audio waveforms that represent speech, and documents filled with various words and punctuation marks.

Classification algorithms effectively analyze high-dimensional data and assign categorical labels, identifying objects in images, spoken words in recordings, or topics in documents This classification process simplifies the input by discarding most information and generating a single output or a probability distribution for that output Additionally, classifiers can often overlook irrelevant parts of the input, such as ignoring the background when recognizing an object in a photo.

Probabilistic models can be utilized for a variety of complex tasks beyond classification Many of these tasks are resource-intensive and involve generating multiple output values Additionally, they necessitate a comprehensive understanding of the overall structure involved.

A natural image is one that can be captured by a camera in a typical environment, distinguishing it from synthetic images or web page screenshots.

Density estimation involves a machine learning system providing an estimate of the true density p(x) based on a given input x, reflecting the underlying data generating distribution While this process necessitates only a single output, it demands a comprehensive understanding of the entire input vector If any element within the vector appears anomalous, the system must correspondingly assign it a low probability, ensuring accurate density representation.

Denoising involves utilizing a machine learning system to restore a damaged or inaccurately observed input, x, by providing an estimate of the original version For instance, the system can effectively eliminate dust or scratches from an old photograph This process necessitates generating multiple outputs for each element of the cleaned example x and requires a comprehensive understanding of the entire input, as even a single damaged area can impact the final estimate's quality.

Missing value imputation involves using observed data to estimate or predict the values of unobserved elements within a dataset The model generates either specific estimates or a probability distribution for these missing values, necessitating multiple output results.

Because the model could be asked to restore any of the elements of x, it must understand the entire input.

• Sampling: the model generates new samples from the distribution p(x).

Speech synthesis involves generating new waveforms that mimic natural human speech, necessitating multiple output values and a comprehensive model of the input data Any deviation in the samples, such as incorporating even a single element from an incorrect distribution, can lead to inaccuracies in the sampling process.

For an example of a sampling task using small natural images, see ﬁgure 16.1.

Modeling a rich distribution across numerous random variables poses significant computational and statistical challenges Even in the simplest scenario of modeling binary variables, the complexity is daunting For instance, a 32×32 pixel color (RGB) image can generate 2^3072 possible binary images, a figure that exceeds 10^800 times the estimated number of atoms in the universe.

To model a distribution over a random vector \( x \) with \( n \) discrete variables, each capable of taking on \( k \) values, the naive method involves creating a lookup table that stores one probability value for every possible outcome This approach necessitates \( k^n \) parameters, which can be impractical for large \( n \).

This is not feasible for several reasons:

The figure illustrates the probabilistic modeling of natural images, showcasing 32 × 32 pixel color images from the CIFAR-10 dataset, as established by Krizhevsky and Hinton in 2009 The bottom section presents samples generated by a structured probabilistic model trained on the same dataset, where each sample is positioned in the grid according to its proximity to the closest training example in Euclidean space This visual comparison highlights the model's capability to synthesize new images rather than merely memorizing the training data, with contrast adjustments made for clarity The figure is reproduced with permission from Courville et al (2011).

• Memory: the cost of storing the representation: For all but very small values of n and k, representing the distribution as a table will require too many values to store.

As the number of parameters in a model increases, the amount of training data required for accurate parameter estimation also rises significantly Table-based models, which have an exceedingly high number of parameters, necessitate an enormous training set to achieve precise fitting Without additional assumptions to connect the various entries in the table, such models are prone to severe overfitting, similar to issues encountered in back-off or smoothed n-gram models.

The cost of inference in a model, such as when computing the marginal distribution P(x1) or the conditional distribution P(x2 | x1), can be significant due to the need to sum across the entire joint distribution P(x) This process entails high runtime and poses an intractable memory challenge for storing the model, highlighting the complexities involved in efficient inference tasks.

Sampling from a model can be inefficient if done naively By generating a random value u from a uniform distribution U(0,1) and iterating through a probability table, one must sum the probabilities until they exceed u This method may require scanning the entire table in the worst-case scenario, resulting in exponential costs similar to other operations.

The table-based approach complicates modeling by accounting for every potential interaction among all variable subsets, while real-world probability distributions are typically much simpler In practice, most variables tend to influence one another indirectly rather than through direct interactions.

Using Graphs to Describe Model Structure

Structured probabilistic models use graphs (in the graph theory sense of “nodes” or

In a graphical model, "vertices" symbolize random variables while "edges" illustrate the direct interactions between them Each node signifies a unique random variable, and the edges denote the direct relationships that exist Although these direct interactions suggest the presence of indirect interactions, it is only necessary to explicitly model the direct connections.

Graphical models are essential for illustrating interactions within a probability distribution, and they can be categorized into two main types: directed acyclic graphs (DAGs) and undirected graphs In the subsequent sections, we will explore some of the most popular and effective methods for representing these models.

Directed graphical models, also known as belief networks or Bayesian networks, are a type of structured probabilistic model These models are termed "directed" due to the directed nature of their edges, which illustrate the relationships between variables.

Judea Pearl proposed the term "Bayesian network" to emphasize the judgmental aspect of the computed values, indicating that these values typically reflect degrees of belief rather than mere event frequencies.

In a directed graphical model, such as the relay race example, Alice’s finishing time (t0) directly impacts Bob’s finishing time (t1), as Bob can only start running once Alice has completed her leg Similarly, Carol's finishing time (t2) is influenced by Bob's finishing time (t1), since she begins her run after Bob finishes This relationship is visually represented by arrows in the model, indicating the direction of influence between variables An arrow from variable 'a' to 'b' signifies that the probability distribution of 'b' is conditional upon 'a', meaning the value of 'b' is determined by the outcome of 'a'.

Continuing with the relay race example from section 16.1, suppose we name Alice’s finishing time t 0 , Bob’s finishing time t 1 , and Carol’s finishing time t 2

Our estimate of t1 is influenced by t0, while t2 is directly reliant on t1 and only indirectly on t0 This relationship can be represented in a directed graphical model, as shown in figure 16.2.

A directed graphical model, formally defined on variables x, consists of a directed acyclic graph (DAG) G, where the vertices represent the random variables in the model Additionally, it incorporates a set of local conditional probability distributions, denoted as p(x i | P a G (x i )), which describe the relationships between the variables.

P a G (x i ) gives the parents of x i in G The probability distribution over x is given by p( ) = Πx i p(x i | P a G (x i )) (16.1)

In our relay race example, this means that, using the graph drawn in ﬁgure16.2, p(t 0 ,t 1 ,t 2 ) = (p t 0 ) (p t 1 | t 0 ) (p t 2 | t 1 ) (16.2)

This article presents our first encounter with a structured probabilistic model, highlighting its operational benefits By analyzing the costs associated with its implementation, we can clearly see the numerous advantages structured modeling offers compared to unstructured modeling.

To represent time from minute 0 to minute 10 in 6-second intervals, we can discretize it into values t0, t1, and t2, each having 100 possible values Consequently, if we were to create a table for p(t0, t1, t2), it would require the storage of 999,999 values, resulting from the multiplication of the 100 values for t0, t1, and t2.

The directed graphical model significantly reduces the number of parameters needed for probability distributions Instead of requiring 19,899 values for the conditional probability distributions of t0, t1, and t2, the model condenses this to just 100 values for t2, minus one due to the constraint that the total probabilities must equal one This efficient representation cuts the parameter count by over 50 times, highlighting the model's effectiveness in simplifying complex probability structures.

Modeling discrete variables, each with k values, typically incurs a cost of O(k^n) when using a single table approach However, by employing a directed graphical model, where m represents the maximum number of variables in a conditional probability distribution, the cost reduces to O(k^m) This reduction in cost is significant when we can ensure that m remains much smaller than n, leading to substantial savings in computational resources.

When each variable in a graph has only a few parent nodes, the distribution can be effectively represented with minimal parameters Additionally, imposing specific structural constraints, such as requiring the graph to be tree-like, ensures that operations like calculating marginal or conditional distributions for subsets of variables are performed efficiently.

Understanding what information can be represented in a graph is crucial, as it only encodes simplifying assumptions about the conditional independence of variables For instance, if we assume that Bob's performance is unaffected by Alice's, we can simplify our model to have O(k) parameters instead of O(k²) However, this assumption still requires a direct dependency between the times t0 and t1, since t1 indicates when Bob finishes, not how long he runs Consequently, an arrow must remain from t0 to t1 in the graph The independence of Bob's running time from other factors cannot be captured solely in the graph; instead, it is reflected in the conditional distribution's definition This results in a more complex formula that utilizes k−1 parameters rather than a simple table indexed by t0 and t1 The directed graphical model allows flexibility in defining conditional distributions without imposing constraints on their arguments.

Directed graphical models provide a framework for structured probabilistic modeling, while undirected models, also known as Markov random fields (MRFs) or Markov networks, offer an alternative approach These undirected models are characterized by graphs with undirected edges, allowing for a different representation of probabilistic relationships.

Directed models are best suited for scenarios where causality is clearly defined and flows in a single direction A prime example of this is a relay race, where the performance of earlier runners directly impacts the finishing times of those who run later, while the latter do not influence the results of the former.

Sampling from Graphical Models

Graphical models also facilitate the task of drawing samples from a model.

One advantage of directed graphical models is that a simple and eﬃcient procedure called ancestral samplingcan produce a sample from the joint distribution represented by the model.

To achieve a topological ordering of variables in a graph, we arrange them such that if variable \( x_i \) is a parent of \( x_j \), then \( j \) is greater than \( i \) This allows for sequential sampling of the variables: we first sample \( x_1 \) from \( P(x_1) \), followed by \( P(x_2 | Pa_G(x_2)) \), and continue this process until we sample \( P(x_n | Pa_G(x_n)) \).

As long as each conditional distribution \( p(x_i | P_a G(x_i)) \) can be easily sampled, the entire model becomes straightforward to sample from The process of topological sorting ensures that we can sequentially access and sample the conditional distributions, preventing the issue of sampling a variable before its parent variables are ready.

For some graphs, more than one topological ordering is possible Ancestral sampling may be used with any of these topological orderings.

Ancestral sampling is generally very fast (assuming sampling from each conditional is easy) and convenient.

Ancestral sampling has limitations, as it is applicable only to directed graphical models and does not accommodate all conditional sampling operations When sampling from a subset of variables in a directed graphical model, it is essential that all conditioning variables precede the variables being sampled in the ordered graph.

In scenarios where local conditional probability distributions are defined by the model, sampling can be performed directly from these distributions However, when the required conditional distributions are the posterior distributions based on observed variables, the situation changes Typically, these posterior distributions are not clearly defined or parameterized within the model, making their inference a costly process Consequently, in such models, ancestral sampling becomes an inefficient approach.

Ancestral sampling is limited to directed models, while undirected models require conversion to directed ones, often leading to complex inference challenges or excessive edge introduction, making the model intractable Sampling directly from undirected models necessitates resolving cyclical dependencies, as every variable interacts with all others, complicating the sampling initiation This process is costly and involves multiple passes; Gibbs sampling is the simplest method, where each variable is sampled iteratively based on the others However, even after one complete pass through the model, the samples do not accurately represent the desired distribution, necessitating repeated resampling until convergence is achieved Determining when the samples accurately approximate the target distribution can be challenging, and advanced sampling techniques for undirected models are discussed in detail in chapter 17.

Advantages of Structured Modeling

Structured probabilistic models significantly lower the costs associated with representing probability distributions, learning, and inference They enhance sampling efficiency, particularly in directed models, while undirected models may present more complexity The key to optimizing runtime and memory usage lies in intentionally omitting certain interactions Graphical models effectively communicate information by excluding edges, indicating that no direct interaction needs to be modeled where edges are absent.

Using structured probabilistic models offers the significant advantage of clearly separating knowledge representation from knowledge learning and inference This separation simplifies the development and debugging of models, allowing for the design, analysis, and evaluation of learning and inference algorithms applicable to various graph types Additionally, it enables the creation of models that effectively capture essential relationships within the data By combining different algorithms and structures, we can explore a wide range of possibilities, making it far easier than attempting to create end-to-end algorithms for every potential scenario.

Learning about Dependencies

A robust generative model must effectively represent the distribution of observed variables, denoted as v, which often exhibit strong interdependencies In deep learning, the prevalent method for modeling these dependencies involves the incorporation of multiple latent variables.

“hidden” variables, h The model can then capture dependencies between any pair of variables v i and v j indirectly, via direct dependencies between v i and h, and direct dependencies between h and v j

A well-structured model of v without latent variables necessitates a substantial number of parents for each node in a Bayesian network or large cliques in a Markov network However, representing these higher-order interactions incurs significant costs, both computationally—due to the exponential growth of parameters needed for memory storage—and statistically, as the vast number of parameters demands extensive data for accurate estimation.

In machine learning, when aiming to model dependencies among visible variables with direct connections, it is often impractical to connect all variables Therefore, the graph should focus on tightly coupled variables while omitting connections between others This challenge is addressed by a specialized field known as structure learning, as detailed in Koller and Friedman (2009) Most structure learning methods utilize a greedy search approach, where an initial structure is proposed, trained, and scored based on training set accuracy and model complexity Subsequently, candidate structures with minor edge modifications are evaluated in hopes of improving the score, leading to a continuous search for optimal structures.

Utilizing latent variables in place of adaptive structures eliminates the necessity for discrete searches and repeated training rounds A consistent framework for both visible and hidden variables allows for direct interactions that facilitate indirect interactions among visible units By employing straightforward parameter learning methods, we can develop a model with a stable structure that accurately infers the appropriate configuration for the marginal probability p(v).

Latent variables have advantages beyond their role in eﬃciently capturingp(v).

The introduction of new variables offers an alternative representation for the latent variable v As highlighted in section 3.9.6, the mixture of Gaussians model identifies a latent variable that indicates the category from which the input example originates, enabling classification In chapter 14, we explored how basic probabilistic models, such as sparse coding, utilize latent variables as input features for classifiers or as coordinates on a manifold Additionally, various models can achieve similar outcomes, but deeper architectures and those with diverse interactions can yield more complex representations of the input Many methodologies facilitate feature learning through the development of latent variables, often demonstrating that E[h | v] or argmax h p(h | v) serves as an effective feature mapping for v.

Inference and Approximate Inference

Probabilistic models are essential for understanding the relationships between variables, particularly in medical diagnostics By analyzing a set of medical tests, we can infer potential diseases a patient may have In latent variable models, we aim to extract features that describe the observed variables Solving these problems is often necessary for performing additional tasks, and we typically train our models based on the principle of maximum likelihood.

In many cases, we seek to calculate the probability p(h | v) to apply a learning rule effectively This scenario exemplifies inference problems, where the goal is to predict the value of certain variables based on others or to determine the probability distribution of variables given specific values.

Most deep models face intractable inference problems, even when employing structured graphical models for simplification While the graph structure helps represent complex, high-dimensional distributions with fewer parameters, the graphs commonly utilized in deep learning often lack the necessary restrictions for efficient inference.

Computing the marginal probability in a general graphical model is classified as #P hard, a complexity class that extends NP While NP problems focus on determining the existence of solutions, #P problems involve counting the total number of solutions To illustrate a challenging graphical model, consider a model based on binary variables from a 3-SAT problem, where a uniform distribution is applied to these variables, and an additional binary latent variable is introduced for each clause to indicate its satisfaction.

We can introduce an additional latent variable to indicate if all clauses are satisfied, utilizing a reduction tree of latent variables that avoids creating a large clique Each node in this tree reflects the satisfaction status of two other variables, with the leaves representing the variables for each clause The root of the tree indicates whether the overall problem is satisfied, and the uniform distribution over the literals allows the marginal distribution at the root to reveal the proportion of assignments that meet the problem's criteria Although this serves as a theoretical worst-case scenario, NP-hard graphs frequently occur in practical applications.

Approximate inference is essential in deep learning, particularly through variational inference, where we aim to approximate the true distribution p(h | v) by finding a distribution q(h | v) that closely resembles it This topic, along with other related techniques, is thoroughly explored in chapter 19.

The Deep Learning Approach to Structured Probabilistic Models 585

Deep learning practitioners utilize the same foundational computational tools as those in traditional machine learning with structured probabilistic models However, the design choices made in deep learning lead to unique combinations of these tools, resulting in algorithms and models that significantly differ from conventional graphical models.

Deep learning is not solely characterized by deep graphical models; instead, the depth of a model can be defined through its graphical representation A latent variable \( h_i \) is considered to be at depth \( j \) if the shortest path from \( h_i \) to an observed variable consists of \( j \) steps The overall depth of the model is determined by the greatest depth of any latent variable This definition of depth differs from that based on the computational graph Many generative models in deep learning may have few or no latent variables, yet they utilize deep computational graphs to establish the conditional distributions within the model.

Deep learning fundamentally relies on the concept of distributed representations, with even shallow models often featuring a significant layer of latent variables These models typically possess more latent variables than observed ones, allowing for complex nonlinear interactions through indirect connections that traverse multiple latent variables.

Traditional graphical models primarily consist of variables that are frequently observed, despite some variables being randomly missing in certain training examples These models typically employ higher-order terms and structure learning to effectively represent complex nonlinear interactions among variables When latent variables are present, they are generally limited in number.

Latent variables in deep learning are designed without predefined semantics, allowing the training algorithm to create concepts tailored to specific datasets, which can make interpretation challenging for humans, despite potential visualization techniques In contrast, traditional graphical models often incorporate latent variables with specific meanings, such as a document's topic or a patient's disease, making them more interpretable and theoretically sound However, these models typically struggle to scale to complex problems and lack the reusability found in deep learning models.

Deep learning approaches differ significantly from traditional graphical models in terms of connectivity, with deep models featuring extensive connections between large groups of units, allowing interactions to be represented by a single matrix In contrast, traditional graphical models have limited connections, where each variable's connections are meticulously designed to ensure tractable exact inference When exact inference becomes impractical, loopy belief propagation is often employed in traditional models, which work well with sparsely connected graphs However, deep learning models connect each visible unit to numerous hidden units, creating distributed representations that complicate the use of traditional inference techniques due to their density Consequently, loopy belief propagation is rarely utilized in deep learning, which instead favors efficient algorithms like Gibbs sampling or variational inference Additionally, the presence of many latent variables in deep learning models necessitates efficient numerical implementations, motivating the structuring of units into layers, which facilitates efficient matrix operations and specialized techniques such as block diagonal matrix products or convolutions.

The deep learning approach to graphical modeling embraces a significant tolerance for uncertainty, opting to enhance model complexity rather than simplifying it for exact computations This method often involves utilizing models with marginal distributions that are difficult to compute, focusing instead on generating approximate samples Additionally, we may train models with complex objective functions that are impractical to approximate quickly; however, we can still achieve approximate training by efficiently estimating the gradient of these functions Ultimately, the deep learning strategy prioritizes identifying the essential information needed and swiftly obtaining a reasonable approximation of that information.

16.7.1 Example: The Restricted Boltzmann Machine

The restricted Boltzmann machine (RBM), introduced by Smolensky in 1986, serves as a fundamental example of graphical models in deep learning While the RBM itself is not a deep model, it features a single layer of latent variables that can learn representations of input data In this article, we highlight key characteristics of RBMs that are prevalent in various deep graphical models, such as the organization of units into layers, the dense connectivity between these layers described by a matrix, and the model's design that facilitates efficient Gibbs sampling Additionally, the RBM's architecture allows the training algorithm to autonomously learn latent variables without predefined semantics Further exploration of RBMs will be provided in section 20.2.

The canonical RBM is an energy-based model with binary visible and hidden units Its energy function is

The equation E(v, h) = −b ∙ v − c ∙ h − v ∙ W ∙ h illustrates a model characterized by two distinct groups of units: visible units (v) and hidden units (h), with their interaction mediated by a learnable matrix W This design, as shown in figure 16.14, highlights a key feature of the model: the absence of direct interactions between visible units or between hidden units, which distinguishes it as a "restricted" Boltzmann machine, unlike general Boltzmann machines that allow arbitrary connections.

The restrictions on the RBM structure yield the nice properties p(h v| ) = Π i p(h i | v) (16.11) h 1 h 1 h h 2 2 h h 3 3 v 1 v 1 v v 2 2 v v 3 3 h 4 h 4

Figure 16.14: An RBM drawn as a Markov network. and p(v h| ) = Π i p(v i | h) (16.12)

The individual conditionals are simple to compute as well For the binary RBM we obtain:

Efficient block Gibbs sampling is achieved by simultaneously sampling all hidden variables (h) and visible variables (v) in a Restricted Boltzmann Machine (RBM) model Figure 16.15 illustrates the samples generated through this Gibbs sampling process.

Since the energy function itself is just a linear function of the parameters, it is easy to take its derivatives For example,

Efficient Gibbs sampling and derivatives significantly enhance the convenience of training undirected models In Chapter 18, we will explore how these derivatives can be computed using samples from the model, facilitating effective training processes.

Training the model induces a representation h of the datav We can often use

∼ p( | v)[ ]h as a set of features to describe v

The RBM exemplifies the standard deep learning methodology applied to graphical models, utilizing representation learning through multiple layers of latent variables, while facilitating efficient interactions between these layers, which are parameterized by matrices.

Graphical models offer a clear and flexible framework for articulating probabilistic models In the upcoming chapters, we will utilize this language, along with various perspectives, to explore a diverse range of deep probabilistic models.

Figure 16.15 illustrates samples from a trained Restricted Boltzmann Machine (RBM) using Gibbs sampling on the MNIST dataset Each column represents a distinct Gibbs sampling process, with rows showing outputs after every 1,000 sampling steps, revealing a high correlation among successive samples The accompanying weight vectors demonstrate that the RBM's prior distribution p(h) is not limited to a factorial structure, allowing it to effectively learn feature co-occurrences during sampling In contrast, while the RBM's posterior distribution p(h|v) is factorial, the sparse coding model offers a non-factorial posterior, potentially enhancing feature extraction capabilities Other models can achieve both non-factorial prior p(h) and posterior p(h|v) distributions.

Randomized algorithms are primarily categorized into Las Vegas and Monte Carlo algorithms Las Vegas algorithms guarantee an accurate result or indicate failure but may require varying amounts of resources, such as time or memory Conversely, Monte Carlo algorithms yield answers that may contain a random error, which can often be minimized by allocating additional resources like running time and memory Within a set computational budget, Monte Carlo algorithms can deliver approximate solutions.

Sampling and Monte Carlo Methods

Key technologies in machine learning rely on sampling from probability distributions to generate Monte Carlo estimates for various desired quantities.

Sampling from a probability distribution is essential for various reasons, offering a cost-effective method to approximate sums and integrals It can significantly accelerate computations, such as when using minibatches to estimate the full training cost Additionally, in scenarios where our learning algorithm must approximate complex sums or integrals—like the gradient of the log partition function in undirected models—sampling becomes crucial Ultimately, in many instances, our objective is to develop a model capable of effectively sampling from the training distribution.

17.1.2 Basics of Monte Carlo Sampling

When exact computation of a sum or integral is infeasible, such as when dealing with an exponential number of terms without known simplifications, Monte Carlo sampling offers a viable approximation method This approach interprets the sum or integral as an expectation under a specific distribution, allowing for the estimation of the expectation through a corresponding average Specifically, the sum or integral can be expressed as an expectation, with the requirement that p represents a probability distribution for sums or a probability density for integrals over the random variable x.

We can approximate s by drawing n samples x (1) , ,x ( ) n from p and then forming the empirical average ˆ s n = 1 n

This approximation is justified by a few different properties The first trivial observation is that the estimator sîs unbiased, since

The law of large numbers states that for independent and identically distributed samples \( x_i \), the average converges almost surely to the expected value as the sample size approaches infinity, expressed as \( \lim_{n \to \infty} \hat{s}_n = s \), given that the variance of the individual terms, \( \text{Var}[f(x_i)] \), is bounded As the sample size \( n \) increases, the variance \( \text{Var}[\hat{s}_n] \) decreases and approaches zero, provided that \( \text{Var}[f(x_i)] < \infty \).

Estimating uncertainty in a Monte Carlo average involves calculating the empirical average and variance of the function values, then dividing the estimated variance by the sample size to derive an estimator of variance The central limit theorem indicates that the distribution of the average converges to a normal distribution, allowing for the estimation of confidence intervals using the normal density's cumulative distribution However, effective sampling from the base distribution may not always be feasible, necessitating alternatives like importance sampling or the use of Monte Carlo Markov chains, which provide a more generalized method for converging towards the desired distribution.

Importance Sampling

In the Monte Carlo method, a crucial step involves determining how to decompose the integrand, distinguishing between the probability function p(x) and the quantity f(x) that we aim to estimate the expected value of This decomposition is not unique, as the product p(x)f(x) can be expressed in various forms, such as p(x) = q(x)p(x)f(x)q(x), allowing us to sample from q and compute the average pf q Often, our objective is to calculate an expectation based on a specified p and f, highlighting the importance of this decomposition in the problem-solving process.

The unbiased estimator of variance is typically favored because it divides the sum of squared differences by \( n - 1 \) rather than \( n \), aligning with the expectations of natural decomposition However, the initial problem specification may not be the best option regarding the sample size needed for a specific accuracy level Fortunately, the optimal choice, denoted as \( q^* \), can be easily derived and is associated with optimal importance sampling.

Because of the identity shown in equation 17.8, any Monte Carlo estimator ˆ s p = 1 n

 n i =1 , x ( ) i ∼ p f(x ( ) i ) (17.9) can be transformed into an importance sampling estimator ˆ s q = 1 n

We see readily that the expected value of the estimator does not depend on :q

However, the variance of an importance sampling estimator can be greatly sensitive to the choice of The variance is given byq

The minimum variance occurs when q is q ∗ ( ) =x p( )x |f( )x |

Z , (17.13) where Z is the normalization constant, chosen so that q ∗ (x) sums or integrates to

Effective importance sampling distributions assign greater weight to regions where the integrand is larger When the function f(x) maintains a consistent sign, the variance of the estimator becomes zero, indicating that a single sample is adequate when utilizing the optimal distribution.

The computation of q ∗ effectively resolves the original problem; however, relying on this method to draw a single sample from the optimal distribution is often impractical.

Any selection of a sampling distribution \( q \) is acceptable for achieving the correct expected value, while \( q^* \) represents the optimal choice for minimizing variance Although sampling from \( q^* \) is often impractical, alternative distributions \( q \) can still be utilized to achieve a reduction in variance to some extent.

Biased importance sampling offers a significant advantage by eliminating the need for normalized distributions p or q For discrete variables, the biased importance sampling estimator is represented by the formula ˆ s BIS = ∑ n i=1 p(x(i)) q(x(i)) f(x(i)).

The estimator represented by \( \hat{s}_{BIS} \) is biased, as the expected value \( E[\hat{s}_{BIS}] \) does not equal \( s \), except in the asymptotic case when \( n \) approaches infinity, causing the denominator in equation 17.14 to converge to 1 Therefore, this estimator is classified as asymptotically unbiased.

Choosing an appropriate proposal distribution \( q \) is crucial for enhancing the efficiency of Monte Carlo estimation; however, a poor choice can significantly degrade performance When examining the relationship between \( p(x) \), \( q(x) \), and \( f(x) \), it becomes evident that if \( q(x) \) is very small while \( p(x) \) and \( f(x) \) remain substantial, the estimator's variance can escalate Typically, \( q \) is selected for its simplicity to facilitate sampling, but this simplicity can lead to a poor match with \( p \) or \( |f| \) in high-dimensional spaces Consequently, when \( q(x_i) \) is much smaller than \( p(x_i) |f(x_i)| \), the importance sampling yields ineffective samples, while rare instances where \( q(x_i) \) is larger can result in disproportionately large ratios This imbalance often leads to an underestimation of the overall sum, occasionally offset by rare instances of significant overestimation, reflecting the challenges posed by the vast dynamic range of joint probabilities in high dimensions.

Importance sampling and its variants play a crucial role in enhancing the performance of various machine learning algorithms, particularly in deep learning This technique is instrumental in accelerating the training of neural language models with extensive vocabularies and other neural networks with numerous outputs Additionally, it aids in estimating the partition function, which serves as the normalization constant for probability distributions, and in calculating log-likelihoods in deep directed models like variational autoencoders Furthermore, importance sampling can refine gradient estimates for cost functions during stochastic gradient descent, especially for classifiers where a significant portion of the cost arises from a few misclassified examples By increasing the frequency of sampling challenging examples, the variance of the gradient can be effectively reduced.

Markov Chain Monte Carlo Methods

In many scenarios, utilizing Monte Carlo techniques is challenging due to the lack of a feasible method for drawing exact samples from the distribution p model(x) or a suitable low-variance importance sampling distribution q(x) This issue often arises in deep learning when p model(x) is represented by an undirected model To address this, we introduce Markov chains as a mathematical tool for approximately sampling from p model(x) The algorithms that leverage Markov chains for Monte Carlo estimates are known as Markov chain Monte Carlo methods (MCMC) While MCMC techniques are extensively discussed in Koller and Friedman (2009), their standard guarantees only apply when the model does not assign zero probability to any state Therefore, it is advantageous to present these methods as sampling from an energy-based model (EBM) p(x) ∝ exp(−E(x)), ensuring every state has a non-zero probability Although MCMC methods can be applied to a broader range of probability distributions that may include zero probability states, the theoretical guarantees regarding their behavior need to be established individually for different distribution families In deep learning, the most general theoretical guarantees that apply to all energy-based models are typically relied upon.

Drawing samples from an energy-based model (EBM) can be challenging due to the dependency on two variables, where sampling one variable requires knowledge of the other, creating a complex chicken-and-egg scenario In contrast, directed models simplify this process with their directed and acyclic structure, allowing for efficient ancestral sampling This method enables the sampling of variables in topological order, ensuring that each variable's parents have already been sampled, resulting in a streamlined, single-pass approach to obtaining samples.

In an Evidence-Based Medicine (EBM) framework, the chicken-and-egg problem can be mitigated through the use of a Markov chain for sampling A Markov chain begins with an arbitrary state, denoted as x, which is updated randomly over time This iterative process allows x to converge towards a nearly fair sample from the target distribution p(x) The Markov chain is formally characterized by a random state x and a transition distribution T(x' | x), which defines the probability of moving to a new state x' given the current state x.

Running the Markov chain means repeatedly updating the state xto a value x  sampled from T(x  |x).

To understand the theoretical foundations of Markov Chain Monte Carlo (MCMC) methods, it is helpful to reparametrize the problem by focusing on scenarios where the random variable has countably many states In this context, we can simplify the representation of the state by using positive integers, where each integer corresponds to a unique state in the original problem.

When running infinitely many Markov chains in parallel, the states of these chains are derived from a distribution \( q_t(x) \), which evolves over time Initially, at \( t = 0 \), \( q(0) \) represents a distribution used to randomly initialize each Markov chain As time progresses, \( q_t \) is shaped by the outcomes of all previous Markov chain steps The ultimate objective is for the distribution \( q_t(x) \) to converge to the target distribution \( p(x) \).

Because we have reparametrized the problem in terms of positive integer x, we can describe the probability distribution q using a vector , withv q( = x i) = v i (17.17)

Consider what happens when we update a single Markov chain’s state x to a new state x  The probability of a single state landing in state x  is given by q ( +1) t (x  ) = x q ( ) t ( ) (x T x  |x ) (17.18)

Using our integer parametrization, we can represent the eﬀect of the transition operator T using a matrix A We deﬁne A so that

Using the updated definition, we can reformulate equation 17.18 to reflect the evolution of the entire distribution across multiple parallel Markov chains Instead of focusing on a single state represented by q and T, we now express the update process using v and A, resulting in the equation v(t) = A v(t - 1).

Applying the Markov chain update repeatedly corresponds to multiplying by the matrixArepeatedly In other words, we can think of the process as exponentiating the matrix A: v ( ) t = A t v (0) (17.21)

The matrix A has a unique structure as each column represents a probability distribution, classifying it as a stochastic matrix According to the Perron-Frobenius theorem, if there is a non-zero probability of transitioning from any state x to another state x' for some power t, the largest eigenvalue is guaranteed to be real and equal to one Over time, all eigenvalues can be expressed in an exponentiated form.

The process leads to the decay of all eigenvalues, except for the one equal to 1, towards zero Under certain mild conditions, matrix A is ensured to have a unique eigenvector associated with this eigenvalue Consequently, the process converges to a stationary distribution, also referred to as the equilibrium distribution At convergence, the relationship v' = Av = v holds true, indicating that v is an eigenvector with a corresponding eigenvalue This condition ensures that upon reaching the stationary distribution, further applications of the transition sampling procedure do not alter the overall distribution of states across various Markov chains, even though the transition operator may change individual states.

If we have chosen T correctly, then the stationary distribution q will be equal to the distributionp we wish to sample from We will describe how to choose T shortly, in section 17.4.

Markov Chains with countable states can be effectively generalized to continuous variables While some authors refer to this generalized form as a Harris chain, we prefer to use the term Markov Chain to encompass both scenarios.

A Markov chain with a transition operator T typically converges to a fixed point under mild conditions, represented by the equation q’(x’) = E x ∼ q T(x’|x) In discrete cases, this expectation is expressed as a sum, while in continuous cases, it is represented as an integral.

Markov chain methods, whether continuous or discrete, involve applying stochastic updates until the state samples from the equilibrium distribution, a process known as "burning in." Once equilibrium is achieved, an infinite sequence of samples can be drawn, but these samples are often highly correlated, making a finite sequence less representative of the equilibrium distribution To address this, one can return every n-th sample, reducing bias from correlation However, Markov chains can be costly due to the time needed for burning in and obtaining decorrelated samples For truly independent samples, running multiple Markov chains in parallel is effective, leveraging additional computation to reduce latency While using a single chain for all samples or one chain per sample are two extremes, deep learning practitioners typically utilize a number of chains comparable to the minibatch size, with 100 chains being a common choice.

A significant challenge in working with Markov chains is the uncertainty regarding the number of steps required to reach equilibrium, known as mixing time Testing for equilibrium is complex due to the lack of precise theoretical guidance, as we only know that convergence will occur Analyzing a Markov chain through a matrix A acting on a probability vector v reveals that mixing happens when A's effect has diminished all eigenvalues except for the unique eigenvalue of 1, with the second largest eigenvalue influencing the mixing time However, representing the Markov chain as a matrix is impractical due to the exponential growth of possible states relative to the number of variables Consequently, we often do not know if a Markov chain has mixed, leading us to run it for an estimated sufficient duration and rely on heuristic methods, such as inspecting samples or measuring correlations between them, to assess mixing.

Gibbs Sampling

In this article, we explore methods for sampling from a distribution q(x) by updating x through the transition T(x' | x) To ensure that q(x) is effective, we present two primary approaches The first approach involves deriving T from a pre-existing learned p model, particularly in the context of sampling from Energy-Based Models (EBMs) The second approach focuses on directly parameterizing and learning T, allowing its stationary distribution to implicitly define the desired p model Further examples of this second approach can be found in sections 20.12 and 20.13.

In deep learning, Markov chains are utilized to sample from an energy-based model that defines the distribution p model(x) Our goal is to ensure that the Markov chain's distribution, q(x), aligns with p model(x) To achieve this alignment, it is essential to select an appropriate transition probability T(x' | x).

A conceptually simple and eﬀective approach to building a Markov chain that samples from p model (x) is to use Gibbs sampling, in which sampling from

Sampling T(x | x) involves choosing a variable \( x_i \) and drawing it from the probability model \( p \) conditioned on its neighboring variables in the undirected graph \( G \) that outlines the structure of the energy-based model Additionally, it is feasible to sample multiple variables simultaneously, provided they are conditionally independent when considering all their neighbors.

In the context of Restricted Boltzmann Machines (RBMs), all hidden units can be sampled simultaneously due to their conditional independence given the visible units Similarly, all visible units can be sampled at once, as they are conditionally independent when considering the hidden units This method of updating multiple variables simultaneously is known as block Gibbs sampling.

Alternative methods for designing Markov chains to sample from a model p exist, with the Metropolis-Hastings algorithm being a popular choice in various fields However, within deep learning and undirected modeling, Gibbs sampling remains the predominant technique Exploring enhanced sampling methods presents a promising research frontier.

The Challenge of Mixing between Separated Modes

MCMC methods often struggle with poor mixing, meaning that successive samples from a Markov chain intended to sample from p(x) are not independent and fail to explore various regions of x-space effectively This issue is particularly pronounced in high-dimensional scenarios, leading to strong correlations among samples This phenomenon, known as slow mixing or failure to mix, resembles a noisy gradient descent on the energy function or noisy hill climbing on the probability concerning the state of the chain Consequently, the Markov chain tends to make small steps between configurations, limiting its ability to sample efficiently.

The energy E(x(t)) is typically lower than or approximately equal to the energy E(x(t - 1)), favoring transitions to lower energy states When initiated from a less probable, higher energy configuration, the Markov chain progressively decreases the state's energy, occasionally transitioning to different modes Upon discovering a low energy region, known as a mode, the chain performs a random walk within that area Although it may occasionally exit the mode, it usually returns or, if it identifies an escape route, moves to another mode However, finding effective escape routes is infrequent for many complex distributions, causing the Markov chain to remain in the same mode longer than optimal.

This is very clear when we consider the Gibbs sampling algorithm (section17.4).

The probability of transitioning between nearby modes is influenced by the shape of the energy barrier separating them Transitions between modes with high energy barriers are significantly less likely due to the low probability regions they create This challenge intensifies in scenarios with multiple high-probability modes separated by low-probability regions, particularly when Gibbs sampling updates only a small subset of variables that are predominantly influenced by other variables.

In an energy-based model involving two binary variables, a and b, each taking values of -1 or 1, a strong correlation is indicated when the energy function E(a, b) = -wab, where w is a large positive number This suggests that the model strongly believes a and b share the same sign When updating variable b using Gibbs sampling with a fixed at 1, the conditional probability for b is expressed as P(b = 1 | a = 1) = σ(w), highlighting the relationship between the two variables in the model.

If w is large, the sigmoid saturates, and the probability of also assigning b to be

1 is close to 1 Likewise, if a = −1, the probability of assigning b to be −1 is close to 1 According to P model (a b, ), both signs of both variables are equally likely.

Gibbs sampling paths for three distributions illustrate varying mixing behaviors of the Markov chain, particularly when initialized at the mode In the case of a multivariate normal distribution with two independent variables, Gibbs sampling demonstrates effective mixing due to the independence of the variables Conversely, for a multivariate normal distribution characterized by highly correlated variables, the correlation hinders the Markov chain's ability to mix efficiently This is because each variable's update relies on the other, which slows the Markov chain's movement away from its initial position.

(Right)A mixture of Gaussians with widely separated modes that are not axis-aligned.

Gibbs sampling mixes very slowly because it is diﬃcult to change modes while altering only one variable at a time.

According to P model(a b| ), both variables should have the same sign This means that Gibbs sampling will only very rarely ﬂip the signs of these variables.

In practical applications, the challenge extends beyond transitioning between two modes to encompass all possible modes within a real model When mixing between these modes proves difficult, it becomes costly to acquire a reliable sample set that adequately represents most modes, resulting in slow convergence of the chain to its stationary distribution.

To address the issue of complex dependencies among units, one effective approach is to identify and update highly dependent groups simultaneously However, when the dependencies become intricate, sampling from these groups can become computationally challenging This difficulty highlights the original purpose of the Markov chain, which was designed to facilitate sampling from extensive variable sets.

In models with latent variables that define a joint distribution \( p_{\text{model}}(x, h) \), we frequently obtain samples of \( x \) by alternating between sampling from \( p_{\text{model}}(x | h) \) and \( p_{\text{model}}(h | x) \) This approach enhances the mixing properties of the sampling process.

Figure 17.2: An illustration of the slow mixing problem in deep probabilistic models. Each panel should be read left to right, top to bottom (Left)Consecutive samples from

Gibbs sampling applied to a deep Boltzmann machine trained on the MNIST dataset shows that consecutive samples exhibit significant similarity, primarily based on semantic features rather than raw visuals However, the Gibbs chain struggles to transition between different modes of the distribution, such as altering digit identities In contrast, consecutive ancestral samples from a generative adversarial network are generated independently, avoiding mixing issues While we desire the model \( p(h|x) \) to have high entropy for rapid sampling, effective learning requires \( h \) to retain sufficient information to accurately reconstruct \( x \), leading to a conflict between high mutual information and mixing efficiency This challenge is particularly evident in Boltzmann machines, where a sharper learned distribution hampers the Markov chain's ability to mix effectively.

MCMC methods may be less effective for distributions with a manifold structure, particularly when each class has its own manifold Such distributions are characterized by multiple modes that are widely separated by high-energy regions This scenario is common in classification problems and can lead to slow convergence of MCMC methods due to inadequate mixing between modes.

17.5.1 Tempering to Mix between Modes

When a distribution exhibits sharp peaks of high probability with surrounding low-probability regions, mixing between different modes becomes challenging To facilitate faster mixing, several techniques involve creating alternative versions of the target distribution that reduce the height of the peaks and the depth of the valleys Energy-based models offer a straightforward approach to achieve this These models define a probability distribution as p(x) ∝ exp(-E(x)), and can be enhanced with an additional parameter β that adjusts the sharpness of the distribution, represented as pβ(x) ∝ exp(-βE(x)).

The β parameter, representing the reciprocal of temperature, highlights the connection between energy-based models and statistical physics As temperature approaches zero and β approaches infinity, the model transitions to a deterministic state Conversely, when temperature increases indefinitely and β decreases to zero, the distribution for discrete cases becomes uniform.

In model training, evaluation is typically conducted at β = 1, but utilizing lower temperatures (β < 1) can enhance mixing between modes This tempering strategy involves sampling from higher-temperature distributions to facilitate transitions between modes before returning to the unit temperature distribution Techniques like Markov chains based on tempered transitions have been effectively applied to models such as Restricted Boltzmann Machines (RBMs) Additionally, parallel tempering allows for simultaneous simulation of various states at different temperatures, where higher temperature states mix slowly, aiding in the exploration of the model's parameter space.

The transition operator in the model facilitates accurate sampling by stochastically swapping states between two distinct temperature levels, enabling high-probability samples from a high-temperature slot to transition into a lower temperature slot This method has been utilized in Restricted Boltzmann Machines (RBMs) (Desjardins et al., 2010; Cho et al., 2010) Despite its potential, tempering has yet to significantly advance the sampling challenges associated with complex Energy-Based Models (EBMs) A key factor may be the presence of critical temperatures, where the temperature transition must occur gradually to enhance the effectiveness of tempering.

To improve sampling from a latent variable model, it's crucial that the representation h effectively encodes the input x; otherwise, poor mixing occurs when sampling from p(x|h) A solution is to use a deep representation for h that allows for easier mixing in the h space Representation learning algorithms, like autoencoders and RBMs, often produce a more uniform and unimodal marginal distribution over h compared to the original data distribution over x This uniformity results from the goal of minimizing reconstruction error, which is more effectively achieved when training examples are distinctly separated in h-space.

The Log-Likelihood Gradient

Learning undirected models through maximum likelihood is challenging due to the dependency of the partition function on the parameters Consequently, the gradient of the log-likelihood concerning these parameters includes a component that relates to the gradient of the partition function.

This is a well-known decomposition into the positive phase and negative phase of learning.

In many undirected models, the negative phase presents significant challenges, particularly in models with few or no latent variables, which often feature a manageable positive phase A prime example is the Restricted Boltzmann Machine (RBM), where hidden units maintain conditional independence given the visible units, resulting in a straightforward positive phase but a complex negative phase Conversely, when the positive phase is complicated due to intricate interactions among latent variables, these scenarios are discussed in chapter 19, which emphasizes the difficulties encountered in the negative phase.

Let us look more closely at the gradient of logZ:

For models that guarantee p(x)>0 for all x, we can substitute exp (log ˜p( ))x for p( )˜ x :  x ∇ θ exp (log ˜p( ))x

This derivation utilizes summation over discrete values of x; however, a comparable outcome can be achieved through integration over continuous values of x In the continuous derivation, we apply Leibniz’s rule for differentiating under the integral sign to derive the corresponding identity.

This identity is applicable only under certain regularity conditions on p˜and ∇ θ p(x).˜

In measure theoretic terms, the conditions are: (i) The unnormalized distributionp˜ must be a Lebesgue-integrable function ofx for every value of θ; (ii) The gradient

For all θ and almost all x, the gradient ∇ θ p˜(x) must exist, and there should be an integrable function R(x) that bounds ∇ θ p(x)˜, ensuring that the maximum of | ∂θ ∂ i p(x)˜ | is less than or equal to R(x) for all θ and almost all x Fortunately, many machine learning models of interest exhibit these essential properties.

∼ p( ) ∇ θ log ˜p( )x (18.15) is the basis for a variety of Monte Carlo methods for approximately maximizing the likelihood of models with intractable partition functions.

The Monte Carlo method for learning undirected models offers a clear framework that encompasses both the positive and negative phases During the positive phase, we enhance log ˜p(x) for samples x taken from the data, while in the negative phase, we reduce the partition function by lowering log ˜p(x) sampled from the model distribution.

In deep learning, log probabilities are often parameterized through an energy function, allowing us to view the positive phase as reducing the energy of training examples while the negative phase increases the energy of samples generated by the model, as demonstrated in figure 18.1.

Stochastic Maximum Likelihood and Contrastive Divergence

A straightforward implementation of equation 18.15 involves initializing and burning in a set of Markov chains each time the gradient is required In the context of stochastic gradient descent, this necessitates burning in the chains for every gradient step, as outlined in algorithm 18.1 However, the significant computational cost associated with this inner loop process renders it impractical, serving only as a foundational approach for developing more efficient algorithms.

Algorithm 18.1 A naive MCMC algorithm for maximizing the log-likelihood with an intractable partition function using gradient ascent.

Set , the step size, to a small positive number.

Set k, the number of Gibbs steps, high enough to allow burn in Perhaps 100 to train an RBM on a small image patch. while not converged do

To begin the process, sample a minibatch of m examples from the training set Compute the gradient g as the average of the gradients of the log probability of the sampled data Next, initialize a set of m samples to random values, which can be drawn from a uniform or normal distribution, or a distribution with marginals aligned with the model's marginals Iterate through k steps, updating each sample using Gibbs sampling After completing the updates, recalculate the gradient g based on the new samples Finally, adjust the model parameters θ by adding a scaled version of the gradient g, and repeat the process until convergence.

The MCMC approach to maximum likelihood seeks to balance two opposing forces: one that elevates the model distribution where data is present and another that lowers it where model samples exist This dynamic is depicted in Figure 18.1, highlighting the interplay between maximizing log ˜p and minimizing log Z Various approximations to the negative phase can be employed, each designed to reduce computational costs while potentially misguiding the downward push in less optimal areas.

The negative phase of a model involves drawing samples from its distribution, essentially identifying points that the model strongly believes in, but are often considered incorrect or unrealistic representations of the world These points are commonly referred to as "hallucinations" or "fantasy particles" in the literature, as they reflect the model's misguided beliefs By reducing the probability of these points, the negative phase plays a crucial role in refining the model's understanding of reality, with some researchers even proposing it as a potential explanation for certain phenomena.

The positive phase p model ( ) x p data ( ) x x p(x )

The negative phase p model ( ) x p data ( ) x

Figure 18.1: The view of algorithm 18.1 as having a “positive phase” and “negative phase.”

In the positive phase of training, points from the data distribution are sampled, increasing their unnormalized probability, thereby emphasizing likely data points Conversely, the negative phase samples from the model distribution, reducing their unnormalized probability to balance the positive phase's tendency to inflate probabilities uniformly When both distributions align, the chances of increasing or decreasing probabilities equalize, resulting in no gradient and signaling the end of training This concept parallels the theory of dreaming in humans and animals, where the brain refines its probabilistic model of the world during wakefulness and sleep Although this analogy helps explain the dual-phase algorithms, it remains unverified by neuroscientific studies In machine learning, both phases are often utilized concurrently, as seen in various algorithms that sample from model distributions for different objectives, potentially shedding light on the purpose of dream sleep.

To create a more cost-effective alternative to Algorithm 18.1, we can leverage the understanding of positive and negative phases of learning The primary expense associated with the naive MCMC algorithm arises from the time required to burn in the Markov chains starting from a random initialization at each step A practical solution is to initialize the Markov chains from a distribution that closely resembles the model distribution, thereby reducing the number of steps needed for the burn-in process.

The contrastive divergence (CD) algorithm, also referred to as CD-k when indicating k Gibbs steps, begins each iteration of the Markov chain with samples drawn from the data distribution This method, outlined in algorithm 18.2, leverages readily available data samples, making the initial sampling process cost-free Although the initial model distribution may significantly differ from the data distribution, the positive phase effectively enhances the model's probability of the observed data With continued iterations, the model distribution gradually aligns more closely with the data distribution, leading to improved accuracy in the subsequent negative phase.

Algorithm 18.2 The contrastive divergence algorithm, using gradient ascent as the optimization procedure.

Set , the step size, to a small positive number.

To ensure effective sampling from p(x;θ) using a Markov chain initialized from p data, set the number of Gibbs steps, k, to a sufficiently high value, typically between 1 and 20, especially when training a Restricted Boltzmann Machine (RBM) on small image patches Continue this process until convergence is achieved.

Sample a minibatch of m examples {x (1) , ,x ( ) m } from the training set. g ← m 1  m i=1 ∇ θ log ˜p(x ( ) i ; )θ for i= 1 to m do ˜x ( ) i ← x ( ) i end for for i= 1 to k do for j = 1 to m do ˜ x ( ) j ← gibbs_update(˜x ( ) j ). end for end for g ← −g m 1  m i=1 ∇ θ log ˜p(˜x ( ) i ; )θ θ ←θ+ g end while

While Contrastive Divergence (CD) serves as an approximation for the correct negative phase, it notably struggles to suppress high-probability regions that do not correspond to actual training examples These misleading areas, characterized by high model probability but low likelihood in the data generating distribution, are referred to as spurious modes.

Figure 18.2 illustrates why this happens Essentially, it is because modes in the model distribution that are far from the data distribution will not be visited by x p(x ) p model ( ) x p data ( ) x

The negative phase of contrastive divergence, as illustrated in Figure 18.2, can fail to eliminate spurious modes—modes present in the model distribution but absent in the data distribution Since contrastive divergence initializes its Markov chains from data points and runs them for only a few steps, it is unlikely to explore modes in the model that are distant from these data points Consequently, when sampling from the model, some samples may not resemble the actual data, leading to a misallocation of probability mass This inefficiency hampers the model's ability to assign high probability to the correct modes For visualization purposes, the figure simplifies the concept of distance, depicting the spurious mode as significantly distant from the correct mode along a number line.

This article discusses a Markov chain in R that utilizes local moves with a single variable In deep probabilistic models, Gibbs sampling is commonly employed to facilitate non-local moves of individual variables, though simultaneous movement of all variables is not possible For these scenarios, focusing on the edit distance between modes is often more effective than relying on Euclidean distance However, visualizing edit distance in high-dimensional spaces presents challenges when represented in a two-dimensional plot.

Markov chains initialized at training points, unless k is very large.

Carreira-Perpiủan and Hinton (2005) demonstrated that the Contrastive Divergence (CD) estimator is biased for Restricted Boltzmann Machines (RBMs) and fully visible Boltzmann machines, converging to different points than the maximum likelihood estimator They suggest that this small bias allows CD to serve as a cost-effective method for initializing a model, which can subsequently be refined using more expensive Markov Chain Monte Carlo (MCMC) techniques Additionally, Bengio and Delalleau (2009) explained that CD can be viewed as omitting the smallest terms of the correct MCMC update gradient, providing insight into its inherent bias.

Contrastive Divergence (CD) is effective for training shallow models such as Restricted Boltzmann Machines (RBMs), which can be stacked to initialize deeper models like Deep Belief Networks (DBNs) or Deep Boltzmann Machines (DBMs) However, CD is less effective for directly training deeper models due to the challenges in obtaining samples of hidden units from visible units Since hidden units are absent from the dataset, initializing from training points does not resolve this issue Even with visible units initialized from the data, a Markov chain must still be burned in to sample from the distribution of hidden units conditioned on those visible samples.

The Contrastive Divergence (CD) algorithm penalizes models for rapidly changing inputs derived from data, resembling the training process of autoencoders Although CD is more biased compared to other training methods, it proves beneficial for pretraining shallow models that will be stacked later This pretraining encourages the initial models in the stack to retain more information in their latent variables, making it accessible for subsequent models However, this should be regarded as a frequently exploitable byproduct of CD training rather than a fundamental design advantage.

Sutskever and Tieleman (2010) demonstrated that the Contrastive Divergence (CD) update direction does not align with the gradient of any function While this can lead to scenarios where CD may enter perpetual cycles, it is generally not considered a significant issue in practical applications.

Pseudolikelihood

Monte Carlo methods provide direct approximations for the partition function and its gradient, addressing the complexities of the partition function itself In contrast, alternative strategies often bypass this challenge by training models without explicitly calculating the partition function These methods leverage the fact that, in undirected probabilistic models, computing probability ratios is straightforward since the partition function cancels out in both the numerator and denominator.

Pseudolikelihood leverages the observation that conditional probabilities can be expressed in a ratio-based format, allowing for computation without the need for the partition function By partitioning the variable \( x \) into three components—\( a \), \( b \), and \( c \)—where \( a \) represents the variables of interest for the conditional distribution, \( b \) includes the conditioning variables, and \( c \) consists of irrelevant variables, we can derive the relationship \( p(a|b) = \frac{p(a, b)}{p(b)} \).

Marginalizing out variable 'a' can be highly efficient, especially when 'a' and 'c' have few variables In the simplest scenario, where 'a' is a single variable and 'c' is empty, this operation only necessitates evaluations of p˜ equal to the number of values for that single random variable.

To compute the log-likelihood, it is necessary to marginalize over large sets of variables, specifically reducing a total of 'n' variables by one, resulting in the marginalization of a set of size 'n-1' According to the chain rule of probability, the log-likelihood can be expressed as the sum of the logarithms of the individual probabilities and the conditional probabilities, represented as log(p(x)) = log(p(x1)) + log(p(x2 | x1)) + + log(p(xn | x1:n-1)).

To minimize computational costs, we can transfer the constant c into b, leading to the pseudolikelihood objective function proposed by Besag in 1975 This function focuses on predicting the value of feature x_i while considering all other features x_{-i}.

If each random variable has kdiﬀerent values, this requires onlyk n× evaluations of p˜to compute, as opposed to the k n evaluations needed to compute the partition function.

This may look like an unprincipled hack, but it can be proven that estimation by maximizing the pseudolikelihood is asymptotically consistent (Mase 1995, ).

Of course, in the case of datasets that do not approach the large sample limit, pseudolikelihood may display diﬀerent behavior from the maximum likelihood estimator.

It is possible to trade computational complexity for deviation from maximum likelihood behavior by using the generalized pseudolikelihoodestimator (Huang and Ogata 2002, ) The generalized pseudolikelihood estimator uses mdiﬀerent sets

In the context of generalized pseudolikelihood, let S(i) represent the indices of variables that are conditioned together, where i ranges from 1 to m Notably, when m equals 1 and S(1) encompasses all variables from 1 to n, the generalized pseudolikelihood aligns with the log-likelihood Conversely, when m equals n, the relationship between the two metrics highlights the flexibility of the generalized pseudolikelihood framework.

S ( ) i = { }i , the generalized pseudolikelihood recovers the pseudolikelihood The generalized pseudolikelihood objective function is given by

Pseudolikelihood-based methods are significantly influenced by their intended application, often struggling with tasks that necessitate an accurate representation of the full joint distribution p(x), particularly in density estimation and sampling scenarios.

Generalized pseudolikelihood techniques can outperform maximum likelihood in tasks that focus solely on the conditional distributions utilized during training, such as addressing minor missing values These techniques are particularly effective when the data exhibits a regular structure, enabling the design of Sindex sets that highlight significant correlations while omitting groups of variables with minimal correlation For instance, in natural images, pixels that are far apart tend to show weak correlation, allowing the application of generalized pseudolikelihood with each S set representing a compact, spatially localized window.

The pseudolikelihood estimator has a notable limitation in that it cannot be utilized alongside other approximations, like variational inference, which only provide a lower bound on p(x) due to the presence of p˜ in the denominator This results in an upper bound for the overall expression, making maximization ineffective Consequently, applying pseudolikelihood methods to complex models, such as deep Boltzmann machines, becomes challenging, as variational techniques are typically employed to manage the intricate interactions of multiple hidden variable layers Nevertheless, pseudolikelihood remains valuable in deep learning, particularly for training single-layer models or deep architectures that leverage approximate inference methods not reliant on lower bounds.

Pseudolikelihood incurs a higher computational cost per gradient step compared to SML because it requires the explicit calculation of all conditionals Nonetheless, generalized pseudolikelihood and related methods can achieve effective performance by computing only one randomly selected conditional per example, significantly reducing the computational expense to align with that of SML (Goodfellow et al., 2013b).

The pseudolikelihood estimator, while not directly minimizing logZ, functions similarly to a negative phase Its learning algorithm effectively reduces the probability of states that differ by only one variable from a training example due to the structure of the conditional distribution denominators.

See Marlin and de Freitas 2011( ) for a theoretical analysis of the asymptotic eﬃciency of pseudolikelihood.

Score Matching and Ratio Matching

Score matching, introduced by Hyvörinen in 2005, offers a reliable method for training models without the need to estimate Z or its derivatives The term "score" refers to the derivatives of a log density concerning its argument, represented as ∇x log p(x) This technique focuses on minimizing the expected squared difference between the derivatives of the model's log density and those of the data's log density concerning the input.

The objective function circumvents the challenges of differentiating the partition function Z, as Z is independent of x, resulting in ∇ x Z = 0 However, score matching presents a new challenge, as calculating the score of the data distribution necessitates knowledge of the true distribution p data that generates the training data Fortunately, minimizing the expected value of L(x θ) aligns with minimizing the expected value of the overall function.

(18.25) where n is the dimensionality of x

Because score matching requires taking derivatives with respect to x, it is not applicable to models of discrete data However, the latent variables in the model may be discrete.

Score matching is effective only when direct evaluation of log ˜p(x) and its derivatives is possible, making it incompatible with methods that provide only a lower bound on log ˜p(x) Since score matching requires both first and second derivatives of log ˜p(x), lower bounds fail to provide necessary derivative information Consequently, score matching is unsuitable for estimating models with complex interactions among hidden units, such as sparse coding models or deep Boltzmann machines Although it can be used to pretrain the first hidden layer of a larger model, score matching has not been employed as a pretraining strategy for deeper layers, likely due to the presence of discrete variables in those hidden layers.

Score matching can be interpreted as a variant of contrastive divergence utilizing a specific type of Markov chain Unlike Gibbs sampling, this approach employs local moves directed by the gradient When the size of these local moves tends toward zero, score matching becomes equivalent to contrastive divergence with this particular Markov chain.

Lyu 2009( ) generalized score matching to the discrete case (but made an error in their derivation that was corrected by Marlin et al (2010)) Marlin et al.

(2010) found that generalized score matching (GSM) does not work in high dimensional discrete spaces where the observed probability of many events is 0.

Ratio matching, as proposed by Hyvönen in 2007, offers a more effective method for adapting score matching concepts to discrete data, particularly binary data This technique focuses on minimizing the average of a specific objective function across examples, enhancing the accuracy of data representation.

Ratio matching, as described by Marlin et al (2010), effectively flips the bit at position j in the function f(x, j) and avoids the partition function by leveraging a technique similar to the pseudolikelihood estimator, resulting in the cancellation of the partition function in probability ratios Their research demonstrates that models trained with ratio matching surpass those trained with SML, pseudolikelihood, and GSM in denoising test set images However, it is important to note that, like the pseudolikelihood estimator, ratio matching requires evaluations of p˜ for each data point, leading to a computational cost per update that is approximately n times greater than that of SML.

Ratio matching, similar to the pseudolikelihood estimator, effectively reduces the influence of fantasy states that differ from a training example by only one variable This method specifically targets binary data, operating on all fantasy states within a Hamming distance of 1 from the actual data.

Ratio matching is an effective approach for addressing high-dimensional sparse data, like word count vectors This type of data presents challenges for MCMC-based methods, as it is costly to represent in a dense format Additionally, MCMC samplers struggle to produce sparse values until the model has adequately learned to capture the sparsity within the data distribution, as noted by Dauphin and Bengio.

In 2013, researchers addressed the challenge of ratio matching by developing an unbiased stochastic approximation method This innovative approach evaluates a randomly chosen subset of the objective's terms, eliminating the need for the model to produce complete fantasy samples.

See Marlin and de Freitas 2011( ) for a theoretical analysis of the asymptotic eﬃciency of ratio matching.

Denoising Score Matching

In some cases we may wish to regularize score matching, by ﬁtting a distribution p smoothed ( ) =x

 p data ( ) (y q x y| )dy (18.27) rather than the true p data The distributionq(x y| ) is a corruption process, usually one that forms x by adding a small amount of noise to y

Denoising score matching is advantageous because we typically lack access to the true data distribution, relying instead on an empirical distribution derived from samples Consistent estimators can transform the model into Dirac distributions centered on training points when given sufficient capacity Smoothing using a distribution q mitigates this issue, albeit at the expense of the asymptotic consistency property Kingma and LeCun (2010) proposed a method for regularized score matching, utilizing normally distributed noise as the smoothing distribution q.

Autoencoder training algorithms, as discussed in section 14.5.1, are equivalent to score matching and denoising score matching techniques These algorithms provide an effective solution to the challenges posed by the partition function problem.

Noise-Contrastive Estimation

Many methods for estimating models with complex partition functions fail to provide an actual estimate of the partition function itself Techniques such as Score Matching (SML) and Contrastive Divergence (CD) focus solely on estimating the gradient of the log partition function, rather than the partition function directly Additionally, both score matching and pseudolikelihood methods circumvent the need to calculate any quantities associated with the partition function entirely.

Noise-contrastive estimation (NCE) offers a unique approach to modeling probability distributions by representing the model's probability as logp model (x) = log ˜x p model ( ; ) + x θ c, where c serves as an approximation of −logZ(θ) Unlike traditional methods that focus solely on estimating the parameter θ, NCE simultaneously estimates both θ and c as parameters, enhancing the model's accuracy Although the resulting logp model(x) may not initially align with a valid probability distribution, it progressively approaches validity as the estimation of c improves.

Using maximum likelihood as the estimator criterion is inadequate, as it would result in setting the parameter c to an arbitrarily high value instead of ensuring that c creates a valid probability distribution.

NCE simplifies the unsupervised learning challenge of estimating p(x) by transforming it into a supervised learning task that involves training a probabilistic binary classifier In this framework, one category represents the data generated by the model, allowing for effective maximum likelihood estimation within the supervised setting.

The NCE (Noise-Contrastive Estimation) method is relevant for problems featuring a tractable partition function, eliminating the necessity for an additional parameter c However, its primary appeal lies in its capability to estimate models with challenging partition functions This learning approach establishes an asymptotically consistent estimator for the original problem.

We introduce a new noise distribution, denoted as p_noise(x), which is designed to be easily evaluated and sampled from This allows us to develop a model that incorporates both the variable x and a new binary class variable y In this joint model, we define the relationship such that p_joint(y = 1) = y 1.

2, (18.29) p joint (x| y = 1) = p model ( )x , (18.30) and p joint (x|y = 0) = p noise ( )x (18.31)

In other words,y is a switch variable that determines whether we will generate x from the model or from the noise distribution.

We can develop a joint model for training data where a switch variable indicates whether to sample x from the actual data or from a noise distribution Specifically, the probability of y being 1 is set at 50%, with p train (x | y = 1) equal to the data distribution p data (x), while p train (x | y = 0) corresponds to the noise distribution p noise (x).

We can now just use standard maximum likelihood learning on the supervised learning problem of ﬁtting p joint to p train : θ, c= arg max θ,c

The distribution p joint is essentially a logistic regression model applied to the diﬀerence in log probabilities of the model and the noise distribution: p joint ( = 1 y | x) = p model ( )x p model ( ) +x p noise ( )x (18.33)

NCE is easy to implement when the log ˜p model allows for straightforward back-propagation Additionally, evaluating the p noise is crucial for assessing the p joint, while sampling from it is necessary for generating training data.

NCE is particularly effective for problems involving a limited number of random variables, yet it also performs well when these variables can assume many values A notable application of NCE is in modeling the conditional distribution of a word based on its context, as demonstrated by Mnih and Kavukcuoglu.

2013) Though the word may be drawn from a large vocabulary, there is only one word.

When applying NCE to problems with numerous random variables, its efficiency diminishes significantly In contrast, the logistic regression classifier effectively discards noisy samples by pinpointing any variable with an improbable value Consequently, the learning process decelerates considerably once the model has grasped the fundamental marginal statistics For instance, when modeling images of faces using unstructured Gaussian noise, if the model learns to identify eyes, it can eliminate nearly all unstructured noise samples without acquiring knowledge about other facial features, like mouths.

The requirement for p noise to be easily evaluable and sampleable can be excessively limiting When p noise is straightforward, the majority of samples may be too distinctly different from the actual data, hindering significant improvements in the p model.

NCE, like score matching and pseudolikelihood, is ineffective when only a lower bound on p˜ is available, as this can only help establish a lower bound on p joint (y = 1 | x) but only an upper bound on p joint (y = 0 | x), which is crucial for half of the NCE objective terms Additionally, a lower bound on p noise is not beneficial since it only yields an upper bound on p joint (y = 1 | x).

Noise Contrastive Estimation (NCE) introduces a method known as self-contrastive estimation, where a new noise distribution is defined before each gradient step This approach yields an expected gradient that aligns with the expected gradient of maximum likelihood, as noted by Goodfellow (2014) In a specific instance of NCE, where noise samples originate from the model itself, maximum likelihood can be viewed as a mechanism that compels the model to continuously differentiate between actual data and its own evolving beliefs Conversely, NCE reduces computational costs by requiring the model to differentiate reality from a static baseline, represented by the noise model.

The supervised task of differentiating between training samples and generated samples, utilizing the model energy function to define the classifier, has been previously introduced in various forms by researchers such as Welling et al.

Noise contrastive estimation posits that an effective generative model can differentiate between real data and noise This concept aligns with the notion that a robust generative model should produce samples indistinguishable from actual data by any classifier, leading to the development of generative adversarial networks.

Estimating the Partition Function

This section explores various techniques for directly estimating the intractable partition function Z(θ) related to undirected graphical models, complementing the methods that aim to circumvent its computation.

Estimating the partition function is crucial for computing the normalized likelihood of data, which is essential for evaluating models, monitoring training performance, and facilitating comparisons between different models.

For example, imagine we have two models: model M A deﬁning a probability distributionp A (x;θ A ) = Z 1

A p˜ A (x;θ A ) and model M B deﬁning a probability distribution p B (x;θ B ) = Z 1

B p˜ B (x;θ B ) A common way to compare the models is to evaluate and compare the likelihood that both models assign to an i.i.d. test dataset Suppose the test set consists of m examples {x (1) , ,x ( ) m } If

If the inequality \( \sum_i \log p_A(x_i; \theta_A) - \sum_i \log p_B(x_i; \theta_B) > 0 \) holds, we conclude that model \( M_A \) outperforms model \( M_B \) in terms of test log-likelihood However, verifying this condition necessitates knowledge of the partition function The evaluation of the log probability assigned by each model to the data points inherently requires computing the partition function To simplify this, we can rearrange the equation to focus solely on the ratio of the partition functions of the two models.

We can evaluate whether model M A outperforms model M B by analyzing their ratio, without needing to know the partition function for either This ratio can be estimated through importance sampling, assuming the two models share similarities.

To compute the actual probability of the test data under models M A or M B, it is essential to determine the values of their partition functions However, if we can ascertain the ratio of the two partition functions, represented as r = Z(θ A) / Z(θ B), we can simplify the calculation process.

A ), and we knew the actual value of just one of the two, say Z(θ A ), we could compute the value of the other:

A straightforward method for estimating the partition function involves utilizing a Monte Carlo technique like simple importance sampling This approach is illustrated through continuous variables with integrals, but it is easily adaptable to discrete variables by substituting integrals with summations We employ a proposal distribution defined as p 0 (x) = Z 1.

0 p˜ 0 (x) which supports tractable sampling and tractable evaluation of both the partition function Z 0 and the unnormalized distribution p˜ 0 ( )x

In this section, we develop a Monte Carlo estimator, denoted as Zˆ 1, for the integral by utilizing samples drawn from the proposal distribution p 0 (x) Each sample is then weighted according to the ratio of the unnormalized distribution ˜ p 1 to the proposal distribution p 0, ensuring accurate estimation of the integral.

We see also that this approach allows us to estimate the ratio between the partition functions as

 K k=1 ˜ p 1(x ( ) k ) ˜ p 0(x ( ) k ) s t : x ( ) k ∼p 0 (18.45)This value can then be used directly to compare two models as described in equation 18.39.

When the distribution \( p_0 \) closely approximates \( p_1 \), equation 18.44 serves as an effective method for estimating the partition function (Minka 2005) However, \( p_1 \) is often complex, typically multimodal, and exists in a high-dimensional space, making it challenging to identify a tractable \( p_0 \) that is both simple enough to evaluate and sufficiently close to \( p_1 \) for a high-quality approximation If \( p_0 \) and \( p_1 \) are not closely aligned, most samples drawn from \( p_0 \) will have low probabilities under \( p_1 \), contributing negligibly to the sum in equation 18.44 Consequently, having only a few samples with significant weights will lead to a poor-quality estimator characterized by high variance, which can be quantitatively assessed through an estimate of the variance of our estimate \( \hat{Z}_1 \).

This quantity is largest when there is signiﬁcant deviation in the values of the importance weights p p ˜ ˜ 1 (x ( ) k )

This article discusses two strategies designed to estimate partition functions for complex distributions in high-dimensional spaces: annealed importance sampling and bridge sampling Both methods build upon the basic concept of importance sampling and aim to address the challenge of the proposal distribution \( p_0 \) being significantly different from the target distribution \( p_1 \) They achieve this by incorporating intermediate distributions that effectively connect \( p_0 \) and \( p_1 \).

When the Kullback-Leibler divergence (D KL) between two probability distributions p0 and p1 is significant, indicating minimal overlap, annealed importance sampling (AIS) is employed to connect these distributions by incorporating intermediate distributions.

2001) Consider a sequence of distributions p η 0 , , p η n , with 0 =η 0 < η 1

Tiêu đề	Deep Learning
Tác giả	Ian Goodfellow, Yoshua Bengio, Aaron Courville
Trường học	unknown
Chuyên ngành	Deep Learning
Thể loại	book
Năm xuất bản	unknown
Thành phố	unknown

Định dạng
Số trang	800
Dung lượng	15,66 MB