The Matrix Calculus You Need For Deep Learning Terence Parr and Jeremy Howard July 3, 2018 (We teach in University of San Francisco’s MS in Data Science program and have other nefarious projects under.
Generalization of the Jacobian
In this article, we expand on the concept of the Jacobian matrix by introducing a more general definition that incorporates multiple parameters into a single vector argument, represented as f(x, y, z) becoming f(x) It is important to note that in mathematical literature, vectors are often denoted with bold lowercase letters, such as **x**, while scalars are represented in italics, like *x* Each element of the vector, denoted as x_i, is a scalar and is italicized to reflect this distinction Additionally, we establish that all vectors will be oriented vertically by default and have a size of n×1.
In the context of multiple scalar-valued functions, we can consolidate them into a vector, similar to how we handle parameters Let \( y = f(x) \) represent a vector composed of \( m \) scalar-valued functions, each of which processes a vector \( x \) with a length of \( n = |x| \), where \( |x| \) denotes the count of elements in \( x \) Each function \( f_i \) within \( f \) yields a scalar output, as illustrated by \( y_1 = f_1(x) \) and \( y_2 = f_2(x) \).
For instance, we’d representf(x, y) = 3x 2 y and g(x, y) = 2x+y 8 from the last section as y1 =f1(x) = 3x 2 1 x2 (substitutingx1 forx, x2 fory) y2 =f2(x) = 2x1+x 8 2
It’s very often the case that m =n because we will have a scalar function result for each element of the xvector For example, consider the identity functiony=f(x) =x: y 1 = f 1 (x) = x 1 y2 = f2(x) = x2
The Jacobian matrix consists of all possible partial derivatives arranged in an m×n format, where m represents the number of functions and n denotes the number of parameters It effectively stacks m gradients with respect to the variable x, providing a comprehensive overview of how each function changes in relation to the parameters.
The partial derivative ∂f_i(x) with respect to the vector x, which has a length of n, results in a horizontal vector The Jacobian has n columns when taking the partial derivative with respect to x, reflecting the n parameters that can influence the function's value Consequently, the Jacobian consists of m rows corresponding to m equations Visualizing the Jacobian's structure can aid in understanding its shape and dimensions.
The Jacobian of the identity functionf(x) =x, withf i (x) =x i , hasnfunctions and each function hasnparameters held in a single vectorx The Jacobian is, therefore, a square matrix sincem=n:
= I (I is the identity matrix with ones down the diagonal)
Before proceeding, ensure that you can derive each step outlined above If you encounter difficulties, analyze each element of the matrix individually and utilize standard scalar derivative rules This technique is beneficial: simplify vector expressions into scalar components, compute the partial derivatives, and then appropriately combine the results into vectors and matrices.
When working with matrices, it is essential to distinguish between vertical matrices and their transposed counterparts, denoted as x and x T, respectively Additionally, be mindful of the differences between scalar-valued functions, expressed as y = , and vector-valued functions, represented as y = Understanding these distinctions is crucial for accurate mathematical analysis and application.
Derivatives of vector element-wise binary operators
Element-wise binary operations on vectors, such as vector addition (w + x), are crucial for expressing various common vector operations, including scalar multiplication These operations involve applying an operator to corresponding elements of each vector, producing an output vector where each element is derived from the respective inputs This method is the default behavior for basic mathematical operations in libraries like NumPy and TensorFlow Common examples in deep learning include operations like max(w, x) and w > x, which yield vectors of ones and zeros.
Element-wise binary operations can be expressed using the notation y = f(w)g(x), where the dimensions of y, w, and x are equal (|y| = |w| = |x|) In this context, |x| denotes the number of items in x It is important to note that the symbol used represents any element-wise operator, such as addition, rather than the function composition operator (◦) By focusing on scalar equations, we can gain a clearer understanding of the structure of the equation y = f(w)g(x).
where we writen(notm) equations vertically to emphasize the fact that the result of element-wise operators givem=n sized vector results.
Using the ideas from the last section, we can see that the general case for the Jacobian with respect towis the square matrix:
and the Jacobian with respect to xis:
The Jacobian often simplifies to a diagonal matrix, which is characterized by having non-zero values only on its diagonal and zeros elsewhere This simplification is particularly relevant for element-wise operations, making it essential to understand the conditions under which the Jacobian can be reduced to this form.
In a diagonal Jacobian, all off-diagonal elements are zero, meaning that ∂w ∂ j(fi(w)gi(x)) = 0 for j ≠ i This condition holds true specifically when the functions fi and gi are constants with respect to wj.
∂w jf i (w) = ∂w ∂ jg i (x) = 0 Regardless of the operator, if those partial derivatives go to zero, the operation goes to zero, 00 = 0 no matter what, and the partial derivative of a constant is zero.
When the functions \( f_i \) and \( g_i \) do not depend on \( w_j \), their partial derivatives approach zero This indicates that \( f_i \) is exclusively a function of \( w_i \) and \( g_i \) is solely a function of \( x_i \) For instance, the expression \( w + x \) sums \( w_i + x_i \) As a result, the product \( f_i(w)g_i(x) \) simplifies to \( f_i(w_i)g_i(x_i) \), streamlining the analysis.
The partial differentiation of functions \( f_i(w_i) \) and \( g_i(x_i) \) with respect to \( w_j \) results in zero when \( j \neq i \), indicating that these functions behave as constants in this context It is important to note that the notation used, \( f_i(w_i) \), is somewhat misleading since \( f_i \) and \( g_i \) are actually functions of vectors rather than individual elements A more precise representation would be \( \hat{f_i}(w_i) = f_i(w) \), but for clarity and ease of understanding, we will maintain the current notation, as it is familiar to programmers who often overload functions.
We’ll take advantage of this simplification later and refer to the constraint that f i (w) and g i (x) access at mostwi and xi, respectively, as the element-wise diagonal condition.
Under this condition, the elements along the diagonal of the Jacobian are ∂w ∂ i(fi(wi)gi(xi)):
(The large “0”s are a shorthand indicating all of the off-diagonal are 0.)
More succinctly, we can write:
(fn(wn)gn(xn)) wherediag(x) constructs a matrix whose diagonal elements are taken from vectorx.
In binary element-wise operations, the general function f(w) often simplifies to the vector w due to frequent vector arithmetic This means that when f is a vector, it reduces to f i (w) = w i For instance, vector addition w + x satisfies the element-wise diagonal condition, as the equations y i = f i (w) + g i (x) simplify to y i = w i + x i, with corresponding partial derivatives.
The derivatives of the expression (w+x) with respect to w and x yield the identity matrix I, characterized by having ones along its diagonal and zeros elsewhere This square identity matrix is defined by its dimensions, ensuring it accurately represents the relationship between the variables involved.
Given the simplicity of this special case,fi(w) reducing tofi(wi), you should be able to derive the Jacobians for the common element-wise binary operations on vectors:
Op Partial with respect to w
Op Partial with respect to x
The ⊗ operator represents element-wise multiplication, often referred to as the Hadamard product, while the division operator denotes element-wise division Although there is no universally accepted notation for these operations, we adopt a consistent approach aligned with our general binary operation notation.
Derivatives involving scalar expansion
When performing operations like addition or multiplication between scalars and vectors, we effectively treat the scalar as a vector, enabling element-wise calculations For instance, adding a scalar \( z \) to a vector \( x \) can be expressed as \( y = f(x) + g(z) \), where \( f(x) = x \) and \( g(z) \) represents a vector of ones with the same length as \( x \) This approach simplifies the computation of partial derivatives since the scalar \( z \) does not depend on the vector \( x \) Similarly, multiplying a vector \( x \) by a scalar \( z \) can be represented as \( y = f(x) \otimes g(z) = x \otimes \tilde{1} z \), where \( \otimes \) denotes element-wise multiplication, also known as the Hadamard product.
The partial derivatives of vector-scalar addition and multiplication with respect to vectorxuse our element-wise rule:
This follows because functions f(x) = x and g(z) =~1z clearly satisfy our element-wise diagonal condition for the Jacobian (thatf i (x) refer at most to x i and g i (z) refers to the i th value of the~1z vector).
Using the usual rules for scalar partial derivatives, we arrive at the following diagonal elements of the Jacobian for vector-scalar addition:
Computing the partial derivative with respect to the scalar parameter z, however, results in a vertical vector, not a diagonal matrix The elements of the vector are:
The diagonal elements of the Jacobian for vector-scalar multiplication involve the product rule for scalar derivatives:
The partial derivative with respect to scalar parameter zis a vertical vector whose elements are:
Vector sum reduction
Summing the elements of a vector is a crucial operation in deep learning, particularly in calculating the network loss function Additionally, this operation simplifies the computation of the derivative of vector dot products and other processes that convert vectors into scalar values.
The function Lety is defined as the sum of f(x), expressed as Lety = Σ f_i(x) for i from 1 to n It is important to note that the parameter is represented as a vector x, allowing each function f_i to utilize all values in the vector rather than being limited to just x_i The summation focuses on the outcomes of the function, not the parameters themselves The gradient, which is a 1×n Jacobian, is derived from this vector summation.
(The summation inside the gradient elements can be tricky so make sure to keep your notation consistent.)
Let’s look at the gradient of the simple y = sum(x) The function inside the summation is just fi(x) =xi and the gradient is then:
Because ∂x ∂ jxi = 0 for j6=i, we can simplify to:
Notice that the result is a horizontal vector full of 1s, not a vertical vector, and so the gradient is
The transpose of a vector, denoted as ~1 T, converts a vertical vector into a horizontal one Maintaining the correct shape of vectors and matrices is crucial, as it ensures the accurate computation of derivatives for complex functions.
As another example, let’s sum the result of multiplying a vector by a constant scalar Ify=sum(xz) thenfi(x, z) =xiz The gradient is:
2x 2 z, , ∂x ∂ nx n zi z, z, , z The derivative with respect to scalar variable z is 1×1:
The Chain Rules
Single-variable chain rule
The derivative of the nested expression dx/d(sin(x²)) is calculated as 2xcos(x²), illustrating the application of scalar differentiation rules such as dx/d(x²) = 2x and du/d(sin(u)) = cos(u) This indicates that the solution involves multiplying the derivative of the outer function by the derivative of the inner function, a concept known as the chain rule In this section, we will delve into the general principle behind this process and outline a method applicable to complex, nested expressions involving a single variable.
The chain rule is commonly expressed through nested functions, such as y = f(g(x)), or using function composition notation (f ◦ g)(x) While some sources may use shorthand notation like y' = f'(g(x))g'(x), this can obscure the introduction of an intermediate variable, u = g(x) To avoid confusion and ensure clarity, it is advisable to explicitly define the single-variable chain rule as dy/dx = dy/du * du/dx This formulation helps prevent errors in differentiating with respect to the wrong variable.
To deploy the single-variable chain rule, follow these steps:
1 Introduce intermediate variables for nested subexpressions and subexpressions for both binary and unary operators; e.g., × is binary, sin(x) and other trigonometric functions are usually unary because there is a single operand This step normalizes all equations to single operators or function applications.
2 Compute derivatives of the intermediate variables with respect to their parameters.
3 Combine all derivatives of intermediate variables by multiplying them together to get the overall result.
4 Substitute intermediate variables back in if any are referenced in the derivative equation.
The third step of the chain rule emphasizes the connection between intermediate results, highlighting how these results are linked together A key aspect of all variations of the chain rule is the multiplication of these intermediate derivatives, which serves as a fundamental principle in its application.
Let’s try this process on y=f(g(x)) =sin(x 2 ):
1 Introduce intermediate variables Let u = x 2 represent subexpression x 2 (shorthand for u(x) =x 2 ) This gives us: u = x 2 (relative to definition f(g(x)), g(x) =x 2 ) y = sin(u) (y=f(u) =sin(u))
The sequence of subexpressions does not influence the final result; however, it is advisable to follow the reverse order of operations, starting from the innermost expression to the outermost This approach ensures that expressions and derivatives rely on previously calculated elements.
2 Compute derivatives. du dx = 2x (Take derivative with respect tox) dy du = cos(u) (Take derivative with respect tou notx)
3 Combine. dy dx = dy du du dx =cos(u)2x
4 Substitute. dy dx = dy du du dx =cos(x 2 )2x= 2xcos(x 2 )
Calculating the derivatives of intermediate variables individually is straightforward, thanks to the chain rule This rule not only permits this approach but also provides guidance on how to combine the intermediate results, ultimately leading to the expression 2xcos(x²).
The combining step of the chain rule can be understood through the concept of unit cancellation If we define \( y \) as miles, \( x \) as gallons in a gas tank, and \( u \) as gallons, we can express the relationship as \( \frac{dy}{dx} = \frac{du}{dy} \cdot \frac{du}{dx} \) This represents the conversion of miles per tank to gallons per mile, where the gallons in the denominator and numerator effectively cancel each other out.
The single-variable chain rule can be conceptualized as a dataflow diagram or a chain of operations, resembling an abstract syntax tree familiar to those in compiler design.
Modifications to the function parameter x propagate through a squaring operation and subsequently through a sine operation, ultimately affecting the result y This process can be conceptualized as du/dx representing the transfer of changes from x to u, while dy/du illustrates the relationship between changes in u and the resulting y.
To transition from point X to point Y, an intermediate step is necessary Traditionally, the chain rule is expressed from the output variable to the parameters as dx/dy = du/dy * du/dx However, for clarity, it may be more effective to reverse the flow, presenting it as dy/dx = du/dx * du/dy.
The single-variable chain rule applies when there is a direct dataflow path from \( x \) to the root \( y \), meaning changes in \( x \) influence \( y \) in only one way A simpler condition to remember is that none of the intermediate functions, \( u(x) \) and \( y(u) \), should have more than one parameter For example, in the case of \( y(x) = x + x^2 \), introducing an intermediate variable \( u \) transforms it to \( y(x, u) = x + u \), which creates multiple paths from \( x \) to \( y \) In such scenarios, the single-variable total-derivative chain rule will be utilized to address the complexity.
Automatic differentiation involves two main techniques: forward differentiation and backward differentiation, commonly referenced in academic papers and library documentation Forward differentiation aligns with the standard data flow, assessing how changes in each parameter affect the function output Conversely, backward differentiation examines how changes in the output influence all function parameters simultaneously, making it more efficient for functions with numerous parameters The following table illustrates the sequence of partial derivatives computed for both methods.
Forward differentiation from x to y Backward differentiation from y to x dy dx = du dx du dy dy dx = dy du du dx
Automatic differentiation is beyond the scope of this article, but we’re setting the stage for a future article.
Many readers can easily compute the derivative of \( \frac{d}{dx} \sin(x^2) \) mentally; however, our focus is on a method that applies to more complex expressions This approach mirrors the automatic differentiation utilized in libraries such as PyTorch By practicing manual derivative calculations, you are simultaneously gaining insights into defining functions for custom neural networks in PyTorch.
When dealing with deeply nested expressions, it's beneficial to apply the chain rule similarly to how a compiler deconstructs nested function calls, such as f 4 (f 3 (f 2 (f 1 (x)))), into a sequential series of calls Each function call's result is stored in a temporary variable, known as a register, which is subsequently used as a parameter for the next function in the sequence For instance, let's illustrate this process with a complex equation like y = f(x) = ln(sin(x^3)^2).
1 Introduce intermediate variables. u1 = f1(x) =x 3 u 2 = f 2 (u 1 ) =sin(u 1 ) u3 = f3(u2) =u 2 2 u4 = f4(u3) =ln(u3)(y =u4)
3 Combine four intermediate values. dy dx = du4 dx = du4 du 3 du3 du 2 du2 du 1 du1 dx = 1 u 3 2u2cos(u1)3x 2 = 6u2x 2 cos(u1) u 3
4 Substitute. dy dx = 6sin(u 1 )x 2 cos(x 3 ) u 2 2 = 6sin(x 3 )x 2 cos(x 3 ) sin(u 1 ) 2 = 6sin(x 3 )x 2 cos(x 3 ) sin(x 3 ) 2 = 6x 2 cos(x 3 ) sin(x 3 ) Here is a visualization of the data flow through the chain of operations fromx toy:
To effectively differentiate nested expressions of a single variable, x, we utilize the chain rule, provided that x influences y through a singular data flow path However, for more complex expressions, it is essential to expand our approach to accommodate these intricacies.
Single-variable total-derivative chain rule
The single-variable chain rule has limited use since it requires all intermediate variables to be functions of a single variable Nonetheless, it effectively illustrates the fundamental principle of the chain rule, which involves multiplying the derivatives of intermediate subexpressions To address more complex expressions like y = f(x) = x + x², it is essential to enhance the basic chain rule.
In the context of multivariate functions, applying the scalar addition derivative rule yields dx dy = dx d x + dx d x^2 = 1 + 2x; however, this does not utilize the chain rule correctly Attempting to use the single-variable chain rule leads to incorrect results, as the derivative operator dx d is not applicable to functions like u2, which depend on intermediate variables such as u1 Thus, we have u1(x) = x^2 and u2(x, u1) = x + u1, where f(x) = u2(x, u1).
Let’s try it anyway to see what happens If we pretend that du du 2
1 = 0 + 1 = 1 and du dx 1 = 2x, then dy dx = du dx 2 = du du 2
1 du 1 dx = 2x instead of the right answer 1 + 2x.
Becauseu2(x, u) =x+u1 has multiple parameters, partial derivatives come into play Let’s blindly apply the partial derivative operator to all of our equations and see what we get:
∂x(x+u 1 ) = 1 + 0 = 1 (something’s not quite right here!)
The calculation of the partial derivative ∂u2(x, u1) is incorrect due to a violation of a fundamental principle of partial derivatives, which states that other variables must remain constant while one variable changes In this case, since u1(x) = x² is dependent on x, it cannot be treated as a constant Consequently, ∂u2/∂x(x, u1) does not equal 1, as ∂u/∂x1(x) is not zero Analyzing the data flow diagram for y = u2(x, u1) reveals multiple pathways from x to y, highlighting the necessity to account for both direct and indirect dependencies through u1(x).
A change in x affects y both as an operand of the addition and as the operand of the square operator Here’s an equation that describes how tweaks tox affect the output: ˆ y= (x+ ∆x) + (x+ ∆x) 2
Then, ∆y = ˆy−y, which we can read as “the change iny is the difference between the originaly and y at a tweakedx.”
If we letx= 1, theny= 1+1 2 = 2 If we bumpxby 1, ∆x= 1, then ˆy= (1+1)+(1+1) 2 = 2+4 = 6. The change in y is not 1, as ∂u 2 /u 1 would lead us to believe, but 6−2 = 4!
The law of total derivatives states that to compute dy/dx, we must consider all contributions from changes in x to changes in y The total derivative with respect to x assumes all variables, including u1, are functions of x and may vary as x changes For the function f(x) = u2(x, u1), which depends on x directly and indirectly through the intermediate variable u1(x), the total derivative is expressed as dy/dx = ∂f(x).
∂x Using this formula, we get the proper answer: dy dx = ∂f(x)
∂x = 1 + 1×2x= 1 + 2x That is an application of what we can call thesingle-variable total-derivative chain rule:
The total derivative assumes all variables are potentially codependent whereas the partial derivative assumes all variables but xare constants.
The notation used in this context features partial derivatives for functions f and ui, which depend on multiple variables This approach aligns with MathWorld's notation, contrasting with Wikipedia's use of df(x, u1, , un)/dx, which may highlight the total derivative aspect To maintain consistency with our upcoming discussion on the vector chain rule, we will continue using partial derivative notation.
When calculating the total derivative with respect to x, it's essential to consider that other variables may also depend on x, thus including their contributions While the left side of the equation resembles a typical partial derivative, the right side represents the total derivative Often, temporary variables are functions of a single parameter, causing the total-derivative chain rule to simplify to the single-variable chain rule.
Let’s look at a nested subexpression, such asf(x) =sin(x+x 2 ) We introduce three intermediate variables: u1(x) = x 2 u2(x, u1) = x+u1 u 3 (u 2 ) = sin(u 2 ) (y=f(x) =u 3 (u 2 )) and partials:
∂x = 0 +cos(u2) ∂u ∂x 2 = cos(x+x 2 )(1 + 2x) where both ∂u ∂x 2 and ∂f(x) ∂x have ∂u ∂x i terms that take into account the total derivative.
Also notice that the total derivative formula always sums versus, say, multiplies terms ∂u ∂f i
The concept of the total derivative is often misunderstood, as it represents a weighted sum of all contributions of x to the change in y For instance, while the expression y = x + x² involves simple addition, the total derivative for a more complex equation like y = x × x² still incorporates the addition of partial derivative terms Even though x × x² simplifies to x³, it's essential to recognize the intermediate variables and their corresponding partial derivatives: u₁(x) = x² and u₂(x, u₁) = x u₁, where y = f(x) = u₂(x, u₁).
1 = 1) The form of the total derivative remains the same, however: dy dx = ∂u2
It’s the partials (weights) that change, not the formula, when the intermediate variable operators change.
Introducing intermediate variables for non-nested subexpressions, like x² in x + x², serves three key purposes: it simplifies the computation of derivatives for these subexpressions, enhances the application of the chain rule, and reflects the methodology used in automatic differentiation within neural network libraries.
To simplify the single-variable total-derivative chain rule, we can utilize intermediate variables more aggressively, aiming to eliminate the conspicuous ∂f/∂x term This approach will help streamline the expression and enhance clarity in our final formulation.
We can achieve that by simply introducing a new temporary variable as an alias for x: u n+1 =x. Then, the formula reduces to our final form:
The chain rule based on the total derivative simplifies to the single-variable chain rule when all intermediate variables depend on a single variable This more general formula encompasses both scenarios, resembling a vector dot product or multiplication It's important to clarify terminology, as the term "multivariable chain rule" can be misleading; it refers only to intermediate variables as multivariate functions, while the overall function remains a scalar dependent on a single variable To avoid confusion, we refer to it as the "single-variable total-derivative chain rule," distinguishing it from the simpler single-variable chain rule.
Vector chain rule
Having mastered the total-derivative chain rule, we are now prepared to explore the chain rule for vector functions and vector variables Interestingly, this broader chain rule mirrors the simplicity of the single-variable chain rule for scalars Instead of merely presenting the vector chain rule, we will rediscover it through the process of calculating the derivative of a sample vector function with respect to a scalar, y = f(x), enabling us to derive a general formula.
Let’s introduce two intermediate variables, g 1 and g 2 , one for each f i so that y looks more like y=f(g(x)): g1(x) g2(x) x 2 3x f 1 (g) f 2 (g) ln(g 1 ) sin(g 2 )
The derivative of vector y with respect to scalar x is a vertical vector with elements computed using the single-variable total-derivative chain rule:
We have derived the answer using scalar rules, organizing the derivatives into a vector Now, we aim to abstract this result into a vector form Our objective is to transform the vector of scalar operations into a cohesive vector operation.
∂x terms, isolating the ∂g ∂x j terms into a vector, we get a matrix by vector multiplication:
That means that the Jacobian is the multiplication of two other Jacobians, which is kinda cool. Let’s check our results:
The vector chain rule for functions of a single parameter yields results consistent with the scalar approach, confirming its accuracy and reflecting the principles of the single-variable chain rule.
∂x with the single-variable chain rule: d dxf(g(x)) = df dg dg dx
To adapt this formula for multiple parameters or a vector x, we simply replace x with vector x in the equation This modification results in ∂g ∂x and the corresponding Jacobian, ∂x ∂f, transforming into matrices rather than vertical vectors Thus, our comprehensive vector chain rule is established.
∂xf(g(x)) = ∂g ∂f ∂g ∂x (Note: matrix multiply doesn’t commute; order of ∂g ∂f ∂g ∂x matters)
The vector formula offers a significant advantage over the single-variable chain rule by inherently accounting for the total derivative while preserving notational simplicity The Jacobian matrix encompasses all combinations of the functions fi concerning gj and gi concerning xj, providing a comprehensive representation of these relationships.
wherem=|f|,n=|x|, and k=|g| The resulting Jacobian is m×n(anm×k matrix multiplied by ak×n matrix).
The formula ∂f/∂g ∂g/∂x can be simplified for many applications where the Jacobians are square (m=n) and the off-diagonal entries are zero In neural networks, the mathematics focuses on functions of vectors rather than vectors of functions For instance, the neuron affine function includes the term sum(w⊗x), while the activation function is represented as max(0,x) Derivatives of these functions will be explored in the following section.
Element-wise operations on vectors \( w \) and \( x \) produce diagonal matrices with elements \( \frac{\partial w}{\partial x_i} \), as \( w_i \) depends solely on \( x_i \) and not on \( x_j \) for \( j \neq i \) This principle also applies when \( f_i \) is exclusively a function of \( g_i \), and \( g_i \) is solely dependent on \( x_i \).
In this situation, the vector chain rule simplifies to:
∂x i ) Therefore, the Jacobian reduces to a diagonal matrix whose elements are the single-variable chain rule values.
The vector chain rule is essential for understanding how single-variable formulas fit into broader mathematical concepts By applying the vector chain rule, you can effectively calculate the Jacobian, which involves specific components that need to be multiplied This approach highlights the relationship between scalar and vector functions, demonstrating the interconnectedness of mathematical principles.
∂xf(g(x)) = ∂g ∂f ∂x ∂g scalar u vector u vector u scalar f ∂f ∂u ∂u ∂x
5 The gradient of neuron activation
To compute the derivative of a typical neuron activation in a neural network, we focus on the activation function defined as activation(x) = max(0, wãx + b) This function allows us to analyze how changes in the model parameters, w and b, influence the neuron's output during a single computation Understanding this derivative is crucial for optimizing neural network performance.
This neuron utilizes fully connected weights and employs a rectified linear unit (ReLU) activation function Additionally, there are various affine functions like convolution and other activation functions, including exponential linear units, that operate on similar principles.
To optimize weights and biases in neural networks, we need to compute the derivatives ∂w ∂ (wãx+b) and ∂b ∂ (wãx+b) Although we haven't covered the derivative of the dot product y = f(w)ãg(x), we can apply the chain rule to simplify the process, eliminating the need to memorize additional rules It's important to note that y represents a scalar outcome, not a vector.
The dot product wãx is just the summation of the element-wise multiplication of the elements:
To compute the partial derivatives of the expression \( P_n(i(wixi)) = \sum(w \otimes x) \), we first recognize the importance of the linear algebra notation \( w \hat{x} = w^T x \) While we have established methods for finding the partial derivatives of \( \sum(x) \) and \( w \otimes x \), we must now apply the chain rule to address the partial derivatives of \( \sum(w \otimes x) \) To facilitate this, we introduce an intermediate vector variable \( u = w \otimes x \) and express \( y \) as \( y = \sum(u) \).
Once we’ve rephrased y, we recognize two subexpressions for which we already know the partial derivatives:
The vector chain rule says to multiply the partials:
To check our results, we can grind the dot product down into a pure scalar function: y = wãx = Pn i(w i x i )
Hooray! Our scalar results match the vector chain rule results.
Now, let y =wãx+b, the full expression within the max activation function call We have two different partials to compute, but we don’t need the chain rule:
The neuron activation function is defined as max(0, wãx + b), which effectively sets all negative values of z to zero This results in a piecewise derivative: when z is less than or equal to zero, the derivative is 0, indicating that z remains constant; conversely, when z is greater than zero, the derivative equals 1, reflecting the linear increase of z.
Broadcasting functions across scalars involves applying a single-variable function, such as max(0,x), to each element when one or both of the max arguments are vectors This process exemplifies the use of an element-wise unary operator, allowing for efficient computation across multiple values.
For the derivative of the broadcast version then, we get a vector of zeros and ones where:
∂x i max(0, x i ) (0 xi ≤0 dx i dx i = 1 xi >0
To derive the activation function, we utilize the chain rule due to the nested expression wãx + b We define an intermediate scalar variable z to represent this affine function, expressed as z(w, b, x) = wãx + b Consequently, the activation function can be defined as activation(z) = max(0, z).
The vector chain rule tells us:
∂w which we can rewrite as follows:
1 ∂w ∂z = ∂w ∂z =x T z >0 (we computed ∂w ∂z =x T previously) and then substitute z=wãx+b back in:
The gradient with respect to the weights
−x T wãx+b >0 Then, for the overall gradient, we get:
To interpret that equation, we can substitute an error termei =wãxi+b−yi yielding:
X i=1 e i x T i (for the nonzero activation case)
The computation represents a weighted average across all input vectors, where the weights correspond to the error terms—the discrepancies between the target output and the actual neuron output for each input Consequently, the gradient typically points towards an increase in cost or loss, as larger error terms exert more influence on their respective inputs In the case of a single input vector, the gradient simplifies to 2e1xT1 A zero error results in a zero gradient, indicating that the minimum loss has been achieved Conversely, a small positive error leads to a minor adjustment in the gradient towards the input, while a large error results in a more significant step in that direction If the error is negative, the gradient direction reverses, indicating that the highest cost lies in the negative direction.
To minimize loss effectively, the gradient descent algorithm utilizes a recurrence relation that updates the current position by subtracting the gradient, scaled by a scalar learning rate η This is expressed mathematically as wt+1 = wt - η∂C.
∂wBecause the gradient indicates the direction of higher cost, we want to update x in the opposite direction.
The derivative with respect to the bias
To optimize the bias, b, we also need the partial with respect to b Here are the intermediate variables again: u(w, b,x) = max(0,wãx+b) v(y, u) = yưu
We computed the partial with respect to the bias for equation u(w, b,x) previously:
−1 wãx+b >0 And for the partial of the cost function itself we get:
As before, we can substitute an error term:
X i=1 e i (for the nonzero activation case)
The partial derivative represents the average error, which can be zero depending on the activation level To adjust the neuron bias, we modify it in the opposite direction of the increased cost, following the formula: bt+1 = bt - η∂C.
To streamline the optimization process, it is advantageous to merge the weight vector \( w \) and the bias \( b \) into a single vector parameter \( \hat{w} = [w^T, b]^T \) This adjustment necessitates modifying the input vector \( x \) by appending a 1, resulting in \( \hat{x} = [x^T, 1] \) Consequently, the expression \( w^T x + b \) can be simplified to \( \hat{w}^T \hat{x} \), enhancing the efficiency of the activation function.
This finishes off the optimization of the neural network loss function because we have the two partials necessary to perform a gradient descent.
Congratulations on reaching this stage in your journey to mastering matrix calculus! To aid your understanding, we’ve provided a summary of all the key rules discussed in this article in the following section Additionally, be sure to explore the annotated resource link below for further insights.
Your next step would be to learn about the partial derivatives of matrices not just vectors For example, you can take a look at the matrix differentiation section ofMatrix calculus.
We would like to express our gratitude to Yannet Interian, a faculty member in the MS Data Science program at the University of San Francisco, and David Uminsky, the faculty director of the MS Data Science program, for their invaluable assistance with the notation used in this work.
Gradients and Jacobians
The gradientof a function of two variables is a horizontal 2-vector:
The Jacobian of a vector-valued function that is a function of a vector is an m×n(m =|f| and n=|x|) matrix containing all possible scalar partial derivatives:
The Jacobian of the identity functionf(x) =x isI.
Element-wise operations on vectors
Define genericelement-wise operations on vectorswand x using operator such as +:
The Jacobian with respect tow (similar forx) is:
Given the constraint (element-wise diagonal condition) thatf i (w) andg i (x) access at mostw i and xi, respectively, the Jacobian simplifies to a diagonal matrix:
Here are some sample element-wise operators:
Op Partial with respect to w Partial with respect to x
Scalar expansion
Adding scalarz to vectorx,y=x+z, is really y=f(x) +g(z) where f(x) =xand g(z) =~1z.
Vector reductions
The partial derivative of a vector sum with respect to one of the vectors is:
Fory =sum(xz) andn=|x|, we get:
The vector dot product can be expressed as y = f(w)ãg(x) = Pn i(wixi) = sum(w⊗x) By substituting u = w⊗x and applying the vector chain rule, we derive the relationship du/dx = dx/d(w⊗x) = diag(w) This leads to the equation dy/du = du/d(sum(u)) ≈ 1^T, resulting in dy/dx = du/dy × du/dx ≈ 1^T × diag(w) = w^T.
Chain rules
The vector chain rule serves as a comprehensive framework that encompasses other chain rules When the function \( f \) depends solely on a single variable \( x \) and all intermediate variables \( u \) are also single-variable functions, the single-variable chain rule is utilized In scenarios where one or more intermediate variables are influenced by multiple variables, the single-variable total-derivative chain rule is applicable In all other situations, the vector chain rule is the appropriate choice.
Single-variable rule Single-variable total-derivative rule Vector rule df dx = du df du dx ∂f (u 1 ∂x , ,u n ) = ∂u ∂f ∂u ∂x ∂x ∂ f(g(x)) = ∂g ∂f ∂g ∂x
In vector notation, lowercase letters in bold (e.g., **x**) represent vectors, while lowercase letters in italics (e.g., *x*) denote scalars The notation *x*ᵢ refers to the i-th element of vector **x**, which is italicized since each individual element of a vector is a scalar Additionally, the notation |**x**| signifies the magnitude of the vector **x**.
The T exponent ofx T represents the transpose of the indicated vector.
Pb i=axi is just a for-loop that iteratesifrom atob, summing all thexi.
Notation f(x) refers to a function calledf with an argument ofx.
I represents the square “identity matrix” of appropriate dimensions that is zero everywhere but the diagonal, which contains all ones. diag(x) constructs a matrix whose diagonal elements are taken from vector x.
The dot product w ãx is the summation of the element-wise multiplication of the elements:
Pn i(w i x i ) =sum(w⊗x) Or, you can look at it asw T x.
Differentiation, denoted as \( \frac{d}{dx} \), is an operator that transforms a function of one variable into its derivative Specifically, applying \( \frac{d}{dx} \) to \( f(x) \) yields the derivative of \( f(x) \) with respect to \( x \), represented as \( \frac{df(x)}{dx} \) Furthermore, if \( y = f(x) \), then the relationship can be expressed as \( \frac{dy}{dx} = \frac{df(x)}{dx} = \frac{d}{dx} f(x) \).
The partial derivative of the function with respect tox, ∂x ∂ f(x), performs the usual scalar derivative holding all other variables constant.
The gradient of f with respect to vector x, ∇f(x), organizes all of the partial derivatives for a specific scalar function.
The Jacobian organizes the gradients of multiple functions into a matrix by stacking them:
The following notation means thatyhas the valueauponcondition1 and valuebuponcondition2. y(a condition 1 b condition2
Wolfram Alpha can do symbolic matrix algebra and there is also a cool dedicatedmatrix calculus differentiator.
When looking for resources on the web, search for “matrix calculus” not “vector calculus.” Here are some comments on the top links that come up from aGoogle search:
The Wikipedia article on matrix calculus provides a comprehensive overview of various layout conventions, particularly highlighting the numerator layout where variables are arranged horizontally and functions vertically in the Jacobian It also offers a detailed explanation of total derivatives, although it's important to note that the notation used differs slightly from standard practices, as the article employs the ∂x notation instead of dx.
The Matrix Reference Manual provides a comprehensive section on matrix calculus, detailing essential notations and identities crucial for understanding matrix differentiation It covers differentials for linear, quadratic, and cubic products, as well as the derivatives with respect to both real and complex matrices Additionally, the manual explains the Jacobian and Hessian matrices, highlighting their significance in optimization and integration processes This resource serves as a valuable guide for anyone seeking to deepen their knowledge of matrix calculus and its applications.