All important cheat sheets for data science, ML, RL AI

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	33
Dung lượng	11,53 MB

Nội dung

All important cheat sheets 25032021 COMPILED BY ABHISHEK PRASAD Follow me on LinkedIn wwwdin cominabhishek prasad ap D A T A 10 10 B r o w n U n iv e r s it y S a m u e l S W a t s o n Sets.All important cheat sheets 25032021 COMPILED BY ABHISHEK PRASAD Follow me on LinkedIn wwwdin cominabhishek prasad ap D A T A 10 10 B r o w n U n iv e r s it y S a m u e l S W a t s o n Sets.

All important cheat sheets 25/03/2021 COMPILED BY ABHISHEK PRASAD Follow me on LinkedIn: www.linkedin.com/in/abhishek-prasad-ap Sets and Functions: Sets Programming in Julia A set is an unordered collection of objects The objects in a set are called elements min(m, n) A vector in Rn is a column of n real numbers, also written The cardinality of a set is the number of elements it con- by a program Values have types; for example, is an Int and "Hello world!" is a String as [v1 , , ] A vector may be depicted as an arrow from the origin in n-dimensional space The norm of a vector v is A variable is a name used to refer to a value We can assign a value to a variable x using x = the length If every element of A is also an element of B, then we say tains The empty set ∅ is the set with no elements A is a subset of B and write A ⊂ B If A ⊂ B and B ⊂ A, then we say that A = B Set operations: (i) An element is in the union A ∪ B of two sets A and B if it is in A or B (ii) An element is in the intersection A ∩ B of two sets A and B if it is in A and B (iii) An element is in the set difference A \ B if it is in A but not B (iv) Given a set Ω and a set A ⊂ Ω, the complement of A with respect to Ω is Ac = Ω \ A B A DATA 1010 · Brown University · Samuel S Watson Linear Algebra: Vector Spaces A value is a fundamental entity that may be manipulated B A A∪B A∩B A B B A A\ B A function performs a particular task You prompt a function to perform its task by calling it Values supplied to a function are called arguments For example, in the function call print(1,2), and are arguments Two sets A and B are disjoint if A ∩ B = ∅ (in other words, if they have no elements in common) A partition of a set is a collection of nonempty disjoint subsets whose union is the whole set The Cartesian product of A and B is A numerical value can be either an integer or a float The basic operations are +,-,*,/,^, and expressions are evaluated according to the order of operations Textual data is represented using strings length(s) returns the number of characters in s The * operator concatenates strings 10 A boolean is a value which is either true or false = The set A is called the domain of f and B is called the codomain of f Given a subset A of A, we define the image of f —denoted f ( A )—to be the set of elements which are mapped to from some element in A if x > "x elseif "x else "x end is positive" x == is zero" is negative" 13 The scope of a variable is the region in the program where it is accessible Variables defined in the body of a function are not accessible outside the body of the function 14 Array is a compound data type for storing lists of objects Entries of an array may be accessed with square bracket syntax using an index or using a range object a:b A = [-5,3,2,1] A[2] A[3:end] 15 An array comprehension can be used to generate new arrays: [k^2 for k=1:10 if mod(k,2) == 0] The identity function on a set A is the function f : A → A which maps each element to itself 16 A dictionary encodes a discrete function by storing input-output pairs and looking up input values when indexed This expression returns [0,0,1.0]: codomain of f A function f is bijective if it is both injective and surjective If f is bijective, then the function from B to A that maps b ∈ B to the element a ∈ A that satisfies f ( a) = b is called the inverse of f 10 If f : A → B is bijective, then the function f −1 ◦ f is equal to the identity function on A, and f ◦ f −1 is the identity function on B can be written as a linear combination of the vectors in L A list of vectors is linearly independent if and only if the only linear combination which yields the zero vector is the one with all weights zero Dict("blue"=>[0,0,1.0],"red"=>[1.0,0,0])["blue"] 17 A while loop takes a conditional expression and a body and evaluates them alternatingly until the conditional expression returns false A for loop evaluates its body once for each entry in a given iterator (for example, a range, array, or dictionary) Each value in the iterator is assigned to a loop variable which can be referenced in the body of the loop while x > x -= end A linear transformation L is a function from a vector space V to a vector space W which satisfies L(cv + βw) = cL(v) + L(w) for all c ∈ R, u, v ∈ V These are “flat maps”: equally spaced lines are mapped to equally spaces lines or points Examples: scaling, rotation, projection, reflection 10 Given two vector spaces V and W, a basis {v1 , , } g : B → C is the function g ◦ f which maps a ∈ A to g( f ( a)) ∈ C A function f is surjective if the range of f is equal to the The span of a list L of vectors is the set of all vectors which called a basis of that vector space The number of vectors in a basis of a vector space is called the dimension of the space function f(x,y; shift=0) 3x + 2y + shift end The composition of two functions f : A → B and map to the same element in the codomain where c1 , , ck are real numbers The c’s are called the weights of the linear combination A linearly independent spanning list of a vector space is 12 Functions may be defined using the familiar math notation: f(x,y) = 3x + 2y or using a function block (shift is a keyword argument): The range of f is the image of the domain of f A function f is injective if no two elements in the domain 11 The transpose distributes across matrix multiplication but with an order reversal: ( AB) = B A if A and B are matrices for which AB is defined A list of vectors in a vector space is a spanning list of that vector space if every vector in the vector space can be written as a linear combination of the vectors in that list Sets and Functions: Functions If A and B are sets, then a function f : A → B is an assignment of some element of B to each element of A c1 v + c2 v + · · · + c k v k , 11 Code blocks can be executed conditionally: ∩ Bc Lists differ from sets in that (i) order matters, (ii) repetition matters, and (iii) the cardinality is restricted 10 The transpose is a linear operator: (cA + B) = cA + B if c is a constant and A and B are matrices A vector space is a nonempty set of vectors which is closed under the vector space operations (i) ( A ∩ B)c = Ac ∪ Bc , and A list is an ordered collection of finitely many objects A linear combination of a list of vectors v1 , , vk is an expression of the form Booleans can be combined with the operators && (and), || (or), ! (not) (De Morgan’s laws) If A, B ⊂ Ω, then Ac The transpose A of a matrix A is defined so that the rows of A are the columns of A (and vice versa) A statement is an instruction to be executed (like x = -3) An expression is a combination of values, variables, operators, and function calls that a language interprets and evaluates to a value for i=1:10 print(i) end If the columns of a square matrix A are linearly independent, then it has a unique inverse matrix A−1 with the property that Ax = b implies x = A−1 b for all x and b Matrix inversion satisfies ( AB)−1 = B−1 A−1 if A and B are both invertible way For example, * is an operator since we can call the multiplication function with the syntax * A × B = {( a, b) : a ∈ A and b ∈ B} (ii) ( A ∪ B)c The fundamental vector space operations are vector addition and scalar multiplication An operator is a function that can be called in a special Numbers can be compared using ,==,≤ or ≥ Ac v21 + · · · + v2n of its arrow Ax = b has a solution x if and only if b is in the span of the columns of A If Ax = b does have a solution, then the solution is unique if and only if the columns of A are linearly independent If Ax = b does not have a solution, then there is a unique vector x which minimizes | Ax − b|2 of V, and a list {w1 , , wn } of vectors in W, there exists one and only one linear transformation which maps v1 to w1 , v2 to w2 , and so on 11 The rank of a linear transformation from one vector space to another is the dimension of its range 12 The null space of a linear transformation is the set of vectors which are mapped to the zero vector by the linear transformation 13 The rank of a transformation plus the dimension of its null space is equal to the dimension of its domain (the ranknullity theorem) Linear Algebra: Matrix Algebra The matrix-vector product Ax is the linear combination of the columns of A with weights given by the entries of x Linear transformations from Rn to Rm are in one-to-one correspondence with m × n matrices 12 A matrix A is symmetric if A = A 13 A linear transformation T from Rn to Rn scales all ndimensional volumes by the same factor: the (absolute value of the) determinant of T 14 The sign of the determinant tells us whether T reverses orientations 15 det AB = det A det B and det A−1 = (det A)−1 16 A square matrix is invertible if and only if its determinant is nonzero Linear Algebra: Orthogonality The dot product of two vectors in Rn is defined by x · y = x1 y1 + x2 y2 + · · · + x n y n x · y = x y cos θ , where x, y ∈ Rn and θ is the angle between the vectors x · y = if and only if x and y are orthogonal The dot product is linear: x · (cy + z) = cx · y + x · z The orthogonal complement of a subspace V ⊂ Rn is the set of vectors which are orthogonal to every vector in V The orthogonal complement of the span of the columns of a matrix A is equal to the null space of A rank A = rank A A for any matrix A A list of vectors satisfying vi · v j = for i = j is orthogonal An orthogonal list of unit vectors is orthonormal Every orthogonal list is linearly independent 10 A matrix U has orthonormal columns if and only if U U = I A square matrix with orthonormal columns is called orthogonal An orthogonal matrix and its transpose are inverses 11 Orthogonal matrices represent rigid transformations (ones which preserve lengths and angles) 12 If U has orthonormal columns, then UU is the matrix which represents projection onto the span of the columns of U The identity transformation corresponds to the identity matrix, which has entries of along the diagonal and zero entries elsewhere Linear Algebra: Spectral Analysis Matrix multiplication corresponds to composition of the corresponding linear transformations: AB is the matrix for which ( AB)(x) = A( Bx) for all x An eigenvector v of an n × n matrix A is a nonzero vector with the property that Av = λv for some λ ∈ R We call λ an eigenvalue A m × n matrix is full rank if its rank is equal to If v is an eigenvector of A, then A maps the line span({v}) to itself: Multivariable calculus occupying the region D and having mass density f ( x, y) at each point ( x, y) (Squeeze theorem) If an ≤ bn ≤ cn for all n ≥ and if limn→∞ an = limn→∞ cn = b, then bn → b as n → ∞ 14 Double integration over D: the bounds for the outer integral are the smallest and largest values of y for any point in D, and the bounds for the inner integral are the smallest and largest values of x for any point in a given “y = constant” slice of the region verges to a number x ∈ R if the distance from xn to x on the number line can be made as small as desired by choosing n sufficiently large We say limn→∞ xn = x or xn → x Eigenvectors of A with distinct eigenvalues are linearly independent Not every n × n matrix A has n linearly independent eigenvectors If A does have n linearly independent eigenvectors, we can make a matrix V with these eigenvectors as columns and get AV = V Λ =⇒ A = V ΛV −1 (diagonalization of A) If A = V ΛV −1 , then An = V Λn V −1 If A is a symmetric matrix, then A is orthogonally diag- onalizable: A = V ΛV , where V is an orthogonal matrix (the spectral theorem) A symmetric matrix is positive semidefinite if its eigenvalues are all nonnegative We define the square root of a √ positive semidefinite matrix A = V ΛV to be V ΛV , where √ Λ is obtained by applying the square root function elementwise Linear Algebra: SVD The Gram matrix A A of any m × n matrix A is positive √ semidefinite Furthermore, | A Ax| = | Ax| for all x ∈ Rn The singular value decomposition is the factorization of any rectangular m × n matrix A as U ΣV , where U and V are orthogonal and Σ is an m × n diagonal matrix (with diagonal entries in decreasing order) A= −1 (Comparison test) If ∑∞ n=1 bn converges and if | an | ≤ bn for all n, then ∑∞ n=1 an converges Conversely, if ∑∞ n=1 bn does not converge and ≤ bn < an , then Σ∞ n=1 an also does not converge p The series ∑∞ n=1 n converges if and only if p < −1 The n series ∑∞ n=1 a converges if and only if −1 < a < The Taylor series, centered at c, of an infinitely differentiable function f is defined to be f (c) f (c) f (c) + f (c)( x − c) + ( x − c )2 + ( x − c )3 + · · · 2! 3! where Λ is a diagonal matrix of eigenvalues DATA 1010 · Brown University · Samuel S Watson A sequence of real numbers ( xn )∞ n=1 = x1 , x2 , con- We can multiply or add Taylor series term-by-term, we can integrate or differentiate a Taylor series term-by-term, we can substitute one Taylor series into another to obtain a Taylor series for the composition ∂f The partial derivative ∂x ( x0 , y0 ) of a function f ( x, y) at a point ( x0 , y0 ) is the slope of the graph of f in the x-direction at the point ( x0 , y0 ) Given f : Rn → Rm , we define ∂f/∂x to be the matrix whose (i, j)th entry is ∂ f i /∂x j Then ∂ ∂ (i) ( Ax) = A (ii) (x A) = A ∂x ∂x ∂ ∂v ∂u (iii) (u v) = u +v ∂x ∂x ∂x A function of two variables is differentiable at a point if its graph looks like a plane when you zoom in sufficiently around the point More generally, a function f : Rn → Rm is differentiable at x if it is well-approximated by its derivative near x: ∂f (x)∆x f(x + ∆x) − f(x) + ∂x lim = |∆x | ∆x → 10 The Hessian H of f : Rn → R is the matrix of its second order derivatives: Hi,j (x) = ∂x∂ ∂x∂ f (x) i j The quadratic approximation of f at the origin is f (0) + ∂f (0)x + 12 x ∂x V ′ = −73.2◦ turn U = 16.8◦ turn H(0)x (i) f realizes an absolute maximum and absolute minimum on D (the extreme value theorem) The diagonal entries of Σ are the singular values of A, and the columns of U and V are called left singular vectors and right singular vectors, respectively A maps each right singular vector vi to the corresponding left singular vector ui scaled by σi The vectors in Rn stretched the most by A are the ones which run in the direction of the column or columns of V corresponding to the greatest singular value Same for least For k ≥ 1, the k-dimensional vector space with minimal sum of squared distances to the columns of A (interpreted as points in Rm ) is the span of the first k columns of U The absolute value of the determinant of a square matrix is equal to the product of its singular values Numerical Computation: machine arithmetic Computers store numerical values as sequences of bits The type of a numeric value specifies how to interpret the underlying sequence of bits as a number The Int64 type uses 64 bits to represent the integers from −263 to 263 − For ≤ n ≤ 263 − 1, we represent n using its binary representation, and for ≤ n ≤ 263 , we represent −n using the binary representation of 264 − n Int64 arithmetic is performed modulo 264 The Float64 type uses 64 bits to represent real numbers We call the first bit σ, the next 11 bits (interpreted as a binary integer) e ∈ [0, 2047], and the final 52 bits f ∈ [0, 252 − 1] If e∈ / {0, 2047}, then the number represented by (σ, e, f ) is x = (−1)σ 2e−1023 1+ f 52 The representable numbers between consecutive powers of are the ones obtained by 52 recursive iterations of binary subdivision The value of e indicates the powers of that x is between, and the value of f indicates the position of x between those powers of The Float64 exponent value e = 2047 is reserved for Inf and NaN, while e = is reserved for the subnormal numbers: (σ, 0, f ) represents (−1)σ f /21074 252 values 252 values 1022 252 values 1021 largest finite representable value 252 values 1020 1019 21024 The BigInt and BigFloat are types use an arbitrary number of bits and can handle very large numbers or very high precision Computations are much slower than for 64-bit types 11 Suppose that f is a continuous function defined on a closed and bounded subset D of Rn Then: Σ = 2.303 1.303 15 Polar integration over D: the outer integral bounds are the least and greatest values of θ for a point in D, and the inner integral bounds are the least and greatest values of r for any point in D along each given “θ = constant” ray The area element is dA = r dr dθ (ii) Any point where f realizes an extremum is either a critical point—meaning that ∇ f = or f is nondifferentiable at that point—or at a point on the boundary (iii) (Lagrange multipliers) If f realizes an extremum at a point on a portion of the boundary which is the level set of a differentiable function g with non-vanishing gradient ∇ g, then either f is non-differentiable at that point or the equation ∇ f = λ∇ g is satisfied at that point, for some λ ∈ R 12 If r : R1 → R2 and f : R2 → R1 , then d ∂f dr ( f ◦ r) = (r(t)) (t) dt ∂r dt (chain rule) 13 Integrating a function is a way of totaling up its values D f ( x, y) dx dy can be interpreted as the mass of an object Numerical Computation: Error If A is an approximation for A, then the relative error is A− A A The condition number of a → an is κ( a) = n, and the a (so subtracting b is condition number of a → a − b is a− b ill-conditioned near b—this is called catastrophic cancellation) The relative roundoff error between a non-extreme real number and the nearest T-representable value is no more than the machine epsilon ( mach ) of the floating point type T An algorithm which solves a problem with error much greater than κ mach is unstable An algorithm is unstable if at least one of the steps it performs is ill-conditioned If every step of an algorithm is well-conditioned, then the algorithm is stable 10 The condition number of an matrix A is defined to be the maximum condition number of the function x → Ax over its domain The condition number is equal to the ratio of the largest to the smallest singular value of A Numerical Computation: PRNGs A pseudorandom number generator (PRNG) is an algorithm for generating a deterministic sequence of numbers which is intended to share properties with a sequence of random numbers The PRNG’s initial value is called its seed The linear congruential generator: fix positive integers M, a, and c, and consider a seed X0 ∈ {0, 1, , M − 1} We return the sequence X0 , X1 , X2 , , where Xn = mod( aXn−1 + c, M) for n ≥ The period of a PRNG is the minimum length of a repeating block A long period is a desirable property of a PRNG, and a very short period is typically unacceptable Frequency tests check whether blocks of terms appear with the appropriate frequency (for example, we can check whether a2n > a2n−1 for roughly half of the values of n) Numerical Computation: Automatic Differentiation A dual number is an object that can be substituted into a function f to yield both the value of the function and its derivative at a point x If f is a function which can act on matrices, then x1 0x represents a dual number at x, since f x1 0x = f (x) f (x) (This identity is true for any function f which f (x) can be defined as a limit of polynomial functions, since it can be checked to hold for f + g and f g whenever it holds for f and g, and it holds for the identity function) To find the derivative of f with automatic differentiation, every step in the computation of f must be dual-numberaware See the packages ForwardDiff (for Julia) and autograd (for Python) Roundoff error comes from rounding numbers to fit them into a floating point representation Truncation error comes from using approximate math- ematical formulas or algorithms Statistical error arises from using randomness in an approximation The condition number of a function measures how it stretches or compresses relative error The condition number of a problem is the condition number of the map from the problem’s initial data a to its solution S( a): κ( a ) = d S ( a)| | a|| da |S( a)| A problem is well-conditioned if its condition number is modest and ill-conditioned if the condition number is large Numerical Computation: Optimization Gradient descent seeks to minimize f : Rn → R by repeatedly stepping in f ’s direction of maximum decrease We begin with a value x0 ∈ Rn and repeatedly update using the rule xn+1 = xn − ∇ f (xn−1 ), where is the learning rate We fix a small number τ > and stop when |∇ f (xn )| < τ A function is strictly convex if its Hessian is positive semidefinite everywhere A strictly convex function has at most one local minimum, and any local minimum is also a global minimum Gradient descent will find the global minimum for a convex function, but for non-convex functions it can get stuck in a local minimum Algorithms similar to gradient descent but with usually faster convergence: conjugate gradient, BFGS, L-BFGS Probability: Counting Probability: Expectation and Variance Fundamental principle of counting: If one experiment The expectation E[ X ] (or mean µX ) of a random variable has m possible outcomes, and if a second experiment has n possible outcomes for each of the outcomes in the first experiment, then there are mn possible outcomes for the pair of experiments X is the probability-weighted average of X: The number of ways to arrange n objects in order is n! = · · · · · · · n (read n factorial) The expectation E[ X ] may be thought of as the value of a random game with payout X, or as the long-run average of X over many independent runs of the underlying experiment The Monte Carlo approximation of E[ X ] is obtained by simulating the experiment many times and averaging the value of X Permutations: if S is a set with n elements, then there n! are (n− ordered r-tuples of distinct elements of S r )! E[ X ] = The cumulative distribution function (CDF) of a random variable X is the function FX ( x ) = P( X ≤ x ) Combinations: The number of r-element subsets of an n-element set is (nr) = X (ω) m (ω) The expectation is the center of mass of the distribution mX (x) n! r!(n−r )! ∑ ω∈Ω of X: The function f is called a density, because it measures the amount of probability mass per unit volume at each point (2D volume = area, 1D volume = length) If ( X, Y ) is a pair of random variables whose joint distribution has density f X,Y : R2 → R, then the conditional distribution of Y given the event { X = x } has density f Y | X = x defined by f X,Y ( x, y) f Y | { X = x} (y) = , f X (x) ∞ where f X ( x ) = −∞ f ( x, y) dy is the PDF of X If a random variable X has density f X on R, then E[ g( X )] = Given a random experiment, the set of possible outcomes is called the sample space Ω, like {(H, H), (H, T), (T, H), (T, T)} −2 −1 FX ( x ) DATA 1010 · Brown University · Samuel S Watson We associate with each outcome ω ∈ Ω a probability mass, denoted m(ω) For example, m((H, T)) = 14 In a random experiment, an event is a predicate that can be determined based on the outcome of the experiment (like “first flip turned up heads”) Mathematically, an event is a subset of Ω (like {(H, H), (H, T)}) Basic set operations ∪, ∩, and c correspond to disjunc- tion, conjunction, and negation of events: (i) The event that E happens or F happens is E ∪ F (ii) The event that E happens and F happens is E ∩ F (iii) The event that E does not happen is Ec If E and F cannot both occur (that is, E ∩ F = ∅), we say that E and F are mutually exclusive or disjoint If E’s occurrence implies F’s occurrence, then E ⊂ F mental probability space properties are (i) P(Ω) = — “something has to happen” (ii) P( E) ≥ — “probabilities are non-negative” (iii) P( E ∪ F ) = P( E) + P( F ) if E and F are mutually exclusive — “probability is additive” Other properties which follow from the fundamental ones: (i) P(∅) = (ii) P( Ec ) = − P( E) (iii) E ⊂ F =⇒ P( E) ≤ P( F ) (monotonicity) −2 −1 The joint distribution of two random variables X and Y is the probability measure on R2 which maps A ⊂ R2 to P(( X, Y ) ∈ A) The probability mass function of the joint distribution is m( X,Y ) ( x, y) = P( X = x and Y = y) Probability: Conditional Probability Given a probability space Ω and an event E ⊂ Ω, the conditional probability measure given E is an updated probability measure on Ω which accounts for the information that the result ω of the random experiment falls in E: P( F | E ) = P( F ∩ E ) P( E ) The conditional probability mass function of Y given { X = x } is mY | X = x (y) = m X,Y ( x, y)/m X ( x ) Bayes’ theorem tells us how to update beliefs in light of new evidence It relates the conditional probabilities P( A | E) and P( E | A): P( A | E ) = P( E | A )P( A ) P( E | A )P( A ) = P( E ) P( E | A )P( A ) + P( E | Ac )P( Ac ) Two events E and F are independent if P( E ∩ F ) = P( E )P( F ) Two random variables X and Y are independent if the every pair of events of the form { X ∈ A} and {Y ∈ B} are independent, where A ⊂ R and B ⊂ R Probability: Random Variables The PMF of the joint distribution of a pair of independent random variables factors as m X,Y ( x, y) = m X ( x )mY (y): result of a random experiment (one’s lottery winnings, for example) Mathematically, a random variable is a function X from the sample space Ω to R The distribution of a random variable X is the probability measure on R which maps each set A ⊂ R to P( X ∈ A) The probability mass function of the distribution of X may be obtained by pushing forward the probability mass from each ω ∈ Ω: ∑ E[ g( X, Y )] = ∑ ( x,y)∈R2 g( x ) f X ( x ) dx CDF sampling: F −1 (U ) has CDF F if f U = 1[0,1] Probability: Conditional Expectation g( x )m X ( x ) x ∈R (iv) P( E ∪ F ) = P( E) + P( F ) − P( E ∩ F ) (principle of inclusion-exclusion) A random variable is a number which depends on the able (or two random variables) may be expressed in terms of the PMF m X of the distribution of X (or the PMF m( X,Y ) of the joint distribution of X and Y): E[ g( X )] = The probability P( E) of an event E is the sum of the probability masses of the outcomes in that event The domain of P is 2Ω , the set of all subsets of Ω The pair (Ω, P) is called a probability space The funda- R The expectation of a function of a discrete random vari- Probability: Probability Spaces g( x, y)m( X,Y ) ( x, y) Expectation is linear: if c ∈ R and X and Y are random variables defined on the same probability space, then The conditional expectation of a random variable given an event is the expectation of the random variable calculated with respect to the conditional probability measure given that event: if ( X, Y ) has PMF m X,Y , then E [Y | X = x ] = ∑ y ∈R E[cX + Y ] = c E[ X ] + E[Y ] The variance of a random variable is its average squared deviation from its mean The variance measures how spread out the distribution of X is The standard deviation σ( X ) is the square root of the variance Variance satisfies the properties, if X and Y are independent random variables and a ∈ R: Var( aX ) = a2 Var X Var( X + Y ) = Var( X ) + Var(Y ) The covariance of two random variables X and Y is the expected product of their deviations from their respective means µX = E[ X ] and µY = E[Y ]: where mY | X = x (y) = ymY | X = x (y), m X,Y ( x,y) If ( X, Y ) has pdf f X,Y , then mX (x) E [Y | X = x ] = R y f Y | X = x (y) dy The conditional expectation of a random variable Y given another random variable X is obtained by substituting X for x in the expression for the conditional expectation of Y given X = x Thus E[Y | X ] is a random variable If X and Y are independent, then E[Y | X ] = E[Y ] If Z is a function of X, then E[ ZY | X ] = ZE[Y | X ] The law of iterated expectation: E[E[Y | X ]] = E[Y ] Cov( X, Y ) = E[( X − µX )(Y − µY )] = E[ XY ] − E[ X ]E[Y ] Probability: Common Distributions The covariance of two independent random variables is zero, but zero covariance does not imply independence Bernoulli (Ber( p)): A weighted coin flip 10 The correlation of two random variables is their normalized covariance: Corr( X, Y ) = µ=p Cov( X, Y ) ∈ [−1, 1] σ ( X ) σ (Y ) p σ2 = p ( − p ) 1− p 11 The covariance matrix of a vector X = [ X1 , , Xn ] of random variables defined on the same probability space is defined to be the matrix Σ whose (i, j)th entry is equal to Cov( Xi , X j ) If E[X] = 0, then Σ = E[XX ] Binomial (Bin(n, p)): A sum of n independent Ber( p)’s Probability: Continuous Distributions m(k) = (nk) pk (1 − p)n−k µ = np σ2 = np(1 − p) If Ω ⊂ Rn and P( A) = A f , where f ≥ and Rn f = 1, then we call (Ω, P) a continuous probability space f (x) P([ a, b ]) a b x n Geometric (Geom( p)): Time to first success (1) in a sequence of independent Ber( p)’s m ( k ) = p (1 − p ) k − µ = 1p 1− p σ2 = p Poisson distribution (Poiss(λ)): Limit as n → ∞ of Binomial(n, λn ) A sequence of random variables X1 , X2 , , converges bility space with an unknown probability measure, we seek to draw conclusions about the measure A sequence ν1 , ν2 , of probability measures on Rn converges to a probability measure ν if νn ( A) → ν( A) for every set A ⊂ Rn with the property that ν(∂A) = (intuitively, two measures are close if they put approximately the same amount of mass in approximately the same places) We say Xn converges in distribution to ν if the distribution of Xn converges to ν Supervised learning: (X, Y ) is drawn from an unknown probability measure P on a product space X × Y , and we aim to predict Y given X, based on a i.i.d collection of samples from P (the training data) Chebyshev’s inequality: if X is a random variable with variance σ2 < ∞, then X differs from its mean by more than k standard deviations with probability at most k−2 : P(| X − E[ X ]| > kσ) ≤ P X1 + · · · + X n ∈ / [µ − , µ + ] n We call the components of X features, predictors, or input variables, and we call Y the response variable or output variable Normal distribution (N (µ, σ2 )): Limit as n → ∞ of X + X +···+ Xn the distribution of 2√ , for any independent sen quence X1 , , Xn of identically distributed random variables (i.i.d.) with E[ X1 ] = µ and Var( X1 ) = σ2 < ∞ (see Central Limit Theorem) → 0, variance distribution looks increasingly bell-shaped as n increases, regardless of the distribution being sampled from PMFs for sums of n fair coin flips 10 11 12 13 14 f (x) = √1 σ 2π e 16 2σ We define the standardized running sum of X1 , X2 , to have zero mean and unit variance for all n ≥ 1: σ σ µ Sn∗ = x Multivariate normal distribution (N (0, Σ)): if Z = ( Z1 , Z2 , , Zn ) is a vector of independent N (0, 1)’s, A is an m × n matrix of constants, and µ ∈ Rm , then the vector is multivariate normal The covariance matrix of X is Σ = AA X1 + X2 + · · · + X n − n µ √ σ n Central limit theorem: the sequence of standardized sums of an i.i.d sequence of finite-variance random variables converges in distribution to N (0, 1): for any interval [ a, b], we have X = AZ + µ P(Sn∗ ∈ [ a, b]) → b a D ( u) = (i) a space H of candidate functions, and √ e−t /2 dt 2π total mass = −1 The width of each pile is specified by a bandwidth λ: Dλ (u) = λ1 D λu L(h) = E 1{h( X )=Y } , The kernel density estimator with bandwidth λ is the sum of the piles at each sample: The empirical probability measure on X × Y is the measure which assigns a probability mass of n1 to the location of each training sample (X1 , Y1 ), (X2 , Y2 ), , (Xn , Yn ) The empirical risk of a candidate function h is the risk functional evaluated with respect to the empirical measure of the training data The empirical risk minimizer (ERM) is the function which minimizes empirical risk Generalization error (or test error) is the difference between empirical risk and the actual value of the risk functional The central limit theorem explains the ubiquity of the normal distribution in statistics: many random quantities may be realized as a sum of a multitude of independent contributions fλ (x) = n Dλ ( x − Xi ) n i∑ =1 To choose a suitable bandwidth, we seek to minimize the integrated squared error (ISE) L( f ) = R ( f − f )2 We approximate the minimizer of L with the minimizer of the cross-validation loss estimator J( f ) = R (−i ) where f λ f λ2 − is the KDE with the ith sample omitted f λ ( x, y) = n Dλ ( x − Xi ) Dλ (y − Yi ) n i∑ =1 Example: if H is the space of polynomials and no two training samples have the same x values, then then there are functions in H which have zero empirical risk y n (−i) fλ ( Xi ) , n i∑ =1 If f is a density on R2 , then we use the KDE 10 The ERM can overfit, meaning that test error and L( h) are large despite small empirical risk y 2λ x empirical risk minimizer Stone’s theorem says that the ratio of the CV ISE to the optimal-λ ISE converges to in probability as n → ∞ Also, , and the minimal ISE goes the optimal λ goes to like 1/5 risk minimizer X1 + X2 + · · · + X n − nµ √ n converges in distribution to N (0, Σ) and H contains r (x) = E[Y | X = x], then r is the target function If the loss functional for a classification problem is as n → ∞ Multivariate central limit theorem: If X1 , X2 , is a sequence of independent random vectors whose common distribution has mean µ and covariance matrix Σ, then 70 3 81 (1 − | u| ) for |u| ≤ L(h) = E[(h( X ) − Y )2 ] Since P is unknown, we must approximate the target function with a function h whose values can be computed from the training data A learner is a function which takes a set of training data and returns a prediction function h 15 We choose a kernel function for the shape of each pile: and H contains G (x) = argmaxc P(Y = c | X = x), then G is the target function n=1 n=4 n=7 n = 10 n = 13 n = 16 Given n samples X1 , , Xn from a distribution with density f on R, we can estimate the PDF of the distribution by placing 1/n units of probability mass in a small pile around each sample If the loss functional for a regression problem is k − ( x −µ2) Statistical Learning: Kernel density estimation The target function is argminh∈H L(h) The PDF of a sum of n independent samples from a finite- σ2 = λ12 13 No-free-lunch theorem: all learners are equal on average (over all possible problems), so inductive bias appropriate to a given type of problem is essential to have an effective learner for that type of problem (ii) a loss (or risk) functional L from H to R f ( x ) = λe−λx µ = λ1 12 Inductive bias can lead to underfitting: relevant relations are missed, so both training and test error are larger than necessary The tension between the costs of high inductive bias and the costs of low inductive bias is called the bias-complexity (or bias-variance) tradeoff To choose a prediction function h : X → Y , we specify a as n → ∞ P ( Sn = k ) DATA 1010 · Brown University · Samuel S Watson dependent samples from a finite-variance distribution with mean µ, then the sequence’s running average converges in probability to µ: for all > 0, k Example: X = [ X1 , X2 ], where X1 is the color of a banana, X2 is the weight of the banana, and Y is a measure of deliciousness Values of X1 , X2 , and Y are recorded for many bananas, and they are used to predict Y for other bananas whose X values are known A supervised learning problem is a regression problem if Y is quantitative (Y ⊂ R) and a classification problem if Y is a set of labels k2 Law of large numbers: if X1 , X2 , is a sequence of in- Exponential distribution (Exp(λ)): Limit as n → ∞ of distribution of 1/n times a Geometric(λ/n) Statistical learning: Given some samples from a proba- in probability to X if P(| Xn − X | > ) → as n → ∞, for any > λk m(k) = k! e−λ µ=λ σ2 = λ Statistical Learning: Theory Probability: Central Limit Theorem x 11 Mitigate overfitting with inductive bias: (i) Use a restrictive class H of candidate functions (ii) Regularize: add a term to the loss functional which penalizes complexity n to like 4/5 n The Nadaraya-Watson nonparametric regression estimator r ( x ) computes E[Y | X = x ] with respect to the estimated density f λ Equivalently, we average the Yi ’s, weighted according to horizontal distance from x: n n i =1 i =1 ∑ Yi D(x − Xi ) ∑ D(x − Xi ) tions which is indexed by finitely many parameters Linear regression uses the set of affine functions: H = { x → β0 + [ β1 , , β p ] · x : β , β p ∈ R } Statistical Learning: QDA, LDA, Naive Bayes tion algorithm which uses the training data to estimate the mean —y and covariance matrix Σy of each class conditional distribution: —y = mean({xi : yi = y}) We choose the parameters to minimize a risk function, customarily the residual sum of squares: n RSS(˛) = ∑ ( y i − β0 − ˛ · x i ) = |y − X˛|2 , i =1 where y = [y1 , , yn ], ˛ = [β0 , , β p ], and X is an n × ( p + 1) matrix whose ith row is a followed by the components of xi y where [u]+ denotes max(0, u), the positive part of u Quadratic discriminant analysis (QDA) is a classifica- The parameters ˛ and α encode both H and the a distance—called the margin—from H to a parallel hyperplane where we begin penalizing for lack of decisively correct classification The margin is 1/|˛| (and can be adjusted without changing H by scaling ˛ and α) Σy = mean({(xi − —y )(xi − —y ) : yi = y}) proposed (where { py : y ∈ Y } are the class proportions from the training data) | ˛| Linear discriminant analysis (LDA) is the same as QDA ˛·x−α = We have ˛ · x − α = −1 conditionally independent given Y: Example assumption-satisfying data sets: The RSS minimizer is ˛ = ( X X )−1 X Y We can use the linear regression framework to polynomial regression, since a polynomial is linear in its coefficients: we supplement the list of regressors with products of the original regressors QDA LDA Naive Bayes Kernelization: mapping the feature vectors to a higher Statistical Learning: Optimal classification Consider a classification problem with feature set X and class set Y For each y ∈ Y , we define py = P(Y = y) and let f y be the conditional PMF or PDF of X given {Y = y} (y’s class conditional distribution) Given a prediction function (or classifier) h and an enumeration of the elements of Y as {y1 , y2 , , y|Y | }, we define the (normalized) confusion matrix of h to be the |Y | × |Y | matrix whose (i, j)th entry is P(h(X) = yi | Y = y j ) If Y = {−, +}, the conditional probability of correct classification given a positive sample is the detection rate (DR), while the conditional probability of incorrect classification given a negative sample is the false alarm rate (FAR) The precision of a classifier is the conditional probability that a sample is positive given that the classifier predicts positive, and recall is a synonym of detection rate The Bayes classifier G (x) = argmaxy py f y (x) minimizes the misclassification probability but gives equal weight to both types of misclassification The likelihood ratio test generalizes the Bayes classifier by allowing a variable tradeoff between false-alarm rate and detection rate: given t > 0, we say ht (x) = −1 if f + (x)/ f − (x) < t and ht (x) = otherwise The Neyman-Pearson lemma says that no classifier does better on both false alarm rate and detection rate than ht The receiver operating characteristic of ht is the curve Statistical Learning: Logistic regression Logistic regression for binary classification estimates AUROC false alarm rate tion of affine transformations and componentwise applications of a function K : R → R (i) We call K the activation function Common choices: (a) u → max(0, u) (rectifier, or ReLU) (b) u → 1/(1 + e−u ) (logistic) (ii) Component-wise application of K on Rt refers to the function K.( x1 , , xt ) = (K ( x1 ), , K ( xt )) (iii) An affine transformation from Rt to Rs is a map of the form A(x) = Wx + b, where W is an s × t matrix and b ∈ Rs Entries of W are called weights and entries of b are called biases t We choose α and β1 , , β p to minimize the risk n L (r ) = ∑ i =1 yi log 1 + (1 − yi ) log , r ( xi ) − r ( xi ) which applies large penalty if yi = and r ( xi ) is close to zero or if yi = and r ( xi ) is close to (W1 , b1 )   −2  (W2 , b2 )  input (xi ∈ R p ) A1 −3 K       A2   −5 Statistical Learning: Support vector machines A support vector machine (SVM) chooses a hyperplane H ⊂ R p and predicts classification (Y = {−1, +1}) based on which side of H the feature vector x lies on H = { x ∈ R p : ˛ · x − α = 0} We train the SVM with the risk n L(˛, α) = λ|˛|2 + ∑ − yi (˛ · xi − α) + n i =1 ∂(Wx+b) ∂W =vx Statistical Learning: Dimension reduction desired output (yi ) (W3 , b3 )        3 Structure may be taken to mean variation about the center, in which case we use principal component analysis: (i) Store the points’ components in an n × p matrix, (ii) de-mean each column, (iii) compute the SVD U ΣV of the resulting matrix, and (iv) let W be the first k columns of V Then WW : R p → R p is the rank-k projection matrix which minimizes the sum of squared projection distances of the points, and W : R p → Rk maps each point to its coordinates in that k-dimensional subspace (with respect to the columns of W)  K A3 −1 −1 id −1 −1 Ci −1 −1 output ( N (xi ) ∈ Rq ) − =5 MNIST handwritten digit images cost L is convex, so it can be reliably minimized using numerical optimization algorithms = I and v Dimension reduction can be used as a visualization aid or as a feature pre-processing step in a machine learning model where + e− x ∂(Wx+b) ∂b (v) Stochastic gradient descent: repeat (ii)–(iv) for each sample in a randomly chosen subset of the training set and determine the average desired change in weights and biases to reduce the cost function Update the weights and biases accordingly and iterate to convergence Statistical Learning: Neural networks σ( t ) σ( x ) = (v) The goal of dimension reduction is to map a set of n points in R p to a lower-dimensional space Rk while retaining as much of the data’s structure as possible r (x) = P(Y = | X = x ) as a logistic function of a linear function of x: r ( x ) = σ(α + ˛ · x ), dimensional space allows us to find nonlinear separating surfaces in the original feature space A neural network function N : R p → Rq is a composi- x → sgn(˛ · x − α) is the prediction function, where po or {(FAR(ht ), DR(ht )) : t ∈ [0, ∞]} The AUROC (area under the ROC) is close to for an excellent classifier and close to 12 for a worthless one NP says that no classifier is above the ROC We choose a point on the ROC curve based on context-specific considerations detection rate go od DATA 1010 · Brown University · Samuel S Watson If λ is small, then the optimization prioritizes the correctness term and uses a small margin if necessary If λ is large, the optimization must minimize a large-margin incorrectness penalty A value for λ may be chosen by crossvalidation f y ( x1 , , x p ) = f y,1 ( x1 ) · · · f y,d ( x p ), dK du (iv) Compute the change in cost per small change in the weights and biases at each blue node Each such gradient is equal to the gradient stored at the next purple node times the derivative of the intervening affine map ˛·x−α = A Naive Bayes classifier assumes that the features are for some f y,1 , f y,p (i) Start with random weights and a training input xi (ii) Forward propagation: apply each successive map and store the vectors at each green or purple node The vectors stored at the green nodes are called activations is Wj , and the derivative of K is v → diag except the class covariance matrices are assumed to be equal and are estimated using all of the data, not just class-specific samples QDA and LDA are so named because they yield class prediction boundaries which are quadric surfaces and hyperplanes, respectively x When the weight matrices are large, they have many parameters to tune We use a custom optimization scheme: (iii) Backpropagation: starting with the last green node and working left, compute the change in cost per small change in the vector at each green or purple node By the chain rule, each such gradient is equal to the gradient computed at right-adjacent node times the derivative of map between the two nodes The derivative of A j penalty for samples of class −1 Each distribution is assumed to be multivariate normal (N (—y , Σy )) and the classifier h(y) = argmaxy py f y (x) is ar gi n Parametric regression uses a family H of candidate func- m Statistical Learning: Parametric regression The architecture of a neural network is the sequence of dimensions of the domains and codomains of its affine maps For example, a neural net with W1 ∈ R5×3 , W2 ∈ R4×5 , and W3 ∈ R1×4 has architecture [3, 5, 4, 1] Given training samples {(xi , yi )}iN=1 , we obtain a neural net regression function by minimizing L( N ) = n -2 -4 ∑ C( N (xi ), yi ) where C(y, yi ) = |y − yi |2 first two principal components i =1 12 For classification, we (i) let yi = [0, , 0, 1, 0, 0] ∈ R|Y | , with the location of the nonzero entry indicating class (this is called one-hot encoding), (ii) replace the identity map in the diagram with the softu max function u → e j / ∑nk=1 euk |Y | j =1 , and (iii) replace the cost function with C (y, yi ) = − log(y · yi ) Structure may be taken to mean pairwise proximity of points, which stochastic neighbor embedding attempts to preserve Given the data points x1 , , xn and a parameter ρ called the perplexity of the model, we define Pi,j (σ) = e −|xi −x j |2 /(2σ2 ) ∑k = j e −|xk −x j |2 /(2σ2 ) , and for each j we define σ j to be the solution σ of the equation −∑ P (σ) log2 Pi,j (σ) i = j i,j = ρ ( P (σ ) + P (σ )), which describes the We define pi,j = 2n i,j j j,i i similarity of xi and x j Given y1 , , yn in Rk , we define similarities (1 + | y i − y j |2 ) −1 qi,j = ∑ k = j (1 + | y k − y j |2 ) −1 We choose y1 , , yn to minimize ∑ C ( y1 , , y n ) = 1≤ i = j ≤ n pi,j log2 pi,j qi,j Statistics: Confidence intervals Consider an unknown probability distribution ν from which we get n independent samples X1 , , Xn , and suppose that θ is the value of some statistical functional of ν A confidence interval for θ is an interval-valued function of the sample data X1 , , Xn A confidence interval has confidence level − α if it contains θ with probability at least − α If θ is unbiased, then θ − k se(θ ), θ + k se(θ ) is a − then θ − 1.96 se(θ ), θ + 1.96 se(θ ) is an approximate 95% confidence interval, since 95% of the mass of the standard normal distribution is in the interval [−1.96, 1.96] MNIST handwritten digit images 60 30 Let I ⊂ R, and suppose that T is a function from the set of distributions to the set of real-valued functions on I A − α confidence band for T (ν) is pair of random functions ymin and ymax from I to R defined in terms of n independent samples from ν and having ymin ≤ T (ν) ≤ ymax everywhere on I with probability at least − α DATA 1010 · Brown University · Samuel S Watson -30 Statistics: Empirical CDF convergence -60 -50 50 100 Statistics: Point estimation The central problem of statistics is to make inferences about a population or data-generating process based on the information in a finite sample drawn from the population Parametric estimation involves an assumption that the distribution of the data-generating process comes from a family of distributions parameterized by finitely many real numbers, while nonparametric estimation does not Examples: QDA is a parametric density estimator, while kernel density estimation is nonparametric Point estimation is the inference of a single real-valued Statistics is predicated on the idea that a distribution is well-approximated by independent samples therefrom The ǫ Glivenko-Cantelli theorem is F 20 one formalization of this idea: If F F is the CDF of a distribution ν and Fn is the CDF of the empiriUnif([0,1]) cal distribution νn of n samples from ν, then Fn converges to F along the whole number line: max | F ( x ) − Fn ( x )| → x ∈R as n → ∞, The Dvoretzky-Kiefer-Wolfowitz inequality (DKW) says that the graph of Fn lies in the -band around the graph of F with probability at least − 2e−2n A statistical functional is any function T from the set of distributions to [−∞, ∞] An estimator θ is a random variable defined in terms of n i.i.d random variables, the purpose of which is to approximate some statistical functional of the random variables’ common distribution Example: Suppose that T (ν) = the mean of ν, and that θ = ( X1 + · · · + Xn )/n Statistics: Bootstrapping Given a distribution ν and a statistical functional T, let θ = T (ν) The bias of an estimator of θ is the difference between the estimator’s expected value and θ Example: The expectation of the sample mean θ = ( X1 + · · · + Xn )/n is E( X1 + · · · + Xn )/n = E[ν], so the bias of the sample mean is zero Bootstrapping is the use of simulation to approximate the value of the plug-in estimator of a statistical functional which is expressed in terms of independent samples from ν Example: if θ = T (ν) is the variance of the median of independent samples from ν, then the bootstrap estimate of θ is obtained as a Monte Carlo approximation of T (ν): we sample times (with replacement) from { X1 , , Xn }, record the median, repeat B times for B large, and take the sample variance of the resulting list of B numbers The bootstrap approximation of T (ν) may be made as close to T (ν) as desired by choosing B large enough The difference between T (ν) and T (ν) is likely to be small if n is large (that is, if many samples from ν are available) The bootstrap is useful for computing standard errors, deviation since the standard error of an estimator is often infeasible to compute analytically but conducive to Monte Carlo approximation An estimator is consistent if θ → θ in probability as n → ∞ Statistics: Maximum likelihood estimation The standard error se(θ ) of an estimator θ is its standard The mean squared error of an estimator is defined to be MSE(θ ) = E[(θ − θ )2 ] 10 MSE is equal to variance plus squared bias Therefore, MSE converges to zero as the number of samples goes to ∞ if and only if variance and bias both converge to zero The maximum likelihood estimator is „ MLE = argmax „ ∈Rd L X („ ) Equivalently, „ MLE = argmax („ ), where x („ ) de„ ∈Rd X notes the logarithm of Lx („ ) Example: Suppose that x → f ( x; µ, σ2 ) is the normal density with mean µ and variance σ2 Then the maximum likelihood estimator is the minimizer of the log-likelihood − n ( X1 − µ ) ( Xn − µ)2 log 2π − n log σ − −···− 2σ2 2σ2 Setting the derivatives with respect to µ and σ2 equal to zero, we find µ = X = n1 ( X1 + · · · + Xn ) and σ2 = n1 (( X1 − X )2 + · · · + ( Xn − X )2 ) So the maximum likelihood estimators agree with the plug-in estimators MLE enjoys several nice properties: under certain regularity conditions, we have (stated for θ ∈ R1 ): (i) Consistency: E[(θ − θ )2 ] → as the number of samples goes to ∞ (ii) Asymptotic normality: (θ − θ )/ Var θ converges to N (0, 1) as the number of samples goes to ∞ (iii) Asymptotic optimality: the MSE of the MLE converges to approximately as fast as the MSE of any other consistent estimator Potential difficulties with the MLE: in probability feature of the distribution of the data-generating process (such as its mean, variance, or median) The empirical measure ν of X1 , , Xn is the probability measure which assigns mass n1 to each sample’s location The plug-in estimator of θ = T (ν) is obtained by applying T to the empirical measure: θ = T (ν) Example: Suppose x → f ( x; θ ) is the density of a uniform random variable on [0, θ ] We observe four samples drawn from this distribution: 1.41, 2.45, 6.12, and 4.9 Then the likelihood of θ = is zero, and the likelihood of θ = 106 is very small confidence interval, by Chebyshev’s inequality k2 If θ is unbiased and approximately normally distributed, If X is a vector of n independent samples drawn from f „ ( x ), then LX („ ) is small or zero when „ is not in accordance with the observed data Maximum likelihood estimation is a general approach for proposing an estimator Consider a parametric family { f „ ( x ) : „ ∈ Rd } of PDFs or PMFs Given x ∈ Rn , the likelihood Lx : Rd → R is defined by L x („ ) = f „ ( x1 ) f „ ( x2 ) · · · f „ ( x n ) (i) Computational challenges It might be hard to work out where the maximum of the likelihood occurs, either analytically or numerically (ii) Misspecification The MLE may be inaccurate if the distribution of the samples is not in the specified parametric family (iii) Unbounded likelihood If the likelihood function is not bounded, then θ is not well-defined pothesis The corresponding p-value is defined to be the minimum α-value which would have resulted in rejecting the null hypothesis, with the critical region chosen in the same way* Example: Muriel Bristol claims that she can tell by taste whether the tea or the milk was poured into the cup first She is given eight cups of tea, four poured milk-first and four poured tea-first We posit a null hypothesis that she isn’t able to discern the pouring method, under which the number of cups identified correctly is with probability 1/(84) ≈ 1.4% and at least with probability 17/70 ≈ 24% Therefore, at the 5% significance level, only a correct identification of all the cups would give us grounds to reject the null hypothesis The p-value in that case would be 1.4% Failure to reject the null hypothesis is not necessarily evidence for the null hypothesis The power of a hypothesis test is the conditional probability of rejecting the null hypothesis given that the alternative hypothesis is true A p-value may be low either because the null hypothesis is true or because the test has low power The Wald test is based on the normal approximation Consider a null hypothesis θ = and the alternative hypothesis θ = 0, and suppose that θ is approximately normally distributed The Wald test rejects the null hypothesis at the 5% significance level if |θ | > 1.96 se(θ ) The random permutation test is applicable when the null hypothesis is that the mean of a given random variable is equal for two populations (i) We compute the difference between the sample means for the two groups (ii) We randomly re-assign the group labels and compute the resulting sample mean differences Repeat many times (iii) We check where the original difference falls in the sorted list of re-sampled differences Example: Suppose the heights of the Romero sons are 72, 69, 68, and 66 inches, and the heights of the Larsen sons are 70, 65, and 64 inches Consider the null hypothesis that the expected heights are the same for the two families, and the alternative hypothesis that the Romero sons are taller on average (with α = 5%) We find that the sample mean difference of about 2.4 inches is larger than 88.5% of the mean differences obtained by resampling many times Since 88.5% < 95%, we retain the null hypothesis Hypothesis testing is a disciplined framework for adjudicating whether observed data not support a given hypothesis If we conduct many hypothesis tests, then the probability of obtaining some false rejections is high (xkcd.com/882) This is called the multiple testing problem The Bonferroni method is to reject the null hypothesis only for those tests whose p-values are less than α divided by the number of hypothesis tests being run This ensures that the probability of having even one false rejection is less than α, so it is very conservative Consider an unknown distribution from which we will observe n samples X1 , Xn Statistics: dplyr and ggplot2 Statistics: Hypothesis testing (i) We state a hypothesis H0 –called the null hypothesis–about the distribution (ii) We come up with a test statistic T, which is a function of the data X1 , Xn , for which we can evaluate the distribution of T assuming the null hypothesis (iii) We give an alternative hypothesis Ha under which T is expected to be significantly different from its value under H0 (iv) We give a significance level α (like 5% or 1%), and based on Ha we determine a set of values for T—called the critical region—which T would be in with probability at most α under the null hypothesis (v) After setting H0 , Ha , ¸, T , and the critical region, we run the experiment, evaluate T on the samples we get, and record the result as tobs (vi) If tobs falls in the critical region, we reject the null hy- dplyr is an R package for manipulating data frames The following functions filter rows, sort rows, select columns, add columns, group, and aggregate the columns of a grouped data frame flights %>% filter(month == 1, day < 5) %>% arrange(day, distance) %>% select(month, day, distance, air_time) %>% mutate(speed = distance / air_time * 60) %>% group_by(day) %>% summarise(avgspeed = mean(speed,na.rm=TRUE)) ggplot2 is an R package for data visualization Graphics are built as a sum of layers, which consist of a data frame, a geom, a stat, and a mapping from the data to the geom’s aesthetics (like x, y, color, or size) The appearance of the plot can be customized with scales, coords, and themes Data Science Cheatsheet 2.0 Last Updated February 13, 2021 Statistics Model Evaluation Logistic Regression Discrete Distributions Prediction Error = Bias2 + Variance + Irreducible Noise Bias - wrong assumptions when training → can’t capture underlying patterns → underfit Variance - sensitive to fluctuations when training→ can’t generalize on unseen data → overfit Predicts probability that Y belongs to a binary class (1 or 0) Fits a logistic (sigmoid) function to the data that maximizes the likelihood that the observations follow the curve Regularization can be added in the exponent P (Y = 1) = + e−(β0 +βx) Odds - output probability can be transformed using P (Y =1) Odds(Y = 1) = 1−P (Y =1) , where P ( 13 ) = 1:2 odds Assumptions – Linear relationship between X and log-odds of Y – Independent observations – Low multicollinearity Binomial Bin(n, p) - number of successes in n events, each with p probability If n = 1, this is the Bernoulli distribution Geometric Geom(p) - number of failures before success Negative Binomial NBin(r, p) - number of failures before r successes Hypergeometric HGeom(N, k, n) - number of k successes in a size N population with n draws, without replacement Poisson Pois(λ) - number of successes in a fixed time interval, where successes occur independently at an average rate λ Continuous Distributions Normal/Gaussian N (µ, σ), Standard Normal Z ∼ N (0, 1) Central Limit Theorem - sample mean of i.i.d data approaches normal distribution Exponential Exp(p) - time between independent events occurring at an average rate λ Gamma Gamma(p) - time until n independent events occurring at an average rate λ Hypothesis Testing Significance Level α - probability of Type error p-value - probability of getting results at least as extreme as the current test If p-value < α, or if test statistic > critical value, then reject the null Type I Error (False Positive) - null true, but reject Type II Error (False Negative) - null false, but fail to reject Power - probability of avoiding a Type II Error, and rejecting the null when it is indeed false Z-Test - tests whether population means/proportions are different Assumes test statistic is normally distributed and is used when n is large and variances are known If not, then use a t-test Paired tests compare the mean at different points in time, and two-sample tests compare means for two groups ANOVA - analysis of variance, used to compare 3+ samples with a single test Chi-Square Test - checks relationship between categorical variables (age vs income) Or, can check goodness-of-fit between observed data and expected population distribution Concepts Learning – Supervised - labeled data – Unsupervised - unlabeled data – Reinforcement - actions, states, and rewards Cross Validation - estimate test error with a portion of training data to validate accuracy and model parameters – k-fold - divide data into k groups, and use one to validate – leave-p-out - use p samples to validate and the rest to train Parametric - assume data follows a function form with a fixed number of parameters Non-parametric - no assumptions on the data and an unbounded number of parameters Regression n =n Mean Squared Error (MSE) = (yi − yˆ)2 Mean Absolute Error (MAE) |(yi − yˆ)| Residual Sum of Squares = (yi − yˆ)2 Total Sum of Squares = (yi − y¯)2 SSres R2 = − SS tot Decision Trees Classification Actual Yes Actual No – Precision = Predict Yes True Positives (TP) False Positives (FP) TP T P +F P Predict No False Negatives (FN) True Negatives (TN) , percent correct when predict positive P – Recall, Sensitivity = T PT+F , percent of actual positives N identified correctly (True Positive Rate) N – Specificity = T NT+F , percent of actual negatives identified P correctly, also - FPR (True Negative Rate) precision·recall – F1 = precision+recall , useful when classes are imbalanced Classification and Regression Tree CART for regression minimizes SSE by splitting data into sub-regions and predicting the average value at leaf nodes Trees are prone to high variance, so tune through CV Hyperparameters – Complexity parameter, to only keep splits that improve SSE by at least cp (most influential, small cp → deep tree) – Minimum number of samples at a leaf node – Minimum number of samples to consider a split ROC Curve - plots TPR vs FPR for every threshold α AUC measures how likely the model differentiates positives and negatives Perfect AUC = 1, Baseline = 0.5 Precision-Recall Curve - focuses on the correct prediction of class 1, useful when data or FP/FN costs are imbalanced Linear Regression Models linear relationships between a continuous response and explanatory variables ˆ + by Ordinary Least Squares - find βˆ for yˆ = βˆ0 + βX solving βˆ = (X T X)−1 X T Y which minimizes the SSE Assumptions – Linear relationship and independent observations – Homoscedasticity - error terms have constant variance – Errors are uncorrelated and normally distributed – Low multicollinearity Regularization Add a penalty for large coefficients to reduce overfitting ˆ = λ(number of non−zero variables) Subset (L0): λ||β|| – Computationally slow, need to fit 2k models – Alternatives: forward and backward stepwise selection ˆ = λ |β| ˆ LASSO (L1): λ||β|| – Coefficients shrunk to zero ˆ = λ (β) ˆ2 Ridge (L2): λ||β|| – Reduces effects of multicollinearity Combining LASSO and Ridge gives Elastic Net In all cases, as λ grows, bias increases and variance decreases Regularization can also be applied to many other algorithms Aaron Wang CART for classification minimizes the sum of region impurity, where pî is the probability of a sample being in category i Possible measures, each with a max impurity of 0.5 – Gini impurity = i (pî )2 – Entropy = i −(pî )log2 (pî ) At each leaf node, CART predicts the most frequent category, assuming false negative and false positive costs are the same Random Forest Trains an ensemble of trees that vote for the final prediction Bootstrapping - sampling with replacement (will contain duplicates), until the sample is as large as the training set Bagging - training independent models on different subsets of the data, which reduces variance Each tree is trained on ∼63% of the data, so the out-of-bag 37% can estimate prediction error without resorting to CV Additional Hyperparameters (no cp): – Number of trees to build – Number of variables considered at each split Deep trees increase accuracy, but at a high computational cost Model bias is always equal to one of its individual trees Variable Importance - RF ranks variables by their ability to minimize error when split upon, averaged across all trees .Naive Bayes Classifies data using the label with the highest conditional probability, given data a and classes c Naive because it assumes variables are independent P (a|ci )P (ci ) Bayes’ Theorem P (ci |a) = P (a) Gaussian Naive Bayes - calculates conditional probability for continuous data by assuming a normal distribution Support Vector Machines Separates data between two classes by maximizing the margin between the hyperplane and the nearest data points of any class Relies on the following: Clustering Dimension Unsupervised, non-parametric methods that groups similar data points together based on distance Principal Component Analysis k-Means Randomly place k centroids across normalized data, and assig observations to the nearest centroid Recalculate centroids as the mean of assignments and repeat until convergence Using the median or medoid (actual data point) may be more robust to noise and outliers k-means++ - improves selection of initial clusters Pick the first center randomly Compute distance between points and the nearest center Choose new center using a weighted probability distribution proportional to distance Repeat until k centers are chosen Evaluating the number of clusters and performance: Silhouette Value - measures how similar a data point is to its own cluster compared to other clusters, and ranges from (best) to -1 (worst) Davies-Bouldin Index - ratio of within cluster scatter to between cluster separation, where lower values are Hierarchical Clustering Support Vector Classifiers - account for outliers by allowing misclassifications on the support vectors (points in or on the margin) Kernel Functions - solve nonlinear problems by computing the similarity between points a, b and mapping the data to a higher dimension Common functions: – Polynomial (ab + r)d – Radial e−γ(a−b) Hinge Loss - max(0, − yi (wT xi − b)), where w is the margin width, b is the offset bias, and classes are labeled ±1 Note, even a correct prediction inside the margin gives loss > Clusters data into groups using a predominant hierarchy Agglomerative Approach Each observation starts in its own cluster Iteratively combine the most similar cluster pairs Continue until all points are in the same cluster Divisive Approach - all points start in one cluster and splits are performed recursively down the hierarchy Linkage Metrics - measure dissimilarity between clusters and combines them using the minimum linkage value over all pairwise points in different clusters by comparing: – Single - the distance between the closest pair of points – Complete - the distance between the farthest pair of points – Ward’s - the increase in within-cluster SSE if two clusters were to be combined Reduction Projects data onto orthogonal vectors that maximize variance Remember, given an n × n matrix A, a nonzero vector x, and a scaler λ, if Ax = λx then x and λ are an eigenvector and eigenvalue of A In PCA, the eigenvectors are uncorrelated and represent principal components Start with the covariance matrix of standardized data Calculate eigenvalues and eigenvectors using SVD or eigendecomposition Rank the principal components by their proportion of variance explained = λiλ For a p-dimensional data, there will be p principal components Sparse PCA - constrains the number of non-zero values in each component, reducing susceptibility to noise and improving interpretability Linear Discriminant Analysis Maximizes separation between classes and minimizes variance within classes for a labeled dataset Compute the mean and variance of each independent variable for every class ) and between-class (σ ) Calculate the within-class (σw b variance −1 Find the matrix W = (σw ) (σb ) that maximizes Fisher’s signal-to-noise ratio Rank the discriminant components by their signal-to-noise ratio λ Assumptions – Independent variables are normally distributed – Homoscedasticity - constant variance of error – Low multicollinearity Factor Analysis Describes data using a linear combination of k latent factors Given a normalized matrix X, it follows the form X = Lf + , with factor loadings L and hidden factors f Dendrogram - plots the full hierarchy of clusters, where the height of a node indicates the dissimilarity between its children k-Nearest Neighbors Non-parametric method that calculates yˆ using the average value or most common class of its k-nearest points For high-dimensional data, information is lost through equidistant vectors, so dimension reduction is often applied prior to k-NN Minkowski Distance = ( |ai − bi |p )1/p – p = gives Manhattan distance – p = gives Euclidean distance Assumptions – E(X) = E(f ) = E( ) = – Cov(f ) = I → uncorrelated factors – Cov(f, ) = Since Cov(X) = Cov(Lf ) + Cov( ), then Cov(Lf ) = LL |ai − bi | (ai − bi )2 Scree Plot - graphs the eigenvalues of factors (or principal components) and is used to determine the number of factors to retain The ’elbow’ where values level off is often used as the cutoff Hamming Distance - count of the differences between two vectors, often used to compare categorical variables Aaron Wang .Natural Language Processing Transforms human language into machine-usable code Processing Techniques Neural Convolutional Network Feeds inputs through different hidden layers and relies on weights and nonlinear functions to reach an output – Tokenization - splitting text into individual words (tokens) – Lemmatization - reduces words to its base form based on dictionary definition (am, are, is → be) – Stemming - reduces words to its base form without context (ended → end) – Stop words - remove common and irrelevant words (the, is) Markov Chain - stochastic and memoryless process that predicts future events based only on the current state n-gram - predicts the next term in a sequence of n terms based on Markov chains Bag-of-words - represents text using word frequencies, without context or order tf-idf - measures word importance for a document in a collection (corpus), by multiplying the term frequency (occurrences of a term in a document) with the inverse document frequency (penalizes common terms across a corpus) Cosine Similarity - measures similarity between vectors, A·B calculated as cos(θ) = ||A||||B|| , which ranges from o to Perceptron - the foundation of a neural network that multiplies inputs by weights, adds bias, and feeds the result z to an activation function Activation Function - defines a node’s output Sigmoid ReLU Tanh 1+e−z max(0, z) ez −e−z ez +e−z – Continuous bag-of-words (CBOW) - predicts the word given its context – skip-gram - predicts the context given a word GloVe - combines both global and local word co-occurence data to learn word similarity BERT - accounts for word order and trains on subwords, and unlike word2vec and GloVe, BERT outputs different vectors for different uses of words (cell phone vs blood cell) Sentiment Analysis Extracts the attitudes and emotions from text Polarity - measures positive, negative, or neutral opinions – Valence shifters - capture amplifiers or negators such as ’really fun’ or ’hardly fun’ Sentiment - measures emotional states such as happy or sad Subject-Object Identification - classifies sentences as either subjective or objective Topic Modelling Captures the underlying themes that appear in documents Latent Dirichlet Allocation (LDA) - generates k topics by first assigning each word to a random topic, then iteratively updating assignments based on parameters α, the mix of topics per document, and β, the distribution of words per topic Latent Semantic Analysis (LSA) - identifies patterns using tf-idf scores and reduces data to k dimensions through SVD Pooling - downsamples convolution layers to reduce dimensionality and maintain spatial invariance, allowing detection of features even if they have shifted slightly Common techniques return the max or average value in the pooling window The general CNN architecture is as follows: Perform a series of convolution, ReLU, and pooling operations, extracting important features from the data Feed output into a fully-connected layer for classification, object detection, or other structural analyses Word Embedding Maps words and phrases to numerical vectors word2vec - trains iteratively over local word context windows, places similar words close together, and embeds sub-relationships directly into vectors, such that king − man + woman ≈ queen Relies on one of the following: Neural Network Analyzes structural or visual data by extracting local features Convolutional Layers - iterate over windows of the image, applying weights, bias, and an activation function to create feature maps Different weights lead to different features maps Recurrent Neural Network Since a system of linear activation functions can be simplified to a single perceptron, nonlinear functions are commonly used for more accurate tuning and meaningful gradients Predicts sequential data using a temporally connected system that captures both new inputs and previous outputs using hidden states Loss Function - measures prediction error using functions such as MSE for regression and binary cross-entropy for probability-based classification Gradient Descent - minimizes the average loss by moving iteratively in the direction of steepest descent, controlled by the learning rate γ (step size) Note, γ can be updated adaptively for better performance For neural networks, finding the best set of weights involves: Initialize weights W randomly with near-zero values Loop until convergence: – Calculate the average network loss J(W ) – Backpropagation - iterate backwards from the last ∂J(W ) layer, computing the gradient ∂W and updating the weight W ← W − ∂J(W ) γ ∂W Return the minimum loss weight matrix W To prevent overfitting, regularization can be applied by: – Stopping training when validation performance drops – Dropout - randomly drop some nodes during training to prevent over-reliance on a single node – Embedding weight penalties into the objective function Stochastic Gradient Descent - only uses a single point to compute gradients, leading to smoother convergence and faster compute speeds Alternatively, mini-batch gradient descent trains on small subsets of the data, striking a balance between the approaches Aaron Wang RNNs can model various input-output scenarios, such as many-to-one, one-to-many, and many-to-many Relies on parameter (weight) sharing for efficiency To avoid redundant calculations during backpropagation, downstream gradients are found by chaining previous gradients However, repeatedly multiplying values greater than or less than leads to: – Exploding gradients - model instability and overflows – Vanishing gradients - loss of learning ability This can be solved using: – Gradient clipping - cap the maximum value of gradients – ReLU - its derivative prevents gradient shrinkage for x > – Gated cells - regulate the flow of information Long Short-Term Memory - learns long-term dependencies using gated cells and maintains a separate cell state from what is outputted Gates in LSTM perform the following: Forget and filter out irrelevant info from previous layers Store relevant info from current input Update the current cell state Output the hidden state, a filtered version of the cell state LSTMs can be stacked to improve performance CS 230 – Deep Learning Shervine Amidi & Afshine Amidi Super VIP Cheatsheet: Deep Learning Afshine Amidi and Shervine Amidi 1.1 Convolutional Neural Networks Overview November 25, 2018 ❒ Architecture of a traditional CNN – Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers: Contents Convolutional Neural Networks 1.1 Overview 1.2 Types of layer 1.3 Filter hyperparameters 1.4 Tuning hyperparameters 1.5 Commonly used activation functions 1.6 Object detection 1.6.1 Face verification and recognition 1.6.2 Neural style transfer 1.6.3 Architectures using computational tricks 2 2 3 The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections Recurrent Neural Networks 2.1 Overview 2.2 Handling long term dependencies 2.3 Learning word representation 2.3.1 Motivation and notations 2.3.2 Word embeddings 2.4 Comparing words 2.5 Language model 2.6 Machine translation 2.7 Attention 7 9 9 10 10 10 Deep Learning Tips and Tricks 3.1 Data processing 3.2 Training a neural network 3.2.1 Definitions 3.2.2 Finding optimal weights 3.3 Parameter tuning 3.3.1 Weights initialization 3.3.2 Optimizing convergence 3.4 Regularization 3.5 Good practices 11 11 12 12 12 12 12 12 13 13 Stanford University 1.2 Types of layer ❒ Convolutional layer (CONV) – The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions Its hyperparameters include the filter size F and stride S The resulting output O is called feature map or activation map Remark: the convolution step can be generalized to the 1D and 3D cases as well ❒ Pooling (POOL) – The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively Winter 2019 CS 230 – Deep Learning Purpose Shervine Amidi & Afshine Amidi Max pooling Average pooling Each pooling operation selects the maximum value of the current view Each pooling operation averages the values of the current view ❒ Zero-padding – Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input This value can either be manually specified or automatically set through one of the three modes detailed below: Valid Illustration Same Pstart = Value P =0 Pend = Comments - Preserves detected features - Most commonly used S I S −I+F −S Pstart ∈ [[0,F − 1]] Pend = F − - Downsamples feature map - Used in LeNet Illustration ❒ Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores - Padding such that feature - No padding Purpose 1.4 1.3 Full I −I+F −S S S map size has size - Drops last convolution if dimensions not match I S - Output size is mathematically convenient - Also called ’half’ padding - Maximum padding such that end convolutions are applied on the limits of the input - Filter ’sees’ the input end-to-end Tuning hyperparameters ❒ Parameter compatibility in convolution layer – By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by: Filter hyperparameters The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters O= ❒ Dimensions of a filter – A filter of size F × F applied to an input containing C channels is a F × F × C volume that performs convolutions on an input of size I × I × C and produces an output feature map (also called activation map) of size O × O × I − F + Pstart + Pend +1 S Remark: the application of K filters of size F × F results in an output feature map of size O × O × K Remark: often times, Pstart = Pend the formula above ❒ Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation Stanford University P , in which case we can replace Pstart + Pend by 2P in Winter 2019 CS 230 – Deep Learning Shervine Amidi & Afshine Amidi ❒ Understanding the complexity of the model – In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have In a given layer of a convolutional neural network, it is done as follows: CONV POOL FC Input size I ×I ×C I ×I ×C Nin Output size O×O×K O×O×C Nout Number of parameters (F × F × C + 1) · K (Nin + 1) × Nout Remarks - One bias parameter per filter - In most cases, S < F - A common choice for K is 2C ReLU Leaky ReLU ELU g(z) = max(0,z) g(z) = max( z,z) with g(z) = max(α(ez − 1),z) with α Non-linearity complexities biologically interpretable Addresses dying ReLU issue for negative values Differentiable everywhere Illustration - Pooling operation done channel-wise - In most cases, S = F ❒ Softmax – The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x ∈ Rn and outputs a vector of output probability p ∈ Rn through a softmax function at the end of the architecture It is defined as follows: - Input is flattened - One bias parameter per neuron - The number of FC neurons is free of structural constraints p= p1 pn where pi = e xi n e xj j=1 ❒ Receptive field – The receptive field at layer k is the area denoted Rk × Rk of the input that each pixel of the k-th activation map can ’see’ By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can be computed with the formula: (Fj − 1) j=1 Object detection ❒ Types of models – There are main types of object recognition algorithms, for which the nature of what is predicted is different They are described in the table below: j−1 k Rk = + 1.6 Si Classification w localization Detection - Predicts probability of object - Detects object in a picture - Predicts probability of object and where it is located - Detects up to several objects in a picture - Predicts probabilities of objects and where they are located Traditional CNN Simplified YOLO, R-CNN YOLO, R-CNN Image classification i=0 In the example below, we have F1 = F2 = and S1 = S2 = 1, which gives R2 = 1+2 · 1+2 · = - Classifies a picture 1.5 Commonly used activation functions ❒ Rectified Linear Unit – The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume It aims at introducing non-linearities to the network Its variants are summarized in the table below: Stanford University ❒ Detection – In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image The two main ones are summed up in the table below: Winter 2019 CS 230 – Deep Learning Shervine Amidi & Afshine Amidi Bounding box detection Landmark detection Detects the part of the image where the object is located - Detects a shape or characteristics of an object (e.g eyes) - More granular ❒ YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the following steps: • Step 1: Divide the input image into a G ì G grid ã Step 2: For each grid cell, run a CNN that predicts y of the following form: Box of center (bx ,by ), height bh and width bw Reference points (l1x ,l1y ), ,(lnx ,lny ) y = pc ,bx ,by ,bh ,bw ,c1 ,c2 , ,cp , T ∈ RG×G×k×(5+p) repeated k times ❒ Intersection over Union – Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba It is defined as: IoU(Bp ,Ba ) = where pc is the probability of detecting an object, bx ,by ,bh ,bw are the properties of the detected bouding box, c1 , ,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes Bp ∩ Ba Bp ∪ Ba • Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes Remark: when pc = 0, then the network does not detect any object In that case, the corresponding predictions bx , , cp have to be ignored Remark: we always have IoU ∈ [0,1] By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp ,Ba ) 0.5 ❒ R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes ❒ Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form ❒ Non-max suppression – The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining: • Step 1: Pick the box with the largest prediction probability • Step 2: Discard any box having an IoU Stanford University Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN 0.5 with the previous box Winter 2019 CS 230 – Deep Learning 1.6.1 Shervine Amidi & Afshine Amidi Face verification and recognition ❒ Types of models – Two main types of model are summed up in table below: Face verification - Is this the correct person? - One-to-one lookup Face recognition - Is this one of the K persons in the database? - One-to-many lookup ❒ Activation – In a given layer l, the activation is noted a[l] and is of dimensions nH × nw × nc ❒ Content cost function – The content cost function Jcontent (C,G) is used to determine how the generated image G differs from the original content image C It is defined as follows: Jcontent (C,G) = [l](C) ||a − a[l](G) ||2 ❒ Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its [l] elements Gkk quantifies how correlated the channels k and k are It is defined with respect to activations a[l] as follows: ❒ One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are The similarity function applied to two images is often noted d(image 1, image 2) [l] nH n[l] w [l] Gkk ❒ Siamese Network – Siamese Networks aim at learning how to encode images to then quantify how different two images are For a given input image x(i) , the encoded output is often noted as f (x(i) ) [l] = [l] aijk aijk i=1 j=1 Remark: the style matrix for the style image and the generated image are noted G[l](S) and G[l](G) respectively ❒ Triplet loss – The triplet loss is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative) The anchor and the positive example belong to a same class, while the negative example to another one By calling α ∈ R+ the margin parameter, this loss is defined as follows: ❒ Style cost function – The style cost function Jstyle (S,G) is used to determine how the generated image G differs from the style S It is defined as follows: (A,P,N ) = max (d(A,P ) − d(A,N ) + α,0) [l] Jstyle (S,G) = 1 ||G[l](S) − G[l](G) ||2F = (2nH nw nc )2 (2nH nw nc )2 nc [l](S) Gkk [l](G) − Gkk k,k =1 ❒ Overall cost function – The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows: J(G) = αJcontent (C,G) + βJstyle (S,G) Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style 1.6.3 1.6.2 Neural style transfer ❒ Generative Adversarial Network – Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image ❒ Motivation – The goal of neural style transfer is to generate an image G based on a given content C and a given style S Stanford University Architectures using computational tricks Winter 2019 CS 230 – Deep Learning Shervine Amidi & Afshine Amidi 2.1 Recurrent Neural Networks Overview ❒ Architecture of a traditional RNN – Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states They are typically as follows: Remark: use cases using variants of GANs include text to image, music generation and synthesis ❒ ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error The residual block has the following characterizing equation: a[l+2] = g(a[l] + z [l+2] ) For each timestep t, the activation a and the output y are expressed as follows: ❒ Inception Network – This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance In particular, it uses the × convolution trick to lower the burden of computation a = g1 (Waa a + Wax x + ba ) and y = g2 (Wya a + by ) where Wax , Waa , Wya , ba , by are coefficients that are shared temporally and g1 , g2 activation functions The pros and cons of a typical RNN architecture are summed up in the table below: Advantages Drawbacks - Possibility of processing input of any length - Model size not increasing with size of input - Computation takes into account historical information - Weights are shared across time - Computation being slow - Difficulty of accessing information from a long time ago - Cannot consider any future input for the current state ❒ Applications of RNNs – RNN models are mostly used in the fields of natural language processing and speech recognition The different applications are summed up in the table below: Stanford University Winter 2019 CS 230 – Deep Learning Type of RNN Shervine Amidi & Afshine Amidi Illustration Example T ∂L(T ) = ∂W One-to-one ∂L(T ) ∂W t=1 (t) Traditional neural network Tx = Ty = 2.2 Handling long term dependencies ❒ Commonly used activation functions – The most common activation functions used in RNN modules are described below: One-to-many Music generation Tx = 1, Ty > Sigmoid g(z) = 1 + e−z Tanh g(z) = ez RELU e−z − ez + e−z g(z) = max(0,z) Many-to-one Sentiment classification Tx > 1, Ty = Many-to-many Name entity recognition Tx = Ty ❒ Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are often encountered in the context of RNNs The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers Many-to-many ❒ Gradient clipping – It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation By capping the maximum value for the gradient, this phenomenon is controlled in practice Machine translation Tx = Ty ❒ Loss function – In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows: Ty L(y,y) = ❒ Types of gates – In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose They are usually noted Γ and are equal to: L(y ,y ) t=1 Γ = σ(W x + U a + b) ❒ Backpropagation through time – Backpropagation is done at each point in time At timestep T , the derivative of the loss L with respect to weight matrix W is expressed as follows: Stanford University where W, U, b are coefficients specific to the gate and σ is the sigmoid function The main ones are summed up in the table below: Winter 2019 CS 230 – Deep Learning Shervine Amidi & Afshine Amidi 2.3 Learning word representation Type of gate Role Used in Update gate Γu How much past should matter now? GRU, LSTM Relevance gate Γr Drop previous information? GRU, LSTM Forget gate Γf Erase a cell or not? LSTM 2.3.1 Output gate Γo How much to reveal of a cell? LSTM ❒ Representation techniques – The two main ways of representing words are summed up in the table below: In this section, we note V the vocabulary and |V | its size ❒ GRU/LSTM – Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU Below is a table summing up the characterizing equations of each architecture: Gated Recurrent Unit (GRU) c˜ c tanh(Wc [Γr Γu a ,x ] + bc ) c˜ + (1 − Γu ) a c Motivation and notations 1-hot representation Word embedding - Noted ow - Naive approach, no similarity information - Noted ew - Takes into account words similarity Long Short-Term Memory (LSTM) tanh(Wc [Γr Γu a ,x ] + bc ) c˜ + Γf Γo c c c Dependencies ❒ Embedding matrix – For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows: ew = Eow Remark: learning the embedding matrix can be done using target/context likelihood models Remark: the sign denotes the element-wise multiplication between two vectors ❒ Variants of RNNs – The table below sums up the other commonly used RNN architectures: Bidirectional (BRNN) 2.3.2 Word embeddings ❒ Word2vec – Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words Popular models include skip-gram, negative sampling and CBOW Deep (DRNN) ❒ Skip-gram – The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c By noting θt a parameter associated with t, the probability P (t|c) is given by: Stanford University Winter 2019 CS 230 – Deep Learning Shervine Amidi & Afshine Amidi P (t|c) = exp(θtT ec ) |V | exp(θjT ec ) j=1 Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive CBOW is another word2vec model using the surrounding words to predict a given word ❒ Negative sampling – It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and positive example Given a context word c and a target word t, the prediction is expressed by: 2.5 Language model ❒ Overview – A language model aims at estimating the probability of a sentence P (y) ❒ n-gram model – This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data P (y = 1|c,t) = σ(θtT ec ) ❒ Perplexity – Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T The perplexity is such that the lower, the better and is defined as follows: Remark: this method is less computationally expensive than the skip-gram model ❒ GloVe – The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j Its cost function J is as follows: T J(θ) = |V | PP = f (Xij )(θiT ej + bi + bj − log(Xij ))2 t=1 T |V | j=1 (t) (t) yj · yj i,j=1 Remark: PP is commonly used in t-SNE here f is a weighting function such that Xi,j = =⇒ f (Xi,j ) = Given the symmetry that e and θ play in this model, the final word embedding by: (final) ew = (final) ew is given 2.6 e w + θw Machine translation ❒ Overview – A machine translation model is similar to a language model except it has an encoder network placed before For this reason, it is sometimes referred as a conditional language model The goal is to find a sentence y such that: Remark: the individual components of the learned word embeddings are not necessarily interpretable y= arg max P (y , ,y |x) y , ,y 2.4 ❒ Beam search – It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x Comparing words ❒ Cosine similarity – The cosine similarity between words w1 and w2 is expressed as follows: w1 · w2 similarity = = cos(θ) ||w1 || ||w2 || • Step 1: Find top B likely words y • Step 2: Compute conditional probabilities y |x,y , ,y • Step 3: Keep top B combinations x,y , ,y Remark: θ is the angle between words w1 and w2 ❒ t-SNE – t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space In practice, it is commonly used to visualize word vectors in the 2D space Stanford University Winter 2019 CS 230 – Deep Learning Shervine Amidi & Afshine Amidi Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search Remark: the attention scores are commonly used in image captioning and machine translation ❒ Beam width – The beam width B is a parameter for beam search Large values of B yield to better result but with slower performance and increased memory Small values of B lead to worse results but is less computationally intensive A standard value for B is around 10 ❒ Length normalization – In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as: Objective = Tyα Ty log p(y |x,y , , y ) t=1 Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and ❒ Error analysis – When obtaining a predicted translation y that is bad, one can wonder why we did not get a good translation y ∗ by performing the following error analysis: Case P (y ∗ |x) > P (y|x) Root cause Beam search faulty RNN faulty Increase beam width - Try different architecture - Regularize - Get more data Remedies P (y ∗ |x) ❒ Attention weight – The amount of attention that the output y should pay to the activation a is given by α computed as follows: α = exp(e) Tx exp(e ) t =1 Remark: computation complexity is quadratic with respect to Tx ❒ Bleu score – The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision It is defined as follows: bleu score = exp n n pk k=1 where pn is the bleu score on n-gram only defined as follows: countclip (n-gram) pn = n-gram∈y count(n-gram) n-gram∈y Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score 2.7 Attention ❒ Attention model – This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have: c = α a with α =1 t 10 Winter 2019 CS 230 – Deep Learning Shervine Amidi & Afshine Amidi 3.2 Deep Learning Tips and Tricks 3.1 3.2.1 Data processing Flip Rotation Definitions ❒ Epoch – In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights ❒ Data augmentation – Deep learning models usually need a lot of data to be properly trained It is often useful to get more data from the existing ones using data augmentation techniques The main ones are summed up in the table below More precisely, given the following input image, here are the techniques that we can apply: Original Training a neural network ❒ Mini-batch gradient descent – During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune Random crop ❒ Loss function – In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z ❒ Cross-entropy loss – In the context of binary classification in neural networks, the crossentropy loss L(z,y) is commonly used and is defined as follows: L(z,y) = − y log(z) + (1 − y) log(1 − z) - Image without any modification Color shift - Flipped with respect to an axis for which the meaning of the image is preserved Noise addition - Rotation with a slight angle - Simulates incorrect horizon calibration Information loss - Random focus on one part of the image - Several random crops can be done in a row 3.2.2 Finding optimal weights ❒ Backpropagation – Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output The derivative with respect to each weight w is computed using the chain rule Contrast change Using this method, each weight is updated with the rule: - Nuances of RGB is slightly changed - Captures noise that can occur with light exposure w ←− w − α - Addition of noise - More tolerance to quality variation of inputs - Parts of image ignored - Mimics potential loss of parts of image - Luminosity changes - Controls difference in exposition due to time of day ∂L(z,y) ∂w ❒ Updating weights – In a neural network, weights are updated as follows: • Step 1: Take a batch of training data and perform forward propagation to compute the loss • Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight ❒ Batch normalization – It is a step of hyperparameter γ, β that normalizes the batch {xi } the mean and variance of that we want to correct to the batch, it is done as By noting µB , σB follows: xi ←− γ xi àB + B ã Step 3: Use the gradients to update the weights of the network +β It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization Stanford University 11 Winter 2019 CS 230 – Deep Learning 3.3 3.3.1 Shervine Amidi & Afshine Amidi Parameter tuning Method Momentum Weights initialization ❒ Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture RMSprop ❒ Transfer learning – Training a deep learning model requires a lot of data and more importantly a lot of time It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case Depending on how much data we have at hand, here are the different ways to leverage this: Adam Explanation - Dampens oscillations - Improvement to SGD - parameters to tune - Root Mean Square propagation - Speeds up learning algorithm by controlling oscillations - Adaptive Moment estimation - Most popular method - parameters to tune Update of w Update of b w − αvdw b − αvdb dw w − α√ sdw db b ←− b − α √ sdb w − α√ vdw sdw + b ←− b − α √ vdb sdb + Remark: other methods include Adadelta, Adagrad and SGD Training size Illustration Small Medium Explanation 3.4 Regularization ❒ Dropout – Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p > It forces the model to avoid relying too much on particular sets of features Freezes all layers, trains weights on softmax Freezes most layers, trains weights on last layers and softmax Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1−p Large 3.3.2 Trains weights on layers and softmax by initializing weights on pre-trained ones ❒ Weight regularization – In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights The main ones are summed up in the table below: LASSO Ridge Elastic Net - Shrinks coefficients to - Good for variable selection Makes coefficients smaller Tradeoff between variable selection and small coefficients + λ||θ||1 λ∈R + λ||θ||22 λ∈R Optimizing convergence ❒ Learning rate – The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated It can be fixed or adaptively changed The current most popular method is called Adam, which is a method that adapts the learning rate ❒ Adaptive learning rates – Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution While Adam optimizer is the most commonly used technique, others can also be useful They are summed up in the table below: Stanford University 12 + λ (1 − α)||θ||1 + α||θ||22 λ ∈ R,α ∈ [0,1] Winter 2019 CS 230 – Deep Learning Shervine Amidi & Afshine Amidi ❒ Early stopping – This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase 3.5 Good practices ❒ Overfitting small batch – When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set ❒ Gradient checking – Gradient checking is a method used during the implementation of the backward pass of a neural network It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness Numerical gradient Analytical gradient Formula df f (x + h) − f (x − h) (x) ≈ dx 2h df (x) = f (x) dx Comments - Expensive; loss has to be computed two times per dimension - Used to verify correctness of analytical implementation -Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approx.) Stanford University - ’Exact’ result - Direct computation - Used in the final implementation 13 Winter 2019 Reinforcement Learning Cheat Sheet Optimal Similarly, we can the same for the Q function: π v∗ (s) = max v (s) Agent-Environment Interface (5) π qπ (s, a) Action-Value (Q) Function We can also denoted the expected reward for state, action pairs qπ (s, a) = Eπ Gt |St = s, At = a (6) A policy is a mapping from a state to an action πt (s|a) (1) That is the probability of select an action At = a if St = s Reward k Gt = γ rt+k+1 (2) k=0 Where γ is the discount factor and H is the horizon, that can be infinite Markov Decision Process A Markov Decision Process, MPD, is a 5-tuple (S, A, P, R, γ) where: finite set of states: s∈S finite set of actions: a∈A state transition probabilities: p(s |s, a) = P r{St+1 = s |St = s, At = a} expected reward for state-action-nexstate: r(s , s, a) = E[Rt+1 |St+1 = s , St = s, At = a] ∞ = q∗ (s, a) = max q π (s, a) p(s , r|s, a) r + γEπ Clearly, using this new notation we can redefine 5, using q ∗ (s, a), equation 7: v∗ , p(s , r|s, a) r + γVπ (s ) = equation s ,r (10) v∗ (s) = max qπ∗ (s, a) (8) a∈A(s) Dynamic Programming Intuitively, the above equation express the fact that the value of a state under the optimal policy must be equal to the expected return from the best action from that state Taking advantages of the subproblem structure of the V and Q function we can find the optimal policy by just planning Bellman Equation Policy Iteration An important recursive property emerges for both Value (4) and Q (6) functions if we expand them We can now find the optimal policy Initialisation V (s) ∈ R, (e.g V (s) = 0) and π(s) ∈ A for all s ∈ S, ∆←0 Policy Evaluation while ∆ ≥ θ (a small positive number) foreach s ∈ S v ← V (s) V (s) ← π(a|s) p(s , r|s, a) r + γV (s ) Value Function = Eπ Gt |St = s ∞ = Eπ γ k Rt+k+1 |St = s k=0 a ∞ = Eπ Rt+1 + γ (3) =  p(s , r|s, a) s Sum of all probabilities ∀ possible r ∞ γ k Rt+k+2 |St+1 = k=0 r + γ Eπ  Value function describes how good is to be in a specific state s under a certain policy π For MDP: Vπ (s) = E[Gt |St = s] (4) Informally, is the expected return (expected cumulative discounted reward) when starting from s and following π (9) r s    Expected reward from st+1 = p(s , r|s, a) r + γvπ (s ) π(a|s) a s r Initialise V (s) ∈ R, e.gV (s) = ∆←0 while ∆ ≥ θ (a small positive number) foreach s ∈ S v ← V (s) V (s) ← max p(s , r|s, a) r + γV (s ) a s ,r ∆ ← max(∆, |v − V (s)|) end end s ,r ∆ ← max(∆, |v − V (s)|) γ k Rt+k+2 |St = s k=0 π(a|s) a Value Function We can avoid to wait until V (s) has converged and instead policy improvement and truncated policy evaluation step in one operation γ k Rt+k+2 |St+1 = s k=0 (7) π vπ (s) Value Iteration γ k Rt+k+2 |St = s, At = a k=0 s ,r The total reward is expressed as: H γ k Rt+k+1 |St = s, At = a k=0 ∞ The optimal value-action function: Policy ∞ = Eπ = Eπ Rt+1 + γ Optimal The Agent at each step t receives a representation of the environment’s state, St ∈ S and it selects an action At ∈ A(s) Then, as a consequence of its action the agent receives a reward, Rt+1 ∈ R ∈ R = Eπ Gt |St = s, At = a end end Policy Improvement policy-stable ← true foreach s ∈ S old-action ← π(s) π(s) ← argmax p(s , r|s, a) r + γV (s ) a s ,r policy-stable ← old-action = π(s) end if policy-stable return V ≈ v∗ and π ≈ π∗ , else go to Algorithm 1: Policy Iteration Monte Carlo Methods Monte Carlo (MC) is a Model Free method, It does not require complete knowledge of the environment It is based on averaging sample returns for each state-action pair The following algorithm gives the basic implementation Initialise for all s ∈ S, a ∈ A(s) : Q(s, a) ← arbitrary π(s) ← arbitrary Returns(s, a) ← empty list while forever Choose S0 ∈ S and A0 ∈ A(S0 ), all pairs have probability > Generate an episode starting at S0 , A0 following π foreach pair s, a appearing in the episode G ← return following the first occurrence of s, a Append G to Returns(s, a)) Q(s, a) ← average(Returns(s, a)) end foreach s in the episode π(s) ← argmax Q(s, a) a end end Algorithm 3: Monte Carlo first-visit For non-stationary problems, the Monte Carlo estimate for, e.g, V is: V (St ) ← V (St ) + α Gt − V (St ) (11) Where α is the learning rate, how much we want to forget about past experiences Sarsa Sarsa (State-action-reward-state-action) is a on-policy TD control The update rule: Q(st , at ) ← Q(st , at ) + α [rt + γQ(st+1 , at+1 ) − Q(st , at )] n-step Sarsa Define the n-step Q-Return q (n) = Rt+1 + γRt + + + γ n−1 Rt+n + γ n Q(St+n ) Deep Q Learning Forward View Sarsa(λ) ∞ (n) λn−1 qt qtλ = (1 − λ) n=1 Forward-view Sarsa(λ): Q(st , at ) ← Q(st , at ) + α qtλ − Q(st , at ) Initialise Q(s, a) arbitrarily and Q(terminal − state, ) = foreach episode ∈ episodes Choose a from s using policy derived from Q (e.g., -greedy) while s is not terminal Take action a, observer r, s Choose a from s using policy derived from Q (e.g., -greedy) Q(s, a) ← Q(s, a) + α r + γQ(s , a ) − Q(s, a) s←s a←a end end Algorithm 4: Sarsa(λ) Temporal Difference - Q Learning Temporal Difference (TD) methods learn directly from raw experience without a model of the environment’s dynamics TD substitutes the expected discounted reward Gt from the episode with an estimation: V (St ) ← V (St ) + α Rt+1 + γV (St+1 − V (St ) (12) The following algorithm gives a generic implementation Initialise Q(s, a) arbitrarily and Q(terminal − state, ) = foreach episode ∈ episodes while s is not terminal Choose a from s using policy derived from Q (e.g., -greedy) Take action a, observer r, s Q(s, a) ← Q(s, a) + α r + γ max Q(s , a ) − Q(s, a) a s←s n-step Sarsa update Q(S, a) towards the n-step Q-return end end (n) Q(st , at ) ← Q(st , at ) + α qt Double Deep Q Learning − Q(st , at ) Algorithm 5: Q Learning Created by DeepM ind, Deep Q Learning, DQL, substitutes the Q function with a deep neural network called Q-network It also keep track of some observation in a memory in order to use them to train the network   (r + γ max Q(s , a ; θi−1 ) − Q(s, a; θi ))2 a   Li (θi ) = E(s,a,r,s )∼U (D) target prediction (13) Where θ are the weights of the network and U (D) is the experience replay history Initialise replay memory D with capacity N Initialise Q(s, a) arbitrarily foreach episode ∈ episodes while s is not terminal With probability select a random action a ∈ A(s) otherwise select a = maxa Q(s, a; θ) Take action a, observer r, s Store transition (s, a, r, s ) in D Sample random minibatch of transitions (sj , aj , rj , sj ) from D Set yj ← rj for terminal sj rj + γ max Q(s , a ; θ) for non-terminal sj a Perform gradient descent step on (yj − Q(sj , aj ; Θ))2 s←s end end Algorithm 6: Deep Q Learning Copyright c 2018 Francesco Saverio Zuppichini https://github.com/FrancescoSaverioZuppichini/ReinforcementLearning-Cheat-Sheet ... - Probabilities can be updated as data becomes available - Predict human resource allocation in companies + If the naive assumptions works can converge quicker than …other models Can be used on smaller training data Naive... - Parametric (or Linear): simplify the mapping to a known linear combination form and learning its coefficients - Non parametric (or Nonlinear): free to learn any functional form from the training data, while maintaining some ability to generalize... 9CD where

Ngày đăng: 09/09/2022, 09:40