Charu C Aggarwal Linear Algebra and Optimization for Machine Learning A Textbook Linear Algebra and Optimization for Machine Learning Charu C Aggarwal Linear Algebra and Optimization for Machine Learning A Textbook Charu C Aggarwal Distinguished Research Staff Member IBM T.J Watson Research Center Yorktown Heights, NY, USA ISBN 978-3-030-40343-0 ISBN 978-3-030-40344-7 (eBook) https://doi.org/10.1007/978-3-030-40344-7 © Springer Nature Switzerland AG 2020 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland To my wife Lata, my daughter Sayani, and all my mathematics teachers Contents Linear Algebra and Optimization: An Introduction 1.1 Introduction 1.2 Scalars, Vectors, and Matrices 1.2.1 Basic Operations with Scalars and Vectors 1.2.2 Basic Operations with Vectors and Matrices 1.2.3 Special Classes of Matrices 1.2.4 Matrix Powers, Polynomials, and the Inverse 1.2.5 The Matrix Inversion Lemma: Inverting the Sum of Matrices 1.2.6 Frobenius Norm, Trace, and Energy 1.3 Matrix Multiplication as a Decomposable Operator 1.3.1 Matrix Multiplication as Decomposable Row and Column Operators 1.3.2 Matrix Multiplication as Decomposable Geometric Operators 1.4 Basic Problems in Machine Learning 1.4.1 Matrix Factorization 1.4.2 Clustering 1.4.3 Classification and Regression Modeling 1.4.4 Outlier Detection 1.5 Optimization for Machine Learning 1.5.1 The Taylor Expansion for Function Simplification 1.5.2 Example of Optimization in Machine Learning 1.5.3 Optimization in Computational Graphs 1.6 Summary 1.7 Further Reading 1.8 Exercises 1 12 14 17 19 21 21 25 27 27 28 29 30 31 31 33 34 35 35 36 Linear Transformations and Linear Systems 2.1 Introduction 2.1.1 What Is a Linear Transform? 2.2 The Geometry of Matrix Multiplication 41 41 42 43 VII VIII 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 CONTENTS Vector Spaces and Their Geometry 2.3.1 Coordinates in a Basis System 2.3.2 Coordinate Transformations Between Basis Sets 2.3.3 Span of a Set of Vectors 2.3.4 Machine Learning Example: Discrete Wavelet Transform 2.3.5 Relationships Among Subspaces of a Vector Space The Linear Algebra of Matrix Rows and Columns The Row Echelon Form of a Matrix 2.5.1 LU Decomposition 2.5.2 Application: Finding a Basis Set 2.5.3 Application: Matrix Inversion 2.5.4 Application: Solving a System of Linear Equations The Notion of Matrix Rank 2.6.1 Effect of Matrix Operations on Rank Generating Orthogonal Basis Sets 2.7.1 Gram-Schmidt Orthogonalization and QR Decomposition 2.7.2 QR Decomposition 2.7.3 The Discrete Cosine Transform An Optimization-Centric View of Linear Systems 2.8.1 Moore-Penrose Pseudoinverse 2.8.2 The Projection Matrix Ill-Conditioned Matrices and Systems Inner Products: A Geometric View Complex Vector Spaces 2.11.1 The Discrete Fourier Transform Summary Further Reading Exercises Eigenvectors and Diagonalizable Matrices 3.1 Introduction 3.2 Determinants 3.3 Diagonalizable Transformations and Eigenvectors 3.3.1 Complex Eigenvalues 3.3.2 Left Eigenvectors and Right Eigenvectors 3.3.3 Existence and Uniqueness of Diagonalization 3.3.4 Existence and Uniqueness of Triangulization 3.3.5 Similar Matrix Families Sharing Eigenvalues 3.3.6 Diagonalizable Matrix Families Sharing Eigenvectors 3.3.7 Symmetric Matrices 3.3.8 Positive Semidefinite Matrices 3.3.9 Cholesky Factorization: Symmetric LU Decomposition 3.4 Machine Learning and Optimization Applications 3.4.1 Fast Matrix Operations in Machine Learning 3.4.2 Examples of Diagonalizable Matrices in Machine Learning 3.4.3 Symmetric Matrices in Quadratic Optimization 3.4.4 Diagonalization Application: Variable Separation for Optimization 3.4.5 Eigenvectors in Norm-Constrained Quadratic Programming 51 55 57 59 60 61 63 64 66 67 67 68 70 71 73 73 74 77 79 81 82 85 86 87 89 90 91 91 97 97 98 103 107 108 109 111 113 115 115 117 119 120 121 121 124 128 130 CONTENTS 3.5 131 132 133 135 135 135 Optimization Basics: A Machine Learning View 4.1 Introduction 4.2 The Basics of Optimization 4.2.1 Univariate Optimization 4.2.1.1 Why We Need Gradient Descent 4.2.1.2 Convergence of Gradient Descent 4.2.1.3 The Divergence Problem 4.2.2 Bivariate Optimization 4.2.3 Multivariate Optimization 4.3 Convex Objective Functions 4.4 The Minutiae of Gradient Descent 4.4.1 Checking Gradient Correctness with Finite Differences 4.4.2 Learning Rate Decay and Bold Driver 4.4.3 Line Search 4.4.3.1 Binary Search 4.4.3.2 Golden-Section Search 4.4.3.3 Armijo Rule 4.4.4 Initialization 4.5 Properties of Optimization in Machine Learning 4.5.1 Typical Objective Functions and Additive Separability 4.5.2 Stochastic Gradient Descent 4.5.3 How Optimization in Machine Learning Is Different 4.5.4 Tuning Hyperparameters 4.5.5 The Importance of Feature Preprocessing 4.6 Computing Derivatives with Respect to Vectors 4.6.1 Matrix Calculus Notation 4.6.2 Useful Matrix Calculus Identities 4.6.2.1 Application: Unconstrained Quadratic Programming 4.6.2.2 Application: Derivative of Squared Norm 4.6.3 The Chain Rule of Calculus for Vectored Derivatives 4.6.3.1 Useful Examples of Vectored Derivatives 4.7 Linear Regression: Optimization with Numerical Targets 4.7.1 Tikhonov Regularization 4.7.1.1 Pseudoinverse and Connections to Regularization 4.7.2 Stochastic Gradient Descent 4.7.3 The Use of Bias 4.7.3.1 Heuristic Initialization 4.8 Optimization Models for Binary Targets 4.8.1 Least-Squares Classification: Regression on Binary Targets 4.8.1.1 Why Least-Squares Classification Loss Needs Repair 141 141 142 142 146 147 148 149 151 154 159 159 159 160 161 161 162 163 163 163 164 165 168 168 169 170 171 173 174 174 175 176 178 179 179 179 180 180 181 183 3.6 3.7 3.8 Numerical Algorithms for Finding Eigenvectors 3.5.1 The QR Method via Schur Decomposition 3.5.2 The Power Method for Finding Dominant Eigenvectors Summary Further Reading Exercises IX X CONTENTS 4.8.2 184 185 186 186 188 188 189 190 190 191 192 193 194 194 196 197 197 198 199 199 Advanced Optimization Solutions 5.1 Introduction 5.2 Challenges in Gradient-Based Optimization 5.2.1 Local Optima and Flat Regions 5.2.2 Differential Curvature 5.2.2.1 Revisiting Feature Normalization 5.2.3 Examples of Difficult Topologies: Cliffs and Valleys 5.3 Adjusting First-Order Derivatives for Descent 5.3.1 Momentum-Based Learning 5.3.2 AdaGrad 5.3.3 RMSProp 5.3.4 Adam 5.4 The Newton Method 5.4.1 The Basic Form of the Newton Method 5.4.2 Importance of Line Search for Non-quadratic Functions 5.4.3 Example: Newton Method in the Quadratic Bowl 5.4.4 Example: Newton Method in a Non-quadratic Function 5.5 Newton Methods in Machine Learning 5.5.1 Newton Method for Linear Regression 5.5.2 Newton Method for Support-Vector Machines 5.5.3 Newton Method for Logistic Regression 5.5.4 Connections Among Different Models and Unified Framework 5.6 Newton Method: Challenges and Solutions 5.6.1 Singular and Indefinite Hessian 5.6.2 The Saddle-Point Problem 205 205 206 207 208 209 210 212 212 214 215 215 216 217 219 220 220 221 221 223 225 228 229 229 229 4.9 4.10 4.11 4.12 4.13 The Support Vector Machine 4.8.2.1 Computing Gradients 4.8.2.2 Stochastic Gradient Descent 4.8.3 Logistic Regression 4.8.3.1 Computing Gradients 4.8.3.2 Stochastic Gradient Descent 4.8.4 How Linear Regression Is a Parent Problem in Machine Learning Optimization Models for the MultiClass Setting 4.9.1 Weston-Watkins Support Vector Machine 4.9.1.1 Computing Gradients 4.9.2 Multinomial Logistic Regression 4.9.2.1 Computing Gradients 4.9.2.2 Stochastic Gradient Descent Coordinate Descent 4.10.1 Linear Regression with Coordinate Descent 4.10.2 Block Coordinate Descent 4.10.3 K-Means as Block Coordinate Descent Summary Further Reading Exercises CONTENTS 5.6.3 5.7 5.8 5.9 5.10 5.11 Convergence Problems and Solutions with Non-quadratic Functions 5.6.3.1 Trust Region Method Computationally Efficient Variations of Newton Method 5.7.1 Conjugate Gradient Method 5.7.2 Quasi-Newton Methods and BFGS Non-differentiable Optimization Functions 5.8.1 The Subgradient Method 5.8.1.1 Application: L1 -Regularization 5.8.1.2 Combining Subgradients with Coordinate Descent 5.8.2 Proximal Gradient Method 5.8.2.1 Application: Alternative for L1 -Regularized Regression 5.8.3 Designing Surrogate Loss Functions for Combinatorial Optimization 5.8.3.1 Application: Ranking Support Vector Machine 5.8.4 Dynamic Programming for Optimizing Sequential Decisions 5.8.4.1 Application: Fast Matrix Multiplication Summary Further Reading Exercises XI 231 232 233 233 237 239 240 242 243 244 245 246 247 248 249 250 250 251 Constrained Optimization and Duality 255 6.1 Introduction 255 6.2 Primal Gradient Descent Methods 256 6.2.1 Linear Equality Constraints 257 6.2.1.1 Convex Quadratic Program with Equality Constraints 259 6.2.1.2 Application: Linear Regression with Equality Constraints 261 6.2.1.3 Application: Newton Method with Equality Constraints 262 6.2.2 Linear Inequality Constraints 262 6.2.2.1 The Special Case of Box Constraints 263 6.2.2.2 General Conditions for Projected Gradient Descent to Work 264 6.2.2.3 Sequential Linear Programming 266 6.2.3 Sequential Quadratic Programming 267 6.3 Primal Coordinate Descent 267 6.3.1 Coordinate Descent for Convex Optimization Over Convex Set 268 6.3.2 Machine Learning Application: Box Regression 269 6.4 Lagrangian Relaxation and Duality 270 6.4.1 Kuhn-Tucker Optimality Conditions 274 6.4.2 General Procedure for Using Duality 276 6.4.2.1 Inferring the Optimal Primal Solution from Optimal Dual Solution 276 6.4.3 Application: Formulating the SVM Dual 276 6.4.3.1 Inferring the Optimal Primal Solution from Optimal Dual Solution 278 480 CHAPTER 11 OPTIMIZATION IN COMPUTATIONAL GRAPHS notations U , V , and W are matrices of sizes k × m, m × d, and m × m, respectively The vector h0 is set to the zero vector Start by drawing a (vectored) computational graph for this system Show that node-to-node backpropagation uses the following recurrence: ∂o = UT ∂ht ∂o ∂o = W T Δp−1 ∀p ∈ {2 t} ∂hp−1 ∂hp Here, Δp is a diagonal matrix in which the diagonal entries contain the components of the vector − hp hp What you have just derived contains the node-to-node backpropagation equations of a recurrent neural network What is the size of each ∂o ? matrix ∂h p 10 Show that if we use the loss function L(o) in Exercise 9, then the loss-to-node gradient can be computed for the final layer ht as follows: ∂L(o) ∂L(o) = UT ∂o ∂ht The updates in earlier layers remain similar to Exercise 9, except that each o is replaced ? by L(o) What is the size of each matrix ∂L(o) ∂h p 11 Suppose that the output structure of the neural network in Exercise is changed so that there are k-dimensional outputs o1 ot in each layer, and the overall loss is t L = i=1 L(oi ) The output recurrence is op = U hp All other recurrences remain the same Show that the backpropagation recurrence of the hidden layers changes as follows: ∂L(ot ) ∂L = UT ∂ot ∂ht ∂L ∂L ∂L(op−1 ) = W T Δp−1 + UT ∀p ∈ {2 t} ∂op−1 ∂hp−1 ∂hp 12 For Exercise 11, show the following loss-to-weight derivatives: t ∂L(op ) T ∂L = h , ∂U ∂op p p=1 t ∂L T ∂L = Δp−1 hp−1 , ∂W ∂hp p=2 t ∂L T ∂L = Δp xp ∂V ∂hp p=1 What are the sizes and ranks of these matrices? 13 Consider a neural network in which a vectored node v feeds into two distinct vectored nodes h1 and h2 computing different functions The functions computed at the nodes are h1 = ReLU(W1 v) and h2 = sigmoid(W2 v) We not know anything about the values of the variables in other parts of the network, but we know that h1 = [2, −1, 3]T and h2 = [0.2, 0.5, 0.3]T , that are connected to the node v = [2, 3, 5, 1]T Furthermore, ∂L ∂L = [−2, 1, 4]T and ∂h = [1, 3, −2]T , respectively Show that the loss gradients are ∂h the backpropagated loss gradient ∂L ∂v can be computed in terms of W1 and W2 as follows: ⎤ ⎤ ⎡ ⎡ −2 0.16 ∂L = W1T ⎣ ⎦ + W2T ⎣ 0.75 ⎦ ∂v −0.42 11.8 EXERCISES What are the sizes of W1 , W2 , and 481 ∂L ∂v ? 14 Forward Mode Differentiation: The backpropagation algorithm needs to compute node-to-node derivatives of output nodes with respect to all other nodes, and therefore computing gradients in the backwards direction makes sense Consequently, the pseudocode on page 460 propagates gradients in the backward direction However, consider the case where we want to compute the node-to-node derivatives of all nodes ∂x with respect to source (input) nodes s1 sk In other words, we want to compute ∂s i for each non-input node variable x and each input node si in the network Propose a variation of the pseudocode of page 460 that computes node-to-node gradients in the forward direction 15 All-pairs node-to-node derivatives: Let y(i) be the variable in node i in a directed acyclic computational graph containing n nodes and m edges Consider the case where one wants to compute S(i, j) = ∂y(j) ∂y(i) for all pairs of nodes in a computational graph, so that at least one directed path exists from node i to node j Propose an algorithm for all-pairs derivative computation that requires at most O(n2 m) time [Hint: The pathwise aggregation lemma is helpful First compute S(i, j, t), which is the portion of S(i, j) in the lemma belonging to paths of length exactly t How can S(i, k, t + 1) be expressed in terms of the different S(i, j, t)?] 16 Use the pathwise aggregation lemma to compute the derivative of y(10) with respect to each of y(1), y(2), and y(3) as an algebraic expression (cf Figure 11.9) You should get the same derivative as obtained using the backpropagation algorithm in the text of the chapter 17 Consider the computational graph of Figure 11.8 For a particular numerical input x = a, you find the unusual situation that the value ∂y(j) ∂y(i) is 0.3 for each and every edge (i, j) in the network Compute the numerical value of the partial derivative of the output with respect to the input x (at x = a) Show the computations using both the pathwise aggregation lemma and the backpropagation algorithm 18 Consider the computational graph of Figure 11.8 The upper node in each layer computes sin(x + y) and the lower node in each layer computes cos(x + y) with respect to its two inputs For the first hidden layer, there is only a single input x, and therefore the values sin(x) and cos(x) are computed The final output node computes the product of its two inputs The single input x is radian Compute the numerical value of the partial derivative of the output with respect to the input x (at x = radian) Show the computations using both the pathwise aggregation lemma and the backpropagation algorithm 19 Matrix factorization with neural networks: Consider a neural network containing an input layer, a hidden layer, and an output layer The number of outputs is equal to the number of inputs d Each output value corresponds to an input value, and the loss function is the sum of squared differences between the outputs and their corresponding inputs The number of nodes k in the hidden layer is much less than d The d-dimensional rows of a data matrix D are fed one by one to train this neural network Discuss why this model is identical to that of unconstrained matrix factorization of rank-k Interpret the weights and the activations in the hidden layer in the context of matrix factorization You may assume that the matrix D has full column rank Define weight matrix and data matrix notations as convenient 482 CHAPTER 11 OPTIMIZATION IN COMPUTATIONAL GRAPHS -2 -2 -1 x2 x3 -3 -1 -2 x1 -1 o=0.1 LOSS L=-log(o) OUTPUT NODE (a) Exercise 21 INPUT NODES INPUT NODES x1 -1 x3 o=0.1 x2 -3 LOSS L=-log(o) OUTPUT NODE -1 (b) Exercise 22 Figure 11.17: Computational graphs for Exercises 21 and 22 20 SVD with neural networks: In the previous exercise, unconstrained matrix factorization finds the same k-dimensional subspace as SVD However, it does not find an orthonormal basis in general like SVD (see Chapter 8) Provide an iterative training method for the computational graph of the previous section by gradually increasing the value of k so that an orthonormal basis is found 21 Consider the computational graph shown in Figure 11.17(a), in which the local derivative ∂y(j) ∂y(i) is shown for each edge (i, j), where y(k) denotes the activation of node k ∂L The output o is 0.1, and the loss L is given by −log(o) Compute the value of ∂x for i each input xi using both the path-wise aggregation lemma, and the backpropagation algorithm 22 Consider the computational graph shown in Figure 11.17(b), in which the local derivative ∂y(j) ∂y(i) is shown for each edge (i, j), where y(k) denotes the activation of node k ∂L The output o is 0.1, and the loss L is given by −log(o) Compute the value of ∂x for i each input xi using both the path-wise aggregation lemma, and the backpropagation algorithm 23 Convert the weighted computational graph of Figure 11.2 into an unweighted graph by defining additional nodes containing w1 w5 along with appropriately defined hidden nodes 24 Multinomial logistic regression with neural networks: Propose a neural network architecture using the softmax activation function and an appropriate loss function that can perform multinomial logistic regression You may refer to Chapter for details of multinomial logistic regression 25 Weston-Watkins SVM with neural networks: Propose a neural network architecture and an appropriate loss function that is equivalent to the Weston-Watkins SVM You may refer to Chapter for details of the Weston-Watkins SVM Bibliography C Aggarwal Data mining: The textbook Springer, 2015 C Aggarwal Machine learning for text Springer, 2018 C Aggarwal Recommender systems: The textbook Springer, 2016 C Aggarwal Outlier analysis Springer, 2017 C C Aggarwal and S Sathe Outlier Ensembles: An Introduction Springer, 2017 C Aggarwal Neural networks and deep learning: A textbook Springer, 2018 C Aggarwal On the effects of dimensionality reduction on high dimensional similarity search ACM PODS Conference, pp 256–266, 2001 R Ahuja, T Magnanti, and J Orlin Network flows: theory, algorithms, and applications Prentice Hall, 1993 A Azran The rendezvous algorithm: Multiclass semi-supervised learning with markov random walks ICML, pp 49–56, 2007 10 M Bazaraa, H Sherali, and C Shetty Nonlinear programming: theory and algorithms John Wiley and Sons, 2013 11 I Bayer Fastfm: a library for factorization machines arXiv preprint arXiv:1505.00641, 2015 https://arxiv.org/pdf/1505.00641v2.pdf 12 A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linear inverse problems SIAM journal on imaging sciences, 2(1), pp 183–202, 2009 13 S Becker, and Y LeCun Improving the convergence of back-propagation learning with second order methods Proceedings of the 1988 connectionist models summer school, pp 29–37, 1988 14 J Bergstra and Y Bengio Random search for hyper-parameter optimization Journal of Machine Learning Research, 13, pp 281–305, 2012 15 D Bertsekas Nonlinear programming Athena scientific, 1999 16 D Bertsimas and J Tsitsiklis Introduction to linear optimization Athena Scientific, 1997 17 S Bhagat, G Cormode, and S Muthukrishnan Node classification in social networks Social Network Data Analytics, Springer, pp 115–148 2011 © Springer Nature Switzerland AG 2020 C C Aggarwal, Linear Algebra and Optimization for Machine Learning, https://doi.org/10.1007/978-3-030-40344-7 483 484 BIBLIOGRAPHY 18 C M Bishop Pattern recognition and machine learning Springer, 2007 19 C M Bishop Neural networks for pattern recognition Oxford University Press, 1995 20 E Bodewig Matrix calculus Elsevier, 2014 21 P Boggs and J Tolle Sequential quadratic programming Acta Numerica, 4, pp 1–151, 1995 22 S Boyd and L Vandenberghe Convex optimization Cambridge University Press, 2004 23 S Boyd and L Vandenberghe Applied linear algebra Cambridge University Press, 2018 24 S Brin, and L Page The anatomy of a large-scale hypertextual web search engine Computer Networks, 30(1–7), pp 107–117, 1998 25 A Brouwer and W Haemers Spectra of graphs Springer Science and Business Media, 2011 26 A Bryson A gradient method for optimizing multi-stage allocation processes Harvard University Symposium on Digital Computers and their Applications, 1961 27 C Chang and C Lin LIBSVM: a library for support vector machines ACM Transactions on Intelligent Systems and Technology, 2(3), 27, 2011 http://www.csie.ntu.edu.tw/∼cjlin/libsvm/ 28 O Chapelle Training a support vector machine in the primal Neural Computation, 19(5), pp 1155–1178, 2007 29 F Chung Spectral graph theory American Mathematical Society, 1997 30 C Cortes and V Vapnik Support-vector networks Machine Learning, 20(3), pp 273–297, 1995 31 N Cristianini, and J Shawe-Taylor An introduction to support vector machines and other kernel-based learning methods Cambridge University Press, 2000 32 Y Dauphin, R Pascanu, C Gulcehre, K Cho, S Ganguli, and Y Bengio Identifying and attacking the saddle point problem in high-dimensional non-convex optimization NIPS Conference, pp 2933–2941, 2014 33 S Deerwester, S Dumais, G Furnas, T Landauer, and R Harshman Indexing by latent semantic analysis Journal of the American Society for Information Science, 41(6), 41(6), pp 391–407, 1990 34 C Deng A generalization of the Sherman Morrison Woodbury formula Applied Mathematics Letters, 24(9), pp 1561–1564, 2011 35 C Ding, T Li, and W Peng On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing Computational Statistics and Data Analysis, 52(8), pp 3913–3927, 2008 36 N Draper and H Smith Applied regression analysis John Wiley & Sons, 2014 37 D Du and P Pardalos (Eds) Minimax and applications, Springer, 2013 38 J Duchi, E Hazan, and Y Singer Adaptive subgradient methods for online learning and stochastic optimization Journal of Machine Learning Research, 12, pp 2121–2159, 2011 39 R Duda, P Hart, and D Stork Pattern classification John Wiley and Sons, 2012 40 D Easley, and J Kleinberg Networks, crowds, and markets: Reasoning about a highly connected world Cambridge University Press, 2010 41 C Eckart and G Young The approximation of one matrix by another of lower rank Psychometrika, 1(3), pp 211–218, 1936 BIBLIOGRAPHY 485 42 A Emmott, S Das, T Dietterich, A Fern, and W Wong Systematic Construction of Anomaly Detection Benchmarks from Real Data arXiv:1503.01158, 2015 https://arxiv.org/abs/1503.01158 43 M Faloutsos, P Faloutsos, and C Faloutsos On power-law relationships of the internet topology ACM SIGCOMM Computer Communication Review, pp 251–262, 1999 44 R Fan, K Chang, C Hsieh, X Wang, and C Lin LIBLINEAR: A library for large linear classification Journal of Machine Learning Research, 9, pp 1871–1874, 2008 http://www.csie.ntu.edu.tw/∼cjlin/liblinear/ 45 R Fisher The use of multiple measurements in taxonomic problems Annals of Eugenics, 7: pp 179–188, 1936 46 P Flach Machine learning: the art and science of algorithms that make sense of data Cambridge University Press, 2012 47 C Freudenthaler, L Schmidt-Thieme, and S Rendle Factorization machines: Factorized polynomial regression models GPSDAA, 2011 48 J Friedman, T Hastie, and R Tibshirani Sparse inverse covariance estimation with the graphical lasso Biostatistics, 9(3), pp 432–441, 2008 49 M Garey, and D S Johnson Computers and intractability: A guide to the theory of NPcompleteness New York, Freeman, 1979 50 E Gaussier and C Goutte Relation between PLSA and NMF and implications ACM SIGIR Conference, pp 601–602, 2005 51 H Gavin The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems, 2011 http://people.duke.edu/∼hpgavin/ce281/lm.pdf 52 G Golub and C F Van Loan Matrix computations, John Hopkins University Press, 2012 53 I Goodfellow, Y Bengio, and A Courville Deep learning MIT Press, 2016 54 I Goodfellow, O Vinyals, and A Saxe Qualitatively characterizing neural network optimization problems arXiv:1412.6544, 2014 [Also appears in ICLR, 2015] https://arxiv.org/abs/1412.6544 55 A Grover and J Leskovec node2vec: Scalable feature learning for networks ACM KDD Conference, pp 855–864, 2016 56 T Hastie, R Tibshirani, and J Friedman The elements of statistical learning Springer, 2009 57 T Hastie, R Tibshirani, and M Wainwright Statistical learning with sparsity: the lasso and generalizations CRC Press, 2015 58 K He, X Zhang, S Ren, and J Sun Delving deep into rectifiers: Surpassing human-level performance on imagenet classification IEEE International Conference on Computer Vision, pp 1026–1034, 2015 59 M Hestenes and E Stiefel Methods of conjugate gradients for solving linear systems Journal of Research of the National Bureau of Standards, 49(6), 1952 60 G Hinton Connectionist learning procedures Artificial Intelligence, 40(1–3), pp 185–234, 1989 61 G Hinton Neural networks for machine learning, Coursera Video, 2012 62 K Hoffman and R Kunze Linear algebra, Second Edition, Pearson, 1975 486 BIBLIOGRAPHY 63 T Hofmann Probabilistic latent semantic indexing ACM SIGIR Conference, pp 50–57, 1999 64 C Hsieh, K Chang, C Lin, S S Keerthi, and S Sundararajan A dual coordinate descent method for large-scale linear SVM ICML, pp 408–415, 2008 65 Y Hu, Y Koren, and C Volinsky Collaborative filtering for implicit feedback datasets IEEE ICDM, pp 263–272, 2008 66 H Yu and B Wilamowski Levenberg–Marquardt training Industrial Electronics Handbook, 5(12), 1, 2011 67 R Jacobs Increased rates of convergence through learning rate adaptation Neural Networks, 1(4), pp 295–307, 1988 68 T Jaakkola, and D Haussler Probabilistic kernel regression models AISTATS, 1999 69 P Jain, P Netrapalli, and S Sanghavi Low-rank matrix completion using alternating minimization ACM Symposium on Theory of Computing, pp 665–674, 2013 70 C Johnson Logistic matrix factorization for implicit feedback data NIPS Conference, 2014 71 H J Kelley Gradient theory of optimal flight paths Ars Journal, 30(10), pp 947–954, 1960 72 D Kingma and J Ba Adam: A method for stochastic optimization arXiv:1412.6980, 2014 https://arxiv.org/abs/1412.6980 73 M Knapp Sines and cosines of angles in arithmetic progression Mathematics Magazine, 82(5), 2009 74 D Koller and N Friedman Probabilistic graphical models: principles and techniques MIT Press, 2009 75 Y Koren, R Bell, and C Volinsky Matrix factorization techniques for recommender systems Computer, 8, pp 30–37, 2009 76 A Langville, C Meyer, R Albright, J Cox, and D Duling Initializations for the nonnegative matrix factorization ACM KDD Conference, pp 23–26, 2006 77 D Lay, S Lay, and J McDonald Linear Algebra and its applications, Pearson, 2012 78 Q Le, J Ngiam, A Coates, A Lahiri, B Prochnow, and A Ng, On optimization methods for deep learning ICML Conference, pp 265–272, 2011 79 D Lee and H Seung Algorithms for non-negative matrix factorization Advances in Neural Information Processing Systems, pp 556–562, 2001 80 C J Lin, R C Weng, and S S Keerthi Trust region newton method for logistic regression , 9(Apr), 627–650 Journal of Machine Learning Research, 9, pp 627–650, 2008 81 T.-Y Liu Learning to rank for information retrieval Foundations and Trends in Information Retrieval, 3(3), pp 225–231, 2009 82 B London and L Getoor Collective classification of network data Data Classification: Algorithms and Applications, CRC Press, pp 399–416, 2014 83 D Luenberger and Y Ye Linear and nonlinear programming, Addison-Wesley, 1984 84 U von Luxburg A tutorial on spectral clustering Statistics and computing, 17(4), pp 395–416, 2007 85 S Marsland Machine learning: An algorithmic perspective, CRC Press, 2015 86 J Martens Deep learning via Hessian-free optimization ICML Conference, pp 735–742, 2010 BIBLIOGRAPHY 487 87 J Martens and I Sutskever Learning recurrent neural networks with hessian-free optimization ICML Conference, pp 1033–1040, 2011 88 J Martens, I Sutskever, and K Swersky Estimating the hessian by back-propagating curvature arXiv:1206.6464, 2016 https://arxiv.org/abs/1206.6464 89 J Martens and R Grosse Optimizing Neural Networks with Kronecker-factored Approximate Curvature ICML Conference, 2015 90 P McCullagh Regression models for ordinal data Journal of the royal statistical society Series B (Methodological), pp 109–142, 1980 91 T Mikolov, K Chen, G Corrado, and J Dean Efficient estimation of word representations in vector space arXiv:1301.3781, 2013 https://arxiv.org/abs/1301.3781 92 T Mikolov, I Sutskever, K Chen, G Corrado, and J Dean Distributed representations of words and phrases and their compositionality NIPS Conference, pp 3111–3119, 2013 93 T Minka A comparison of numerical optimizers for logistic regression Unpublished Draft, 2003 94 T Mitchell Machine learning, McGraw Hill, 1997 95 K Murphy Machine learning: A probabilistic perspective, MIT Press, 2012 96 G Nemhauser, A Kan, and N Todd Nondifferentiable optimization Handbooks in Operations Research and Management Sciences, 1, pp 529–572, 1989 97 Y Nesterov A method of solving a convex programming problem with convergence rate O(1/k2 ) Soviet Mathematics Doklady, 27, pp 372–376, 1983 98 A Ng, M Jordan, and Y Weiss On spectral clustering: Analysis and an algorithm NIPS Conference, pp 849–856, 2002 99 J Nocedal and S Wright Numerical optimization Springer, 2006 100 N Parikh and S Boyd Proximal algorithms Foundations and Trends in Optimization, 1(3), pp 127–239, 2014 101 J Pennington, R Socher, and C Manning Glove: Global Vectors for Word Representation EMNLP, pp 1532–1543, 2014 102 J C Platt Sequential minimal optimization: A fast algorithm for training support vector machines Advances in Kernel Method: Support Vector Learning, MIT Press, pp 85–208, 1998 103 B Perozzi, R Al-Rfou, and S Skiena Deepwalk: Online learning of social representations ACM KDD Conference, pp 701–710, 2014 104 E Polak Computational methods in optimization: a unified approach Academic Press, 1971 105 B Polyak and A Juditsky Acceleration of stochastic approximation by averaging SIAM Journal on Control and Optimization, 30(4), pp 838–855, 1992 106 N Qian On the momentum term in gradient descent learning algorithms Neural networks, 12(1), pp 145–151, 1999 107 S Rendle Factorization machines IEEE ICDM Conference, pp 995–100, 2010 108 S Rendle Factorization machines with libfm ACM Transactions on Intelligent Systems and Technology, 3(3), 57, 2012 488 BIBLIOGRAPHY 109 F Rosenblatt The perceptron: A probabilistic model for information storage and organization in the brain Psychological Review, 65(6), 386, 1958 110 D Rumelhart, G Hinton, and R Williams Learning internal representations by backpropagating errors In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pp 318–362, 1986 111 T Schaul, S Zhang, and Y LeCun No more pesky learning rates ICML Confererence, pp 343–351, 2013 112 B Schă olkopf, A Smola, and K.-R Mă uller Nonlinear component analysis as a kernel eigenvalue problem Neural Computation, 10(5), pp 12991319, 1998 113 B Schă olkopf, J C Platt, J Shawe-Taylor, A J Smola, and R C Williamson Estimating the Support of a High-Dimensional Distribution Neural Computation, 13(7), pp 1443–1472, 2001 114 J Shewchuk An introduction to the conjugate gradient method without the agonizing pain Technical Report, CMU-CS-94-125, Carnegie-Mellon University, 1994 115 J Shi and J Malik Normalized cuts and image segmentation IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), pp 888–905, 2000 116 N Shor Minimization methods for non-differentiable functions (Vol 3) Springer Science and Business Media, 2012 117 A Singh and G Gordon A unified view of matrix factorization models Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp 358–373, 2008 118 B Schă olkopf and A J Smola Learning with kernels: support vector machines, regularization, optimization, and beyond Cambridge University Press, 2001 119 J Solomon Numerical Algorithms: Methods for Computer Vision, Machine Learning, and Graphics CRC Press, 2015 120 N Srebro, J Rennie, and T Jaakkola Maximum-margin matrix factorization Advances in neural information processing systems, pp 1329–1336, 2004 121 G Strang The discrete cosine transform SIAM review, 41(1), pp 135–147, 1999 122 G Strang An introduction to linear algebra, Fifth Edition Wellseley-Cambridge Press, 2016 123 G Strang Linear algebra and its applications, Fourth Edition Brooks Cole, 2011 124 G Strang and K Borre Linear algebra, geodesy, and GPS Wellesley-Cambridge Press, 1997 125 G Strang Linear algebra and learning from data Wellesley-Cambridge Press, 2019 126 J Tenenbaum, V De Silva, and J Langford A global geometric framework for nonlinear dimensionality reduction Science, 290 (5500), pp 2319–2323, 2000 127 A Tikhonov and V Arsenin Solution of ill-posed problems Winston and Sons, 1977 128 M Udell, C Horn, R Zadeh, and S Boyd Generalized low rank models Foundations and Trends in Machine Learning, 9(1), pp 1–118, 2016 https://github.com/madeleineudell/LowRankModels.jl 129 G Wahba Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV Advances in Kernel Methods-Support Vector Learning, 6, pp 69–87, 1999 130 H Wendland Numerical linear algebra: An introduction Cambridge University Press, 2018 131 P Werbos Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences PhD thesis, Harvard University, 1974 BIBLIOGRAPHY 489 132 B Widrow and M Hoff Adaptive switching circuits IRE WESCON Convention Record, 4(1), pp 96–104, 1960 133 C Williams and M Seeger Using the Nystră om method to speed up kernel machines NIPS Conference, 2000 134 S Wright Coordinate descent algorithms Mathematical Programming, 151(1), pp 3–34, 2015 135 T T Wu, and K Lange Coordinate descent algorithms for lasso penalized regression The Annals of Applied Statistics, 2(1), pp 224–244, 2008 136 H Yu, F Huang, and C J Lin Dual coordinate descent methods for logistic regression and maximum entropy models Machine Learning, 85(1–2), pp 41–75, 2011 137 H Yu, C Hsieh, S Si, and I S Dhillon Scalable coordinate descent approaches to parallel matrix factorization for recommender systems IEEE ICDM, pp 765–774, 2012 138 R Zafarani, M A Abbasi, and H Liu Social media mining: an introduction Cambridge University Press, 2014 139 M Zeiler ADADELTA: an adaptive learning rate method arXiv:1212.5701, 2012 https://arxiv.org/abs/1212.5701 140 T Zhang On the dual formulation of regularized linear systems with convex risks Machine Learning, 46, 1–3, pp 81–129, 2002 141 Y Zhou, D Wilkinson, R Schreiber, and R Pan Large-scale parallel collaborative filtering for the Netflix prize Algorithmic Aspects in Information and Management, pp 337–348, 2008 142 J Zhu and T Hastie Kernel logistic regression and the import vector machine Advances in neural information processing systems, 2002 143 X Zhu, Z Ghahramani, and J Lafferty Semi-supervised learning using gaussian fields and harmonic functions ICML Conference, pp 912–919, 2003 144 https://www.csie.ntu.edu.tw/∼cjlin/libmf/ Index Symbols (PD) Constraints, 275 A Activation, 451 AdaGrad, 214 Adam Algorithm, 215 Additively Separable Functions, 128 Adjacency Matrix, 414 Affine Transform, 42 Affine Transform Definition, 43 Algebraic Multiplicity, 110 Alternating Least-Squares Method, 197 Alternating Least Squares, 349 Anisotropic Scaling, 49 Armijo Rule, 162 Asymmetric Laplacian, 426 B Backpropagation, 459 Barrier Function, 289 Barrier Methods, 288 Basis, 54 Basis Change Matrix, 104 BFGS, 237, 251 Binary Search, 161 Block Coordinate Descent, 197, 349 Block Diagonal Matrix, 13, 419 Block Upper-Triangular Matrix, 419 Bold-Driver Algorithm, 160 Box Regression, 269 C Cauchy-Schwarz Inequality, Cayley-Hamilton Theorem, 106 Chain Rule for Vectored Derivatives, 175 Chain Rule of Calculus, 174 Characteristic Polynomial, 105 Cholesky Factorization, 119 Closed Convex Set, 155 Clustering Graphs, 423 Collective Classification, 440 Compact Singular Value Decomposition, 307 Competitive Learning Algorithm, 477 Complementary Slackness Condition, 275 Complex Eigenvalues, 107 Computational Graphs, 34 Condition Number of a Matrix, 85, 326 Conjugate Gradient Method, 233 Conjugate Transpose, 88 Connected Component of Graph, 413 Connected Graph, 413 Constrained Optimization, 255 Convergence in Gradient Descent, 147 Convex Objective Functions, 124, 154 Convex Sets, 154 Coordinate, 2, 55 Coordinate Descent, 194, 348 Coordinate Descent in Recommenders, 348 Cosine Law, Covariance Matrix, 122 Critical Points, 143 Cycle, 413 © Springer Nature Switzerland AG 2020 C C Aggarwal, Linear Algebra and Optimization for Machine Learning, https://doi.org/10.1007/978-3-030-40344-7 491 492 D Data-Specific Mercer Kernel Map, 383 Davidson–Fletcher–Powell, 238 Decision Boundary, 181 Decomposition of Matrices, 339 Defective Matrix, 110 Degree Centrality, 434 Degree Matrix, 416 Degree of a Vertex, 413 Degree Prestige, 434 Denominator Layout, 170 DFP, 238 Diagonal Entries of a Matrix, 13 Diameter of Graph, 414 Dimensionality Reduction, 307 Directed Acyclic Graphs, 413 Directed Acyclic Graphs, 413 Directed Graph, 412 Directed Link Prediction, 430 Discrete Cosine Transform, 77 Discrete Fourier Transform, 79, 89 Discrete Wavelet Transform, 60 Disjoint Vector Spaces, 61 Divergence in Gradient Descent, 148 Document-Term Matrix, 340 Duality, 255 Dynamic Programming, 248, 453 E Economy Singular Value Decomposition, 306 Eigenspace, 110, 436 Eigenvalues, 104 Eigenvector Centrality, 434 Eigenvector Prestige, 434 Eigenvectors, 104 Elastic-Net regression, 244 Elementary Matrix, 22 Energy, 20, 311 Epoch, 165 Ergodic Markov Chain, 432 Euler Identity, 33, 87 F Factorization of Matrices, 299, 339 Fat Matrix, Feasible Direction method, 256 Feature Engineering, 329 Feature Preprocessing with PCA, 327 Feature Spaces, 383 INDEX Fields, Finite-Difference Approximation, 159 Frobenius Inner Product, 309 Full Rank Matrix, 63 Full Row/Column Rank, 63 Full Singular Value Decomposition, 306 Fundamental Subspaces of Linear Algebra, 63, 325 G Gaussian Elimination, 65 Gaussian Radial Basis Kernel, 405 Generalization, 165 Generalized Low-Rank Models, 365 Geometric Multiplicity, 111 Givens Rotation, 47 Global Minimum, 145 GloVe, 361 Golden-Section Search, 161 Gram-Schmidt Orthogonalization, 73 Gram Matrix, 72 Graphs, 411 H Hard Tanh Activation, 451 Hessian, 127, 152, 217 Hessian-free Optimization, 233 Hinge Loss, 184 Homogeneous System of Equations, 327 Homophily, 440 Householder Reflection Matrix, 47 Huber Loss, 223 Hyperbolic Tangent Activation, 451 I Idempotent Property, 83 Identity Activation, 451 Ill-Conditioned Matrices, 85 Implicit Feedback Data, 341 Indefinite Matrix, 118 Indegree of a Vertex, 413 Inflection Point, 143 Initialization, 163 Inner Product, 86, 309 Interior Point Methods, 288 Irreducible Matrix, 420 ISOMAP, 394 ISTA, 246 INDEX Iterative Label Propagation, 442 Iterative Soft Thresholding Algorithm, 246 J Jacobian, 170, 217, 467 Jordan Normal Form, 111 K K-Means Algorithm, 197, 342 Katz Measure, 418 Kernel Feature Spaces, 383 Kernel K-Means, 395, 397 Kernel Methods, 122 Kernel PCA, 391 Kernel SVD, 384 Kernel SVM, 396, 398 Kernel Trick, 395, 397 Kuhn-Tucker Optimality Conditions, 274 L L-BFGS, 237, 239, 251 Lagrangian Relaxation, 270 Laplacian, 426 Latent Components, 306 Latent Semantic Analysis, 323 Learning Rate Decay, 159 Learning Rate in Gradient Descent, 33, 146 Left Eigenvector, 108 Left Gram Matrix, 73 Left Inverse, 79 Left Null Space, 63 Leibniz formula, 100 Levenberg–Marquardt Algorithm, 251 libFM, 374 LIBLINEAR, 199 Linear Activation, 451 Linear Conjugate Gradient Method, 237 Linear Independence/Dependence, 53 Linear Kernel, 405 Linearly Additive Functions, 149 Linear Programming, 257 Linear Transform as Matrix Multiplication, Linear Transform Definition, 42 Line Search, 160 Link Prediction, 360 Local Minimum, 145 Logarithmic Barrier Function, 289 Loss Function, 142 493 Loss Functions, 28 Low-Rank Approximation, 308 Low-Rank Matrix Update, 19 Lower Triangular Matrix, 13 LSA, 323 LU Decomposition, 66, 119 M Mahalanobis Distance, 329 Manhattan Norm, Markov Chain, 432 Matrix Calculus, 170 Matrix Decomposition, 339 Matrix Factorization, 299, 339 Matrix Inversion, 67 Matrix Inversion Lemma, 18 Maximum Margin Matrix Factorization, 364 Minimax Theorem, 272 Momentum-based Learning, 212 Moore-Penrose Pseudoinverse, 81, 179, 325 Multivariate Chain Rule, 175 N Negative Semidefinite Matrix, 118 Newton Update, 218 Nilpotent Matrix, 14 Noise Removal with SVD, 324 Non-Differentiable Optimization, 239 Nonlinear Conjugate Gradient Method, 237 Nonnegative Matrix Factorization, 350 Nonsingular Matrix, 15 Norm, Normal Equation, 56, 80 Null Space, 63 Nystră om Technique, 385 O One-Sided Inverse, 79 Open Convex Set, 155 Orthogonal Complementary Subspace, 62 Orthogonal Matrix, 17 Orthogonal Vectors, Orthogonal Vector Spaces, 61 Orthonormal Vectors, Outdegree of a Vertex, 413 Outer Product, 10 Outlier Detection, 328 Overfitting, 166 494 P Path, 413 PCA, 320 Permutation Matrix, 24 Perron-Frobenius Theorem, 421, 422 Polar Decomposition of Matrix, 303 Polynomial Kernel, 405 Polynomial of Matrix, 14 Positive Semidefinite Matrices, 117 Power Method, 133 Primal-Dual Constraints, 275 Primal-Dual Methods, 286 Principal Component Analysis, 123, 320 Principal Components, 321 Principal Components Regression, 327 Principal Eigenvector, 133, 421 Probabilistic Graphical Models, 477 Projected Gradient Descent, 256 Projection, Projection Matrix, 82, 114, 259 Proportional Odds Model, 368 Proximal Gradient Method, 244 Pseudoinverse, 179, 325 Push-Through Identity, 19 Q QR Decomposition, 74 Quadratic Programming, 124, 130, 173, 257 Quasi-Newton Methods, 237 R Ranking Support Vector Machines, 247 Recommender Systems, 346 Rectangular Diagonal Matrix, 13 Rectangular Matrix, Reduced Singular Value Decomposition, 307 Reducible Matrix, 420 Regressand, 30, 176 Regressors, 30, 176 ReLU Activation, 451 Representer Theorem, 400 Residual Matrix, 319 Response Variable, 30 Right Eigenvector, 108 Right Inverse, 79 Right Null Space, 63 Rigid Transformation, 48 RMSProp, 215 Rotreflection, 46 INDEX Row Echelon Form, 65 Row Space, 63 S Saddle Point, 143 Saddle Points, 128, 230 Scatter Matrix, 123 Schur’s Product Theorem, 405 Schur Decomposition, 112, 132 Schwarz Theorem, 152 Semi-Supervised Classification, 443 Separator, 181 Sequential Linear Programming, 266 Sequential Quadratic Programming, 267 Shear Matrix, 26 Shear Transform, 26 Sherman–Morrison–Woodbury Identity, 19 Shortest Path, 414 Sigmoid Activation, 451 Sigmoid Kernel, 405 Similarity Graph, 442 Similarity Transformation, 113 Simultaneously Diagonalizable, 115 Singular Matrix, 15 Singular Value Decomposition, 299 Span of Vector Set, 59 Sparse Matrix, 14 Spectral Decomposition, 134, 306 Spectral Theorem, 115 Square Matrix, Standard Basis, 55 Stationarity Condition, 275 Steepest Descent Direction, 146 Stochastic Gradient Descent, 164 Stochastic Transition Matrix, 415 Strict Convexity, 158 Strictly Triangular Matrix, 13 Strong Duality, 274 Strongly Connected Graph, 413 Subgradient Method, 240 Subgraph, 413 Subspace of Vector Space, 52 Subtangents, 240 Super-diagonal Entries of a Matrix, 112 SVD, 299 Sylvester’s Inequality, 72 Symmetric Laplacian, 426 Symmetric Matrix, 12, 115 INDEX T Tall Matrix, Taylor Expansion, 31, 217 Trace, 20, 113 Triangle Inequality, Triangular Matrix, 13 Triangular Matrix Inversion, 18 Truncated SVD, 307 Trust Region Method, 232 Tuning Hyperparameters, 168 U Undirected Graph, 412 Unitary Matrix, 89 Univariate Optimization, 142 Upper Triangular Matrix, 13 495 V Vector Space, 51, 87 Vertex Classification, 440 Von Mises Iterations, 133 W Walk, 413 Weak Duality, 272 Weighted Graph, 412 Weston-Watkins SVM, 190 Whitening with PCA, 327 Wide Matrix, Widrow-Hoff Update, 183 Woodbury Identity, 19 Word2vec, 364 .. .Linear Algebra and Optimization for Machine Learning Charu C Aggarwal Linear Algebra and Optimization for Machine Learning A Textbook Charu C Aggarwal Distinguished Research Staff Member... Switzerland AG 2020 C C Aggarwal, Linear Algebra and Optimization for Machine Learning, https://doi.org/10.1007/978-3-030-40344-7 1 CHAPTER LINEAR ALGEBRA AND OPTIMIZATION: AN INTRODUCTION a... the general cochair of the IEEE Big Data Conference (2014) and as the program cochair of the ACM CIKM Conference (2015), the IEEE ICDM Conference (2015), and the ACM KDD Conference (2016) He