a tutorial on deep learning

A Tutorial on Deep Learning Kai Yu Multimedia Department, Baidu Background Classification Models Since Late 80’s §  Neural Networks §  Boosting §  Support Vector Machines §  Maximum Entropy §  … 11/3/12 Since 2000 – Learning with Structures §  Kernel Learning §  Transfer Learning §  Semi-supervised Learning §  Manifold Learning §  Sparse Learning §  Matrix Factorization §  Structured Input-Output Prediction §  … 11/3/12 Mining$for$Structure$ Mission Yet Accomplished Massive$increase$in$both$computa:onal$power$and$the$amount$of$ data$available$from$web,$video$cameras,$laboratory$measurements.$ Images$&$Video$ Text$&$Language$$ Speech$&$Audio$ Gene$Expression$ Product$$ Recommenda:on$ Rela:onal$Data/$$ Social$Network$ Climate$Change$ Geological$Data$ Mostly$Unlabeled$ • $Develop$sta:s:cal$models$that$can$discover$underlying$structure,$cause,$or$ sta:s:cal$correla:on$from$data$in$unsupervised*or$semi,supervised*way.$$ Slide Courtesy: Russ Salakhutdinov • $Mul:ple$applica:on$domains.$ 11/7/12 The pipeline of machine visual perception Most Efforts in Machine Learning Low-level sensing Preprocessing Feature extract Feature selection Inference: prediction, recognition • Most critical for accuracy • Account for most of the computation for testing • Most time-consuming in development cycle • Often hand-craft in practice 11/3/12 Computer vision features SIFT HoG Spin image RIFT GLOH Slide Courtesy: Andrew Ng Learning features from data Machine Learning Low-level sensing Preprocessing Feature extract Feature selection Inference: prediction, recognition Feature Learning 11/3/12 Convolution Neural Networks Coding Pooling Coding Pooling Y LeCun, B Boser, J S Denker, D Henderson, R E Howard, W Hubbard, and L D Jackel Backpropagation applied to handwritten zip code recognition Neural Computation, 1989 11/3/12 “Winter of Neural Networks” Since 90’s §  Non-convex §  Need a lot of tricks to play with §  Hard to theoretical analysis 11/3/12 10 - Easy to add many error terms to loss function Layer-by-Layer Supervised Training - Joint learning of related tasks yields better representations Example of architecture: Collobert et al “NLP (almost) from scratch” JMLR 2011 11/7/12 138 Ranzato Slide Courtesy: Marc'Aurelio Ranzato 60 Biological & Theoretical Justification Why Hierarchy? Theoretical: “…well-known depth-breadth tradeoff in circuits design [Hastad 1987] This suggests many functions can be much more efficiently represented with deeper architectures…” [Bengio & LeCun 2007] Biological: Visual cortex is hierarchical (Hubel-Wiesel [Thorpe] Model) Sparse DBN: Training on face images object models Deep Architecture in the Brain object parts (combination of edges) Area V4 Higher level visual abstractions Area V2 Primitive shape detectors Area V1 Edge detectors Retina pixels edges pixels [Lee, Grosse, Ranganath & Ng, 2009] Sensor representation in the brain Auditory cortex learns to see Auditory Cortex Seeing with your tongue (Same rewiring process also works for touch/ somatosensory cortex.) Human echolocation (sonar [Roe et al., 1992; BrainPort; Welsh & Blasch, 1997] Large-scale training The challenge The Challenge A Large Scale problem has: – lots of training samples (>10M) – lots of classes (>10K) and – lots of input dimensions (>10K) – best optimizer in practice is on-line SGD which is naturally sequential, hard to parallelize – layers cannot be trained independently and in parallel, hard to distribute – model can have lots of parameters that may clog the network, hard to distribute across machines 11/7/12 Slide Courtesy: Marc'Aurelio Ranzato 150 66 A solution by model parallelism Our Solution 1st machine 2nd machine 3rd machine MODEL PARALLELISM + MODEL DATA PARALLELISM input #3 input #2 input #1 PARALLELISM 153 Ranzato Ra Le et al “Building high-level features using large-scale unsupervised learning” ICML 2012 11/7/12 Slide Courtesy: Marc'Aurelio Ranzato 67 MODEL MODEL PARALLELISM PARALLELISM + + DATA input #3 input #3 input #2 input #2input #1 input #1 DATA PARALLELISM PARALLELISM Ra large-scale unsupervised learning” ICML 2012 Ran Slide Courtesy: Marc'Aurelio Ranzato 68 Le et al “Building high-level features using large-scale unsupervised learning” ICML 2012 Le et al “Building high-level features using 11/7/12 Asynchronous SGD Asynchronous SGD PARAMETER SERVER ∂L ∂1 1st replica 11/7/12 2nd replica 3rd replica 157 Ranzato Slide Courtesy: Marc'Aurelio Ranzato 69 Asynchronous SGD Asynchronous SGD PARAMETER SERVER 1 1st replica 11/7/12 2nd replica 3rd replica 158 Ranzato Slide Courtesy: Marc'Aurelio Ranzato 70 PARAMETER SERVER ∂L ∂2 1st replica 11/7/12 2nd replica 3rd replica 160 Ranzato Slide Courtesy: Marc'Aurelio Ranzato 71 PARAMETER SERVER 2 1st replica 11/7/12 2nd replica 3rd replica 161 Ranzato 72 Slide Courtesy: Marc'Aurelio Ranzato Unsupervised Learning With 1B Paramete Training A Model with B Parameters Deep Net: – stages – each stage consists of local filtering, L2 pooling, LCN - 18x18 filters - filters at each location - L2 pooling and LCN over 5x5 neighborhoods – training jointly the three layers by: - reconstructing the input of each layer - sparsity on the code 11/7/12 Slide Courtesy: Marc'Aurelio Ranzato 73 Thank you

Định dạng
Số trang	74
Dung lượng	28,13 MB