Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 50 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
50
Dung lượng
22,8 MB
Nội dung
Deep Learning Theory Yoshua Bengio April 15, 2015 London & Paris ML Meetup Breakthrough • Deep Learning: machine learning algorithms based on learning mul:ple levels of representa:on / abstrac:on Amazing improvements in error rate in object recogni?on, object detec?on, speech recogni?on, and more recently, some in machine transla?on Ongoing Progress: Natural Language Understanding • Recurrent nets genera?ng credible sentences, even beCer if condi?onally: • Machine transla?on Xu et al, to appear ICML’2015 • Image 2 text Why is DeepLearning Working so Well? Machine Learning, AI & No Free Lunch • Three key ingredients for ML towards AI 1. Lots & lots of data 2. Very flexible models 3. Powerful priors that can defeat the curse of dimensionality Ultimate Goals • AI • Needs knowledge • Needs learning (involves priors + op#miza#on/search) • Needs generaliza:on (guessing where probability mass concentrates) • Needs ways to fight the curse of dimensionality (exponen?ally many configura?ons of the variables to consider) • Needs disentangling the underlying explanatory factors (making sense of the data) ML 101 What We Are Fighting Against: The Curse of Dimensionality To generalize locally, need representa?ve examples for all relevant varia?ons! Classical solu?on: hope for a smooth enough target func?on, or make it smooth by handcraZing good features / kernel Not Dimensionality so much as Number of Variations (Bengio, Dellalleau & Le Roux 2007) • Theorem: Gaussian kernel machines need at least k examples to learn a func?on that has 2k zero-‐crossings along some line • Theorem: For a Gaussian kernel machine to learn some maximally varying func?ons over d inputs requires O(2d) examples Putting Probability Mass where Structure is Plausible • Empirical distribu?on: mass at training examples • Smoothness: spread mass around • Insufficient • Guess some ‘structure’ and generalize accordingly Bypassing the curse of dimensionality We need to build composi?onality into our ML models Just as human languages exploit composi?onality to give representa?ons and meanings to complex ideas Exploi?ng composi?onality gives an exponen?al gain in representa?onal power Distributed representa?ons / embeddings: feature learning Deep architecture: mul?ple levels of feature learning Prior: composi?onality is useful to describe the world around us efficiently 10 Manifold Learning = Representation Learning tangent directions tangent plane Data on a curved manifold 36 Non-Parametric Manifold Learning: hopeless without powerful enough priors Manifolds es?mated out of the neighborhood graph: -‐ node = example -‐ arc = near neighbor AI-‐related data manifolds have too many twists and turns, not enough examples to cover all the ups & downs & twists 37 Auto-Encoders Learn Salient Variations, like a non-linear PCA • Minimizing reconstruc?on error forces to keep varia?ons along manifold • Regularizer wants to throw away all varia?ons • With both: keep ONLY sensi?vity to varia?ons ON the manifold 38 Denoising Auto-Encoder • Learns a vector field poin?ng towards higher probability direc?on (Alain & Bengio 2013) reconstruction(x) x ! 2@ log p(x) @x • Some DAEs correspond to a kind of Gaussian RBM with regularized Score Matching (Vincent 2011) Corrupted input [equivalent when noiseà0] prior: examples concentrate near a lower dimensional “manifold” Corrupted input Regularized Auto-Encoders Learn a Vector Field that Estimates a Gradient Field (Alain & Bengio ICLR 2013) 40 Denoising Auto-Encoder Markov Chain corrupt Xt 41 X~ t denoise Xt+1 X~ t+1 X~ t+2 Xt+2 Denoising Auto-Encoders Learn a Markov Chain Transition Distribution (Bengio et al NIPS 2013) 42 Generative Stochastic Networks (GSN) (Bengio et al ICML 2014, Alain et al arXiv 2015) • Recurrent parametrized stochas:c computa:onal graph that defines a transi:on operator for a Markov chain whose asympto:c distribu:on is implicitly es:mated by the model • Noise injected in input and hidden layers • Trained to max reconstruc?on prob of example at each step • Example structure inspired from the DBM Gibbs chain: noise W3" h3" h2" h1" noise W1" x0" 1" 43 W2" WT"1" target" WT"2" W1" sample"x1" W3" W3"T" W3"T" W2" W2"T" W2" W1"T" W1" W1"T" target" sample"x2" target" sample"x3" to 5 steps Space-Filling in Representation-Space • Deeper representa:ons " abstrac:ons " disentangling • Manifolds are expanded and fla]ened Pixel space 9’s manifold 3’s manifold Representa?on space 9’s manifold X-‐space 3’s manifold H-‐space Linear interpola?on at layer 2 9’s manifold 3’s manifold Linear interpola?on at layer 1 Linear interpola?on in pixel space (Bengio 2014, arXiv 1407.7906) Each level transforms the data into a representa?on in which it is easier to model, unfolding it more, contrac?ng the noise dimensions and mapping the signal dimensions to a factorized (uniform-‐like) distribu?on min KL(Q(x, h)||P (x, h)) for each intermediate level h 45 Q(hL) noise Extracting Structure By Gradual Disentangling and Manifold Unfolding P(hL) signal fL gL f2 g2 P(h2|h1) … Q(h2|h1) P(h1) Q(h1) Q(h1|x) Q(x) f1 g1 P(x|h1) DRAW: the latest variant of Variational Auto-Encoder eural Network For Image Generation (Gregor et al of Google DeepMind, arXiv 1502.04623, 2015) KAROLG @ GOOGLE COM DANIHELKA @ GOOGLE COM GRAVESA @ GOOGLE COM WIERSTRA @ GOOGLE COM • Even for a sta?c input, the encoder and decoder are now recurrent nets, which gradually add elements to the answer, DRAW: Neuralm Network For Image and use AaRecurrent n aCen?on echanism to Generation choose where to do so ial glimpses, or foveations, than by a sin- enugh the entire image (Larochelle & Hinton, ure ine al., 2012; Tang et al., 2013; Ranzato, 2014; ics 014; Mnih et al., 2014; Ba et al., 2014; Serial 14) The main challenge faced by sequential ws es is learning where to look, which can be els ate reinforcement learning techniques such as nd, ers s (Mnih et al., 2014) The attention model in in- er, is fully differentiable, making it possible andard backpropagation In this sense it relective read and write operations developed Turing Machine (Graves et al., 2014) Time P (x|z) decoder FNN ct ct write write decoder RNN decoder RNN z zt zt+1 sample sample sample Q(z|x) encoder FNN x hdec t Q(zt |x, z1:t henc t 1) cT Q(zt+1 |x, z1:t ) encoder RNN encoder RNN read read x x P (x|z1:T ) decoding (generative model) encoding (inference) a visual Figure Left: Conventional Variational Auto-Encoder Durfashion, section defines the DRAW architecture, 46 A trained Figure DRAW network generating MNIST dig Rough ing generation, a sample z is drawn from a prior P (z) and passed its Eachused row shows successive stages in thethe generation of a sinlossarefunction for training and proines gle digit Note how the lines composing the digits appear to be through the feedforward decoder network to compute the probaand the ge generation The presents thedelimits selec“drawn” bySection the network red rectangle the area atbility of the input P (x|z) given the sample During inference the atic im- Task #glimpses LSTM #h 100 ⇥ 100 MNIST Classification 256 MNIST Model 64 256 SVHN Model 32 800 Samples of SVHN Images: the CIFAR Model 64 400 DRAW drawing process 47 #z 100 100 200 DRAW Samples of SVHN Images: generated samples vs training nearest ecurrent Neural Network For Image Generation neighbor Nearest training example for last column of samples 48 Figure Generated SVHN images The rightmost column Conclusions • Distributed representa:ons: • prior that can buy exponen?al gain in generaliza?on • Deep composi:on of non-‐lineari:es: • prior that can buy exponen?al gain in generaliza?on • Both yield non-‐local generaliza:on • Strong evidence that local minima are not an issue, saddle points • Auto-‐encoders capture the data genera:ng distribu:on • Gradient of the energy • Markov chain genera?ng an es?mator of the dgd • Can be generalized to deep genera?ve models 49 MILA: Montreal Institute for Learning Algorithms ...Breakthrough • Deep Learning: machine learning algorithms based on learning mul:ple levels of representa:on / abstrac:on Amazing... Transfer Learning Challenge: Deep Learning 1st Place Raw data layer ICML’2011 workshop on Unsup & Transfer Learning layers NIPS’2011 Transfer Learning Challenge Paper:... representa?onal power Distributed representa?ons / embeddings: feature learning Deep architecture: mul?ple levels of feature learning Prior: composi?onality is useful to describe the