Bài giảng neural network Thân Quang Khoát

Introduction to Machine Learning and Data Mining (Học máy Khai phá liệu) Khoat Than School of Information and Communication Technology Hanoi University of Science and Technology 2020 Outline ¡ Introduction to Machine Learning & Data Mining ¡ Unsupervised learning Ă Supervised learning ă Artificial neural network Ă Practical advice Artificial neural network: introduction (1) ¡ Artificial neural network (ANN) (mạng nơron nhân tạo) ¡ Simulates the biological neural systems (human brain) ¡ ANN is a structure/network made of interconnection of artificial neurons ¡ Neuron ¡ Has input/output ¡ Executes a local calculation (local function) ¡ Output of a neuron is charactorized by ¡ In/out characteristics ¡ Connections between it and other neurons ¡ (Possible) other inputs Artificial neural network: introduction (2) ¡ ANN can be thought of as a highly decentralized and parallel information processing structure ¡ ANN has the ability to learn, recall and generalize from the training data ¡ The ability of an ANN depends on ¡ Topology of the neural network ¡ Input/output characteristics ¡ Learning algorithm ¡ Training data ANN: a huge breakthrough ¡ AlphaGo of Google the world champion at Go, 3/2016 ¡ Go is a 2500 year-old game ¡ Go is one of the most complex games ¡ AlphaGo learns from 30 millions human moves, and plays itself to find new moves ¡ It beat Lee Sedol (World champion) http://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefinedfuture/ http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-ofgo-1.19234 Structure of a neuron ¡ Input signals of a neuron {𝑥𝑖 , 𝑖 = … 𝑚} ¡ Each input signal 𝑥𝑖 is associated with a weight 𝑤𝑖 ¡ Bias 𝑤0 (with 𝑥0 = 1) ¡ Net input is a combination of the input signals 𝑁𝑒𝑡(𝒘, 𝒙) ¡ Activation/transfer function 𝑓 computes the output of a neuron ¡ Output 𝑂𝑢𝑡 = 𝑓 (𝑁𝑒𝑡 (𝒘, 𝒙)) x0=1 x1 x2 … xm w0 w1 w2 Output (Out) S wm Input (x) Net input (Net) Activation/ transfer function (f) Net Input ¡ Net input is usually calculated by a linear function m m i =1 i =0 Net = w0 + w1 x1 + w2 x2 + + wm xm = w0 + å wi xi = å wi xi ¡ Role of bias: ¡ Net=w1x1 may not separate well the classes ¡ Net=w1x1+w0 is able to better Net Net = w1x1 Net Net = w1x1 + w0 x1 x1 Activation function: hard-limited ¡ Also know as a threshold function "$ 1, if Net ≥ θ Out(Net) = HL(Net, θ ) = # $% 0, otherwise ¡ The output takes one of the two values Out(Net) = HL2(Net, θ ) = sign(Net, θ ) ¡ q is the threshold value ¡ Disadvantages: discontinuous, non-smoothed (không trơn) Out Binary hard-limiter Out Bipolar hard-limiter 1 q Net q -1 Net Activation function: threshold logic ì ï 0, if Net < -q ïï Out ( Net ) = tl ( Net, a , q ) = ía ( Net + q ), if - q £ Net £ - q a ï ï 1, if Net > - q ïỵ a (α >0) = max(0, min(1, a ( Net + q ))) Out ¡ Also know as a saturating linear function ¡ Combination of activation functions: linear and tight limit ¡ 𝛼 determine the slope of the linear range ¡ Disadvantages: continuous, non-smoothed (liên tục, không trơn) -q 1/α (1/α)-q Net 10 Activation function: Sigmoid Out ( Net) = sf ( Net, a ,q ) = 1 + e -a ( Net +q ) Out ¡ Popular ¡ The parameter 𝛼 determine the slope 0.5 ¡ Output in the range of and ¡ Advantages -q Net ¡ Continuous, smoothed ¡ Gradient of a sigmoid function is represented by a function of itself 55 BP algorithm: Update weight(1) d1 w1x1 x1 f(Net1) w1x2 f(Net4) f(Net6) f(Net2) x2 Out6 f(Net5) f(Net3) w1x1 = w1x1 + hd1 x1 w1x2 = w1x2 + hd1 x2 56 BP algorithm: Update weight(2) f(Net1) x1 w2 x1 d2 f(Net4) f(Net6) f(Net2) x2 w2 x2 f(Net3) Out6 f(Net5) w2 x1 = w2 x1 + hd x1 w2 x2 = w2 x2 + hd x2 57 BP algorithm: Update weight(3) f(Net1) x1 f(Net4) f(Net6) f(Net2) x2 w3x1 w3 x2 d3 f(Net3) Out6 f(Net5) w3 x1 = w3 x1 + hd x1 w3 x2 = w3 x2 + hd x2 58 BP algorithm: Update weight(4) f(Net1) w41 w42 x1 f(Net2) x2 d4 f(Net4) w43 f(Net6) Out6 f(Net5) w41 = w41 + hd 4Out1 f(Net3) w42 = w42 + hd 4Out w43 = w43 + hd 4Out 59 BP algorithm: Update weight(5) f(Net1) x1 f(Net4) f(Net2) w 51 w52 x2 f(Net3) w53 d5 f(Net6) Out6 f(Net5) w51 = w51 + hd 5Out1 w52 = w52 + hd 5Out w53 = w53 + hd 5Out3 60 BP algorithm: Update weight(6) f(Net1) x1 f(Net4) f(Net6) f(Net2) x2 f(Net5) f(Net3) w64 d6 Out6 w65 w64 = w64 + ηδ6Out w65 = w65 + ηδ6Out5 BP algorithm: Initialize weights ¡ Normally, weights are initialized with random small values ¡ If the weights have large initial values ¡ Sigmoid functions will reach saturation soon ¡ The system will deadlock at a saddle / stationary points 61 BP algorithm: Learning rate 62 ¡ Important effect on the efficiency and convergence of BP algorithm ¡ A large value of h can accelerate the convergence of the learning process, but can cause the system to ignore the global optimal point or focus on bad points (saddle points) ¡ A small h value can make the learning process take a long time ¡ Often select according to experimentally for each problem ¡ Good values of learning rate at the beginning (learning process) may not be good at a later time ¡ Using an adaptive (dynamic) learning rate? 63 BP algorithm: Momentum ¡ The gradient descent method can be very slow if h is small and can fluctuate greatly if h is too large aDw(t’) -hÑE(t’+1) + aDw(t’) Dw(t’) A’ -hÑE(t’+1) B Dw(t) A ¡ To reduce the level of fluctuations, it is necessary to add a momentum component Dw(t) = -hÑE(t) + aDw(t-1) ¡ where a (Ỵ[0,1]) is a momentum parameter (usually assign = 0.9) ¡ Based on the experiences, we should choose reasonable values for learning rate and satisfying momentum (h+a) >» where a>h to avoid fluctuations B’ -hÑE(t+1) aDw(t) -hÑE(t+1) + aDw(t) Gradient descent for a simple square error function The left trajectory does not use momentum The right trajectory uses momentum BP algorithm: Number of neurons 64 ¡ The size (number of neurons) of the hidden layer is an important question for the application of multi-layer neural network to solve practical problems ¡ In fact, it is difficult to identify the exact number of neurons needed to achieve the desired system accuracy ¡ The size of the hidden layer is usually determined through experiments (experiment/trial and test) ANN: Learning limit 65 ¡ Boolean functions ¡ Any binary function can be learned (approximately good) by an ANN using a hidden layer ¡ Continuous functions ¡ Any bounded continuous function can be learned (approximately) by an ANN using a hidden layer [Cybenko, 1989; Hornik et al., 1991] ANN: advantages, disadvantages 66 ¡ Advantages ¡ Nature (structure) supports high-level parallel computation ¡ Obtain high accuracy in many problems (photos, video, audio, text) ¡ Very flexible in network architecture ¡ Disadvantages ¡ There are no general rules for determining the network architecture and optimal parameters for a given problem ¡ There is no general method for assessing ANN's inner workings (thus, the ANN system is viewed as a "black box") ¡ It is difficult (impossible) to give explanations to the user ¡ Fundamental theories are few, to help explain the real successes ANN: When? ¡ The form of the function does not predetermined ¡ It is not necessary (or unimportant) to provide an explanation to the user about the results ¡ Accept long time for the training process ¡ Can collect a large number of labels for data ¡ Domains related to image, video, speech, text 67 Open library 68 References 69 ¡ Cybenko, G (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, (4), 303314 ¡ Kurt Hornik (1991) "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks, 4(2), 251–257 ... learning ă Artificial neural network Ă Practical advice Artificial neural network: introduction (1) ¡ Artificial neural network (ANN) (mạng nơron nhân tạo) ¡ Simulates the biological neural systems... ANN: Architecture (4) Feed-forward network A neuron with feedback to itself Recurrent network with single layer Feed-forward network with multiple layers Recurrent network with multiple layers 17... Forward signals via the network Error backward phase: •Calculate the error at the output •Error back-propagation 31 Network structure x1 ¡ Consider the 3-layer neural network (in the figure)