Trí tuệ nhân tạo (AI) là ngành tạo ra máy móc và hệ thống thông minh thông qua việc sử dụng mô hình máy tính, kỹ thuật và công nghệ liên quan, giúp thực hiện các công việc yêu cầu trí thông minh của con người. Nhìn chung, đây là một ngành học rất rộng, bao gồm các yếu tố tâm lý học, khoa học máy tính và kỹ thuật. Một số ví dụ phổ biến về AI có thể kể đến ô tô tự lái, phần mềm dịch thuật tự động, trợ lý ảo trên điện thoại hay đối thủ ảo khi chơi trò chơi trên điện thoại.
Trang 1AI Foundations and Applications
8 Optimization of learning process
Thien Huynh-The HCM City Univ Technology and Education
Jan, 2023
Trang 2AI Foundations and Applications
Challenges in Deep Learning
Local minima
• The objective function of deep learning usually has many local minima
• The numerical solution obtained by the final iteration may only minimize the
objective function locally, rather than globally
• As the gradient of the objective function's solutions approaches or becomes zero
4/1/2024
Trang 3Challenges in Deep Learning
Vanishing gradient
• As more layers using certain activation functions are added
to neural networks, the gradients of the loss function
approaches zero , making the network hard to train
• The simplest solution is to use other activation functions,
such as ReLU, which doesn’t cause a small derivative.
• Residual networks are another solution, as they provide
residual connections straight to earlier layers
Exploding gradient
• On the contrary, in some cases, the gradients keep on getting
larger and larger as the backpropagation algorithm
progresses This, in turn, causes very large weight updates
and causes the gradient descent to diverge This is known as
the exploding gradients problem.
Trang 4AI Foundations and Applications
Challenges in Deep Learning
4/1/2024
Which one is better in preventing a neural network having more activation layers from vanishing gradient,
sigmoid or ReLU ?
Trang 5Challenges in Deep Learning
How to choose the right Activation Function?
Few other guidelines to help you out.
• ReLU activation function should only be used in the hidden layers.
• Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).
• Swish function is used in neural networks having a depth greater than 40 layers.
Choose the activation function for your output layer based
on the type of prediction problem that you are solving:
• Regression - Linear Activation Function
• Binary Classification—Sigmoid/Logistic Activation Function
• Multiclass Classification—Softmax
• Multilabel Classification—Sigmoid
The activation function used in hidden layers is typically chosen based on the type of neural network architecture.
• Convolutional Neural Network (CNN): ReLU activation function.
• Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Trang 6AI Foundations and Applications
Challenges in Deep Learning
Over fitting and under fitting
• Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to a
limited set of data points As a result, the model is useful in reference only to its initial data set, and
not to any other data sets
• The model fit well the training data, but it not show the good performance with the testing data
• Underfitting is a scenario in data science where a data model is unable to capture the relationship
between the input and output variables accurately, generating a high error rate on both the training
set and unseen data
4/1/2024
Trang 7Optimization schemes - Momentum
• The method of momentum is designed to accelerate learning
• The momentum algorithm accumulates exponentially decaying moving average of past gradient and
continues to move in their direction
Gradient descent with 2 variables Gradient descent with 2 variables (another example)
• Learning rate 0.4
• Learning rate 0.6
Trang 8AI Foundations and Applications
Optimization schemes - Momentum
• Instead of using only the gradient of the current step to guide the search,
momentum also accumulates the gradient of the past steps to determine the
direction to go
• The equations of gradient descent are revised as follows
8
Trang 9Adaptive Gradient Descent (Adagrad)
• Decay the learning rate for parameters in proportion to their update history
• Adapts the learning rate to the parameters, performing smaller updates (low learning rates) for
parameters associated with frequently occurring features, and larger updates (high learning rates)
for parameters associated with infrequent features
• It is well-suited for dealing with sparse data
• Adagrad greatly improved the robustness of SGD and used it for training large-scale neural nets
Trang 10AI Foundations and Applications
Root Mean Squared Propagation (RMSProp)
• Adapts the learning rate to the parameters
• Divide the learning rate for a weight by a running average of the magnitudes of
recent gradients for that weight
4/1/2024
Trang 11Adaptive Moment Estimation (Adam)
• ADAM combines two stochastic gradient descent approaches, Adaptive Gradients,
and Root Mean Square Propagation
• Adam also keeps an exponentially decaying average of past gradients similar to
SGD with momentum
Trang 12AI Foundations and Applications
Dropout
• Avoid overfitting problem
• Probabilistically dropping out nodes in the network is a simple and effective
regularization method
• Dropout is implemented per-layer in a neural network
• A common value is a probability of 0.5 for retaining the output of each node in a
hidden layer
4/1/2024
Trang 13Dropout
Trang 14AI Foundations and Applications
Assignment 2 (mandatory)
Assignment 2 (mandatory)
Design a multilayer neural networks with input layer, 02 hidden layers (sigmoid) ,
output layer (softmax) Apply the optimization methods of Momentum và Adam
Compare the accuracy and Converging time among two methods Assume that the
MNIST dataset is used for training and testing the neural network Important: The
use of built-in functions are prohibited
Student submit the python code on Google Class
4/1/2024