This thesis will leverage Prompt-based Multimodal for both purposes: edit images and deliver feedback based on user requestsexpressed in natural language.. The objective of this thesis i
Trang 1VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE
Dr NGUYEN VINH TIEP
HO CHI MINH CITY, 2024
Trang 2LIST OF THESIS DEFENSE COMMITTEES
Thesis Defense Committee, established according to decision by the Rector of the
University of Information Technology
1 Chairman: PhD Duong Viet Hang
2 Secretary: MS Cap Pham Dinh Thang
3 Members: MS Do Van Tien
11
Trang 3I wish to express my gratitude to my esteemed university for providing a robusttheoretical foundation, facilitating a swift assimilation of new knowledge The uni-versity’s unwavering support and comprehensive resources have been instrumental inenhancing my comprehension and application of intricate concepts
Furthermore, I extend my appreciation to the MMLab laboratory for fostering an
academic environment conducive to research and learning The laboratory’s ample resources have significantly contributed to an enriching scholarly experience, enabling
me to delve deeper into my studies
A heartfelt acknowledgment is reserved for my supervisor, Doctor Nguyen Vinh
Tiep, whose guidance and support have been indispensable throughout the thesis deavor His mentorship has not only directed my research but has also served as a
en-source of motivation I am grateful for their encouragement and insightful feedback.Additionally, the financial support provided has allowed me to dedicate myself fully tothe research process
In recognizing the collective contributions of my university, MMLab laboratory,and my teacher, I express my deep appreciation for their pivotal roles in the successfulculmination of this thesis
iv
Trang 4Currently, the achievements of Generative AI have become popular and powerful.
They can not only answer user questions about various aspects of an image but alsoperform edits, creating unique images based on user requests Chatbots like GPT4-Vision, BingChat, or Bard can provide excellent feedback when users inquire about
different aspects of an image Models like Dall-E, Midjourney, and Parti can generate and edit images according to specific requests.
A gap persists in integrating image commentary and editing functionalities into
a single efficient tool Users often resort to multiple tools for varied needs, causing
time-consuming and cumbersome workflows Existing models for image feedback are
generally trained for general applications, yielding overly broad responses lacking ficity in aesthetic evaluation
speci-This study aims to explore Multimodal models, which can process and synthesize
in-formation from diverse sources like images, text, and audio The focus is on developing
a Chatbot that simultaneously handles images and questions, proficient in both edits
and providing insightful commentary Specializing in detailed aesthetic commentary
on elements like lighting, color, composition, and depth, the Chatbot executes localizededits Designed to interpret and respond to natural language instructions, it enhancesthe user experience in digital image interaction
Thttps://photutorial.com /photos-statistics/
Trang 516 ‘Thesis struciire) Ƒ ge 9 (0Ô ff ⁄ 8
2 Background and Related Work 10
Trang 64
2.1.2.6 ‘Training-efficient technquel
2.2 Research on Mulimodal
2.3 End-to-end training with Large Language Model]
2.3.1 Image Assessment Modelsl
2.3.1.1 Jointly Training with Image and Text]
2.3.1.2 Learned Image Embedding as (Frozen) LM Prefix] .
2.3.1.3 Text-lmage Cross-Attention Fuse Mechanisms]
2.3.2 Local Region Image Editing Modell
2.4 Chaining tools with Large Language Model]
2.5 Deep Learning in Aesthetic Àssessmenl|
2.5.1 Data Survey] 2.5.2 Architectural Survey} 2 ee ee Proposed Method 3.1 Proposal for Aesthetic Àssessmenl|
3.1.1 Proposal for data lnmmtallonsl
3.1.2 Proposal for resource limitatlon|l
3.2 Local Image Editing Pipelnel ốc 3.2.1 Model selection 3.2.2 Open-set Object detector), 2 2.0 2.000002 0 00] 3.2.3 Finding mask for detected obJects|
3.2.4 Complete photo editing pipelmel
3 End-to-end Chatbot] Experiment 4.1 Create Instruction-following dataset]
4.2 Fine tune LLaVaj
4.2.1 Training process
vii
31
38
41
47 52 53
56
58
58
60 62
64
64 66
68 70
70
Trang 8List of Figures
1.1 The initial intent of editing is to preserve most of the original content.| 3
1.2 Recent models cannot edit while preserving the original image 4
1.3 Chatbot like GPT-4V, Bard can response to the question about image} 5
1.4 Since trained on general-purpose tasks these chatbot can’t perform well
in a specific task] 2 ee 5)
2.1 Tustration of CNN architecture Source: 112.2_ Some GAN applications Source: LearnOpenCV_ 132.3 ViT marked the first time Transformer used in Vision problems
tract and retain information from past inputs Source: Wikipedia 2.5 Transformer consists of the Encoder to comprehend the textual in-
ee 16
| Epoch AD ww 19
¬ i
¬ 20
2.9 Comparing per-device memory use for model states with three
ZeRO-DP\66] optimization stages Model size V = 7.5B, DP degree Ng = 64,and kK = 12 in mixed-precision training with Adam optimizer] 22
ix
Trang 92.10 CLIP|61| uses Contrastive Learning jointly learning different modalities} 24
2.11 Reponse from MiniGPT-4|98] has shown that this open source MLLM
re 44
sàn 46
¬ 4ï
Trang 10~— ad eben 57
3.1 LLaVa-vl.5-13B, the latest in the LLaVa series [49], performs well in
benchmarks but falls short in enhancing visual appeal) 59
3.2 LLaVa|49| leverages ChatGPT/GPT4 to synthesize data} 61 3.3 Image editing outcomes in InstructPix2Pix|7] and ControlNet|94]| 65
Object Detectors struggle to detect them effectively|50|| 66
3.5 The improved Grounding DINO|B0| framework, derived from GLIP|46],
establishes a strong interconnection between modalities} 67
3.6 SAM|40] is a promptable segmentation model] 68
3.7 A robust image encoder efHclentÌy generates an image embedding,
en-abling prompt-based queries to produce masks in SAM[40]| 69
3.8 Proposed Local Image Editing Modell 70
4.1 Word Cloud illustrates the word frequency in the new dataset] 76
Trang 114.6 Examples of unsuccessful editing) ccSẶ Ặ4.7 Complete Chatbot
x1
Trang 12List of Tables
4.1 Prompt used for ChatGPT to synthesize Instruction-following data] 75
4.2 Comparison between Our LLaVa, GPT-4V[57|, LLaVa-v1.5-13B[49]/ 80
al
> i Or 82
4.6 Some response too short that do not contain too much information| 84
Trang 13Chapter 1
Problem statement
This chapter will provide an overview of the issue under investigation Additionally,
it will elucidate the key concepts referenced in this thesis, namely Aesthetic
Assess-ment and Local Region Image Editing Subsequently, we will delineate the scope of
the proposed solution and articulate the research objectives Finally, the primary
con-tributions made in this study will be highlighted
However, it is imperative to acknowledge that not everyone possesses expertise inthe realm of photography For many users, the creation of visually appealing imagespresents a challenge, particularly in mecting aesthetic criteria such as color, light,
composition, depth, and other relevant factors Recognizing this challenge, there is a
growing necessity for the involvement of experts who can offer constructive feedback
Trang 14and suggest viable solutions to enhance the visual appeal of these photographs.
Consequently, a discernible demand has arisen for a tool designed to assist general users by providing insightful comments, valuable suggestions, and facilitating photo
edits This tool serves as a means of empowering users to refine their photographicendeavors and elevate the overall quality of their visual content
There are several software programs that have efficiently addressed the demand for
photo editing, such as the Adobe tools including Photoshop and Lightroom However,
these are professional tools designed for a highly specialized group of users For thegeneral user, effectively utilizing these tools can be extremely challenging, as theyrequire a significant amount of time to accumulate knowledge and learn how to use
them Recognizing this issue, some famous applications have integrated editing tools and color-changing filters with a remarkably simple interface to make image editing
more appealing TikTok, Instagram are among the most famous examples that havecaptured the market well, providing users with simple yet effective editing tools
However, there still lacks a tool that provides feedback on photos Users often edit photos based on intuition, without a foundation or any specific criterias, making this
process time-consuming They frequently share their photos on photography forums,seeking suggestions on aesthetic aspects from highly skilled photographers Although
chatbots like Bard, ChatGPT-Plus, or BingChat have supported for conversation and
commenting on photos, they are trained on general-purpose tasks, making it challenging
to get specialized feedback in a specific field, such as aesthetic
1.2 Definition
Local Region Image Editing is the basic user demand for photos Almost
every captured photo can’t be directly posted to social media; instead, they will gothrough a time-consuming process that makes the photos more attractive Some basickinds of editing include changing the brightness, applying filters, assigning stickers, etc
Trang 15"Change the man's
hair to pink hair"
Figure 1.1: The initial intent of editing is to preserve most of the original content.
Since these activities require intensive work, which is time-consuming, several artificial
intelligence tools emerged as a desire to help users create a good-looking image
As depicted in Figure[T.1| the AT agent, primarily utilizing Stable Diffusion[70], can
generate a modified version of the original image based on a provided natural languageprompt, while preserving most of the initial image
But in some recent well-known text-to-image models, studies have shown that they
are sensitive to the prompt; a small change in prompt may lead to completely different
images (shown in Figure (1.2) As mentioned in [27], the basic intention of editing
an image is to preserve most of the original content To do some editing work, like
replacing objects in the image, users have to carefully prepare prompts, which hinders the creativity and convenience of the prompt-based approach.
Aesthetic Assessment is one type of suggestion that mostly mentions the thetic aspect The goal of this suggestion is to point out which parts of the image are
aes-not good and then give some advice on how to enhance the visual content People often assume that the factors that construct an image’s beauty, mostly intuition, depend on the feelings of assessors It may not hold true; similar to other domains, numerous cri-
teria must be met for an image to be deemed “beautiful” Otherwise, every assessment
Trang 16Figure 1.2: Recent models cannot edit while preserving the original image.
without a system of standard assessments is not valuable There are some cornerstone
aspects that even amateras must follow, including colour, light, composition, etc
Within the realm of Deep Learning, this research topic has been extensively cussed over an extended period, employing various approaches However, the work by
dis-stands out as a pioneer in framing return problems using words rather than
num-bers or bounding boxes, as was done previously The authors posit that representingresponses in natural language is not only more appealing to users but also provides
insight into how the underlying model processes information.
A recent advance in AI is Multimodal, which has explored the new ability to allow
users to chat with AI agents with images As shown in Figure [1.3] users can ask about
many aspects related to the image, and the agent will generate a response based on
the question and image The impressive response created by GPT-4V[57]|, Bing Chat,
Bard recently has made this research promising in the future of generalist AI.
However, since these types of models are mostly trained on general-purpose tasks,
it is hard for the response to be specific in a single domain, such as aesthetics
Ex-periments (Figure show that the responses of these AI agents when asked about
aesthetic aspects are not supportive, and some times they mention objects that are not
included in the image (“leaves” are not in|1.4), which raises doubt about their ability
to truthfully understand the visual content
Trang 17The composition of this image is
well-balanced, with the pier
leading towards the water and
"What do you think about the mountains in the background.
composition of this image?" => Al Agent => The reflection of the pier in the
water adds a sense of depth
and enhances the overall
composition.
Figure 1.3: Chatbot like GPT-4V, Bard can response to the question about image
+,
The image you sent me is a beautiful example of a minimalist aesthetic.
The soft gray fur of the cat contrasts beautifully with the crisp white
window sill and bright green leaves outside The cat's pose is simple
and elegant, and its expression is one of curiosity and wonder.
The image is also very well-composed The cat is centered in the frame, and the negative space around it helps to create a sense of
balance The leading lines of the window sill and the leaves outside
draw the viewer's eye to the cat, and the use of natural light creates a
soft and ethereal glow.
+, Ba rd => Overall, | think the aesthetic of the image is very pleasing to the eye It
is simple, elegant, and harmonious The image conveys a sense of peace and tranquility, and it is sure to evoke positive emotions in the
viewer.
"What is your opinion on the Here is an enhanced version of the image:
aesthetic of the image?" The enhanced image emphasizes the soft and ethereal quality of the
original image The soft glow around the cat makes it look even more peaceful and serene The sharpened leaves in the background
provide a subtle contrast to the soft fur of the cat, and they also
help to draw the viewer's eye to the cat.
Figure 1.4: Since trained on general-purpose tasks these chatbot can’t perform well in
a specific task
Trang 181.3 Research Objectives
This thesis aims to research methodologies capable of:
1 Enhancing visual chatbots in the aspect of aesthetic assessment so that they
can provide responses that properly describe the content of the image Moreover,
when users pose questions that require aesthetic knowledge, the model’s responseshould not only offer supportive information but also ensure reliability The
response must avoid being overly general or cliché.
2 Researching advanced methods that focus solely on local editing regions While
there are many text-to-image models that can both generate and edit based on
user prompts, the initial intent of editing was to make necessary changes to
required regions while preserving the majority of the image structure Manyrecent models still lack this ability, making it challenging for these models to begenuinely useful for general image editing work
3 Proposing a method that can combine the aforementioned abilities into a unified
AI agent capable of not only providing information about the given image but
also editing it based on the user’s request Users won’t need to manually switch
between tools since the AI agent is intelligent enough to select the appropriate
tool for each task
“Language serves as the attire of thought”, and undoubtedly, it plays a pivotal
role in articulating an individual’s desires This thesis will leverage Prompt-based
Multimodal for both purposes: edit images and deliver feedback based on user requestsexpressed in natural language Natural Language is used as a primary communicationmethod between AI agent and human since it represents the most effective information
for a computer to comprehend desires deeply and engage seamlessly with users.
Multimodal in recent years has emerged as an innovative research topic, showcasingtheir ability to synthesize and process information from diverse sources, such as images
Trang 19and text The objective of this thesis is to explore multimodal models capable ofproviding feedback and editing images, subsequently refining them based on aesthetic
evaluation tasks As image feedback and editing models are typically distinct, the project also proposes a solution to seamlessly integrate them into a unified chatbot.
Users can engage in a conversation through the chat interface, send an image, andrequest feedback on various aesthetic aspects while editing the image according to
suggestions
1.4 Research scope
This topic focuses on researching Multimodal, primarily concentrating on two areas:
language and image The objective is to develop a chatbot with two main functions:
image critique and editing, combining two models to perform tasks such as Visual
Ques-tion Answering and Prompt-based Image Editing The primary target audience for this
research is non-expert users without specialized knowledge in the field of aesthetics
During the critique process, the chatbot focuses on common aesthetic aspects such
as lighting, color, composition, depth, and content Additionally, the chatbot can
suggest ways to improve the image quality, providing practical assistance to users.Its editing capabilities are concentrated on the local regions of the image, includingchanges to hair color, clothing, and other detailed factors
Given the diversity of photo editing needs, there are numerous different aspects
that demand attention beyond the scope of the Multimodal research Therefore, thesediscoveries will be left for future exploration
1.5 Contribution
In summary, the contributions of the thesis are the following:
1 Research on the type of Multimodal that is capable of assessing and chatting
Trang 20about the image Then leverage ChatGPT to synthesise Instruction-followingdata Finally, by using some modern training techniques, we successfully fine-
tuned Vision Language Model that includes 7B parameters for LLM on a single
RTX3090
2 Research on the Generative Model which focuses more on Local Region Editing.
Recent years have shown several models that are capable of both synthesising and editing The research is going to find the end-to-end pipeline that can go
straightforward from user image and prompt to edited image, which preservesmost of the original content
3 Proposing the solution that can connect those separated models, unified it into a
single chatbot that can both do Aesthetic Assessment and Local Image Editing.
1.6 Thesis structure
The structure of the thesis is divided into the following chapters:
e Chapter 1: Problem statement: Presents the main problem addressed by
the thesis, along with an overview of the trends in artificial intelligence (AI)
development and their relevance to the research problem
e Chapter 2: Background and Related Work: Offering a comprehensive
overview of the current landscape within the domain of Deep Learning
Con-ducting surveys and investigations into prior and existing models related to timodal applications
Mul-e ChaptMul-er 3: ProposMul-ed MMul-ethodology: AnalyzMul-es and sMul-elMul-ects appropriatMul-e data,
models, and training methods
e Chapter 4: Experiments: Conducts model training, evaluation, and identifies
solution directions
Trang 21e Chapter 5: Discussion: Summarizes the work, and achieved results in the
thesis Discusses research work and potential future developments
In the initial chapter, the problem statement has been introduced The tual framework surrounding two pivotal keywords frequently referenced in this thesis,
concep-namely Aesthetic Assessment and Local Region Image Editing, has been presented Additionally, discussion has been provided on the research scope, research objectives,
and the contributions made in the study
Trang 22Chapter 2
Background and Related Work
In the realm of artificial intelligence, there is a persistent endeavor to develop a satile assistant capable of emulating human perception, inference-making, and interac-tion with the real world Recent research has concentrated on constructing FoundationModels, general AI models proficient in acquiring information from diverse modalitiesand performing a broad spectrum of tasks This thesis, in particular, focuses on Vision
ver-Language Models, a set of meticulously designed models intended for synthesizing and
comprehending information from both images and texts However, before delving intothe specific area of study, an exploration of the background within the Deep Learn-
ing domain is undertaken to gain insights into the innovations that contribute to thesuccessful development of Multimodal systems.
2.1 Background
2.1.1 Research on Computer Vision
Computer Vision is a multidisciplinary field at the intersection of computer science
and artificial intelligence that empowers machines to interpret, understand, and extract
meaningful information from visual data, such as images or videos, mimicking human
10
Trang 23Convolution Neural Network (CNN)
Figure 2.1: Illustration of CNN architecture
visual perception Several studies have been proposed in this field since 1950s One of the most important ones was AlexNet architecture [41], introduced in 2012, not only
marked a breakthrough but also signaled the beginning of an era of Deep Learningthat continues to this day
2.1.1.1 Pre-Transformer: Power and Limits of CNN
This model is particularly impressive for its application of Convolutional Neural
Network (CNN) architecture, inspired by how humans perceive images The uniqueness
of CNN lies in using sliding windows to extract feature over small regions of an imagewhich depicted in Figure As the network goes deeper, these windows expand toextract larger region, help in extracting large features, thereby enhancing the learningcapability and helping the model extract more crucial information from the data
In the following years, the AlexNet architecture underwent significant
improve-ments, optimizing the number of parameters while maintaining the performance of
models such as VGG[76|, ResNet[26], InceptionNet[77|, DenseNet|31], and many other
architectures At that time, increasing the number of parameters for CNNs was not
considered an optimal solution to improve model performance This is because models
within this architecture family often faced the issue of overfitting—a situation where
11
Trang 24the model performs well on the training data but less efectively on new data.
To overcome these challenges, several new techniques have emerged, such as Batch
Normalization[32|, Dropout, and Regularization These improvements not only help
stabilize the training process but also enhance the model’s performance The success
of these architectures has laid the foundation for various common tasks, including
Image Classification, Object Localization, Object Detection, Object Segmentation, and has been widely applied in many fields of life Applications range from assisting in tumor detection in the medical field to urban traffic management and enforcing social
distancing measures during the COVID-19 pandemic
All of these applications belong to the group of Discriminative Model, approaching
the problem by focusing on learning a decision boundary, determining a clear separation
between classes or categories in the data They primarily learn how to discriminatebetween different classes based on input features On the other hand, Generative Model
is a novel approach that focuses on generation
2.1.1.2 Generative Models
In contrast to Discriminative Model, Generative AI is focussing on leanring the datadistribution Instead of solely concentrating on classification, Generative AI emphasizes
the ability to create new data from the distribution of the training data.
This opens up various creative applications, from generating new images to sizing speech, expanding the realm of imagination in the field of artificial intelligence.Nowadays, it has the capability to create images of faces that have never existed and
synthe-transform videos with a level of realism that is challenging to distinguish from reality.
It also innovates in altering image styles, unlocking new potentials for creativity and
applications across various domains
The group of popular architectures in the field of Generative AI includes the
Gen-erative Adversarial Network (GAN) [17], Variational Autoencoder (VAE) [89], and
currently, the Diffusion Model is the most popular choice For Diffusion, its
train-12
Trang 25This small bird has This bird has a People at the park The bathroom with
ayellow crown blue crown with flying kites and the white tile has
and a white belly white throat and — walking been cleaned.
= = =
30 Years Old
, '
Face Aging Super Resolution
Figure 2.2: Some GAN applications
Source: LearnOpenCV
ing method is evaluated to be more stable compared to previous methods, delivering
results with high image quality and superior realism Figure [2.2] shows some GAN
ap-plications in generating humans at different ages, super resolution and Image-to-Image
translation
2.1.1.3 Transformer’s Entry: Intersection in vision and language
By the end of 2020, the emergence of the Vision Transformer (ViT) highlighted
the Transformer|82] architecture’s existence in the field of image processing The ViT’s
architecture is shown in Figure [2.3] where the first step is splitting the image into
mul-tiple patches played as “image tokens” After flattening the extracted visual features,
the Position Embedding will be added to the embedding space to append the orderinformation And finally, a Transformer’s Encoder will extract those “image tokens”
as it does on the word tokens The Self-Attention mechanism allows image patches to
attend to each other and help the model learn their spatial and visual information in
13
Trang 26“embedding” > OL) 0Ủ 6Ö 6) aad aD 8Ẻ
* Extra learnable
Sam
-tr 18B —-RMMWt#ứẽ(?
Parry
Figure 2.3: ViT|22] marked the first time Transformer used in Vision problems
different parts of the image.
This was not only a significant step forward but also marked a special convergence
between two crucial domains: Computer Vision and Natural Language Processing TheVision Transformer not only achieved results far superior to CNN but also opened upmany potentials in how we understand and apply the Transformer architecture acrossvarious vision and language applications
2.1.2 Research on Natural Language Processing
2.1.2.1 Transformers: NLP Milestone
Formerly, classical Recurrent Neural Nerwork (RNN) architecture (Figure [2.4) was
commonly applied in natural language processing area RNN, characterized by thesequential connection of neurons forming a directed cycle, allows them to maintainhidden states to gather information from previous inputs This makes RNN ideal forhandling sequential data such as text and time series, where the relationships between
components often depend on context and order.
However, despite the benefits that RNNs bring to processing sequential data, they
14
Trang 27°
@ ® @®
Eigure 2.4: A Recurrent Neural Network (RNN) utilizes sequential neurons to extract
and retain information from past inputs
Source: |Wikipedia
also face significant challenges The issue of the Vanishing Gradient is one of thecrucial aspects, making the backward flow of information through time steps difficult
and leading to the loss of important information This inefficiency poses a major
limitation in handling long sequences of data
A significant breakthrough with the Transformer|82] architecture has overcome
these limitations, providing the ability to process information in parallel and aging the Self-Attention mechanism As a result, this architecture not only addresses
lever-the Vanishing Gradient problem but also opens up new potentials in understanding
and generating natural language in a distinctive way Transformer is the architecture
behind the success of Large Language Model (LLM).
As shown in Figure Transformer consist of an Encoder to comprehend the
textual information connect to a Decoder use that information to generate sequence in
probabilistic way
The core of the architecture lies in the Self-Attention mechanism Drawing
inspira-tion from database principles, the embedding of tokens undergoes transformainspira-tion via
a linear layer, generating three vectors: Q as the query, K as the key, and V as the
value The query vector @ identifies the key vector K that is most similar, followed bynormalization using the softmax function to create a distribution between 0 and 1 Ahigher value indicates a higher probability of the query vector aligning with the key vec-tor This distribution is then multiplied with V to determine which parts of V should
15
Trang 28nputs Oulputs
(shiftad rignt)
Figure 2.5: Transformer consists of the Encoder to comprehend the textual
information and the Decoder to generate sequence
receive more attention over time, distinguishing between vectors that require more or
less attention These attention mechanisms allow Transformers to learn different and
distant dependencies in language
QkT
Vd
Multi-Head Attention is a crucial element in the Transformer architecture,
com-Attention(Q, K,V) = softmax( )V
plementing the earlier introduced Self-Attention mechanism While Self-Attention
al-lows a model to weigh different parts of the input sequence based on their relevance, Multi-Head Attention extends this capability by employing multiple attention heads
in parallel Each head operates independently, enabling the model to capture diversefeatures and patterns simultaneously This parallel processing not only enhances the
model’s ability to understand complex relationships within the data but also facilitates
16
Trang 29more efficient training and inference through parallelization.
Multihead(Q, K,V) = Conat(headi, , head,)W/Ø
where head; = Attention(QW®, KWK,vw/’)
The subsequent years witnessed the remarkable development of Large Language
Mod-els (LLM), achieved by increasing the number of parameters and the size of training
data These LLMs have demonstrated impressive capabilities, enabling them to
per-form tasks such as poetry generation [63], text summarization, and translation [64]
that are comparable to humans’ levels
2.1.2.2 Training method
In addition to training on traditional next-word-prediction tasks, various training
methods have been proposed to enhance the performance of Large Language Models
(LLM) These approaches have enabled some smaller-sized models to compete with
larger ones effectively An important step in the development process involves vised Tuning and the application of Reinforcement Learning, specifically the Reinforce-
Super-ment Learning from Human Feedback (RLHF) method.
Learning through interaction with humans, the GPT-3 model [8] has evolved into GPT-3.5 [59], becoming a particularly flexible and multitasking model With the ability
to function as a chatbot, users can request the model to perform various tasks simply
by providing desired prompts This development serves as a precursor to the currentstate of ChatGPT.
Instruction Tuning is also an advanced fine-tuning method designed to optimizemodel performance by adjusting specific instructions for each particular task Well-
known Large Language Models (LLM) like T5 [64], LLaMa [79] [80], and others have
demonstrated impressive results for various specific tasks with the help of Instruction
Tuning Some powerful models employing this method, such as Alpaca [78], Flan-T5
17
Trang 3016], and WizardLM [89], have become outstanding, excelling not only in performing
specific tasks but also in responding effectively to human instructions
Through Instruction Tuning, many models have seen notable advancements,
es-pecially in enhancing their capabilities for tasks that demand high-level reasoningand inference, such as mathematics problem-solving and code generation Models
like Llemma [5], WizardMath [52], WizardCoder [53], and CodeLLaMa serve as
evidence of the success of this approach, improving LLMs’ logical reasoning abilities in
mathematical problem-solving and code writing
In some cases, the crucial aspect lies in carefully crafting and providing accurateprompts, which can sometimes be sufficient to achieve the desired results without theneed for expensive fine-tuning This has introduced a new level of convenience and
opened up many creative possibilities in the real-world applications of Large Language
Models Several popular prompt methods, such as Chain of Thought and In
Context Learning [8], have emerged, achieving impressive results with the simple act
of writing prompts appropriately
Despite their strength, these models are often trained on general tasks, meaningthey perform well across a broad range of tasks Therefore, when applied to specifictasks, the results may not reach the optimal level expected To address this issue,fine-tuning is a common choice to optimize the performance of Large Language Models
on specific tasks.
2.1.2.3 Success Recipe: Larger Models
In recent years, we have witnessed not only a significant leap in the size of Large
Language Models (LLM) but also a rapid evolution Looking back, the BERT model [27] in 2018 had approximately 340 million parameters Just a year later, OpenAT's GPT-2 [63], a project initially considered too risky for widespread release due to con-
cerns about its potential misuse by politicians and scammers, became more impressive
with 1.5 billion parameters By 2020, the precursor to ChatGPT, GPT-3 [8], took this
18
Trang 31Record-setting training costs: using price of actual GPU used for training
© Other system costs ® Record-setting costs
ggg gWF"_— WP +” Wg go” gO ag” owt
Publication date of ML system CC BY Epoc
Figure 2.6: Large Language Models requires substantial resource consumption
Source:
+ h +
† h i
' T '
Input h [ Input ) ' { Prompt | Input ) '
(a) Adapter Tuning h (b) Prefix Tuning H (©) Prompt Tuning ' (d) Low-Rank Adapation
Figure 2.7: Some Parameter-efficient tuning methods[|95]
number to a staggering 175 billion parameters The increase in model size implies that
the success recipe of Language Models is expanding data size and parameters.
Figure shows the cost of training Large Language Model, the growth in sizenot only brings opportunities but also poses challenges for computational performance
and model training cost This presents a significant barrier for research groups with limited computational resources, making it difficult for them to participate in the ever- increasing race for model size.
19
Trang 32We IRdxd
d
(ns
x
Figure 2.8: LoRal30| lower significantly the trainable parameters
2.1.2.4 Parameter-efficient Fine tuning methods
There have been several methods developed to optimize the fine-tuning process of
Large Language Models (LLM) with limited resources, enabling successful training on
constrained hardware One significant research direction is Parameter-efficient
fine-tuning (PEFT), aimed at reducing the number of trainable parameters during training
while preserving model performance Notable approaches include Adapter Tuning [29],
which trains small neural network adapters integrated into the Transformer architecture
(Figure [2.7(a)), Prefix Tuning [47] that adds a prefix string as a set of trainable vectors
to each Transformer layer(Figure[2.7{b)), and Prompt Tuning [42], which, unlike Prefix
Tuning, integrates vectors into the input layer (Figure [2.7(c)) Among them, LoRa
stands out as a particularly effective method (Figure )).
LoRa is proposed by Microsoft researchers, which is a specialized training
mecha-nism that freezes the entire pre-trained model and focuses solely on training low-rankmatrices embedded in each layer of the Transformer architecture By concentrating
on training low-rank matrices, LoRa significantly reduces the number of parameters,
alleviating resource burdens
Inspired by previous research indicating that pre-trained language models possess
a low “intrinsic dimension” and can efficiently learn despite a random projection to a
smaller subspace, the author extends this insight to hypothesize that updates to the
weights also exhibit a low “intrinsic rank” during adaptation
For a pre-trained weight matrix Wyo € R?@**, the author constrains its update
20
Trang 33by employing a low-rank decomposition W + AW = Wo + BA, where B € JRr
and A € RTM* Throughout training, Wo remains static without receiving gradient
updates, while matrices A and B are the trainable parameters Importantly, both
Wo and AW = BA undergo multiplication with the same input, and their respective
output vectors are summed coordinate-wise For h = Wox we can update:
The variable r denotes the rank, influencing the amount of trainable parameters
and memory consumption As depicted in Figure {2.8} the matrix A is initialized with a
Gaussian Distribution, while B is set to start with 0 Consequently, the initial change in weights AW, is zero, indicating no updates during the initial phase Upon completion
of training, the integration of LoRa weights into the original model becomes a seamlessprocess through matrix addition Notably, while LoRa can theoretically be applied to
various layers of the Transformer, the author experiments have shown promising results specifically when applied to Self-Attention layers.
Experimental results have also shown that LoRa can fine-tune the GPT3.5 billion model with fewer parameters by about 10,000 times and requires fewer GPUs, up
175-to three times less, while maintaining high performance Compared 175-to other methods,
this is indeed an optimized solution, particularly valuable when fine-tuning models with limited computational resources while ensuring high performance.
2.1.2.5 Memory-efficient fine-tuning method
Quantization is a crucial technique employed in Large Language Models (LLMs) to
optimize and enhance their efficiency In the context of LLMs, which often have massive
numbers of parameters, quantization addresses the challenge of reducing model size andcomputational complexity without compromising performance Naive casting, or the
reduction of precision without careful consideration, may result in a loss of critical
21
Trang 34m Parameters Gradients TM Optimizer States
Figure 2.9: Comparing per-device memory use for model states with three
ZeRO-DP{66] optimization stages Model size V = 7.5B, DP degree Ny = 64, and K = 12 in
mixed-precision training with Adam optimizer
information, impacting the model’s overall performance
To address these challenges, sophisticated techniques have emerged for effective
quantization fine-tuning LLM int8() seamlessly converts float32 to int8, preserving overall performance QLoRa [20], an innovative approach, combines quantization with
LoRa to optimize memory usage during model training Introducing the NF4 data type, QLoRa uniquely quantizes based on a fraction of the standard distribution, achieving
memory-efficient optimization compared to traditional 4-bit and float quantization, allwhile maintaining high accuracy
These methods aim to preserve the essential information while achieving the desired compression of model parameters and activations By navigating the trade-off between
reduced precision and model accuracy, these techniques contribute to the successfulimplementation of quantized models in real-world applications
2.1.2.6 Training-efficient technique
The large size of the model consumes nearly all the GPU memory, forcing us to use
a smaller batch size during training and significantly prolonging the training process
Gradient Checkpointing is a technique that addresses this issue by allowing training with a small batch size while achieving the same effect as a large batch size It does
22
Trang 35this by accumulating gradients for a certain number of steps before making updates.
Another solution is to scale the number of GPUs, which involves increasing the GPU count to enable training on multiple GPUs This technique, supported by the PyTorch
framework, is known as Data Parallel PyTorch also supports Distributed Data allel, allowing training on multiple nodes and alleviating computational pressure byleveraging multiple devices
Par-It’s important to note that these techniques are effective when the model size can fit into a single GPU If the model size is too large for a single GPU, partitioning the
model becomes a viable solution
DeepSpeed Zero [67], introduced by the Microsoft research team, represents a
significant breakthrough in memory optimization during model training Illustrated
in Figure [2.9] the three main methods include optimized state sharding Pos, gradient
sharding P,, and parameter sharding P, These advancements not only enable efficienttraining on devices with multiple GPUs but also integrate an Offload mechanism to
simultaneously leverage both GPU VRAM and CPU RAM, ensuring comprehensive
system performance optimization
Thanks to these technological advances, optimizing efficiency on devices with ited resources has become more accessible than ever, and fine-tuning Language Models
lim-(LLMs) is no longer a formidable challenge.
2.2 Research on Multimodal
Although Large Language Models (LLM) have demonstrated remarkable zero-shot
capabilities on many tasks related to language, they still appear “blind” to the field of
images, as they are primarily trained to process words
To overcome this limitation and leverage simultaneous advancements in both image
processing and natural language, researchers have conducted studies combining both
image and language models One notable contribution is OpenAl’s CLIP{6I] CLIP
23
Trang 36Figure 2.10: CLIP|61] uses Contrastive Learning jointly learning different modalities
proposes building a shared embedding space to align images and language Using
Contrastive Learning to optimize similarity between pairs of images and text (Figure
2.10), CLIP is pretrained on over 400 million image-text pairs As a result, CLIP
enhances Zero-shot and Few-shot capabilities, enabling object recognition in imageswithout prior training, based on understanding natural language descriptions
Usually, when referring to Multimodal Large Language Models (MLLM), we often mention models based on Large Language Models (LLM) with the capability to intake and process information from multiple sources According to [92], from the perspective
of developing an Artificial Generalist Model (AGI), MLLM is considered a significant
step beyond LLM, for the following reasons:
e MLLM is similar to how humans observe the world, utilizing multiple senses to
naturally intake information, often in a complementary and cooperative manner.
e MLLM provides a more user-friendly interactive interface Thanks to its
capa-bility to support diverse inputs, users can interact and communicate with thisintelligent assistant in a more flexible manner
e MLLM is a more comprehensive tool for solving tasks While LLM can handle
various NLP tasks, MLLM has the ability to address a broader range of tasks,
providing diversity and comprehensiveness in information processing capabilities
24
Trang 37Help me write an advertisement for ® & The image shows a person running
across a busy street with cars and
buses passing by The person is
taken from a surveillance camera and
stylish lighting solution for your asia red circlelaround it:
home or office? Look no further than our brass toucan lamp! This
beautifully crafted bird lamp is sure
to add a touch of whimsy and charm to
any room The toucan's beak is
designed to hold a light bulb, œ As an Al language model, | do not
providing a warm and inviting glow have the ability to form opinions or
The base is made of wood, adding a emotions However, based on the
k image, it appears that the person is natural touch to the overall design running across a busy street with
The toucan lamp is a great addition cars and buses passing by This could
to any room, whether you're looking be considered dangerous and
to brighten up a dark corner or add a potentially hazardous to the person's
touch of personality to your space safety It is important to always be
Order yours today and bring a touch aware of your surroundings and take
of the tropics to your home or necessary precautions to ensure your
office safety.
Figure 2.11: Reponse from MiniGPT-4[98] has shown that this open source MLLM
has the ability to deeply understand the content of the image
GPT-4[57| by OpenAT has garnered significant attention from the research
commu-nity, especially for the impressive capabilities demonstrated by this chatbot However,
it’s worth noting that GPT-4 is not an open-source software, and as of the current ment, no official information about its configuration has been disclosed Nevertheless,
mo-the research community continues its efforts to develop open-source MLLM, and some
projects have achieved notable successes
As shown in Figure2.11] MLLMs are able to follow the human’s prompt to generate
responses properly They are now capable of writing an advertisement from an image,recognizing risks that appear in CCTV cameras, generating code for a website based onimages from that website, understanding the meaning of memes on social media, andreading and comprehending text from images without the need for Optical Character
Recognition (OCR) technology These advancements demonstrate the strength of the
research community and open up the prospect of practical applications for MLLMs
In recent years, numerous research endeavors have revolved around this domain
As per [92], these works can be broadly classified into two primary groups:
25
Trang 381 End-to-end Training with Large Language Models: In this area,
collec-tions of multimodal datasets that include images, text, and cross-modal
instruc-tions have been gathered to aid in the training of Language Models (LLM)
fo-cused on understanding visual information Various models, such as Flamingo
[1], GPT-4 [57], as well as open-source versions like MiniGPT-4 [98], Kosmos-2
have demonstrated remarkable proficiency in comprehending visual data and
generating responses that align contextually.
While these models have improved Language Models for diverse tasks, persistent challenges impede their progress toward an optimal solution A major hurdle is
data sourcing, requiring extensive conversational and question-and-answer datarelated to imagery Generating intricate task-specific data remains complex Ad-ditionally, creating a unified architectural framework for tasks like segmentation
and image generation poses a significant challenge, crucial for practical tions
applica-2 Chaining Tools with Large Language Models: This approach involves
care-ful preparation of prompts with the aim of empowering Language Models (LLM),
as seen in Langchain [12] The goal is to enable these models to coordinate various
tools, such as segmentation models, to perform specific sub-tasks without
requir-ing additional resource-intensive trainrequir-ing processes Noteworthy contributions
like Visual ChatGPT [88], MM-REACT illustrate the advantages of this
ap-proach, especially in achieving diverse task capabilities by seamlessly integrating tools within an AI model.
Despite the inherent adaptability of this approach, user prompts frequently volve a level of complexity that is not consistently straightforward The presence
in-of uncertainties within these prompts and their occasional inadequacy in
accu-rately selecting and initiating required tasks presents common challenges
Conse-quently, the ensuing process may involve integrating multiple models, potentiallyleading to delays in execution and incurring operational expenses associated with
26
Trang 39the simultaneous use of numerous models for various tasks.
This study aims to develop an image-assessing and modifying Chatbot, emphasizing aesthetics It will scrutinize models in this category, conducting a thorough analysis of
their strengths and weaknesses The goal is to establish a foundational understandingfor informed decisions on training and seamless integration to align with the intendedpurpose.
2.3 End-to-end training with Large Language Model
Within the realm of Vision Language Models, there are two predominant structuralparadigms: Image-to-Tert and Text-to-Image architectures In the former, a model
generates an text, while in the latter, another model produces image based on the
provided image or text input
The Image-to-Text framework finds prominence in tasks such as Image Captioningand Visual Question Answering, where the input predominantly comprises images andqueries, yielding natural language responses as output These models primarily servetasks related to the assessment of images
Conversely, the Text-to-Image framework is commonly deployed for image
gener-ation, typically involving a descriptive directive sentence as input and generating a
corresponding image as output In some cases, the input goes beyond a simple rective sentence and includes additional information in the form of prompts or binarymasks These masks outline specific regions for alteration, a method commonly used
di-in tasks di-involvdi-ing localized image editdi-ing
2.3.1 Image Assessment Models
Within the domain of artificial intelligence models designed to clarify and facilitate
user inquiries pertaining to images, predominant tasks often revolve around Image Captioning and Visual Question Answering Image Captioning, in particular, focuses
27
Trang 40<start> Giraffes standing <end>
† ft ft Tf
Pretrained CNN Softmax Softmax | | Softmax Softmax
using ImageNet dataset † +
Figure 2.12: Traditional CNN-LSTM are widely used previously [83]
on the generation of comprehensive descriptions elucidating the content depicted within
a given image This underscores the model’s proficiency in discerning and articulating image attributes through the use of natural and coherent language.
In contrast, Visual Question Answering establishes a question-and-answer
frame-work wherein users interact with various aspects of the image This necessitates the
model not only to recognize image elements but also to possess cognitive reasoningcapabilities in order to provide accurate responses This task demands a deeper com-prehension of both visual stimuli and textual queries when compared to the inherentcomplexities involved in Image Captioning
Research problems have historically been tackled through the utilization of the
classical CNN-LSTM architecture, as depicted in figure [2.12] and referenced in previous
works [90 [3] However, recent years have witnessed significant advancements in this domain, notably through the adoption of Transformer architectures [82] in lieu of the
antiquated CNN or LSTM models Notably, OpenAI’s CLIP model stands as a
pioneering work that has showcased promising outcomes, underscoring the potential of
this burgeoning research field
Subsequently, this subject matter has garnered substantial attention from ous research groups, employing diverse training methodologies across varied datasets,
numer-28