Khóa luận tốt nghiệp: Mô hình chatbot chỉnh sửa hình ảnh và đánh giá thẩm mỹ

This thesis will leverage Prompt-based Multimodal for both purposes: edit images and deliver feedback based on user requestsexpressed in natural language.. The objective of this thesis i

Trang 1

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE

Dr NGUYEN VINH TIEP

HO CHI MINH CITY, 2024

Trang 2

LIST OF THESIS DEFENSE COMMITTEES

Thesis Defense Committee, established according to decision by the Rector of the

University of Information Technology

1 Chairman: PhD Duong Viet Hang

2 Secretary: MS Cap Pham Dinh Thang

3 Members: MS Do Van Tien

11

Trang 3

I wish to express my gratitude to my esteemed university for providing a robusttheoretical foundation, facilitating a swift assimilation of new knowledge The uni-versity’s unwavering support and comprehensive resources have been instrumental inenhancing my comprehension and application of intricate concepts

Furthermore, I extend my appreciation to the MMLab laboratory for fostering an

academic environment conducive to research and learning The laboratory’s ample resources have significantly contributed to an enriching scholarly experience, enabling

me to delve deeper into my studies

A heartfelt acknowledgment is reserved for my supervisor, Doctor Nguyen Vinh

Tiep, whose guidance and support have been indispensable throughout the thesis deavor His mentorship has not only directed my research but has also served as a

en-source of motivation I am grateful for their encouragement and insightful feedback.Additionally, the financial support provided has allowed me to dedicate myself fully tothe research process

In recognizing the collective contributions of my university, MMLab laboratory,and my teacher, I express my deep appreciation for their pivotal roles in the successfulculmination of this thesis

iv

Trang 4

Currently, the achievements of Generative AI have become popular and powerful.

They can not only answer user questions about various aspects of an image but alsoperform edits, creating unique images based on user requests Chatbots like GPT4-Vision, BingChat, or Bard can provide excellent feedback when users inquire about

different aspects of an image Models like Dall-E, Midjourney, and Parti can generate and edit images according to specific requests.

A gap persists in integrating image commentary and editing functionalities into

a single efficient tool Users often resort to multiple tools for varied needs, causing

time-consuming and cumbersome workflows Existing models for image feedback are

generally trained for general applications, yielding overly broad responses lacking ficity in aesthetic evaluation

speci-This study aims to explore Multimodal models, which can process and synthesize

in-formation from diverse sources like images, text, and audio The focus is on developing

a Chatbot that simultaneously handles images and questions, proficient in both edits

and providing insightful commentary Specializing in detailed aesthetic commentary

on elements like lighting, color, composition, and depth, the Chatbot executes localizededits Designed to interpret and respond to natural language instructions, it enhancesthe user experience in digital image interaction

Thttps://photutorial.com /photos-statistics/

Trang 5

16 ‘Thesis struciire) Ƒ ge 9 (0Ô ff ⁄ 8

2 Background and Related Work 10

Trang 6

4

2.1.2.6 ‘Training-efficient technquel

2.2 Research on Mulimodal

2.3 End-to-end training with Large Language Model]

2.3.1 Image Assessment Modelsl

2.3.1.1 Jointly Training with Image and Text]

2.3.1.2 Learned Image Embedding as (Frozen) LM Prefix] .

2.3.1.3 Text-lmage Cross-Attention Fuse Mechanisms]

2.3.2 Local Region Image Editing Modell

2.4 Chaining tools with Large Language Model]

2.5 Deep Learning in Aesthetic Àssessmenl|

2.5.1 Data Survey] 2.5.2 Architectural Survey} 2 ee ee Proposed Method 3.1 Proposal for Aesthetic Àssessmenl|

3.1.1 Proposal for data lnmmtallonsl

3.1.2 Proposal for resource limitatlon|l

3.2 Local Image Editing Pipelnel ốc 3.2.1 Model selection 3.2.2 Open-set Object detector), 2 2.0 2.000002 0 00] 3.2.3 Finding mask for detected obJects|

3.2.4 Complete photo editing pipelmel

3 End-to-end Chatbot] Experiment 4.1 Create Instruction-following dataset]

4.2 Fine tune LLaVaj

4.2.1 Training process

vii

31

38

41

47 52 53

56

58

60 62

64

64 66

68 70

70

Trang 8

List of Figures

1.1 The initial intent of editing is to preserve most of the original content.| 3

1.2 Recent models cannot edit while preserving the original image 4

1.3 Chatbot like GPT-4V, Bard can response to the question about image} 5

1.4 Since trained on general-purpose tasks these chatbot can’t perform well

in a specific task] 2 ee 5)

2.1 Tustration of CNN architecture Source: 112.2_ Some GAN applications Source: LearnOpenCV_ 132.3 ViT marked the first time Transformer used in Vision problems

tract and retain information from past inputs Source: Wikipedia 2.5 Transformer consists of the Encoder to comprehend the textual in-

ee 16

| Epoch AD ww 19

¬ i

¬ 20

2.9 Comparing per-device memory use for model states with three

ZeRO-DP\66] optimization stages Model size V = 7.5B, DP degree Ng = 64,and kK = 12 in mixed-precision training with Adam optimizer] 22

ix

Trang 9

2.10 CLIP|61| uses Contrastive Learning jointly learning different modalities} 24

2.11 Reponse from MiniGPT-4|98] has shown that this open source MLLM

re 44

sàn 46

¬ 4ï

Trang 10

~— ad eben 57

3.1 LLaVa-vl.5-13B, the latest in the LLaVa series [49], performs well in

benchmarks but falls short in enhancing visual appeal) 59

3.2 LLaVa|49| leverages ChatGPT/GPT4 to synthesize data} 61 3.3 Image editing outcomes in InstructPix2Pix|7] and ControlNet|94]| 65

Object Detectors struggle to detect them effectively|50|| 66

3.5 The improved Grounding DINO|B0| framework, derived from GLIP|46],

establishes a strong interconnection between modalities} 67

3.6 SAM|40] is a promptable segmentation model] 68

3.7 A robust image encoder efHclentÌy generates an image embedding,

en-abling prompt-based queries to produce masks in SAM[40]| 69

3.8 Proposed Local Image Editing Modell 70

4.1 Word Cloud illustrates the word frequency in the new dataset] 76

Trang 11

4.6 Examples of unsuccessful editing) ccSẶ Ặ4.7 Complete Chatbot

x1

Trang 12

List of Tables

4.1 Prompt used for ChatGPT to synthesize Instruction-following data] 75

4.2 Comparison between Our LLaVa, GPT-4V[57|, LLaVa-v1.5-13B[49]/ 80

al

> i Or 82

4.6 Some response too short that do not contain too much information| 84

Trang 13

Chapter 1

Problem statement

This chapter will provide an overview of the issue under investigation Additionally,

it will elucidate the key concepts referenced in this thesis, namely Aesthetic

Assess-ment and Local Region Image Editing Subsequently, we will delineate the scope of

the proposed solution and articulate the research objectives Finally, the primary

con-tributions made in this study will be highlighted

However, it is imperative to acknowledge that not everyone possesses expertise inthe realm of photography For many users, the creation of visually appealing imagespresents a challenge, particularly in mecting aesthetic criteria such as color, light,

composition, depth, and other relevant factors Recognizing this challenge, there is a

growing necessity for the involvement of experts who can offer constructive feedback

Trang 14

and suggest viable solutions to enhance the visual appeal of these photographs.

Consequently, a discernible demand has arisen for a tool designed to assist general users by providing insightful comments, valuable suggestions, and facilitating photo

edits This tool serves as a means of empowering users to refine their photographicendeavors and elevate the overall quality of their visual content

There are several software programs that have efficiently addressed the demand for

photo editing, such as the Adobe tools including Photoshop and Lightroom However,

these are professional tools designed for a highly specialized group of users For thegeneral user, effectively utilizing these tools can be extremely challenging, as theyrequire a significant amount of time to accumulate knowledge and learn how to use

them Recognizing this issue, some famous applications have integrated editing tools and color-changing filters with a remarkably simple interface to make image editing

more appealing TikTok, Instagram are among the most famous examples that havecaptured the market well, providing users with simple yet effective editing tools

However, there still lacks a tool that provides feedback on photos Users often edit photos based on intuition, without a foundation or any specific criterias, making this

process time-consuming They frequently share their photos on photography forums,seeking suggestions on aesthetic aspects from highly skilled photographers Although

chatbots like Bard, ChatGPT-Plus, or BingChat have supported for conversation and

commenting on photos, they are trained on general-purpose tasks, making it challenging

to get specialized feedback in a specific field, such as aesthetic

1.2 Definition

Local Region Image Editing is the basic user demand for photos Almost

every captured photo can’t be directly posted to social media; instead, they will gothrough a time-consuming process that makes the photos more attractive Some basickinds of editing include changing the brightness, applying filters, assigning stickers, etc

Trang 15

"Change the man's

hair to pink hair"

Figure 1.1: The initial intent of editing is to preserve most of the original content.

Since these activities require intensive work, which is time-consuming, several artificial

intelligence tools emerged as a desire to help users create a good-looking image

As depicted in Figure[T.1| the AT agent, primarily utilizing Stable Diffusion[70], can

generate a modified version of the original image based on a provided natural languageprompt, while preserving most of the initial image

But in some recent well-known text-to-image models, studies have shown that they

are sensitive to the prompt; a small change in prompt may lead to completely different

images (shown in Figure (1.2) As mentioned in [27], the basic intention of editing

an image is to preserve most of the original content To do some editing work, like

replacing objects in the image, users have to carefully prepare prompts, which hinders the creativity and convenience of the prompt-based approach.

Aesthetic Assessment is one type of suggestion that mostly mentions the thetic aspect The goal of this suggestion is to point out which parts of the image are

aes-not good and then give some advice on how to enhance the visual content People often assume that the factors that construct an image’s beauty, mostly intuition, depend on the feelings of assessors It may not hold true; similar to other domains, numerous cri-

teria must be met for an image to be deemed “beautiful” Otherwise, every assessment

Trang 16

Figure 1.2: Recent models cannot edit while preserving the original image.

without a system of standard assessments is not valuable There are some cornerstone

aspects that even amateras must follow, including colour, light, composition, etc

Within the realm of Deep Learning, this research topic has been extensively cussed over an extended period, employing various approaches However, the work by

dis-stands out as a pioneer in framing return problems using words rather than

num-bers or bounding boxes, as was done previously The authors posit that representingresponses in natural language is not only more appealing to users but also provides

insight into how the underlying model processes information.

A recent advance in AI is Multimodal, which has explored the new ability to allow

users to chat with AI agents with images As shown in Figure [1.3] users can ask about

many aspects related to the image, and the agent will generate a response based on

the question and image The impressive response created by GPT-4V[57]|, Bing Chat,

Bard recently has made this research promising in the future of generalist AI.

However, since these types of models are mostly trained on general-purpose tasks,

it is hard for the response to be specific in a single domain, such as aesthetics

Ex-periments (Figure show that the responses of these AI agents when asked about

aesthetic aspects are not supportive, and some times they mention objects that are not

included in the image (“leaves” are not in|1.4), which raises doubt about their ability

to truthfully understand the visual content

Trang 17

The composition of this image is

well-balanced, with the pier

leading towards the water and

"What do you think about the mountains in the background.

composition of this image?" => Al Agent => The reflection of the pier in the

water adds a sense of depth

and enhances the overall

composition.

Figure 1.3: Chatbot like GPT-4V, Bard can response to the question about image

+,

The image you sent me is a beautiful example of a minimalist aesthetic.

The soft gray fur of the cat contrasts beautifully with the crisp white

window sill and bright green leaves outside The cat's pose is simple

and elegant, and its expression is one of curiosity and wonder.

The image is also very well-composed The cat is centered in the frame, and the negative space around it helps to create a sense of

balance The leading lines of the window sill and the leaves outside

draw the viewer's eye to the cat, and the use of natural light creates a

soft and ethereal glow.

+, Ba rd => Overall, | think the aesthetic of the image is very pleasing to the eye It

is simple, elegant, and harmonious The image conveys a sense of peace and tranquility, and it is sure to evoke positive emotions in the

viewer.

"What is your opinion on the Here is an enhanced version of the image:

aesthetic of the image?" The enhanced image emphasizes the soft and ethereal quality of the

original image The soft glow around the cat makes it look even more peaceful and serene The sharpened leaves in the background

provide a subtle contrast to the soft fur of the cat, and they also

help to draw the viewer's eye to the cat.

Figure 1.4: Since trained on general-purpose tasks these chatbot can’t perform well in

a specific task

Trang 18

1.3 Research Objectives

This thesis aims to research methodologies capable of:

1 Enhancing visual chatbots in the aspect of aesthetic assessment so that they

can provide responses that properly describe the content of the image Moreover,

when users pose questions that require aesthetic knowledge, the model’s responseshould not only offer supportive information but also ensure reliability The

response must avoid being overly general or cliché.

2 Researching advanced methods that focus solely on local editing regions While

there are many text-to-image models that can both generate and edit based on

user prompts, the initial intent of editing was to make necessary changes to

required regions while preserving the majority of the image structure Manyrecent models still lack this ability, making it challenging for these models to begenuinely useful for general image editing work

3 Proposing a method that can combine the aforementioned abilities into a unified

AI agent capable of not only providing information about the given image but

also editing it based on the user’s request Users won’t need to manually switch

between tools since the AI agent is intelligent enough to select the appropriate

tool for each task

“Language serves as the attire of thought”, and undoubtedly, it plays a pivotal

role in articulating an individual’s desires This thesis will leverage Prompt-based

Multimodal for both purposes: edit images and deliver feedback based on user requestsexpressed in natural language Natural Language is used as a primary communicationmethod between AI agent and human since it represents the most effective information

for a computer to comprehend desires deeply and engage seamlessly with users.

Multimodal in recent years has emerged as an innovative research topic, showcasingtheir ability to synthesize and process information from diverse sources, such as images

Trang 19

and text The objective of this thesis is to explore multimodal models capable ofproviding feedback and editing images, subsequently refining them based on aesthetic

evaluation tasks As image feedback and editing models are typically distinct, the project also proposes a solution to seamlessly integrate them into a unified chatbot.

Users can engage in a conversation through the chat interface, send an image, andrequest feedback on various aesthetic aspects while editing the image according to

suggestions

1.4 Research scope

This topic focuses on researching Multimodal, primarily concentrating on two areas:

language and image The objective is to develop a chatbot with two main functions:

image critique and editing, combining two models to perform tasks such as Visual

Ques-tion Answering and Prompt-based Image Editing The primary target audience for this

research is non-expert users without specialized knowledge in the field of aesthetics

During the critique process, the chatbot focuses on common aesthetic aspects such

as lighting, color, composition, depth, and content Additionally, the chatbot can

suggest ways to improve the image quality, providing practical assistance to users.Its editing capabilities are concentrated on the local regions of the image, includingchanges to hair color, clothing, and other detailed factors

Given the diversity of photo editing needs, there are numerous different aspects

that demand attention beyond the scope of the Multimodal research Therefore, thesediscoveries will be left for future exploration

1.5 Contribution

In summary, the contributions of the thesis are the following:

1 Research on the type of Multimodal that is capable of assessing and chatting

Trang 20

about the image Then leverage ChatGPT to synthesise Instruction-followingdata Finally, by using some modern training techniques, we successfully fine-

tuned Vision Language Model that includes 7B parameters for LLM on a single

RTX3090

2 Research on the Generative Model which focuses more on Local Region Editing.

Recent years have shown several models that are capable of both synthesising and editing The research is going to find the end-to-end pipeline that can go

straightforward from user image and prompt to edited image, which preservesmost of the original content

3 Proposing the solution that can connect those separated models, unified it into a

single chatbot that can both do Aesthetic Assessment and Local Image Editing.

1.6 Thesis structure

The structure of the thesis is divided into the following chapters:

e Chapter 1: Problem statement: Presents the main problem addressed by

the thesis, along with an overview of the trends in artificial intelligence (AI)

development and their relevance to the research problem

e Chapter 2: Background and Related Work: Offering a comprehensive

overview of the current landscape within the domain of Deep Learning

Con-ducting surveys and investigations into prior and existing models related to timodal applications

Mul-e ChaptMul-er 3: ProposMul-ed MMul-ethodology: AnalyzMul-es and sMul-elMul-ects appropriatMul-e data,

models, and training methods

e Chapter 4: Experiments: Conducts model training, evaluation, and identifies

solution directions

Trang 21

e Chapter 5: Discussion: Summarizes the work, and achieved results in the

thesis Discusses research work and potential future developments

In the initial chapter, the problem statement has been introduced The tual framework surrounding two pivotal keywords frequently referenced in this thesis,

concep-namely Aesthetic Assessment and Local Region Image Editing, has been presented Additionally, discussion has been provided on the research scope, research objectives,

and the contributions made in the study

Trang 22

Chapter 2

Background and Related Work

In the realm of artificial intelligence, there is a persistent endeavor to develop a satile assistant capable of emulating human perception, inference-making, and interac-tion with the real world Recent research has concentrated on constructing FoundationModels, general AI models proficient in acquiring information from diverse modalitiesand performing a broad spectrum of tasks This thesis, in particular, focuses on Vision

ver-Language Models, a set of meticulously designed models intended for synthesizing and

comprehending information from both images and texts However, before delving intothe specific area of study, an exploration of the background within the Deep Learn-

ing domain is undertaken to gain insights into the innovations that contribute to thesuccessful development of Multimodal systems.

2.1 Background

2.1.1 Research on Computer Vision

Computer Vision is a multidisciplinary field at the intersection of computer science

and artificial intelligence that empowers machines to interpret, understand, and extract

meaningful information from visual data, such as images or videos, mimicking human

10

Trang 23

Convolution Neural Network (CNN)

Figure 2.1: Illustration of CNN architecture

visual perception Several studies have been proposed in this field since 1950s One of the most important ones was AlexNet architecture [41], introduced in 2012, not only

marked a breakthrough but also signaled the beginning of an era of Deep Learningthat continues to this day

2.1.1.1 Pre-Transformer: Power and Limits of CNN

This model is particularly impressive for its application of Convolutional Neural

Network (CNN) architecture, inspired by how humans perceive images The uniqueness

of CNN lies in using sliding windows to extract feature over small regions of an imagewhich depicted in Figure As the network goes deeper, these windows expand toextract larger region, help in extracting large features, thereby enhancing the learningcapability and helping the model extract more crucial information from the data

In the following years, the AlexNet architecture underwent significant

improve-ments, optimizing the number of parameters while maintaining the performance of

models such as VGG[76|, ResNet[26], InceptionNet[77|, DenseNet|31], and many other

architectures At that time, increasing the number of parameters for CNNs was not

considered an optimal solution to improve model performance This is because models

within this architecture family often faced the issue of overfitting—a situation where

11

Trang 24

the model performs well on the training data but less efectively on new data.

To overcome these challenges, several new techniques have emerged, such as Batch

Normalization[32|, Dropout, and Regularization These improvements not only help

stabilize the training process but also enhance the model’s performance The success

of these architectures has laid the foundation for various common tasks, including

Image Classification, Object Localization, Object Detection, Object Segmentation, and has been widely applied in many fields of life Applications range from assisting in tumor detection in the medical field to urban traffic management and enforcing social

distancing measures during the COVID-19 pandemic

All of these applications belong to the group of Discriminative Model, approaching

the problem by focusing on learning a decision boundary, determining a clear separation

between classes or categories in the data They primarily learn how to discriminatebetween different classes based on input features On the other hand, Generative Model

is a novel approach that focuses on generation

2.1.1.2 Generative Models

In contrast to Discriminative Model, Generative AI is focussing on leanring the datadistribution Instead of solely concentrating on classification, Generative AI emphasizes

the ability to create new data from the distribution of the training data.

This opens up various creative applications, from generating new images to sizing speech, expanding the realm of imagination in the field of artificial intelligence.Nowadays, it has the capability to create images of faces that have never existed and

synthe-transform videos with a level of realism that is challenging to distinguish from reality.

It also innovates in altering image styles, unlocking new potentials for creativity and

applications across various domains

The group of popular architectures in the field of Generative AI includes the

Gen-erative Adversarial Network (GAN) [17], Variational Autoencoder (VAE) [89], and

currently, the Diffusion Model is the most popular choice For Diffusion, its

train-12

Trang 25

This small bird has This bird has a People at the park The bathroom with

ayellow crown blue crown with flying kites and the white tile has

and a white belly white throat and — walking been cleaned.

= = =

30 Years Old

, '

Face Aging Super Resolution

Figure 2.2: Some GAN applications

Source: LearnOpenCV

ing method is evaluated to be more stable compared to previous methods, delivering

results with high image quality and superior realism Figure [2.2] shows some GAN

ap-plications in generating humans at different ages, super resolution and Image-to-Image

translation

2.1.1.3 Transformer’s Entry: Intersection in vision and language

By the end of 2020, the emergence of the Vision Transformer (ViT) highlighted

the Transformer|82] architecture’s existence in the field of image processing The ViT’s

architecture is shown in Figure [2.3] where the first step is splitting the image into

mul-tiple patches played as “image tokens” After flattening the extracted visual features,

the Position Embedding will be added to the embedding space to append the orderinformation And finally, a Transformer’s Encoder will extract those “image tokens”

as it does on the word tokens The Self-Attention mechanism allows image patches to

attend to each other and help the model learn their spatial and visual information in

13

Trang 26

“embedding” > OL) 0Ủ 6Ö 6) aad aD 8Ẻ

* Extra learnable

Sam

-tr 18B —-RMMWt#ứẽ(?

Parry

Figure 2.3: ViT|22] marked the first time Transformer used in Vision problems

different parts of the image.

This was not only a significant step forward but also marked a special convergence

between two crucial domains: Computer Vision and Natural Language Processing TheVision Transformer not only achieved results far superior to CNN but also opened upmany potentials in how we understand and apply the Transformer architecture acrossvarious vision and language applications

2.1.2 Research on Natural Language Processing

2.1.2.1 Transformers: NLP Milestone

Formerly, classical Recurrent Neural Nerwork (RNN) architecture (Figure [2.4) was

commonly applied in natural language processing area RNN, characterized by thesequential connection of neurons forming a directed cycle, allows them to maintainhidden states to gather information from previous inputs This makes RNN ideal forhandling sequential data such as text and time series, where the relationships between

components often depend on context and order.

However, despite the benefits that RNNs bring to processing sequential data, they

14

Trang 27

°

@ ® @®

Eigure 2.4: A Recurrent Neural Network (RNN) utilizes sequential neurons to extract

and retain information from past inputs

Source: |Wikipedia

also face significant challenges The issue of the Vanishing Gradient is one of thecrucial aspects, making the backward flow of information through time steps difficult

and leading to the loss of important information This inefficiency poses a major

limitation in handling long sequences of data

A significant breakthrough with the Transformer|82] architecture has overcome

these limitations, providing the ability to process information in parallel and aging the Self-Attention mechanism As a result, this architecture not only addresses

lever-the Vanishing Gradient problem but also opens up new potentials in understanding

and generating natural language in a distinctive way Transformer is the architecture

behind the success of Large Language Model (LLM).

As shown in Figure Transformer consist of an Encoder to comprehend the

textual information connect to a Decoder use that information to generate sequence in

probabilistic way

The core of the architecture lies in the Self-Attention mechanism Drawing

inspira-tion from database principles, the embedding of tokens undergoes transformainspira-tion via

a linear layer, generating three vectors: Q as the query, K as the key, and V as the

value The query vector @ identifies the key vector K that is most similar, followed bynormalization using the softmax function to create a distribution between 0 and 1 Ahigher value indicates a higher probability of the query vector aligning with the key vec-tor This distribution is then multiplied with V to determine which parts of V should

15

Trang 28

nputs Oulputs

(shiftad rignt)

Figure 2.5: Transformer consists of the Encoder to comprehend the textual

information and the Decoder to generate sequence

receive more attention over time, distinguishing between vectors that require more or

less attention These attention mechanisms allow Transformers to learn different and

distant dependencies in language

QkT

Vd

Multi-Head Attention is a crucial element in the Transformer architecture,

com-Attention(Q, K,V) = softmax( )V

plementing the earlier introduced Self-Attention mechanism While Self-Attention

al-lows a model to weigh different parts of the input sequence based on their relevance, Multi-Head Attention extends this capability by employing multiple attention heads

in parallel Each head operates independently, enabling the model to capture diversefeatures and patterns simultaneously This parallel processing not only enhances the

model’s ability to understand complex relationships within the data but also facilitates

16

Trang 29

more efficient training and inference through parallelization.

Multihead(Q, K,V) = Conat(headi, , head,)W/Ø

where head; = Attention(QW®, KWK,vw/’)

The subsequent years witnessed the remarkable development of Large Language

Mod-els (LLM), achieved by increasing the number of parameters and the size of training

data These LLMs have demonstrated impressive capabilities, enabling them to

per-form tasks such as poetry generation [63], text summarization, and translation [64]

that are comparable to humans’ levels

2.1.2.2 Training method

In addition to training on traditional next-word-prediction tasks, various training

methods have been proposed to enhance the performance of Large Language Models

(LLM) These approaches have enabled some smaller-sized models to compete with

larger ones effectively An important step in the development process involves vised Tuning and the application of Reinforcement Learning, specifically the Reinforce-

Super-ment Learning from Human Feedback (RLHF) method.

Learning through interaction with humans, the GPT-3 model [8] has evolved into GPT-3.5 [59], becoming a particularly flexible and multitasking model With the ability

to function as a chatbot, users can request the model to perform various tasks simply

by providing desired prompts This development serves as a precursor to the currentstate of ChatGPT.

Instruction Tuning is also an advanced fine-tuning method designed to optimizemodel performance by adjusting specific instructions for each particular task Well-

known Large Language Models (LLM) like T5 [64], LLaMa [79] [80], and others have

demonstrated impressive results for various specific tasks with the help of Instruction

Tuning Some powerful models employing this method, such as Alpaca [78], Flan-T5

17

Trang 30

16], and WizardLM [89], have become outstanding, excelling not only in performing

specific tasks but also in responding effectively to human instructions

Through Instruction Tuning, many models have seen notable advancements,

es-pecially in enhancing their capabilities for tasks that demand high-level reasoningand inference, such as mathematics problem-solving and code generation Models

like Llemma [5], WizardMath [52], WizardCoder [53], and CodeLLaMa serve as

evidence of the success of this approach, improving LLMs’ logical reasoning abilities in

mathematical problem-solving and code writing

In some cases, the crucial aspect lies in carefully crafting and providing accurateprompts, which can sometimes be sufficient to achieve the desired results without theneed for expensive fine-tuning This has introduced a new level of convenience and

opened up many creative possibilities in the real-world applications of Large Language

Models Several popular prompt methods, such as Chain of Thought and In

Context Learning [8], have emerged, achieving impressive results with the simple act

of writing prompts appropriately

Despite their strength, these models are often trained on general tasks, meaningthey perform well across a broad range of tasks Therefore, when applied to specifictasks, the results may not reach the optimal level expected To address this issue,fine-tuning is a common choice to optimize the performance of Large Language Models

on specific tasks.

2.1.2.3 Success Recipe: Larger Models

In recent years, we have witnessed not only a significant leap in the size of Large

Language Models (LLM) but also a rapid evolution Looking back, the BERT model [27] in 2018 had approximately 340 million parameters Just a year later, OpenAT's GPT-2 [63], a project initially considered too risky for widespread release due to con-

cerns about its potential misuse by politicians and scammers, became more impressive

with 1.5 billion parameters By 2020, the precursor to ChatGPT, GPT-3 [8], took this

18

Trang 31

Record-setting training costs: using price of actual GPU used for training

ggg gWF"_— WP +” Wg go” gO ag” owt

Publication date of ML system CC BY Epoc

Figure 2.6: Large Language Models requires substantial resource consumption

Source:

+ h +

† h i

' T '

Input h [ Input ) ' { Prompt | Input ) '

(a) Adapter Tuning h (b) Prefix Tuning H (©) Prompt Tuning ' (d) Low-Rank Adapation

Figure 2.7: Some Parameter-efficient tuning methods[|95]

number to a staggering 175 billion parameters The increase in model size implies that

the success recipe of Language Models is expanding data size and parameters.

Figure shows the cost of training Large Language Model, the growth in sizenot only brings opportunities but also poses challenges for computational performance

and model training cost This presents a significant barrier for research groups with limited computational resources, making it difficult for them to participate in the ever- increasing race for model size.

19

Trang 32

We IRdxd

d

(ns

x

Figure 2.8: LoRal30| lower significantly the trainable parameters

2.1.2.4 Parameter-efficient Fine tuning methods

There have been several methods developed to optimize the fine-tuning process of

Large Language Models (LLM) with limited resources, enabling successful training on

constrained hardware One significant research direction is Parameter-efficient

fine-tuning (PEFT), aimed at reducing the number of trainable parameters during training

while preserving model performance Notable approaches include Adapter Tuning [29],

which trains small neural network adapters integrated into the Transformer architecture

(Figure [2.7(a)), Prefix Tuning [47] that adds a prefix string as a set of trainable vectors

to each Transformer layer(Figure[2.7{b)), and Prompt Tuning [42], which, unlike Prefix

Tuning, integrates vectors into the input layer (Figure [2.7(c)) Among them, LoRa

stands out as a particularly effective method (Figure )).

LoRa is proposed by Microsoft researchers, which is a specialized training

mecha-nism that freezes the entire pre-trained model and focuses solely on training low-rankmatrices embedded in each layer of the Transformer architecture By concentrating

on training low-rank matrices, LoRa significantly reduces the number of parameters,

alleviating resource burdens

Inspired by previous research indicating that pre-trained language models possess

a low “intrinsic dimension” and can efficiently learn despite a random projection to a

smaller subspace, the author extends this insight to hypothesize that updates to the

weights also exhibit a low “intrinsic rank” during adaptation

For a pre-trained weight matrix Wyo € R?@**, the author constrains its update

20

Trang 33

by employing a low-rank decomposition W + AW = Wo + BA, where B € JRr

and A € RTM* Throughout training, Wo remains static without receiving gradient

updates, while matrices A and B are the trainable parameters Importantly, both

Wo and AW = BA undergo multiplication with the same input, and their respective

output vectors are summed coordinate-wise For h = Wox we can update:

The variable r denotes the rank, influencing the amount of trainable parameters

and memory consumption As depicted in Figure {2.8} the matrix A is initialized with a

Gaussian Distribution, while B is set to start with 0 Consequently, the initial change in weights AW, is zero, indicating no updates during the initial phase Upon completion

of training, the integration of LoRa weights into the original model becomes a seamlessprocess through matrix addition Notably, while LoRa can theoretically be applied to

various layers of the Transformer, the author experiments have shown promising results specifically when applied to Self-Attention layers.

Experimental results have also shown that LoRa can fine-tune the GPT3.5 billion model with fewer parameters by about 10,000 times and requires fewer GPUs, up

175-to three times less, while maintaining high performance Compared 175-to other methods,

this is indeed an optimized solution, particularly valuable when fine-tuning models with limited computational resources while ensuring high performance.

2.1.2.5 Memory-efficient fine-tuning method

Quantization is a crucial technique employed in Large Language Models (LLMs) to

optimize and enhance their efficiency In the context of LLMs, which often have massive

numbers of parameters, quantization addresses the challenge of reducing model size andcomputational complexity without compromising performance Naive casting, or the

reduction of precision without careful consideration, may result in a loss of critical

21

Trang 34

m Parameters Gradients TM Optimizer States

Figure 2.9: Comparing per-device memory use for model states with three

ZeRO-DP{66] optimization stages Model size V = 7.5B, DP degree Ny = 64, and K = 12 in

mixed-precision training with Adam optimizer

information, impacting the model’s overall performance

To address these challenges, sophisticated techniques have emerged for effective

quantization fine-tuning LLM int8() seamlessly converts float32 to int8, preserving overall performance QLoRa [20], an innovative approach, combines quantization with

LoRa to optimize memory usage during model training Introducing the NF4 data type, QLoRa uniquely quantizes based on a fraction of the standard distribution, achieving

memory-efficient optimization compared to traditional 4-bit and float quantization, allwhile maintaining high accuracy

These methods aim to preserve the essential information while achieving the desired compression of model parameters and activations By navigating the trade-off between

reduced precision and model accuracy, these techniques contribute to the successfulimplementation of quantized models in real-world applications

2.1.2.6 Training-efficient technique

The large size of the model consumes nearly all the GPU memory, forcing us to use

a smaller batch size during training and significantly prolonging the training process

Gradient Checkpointing is a technique that addresses this issue by allowing training with a small batch size while achieving the same effect as a large batch size It does

22

Trang 35

this by accumulating gradients for a certain number of steps before making updates.

Another solution is to scale the number of GPUs, which involves increasing the GPU count to enable training on multiple GPUs This technique, supported by the PyTorch

framework, is known as Data Parallel PyTorch also supports Distributed Data allel, allowing training on multiple nodes and alleviating computational pressure byleveraging multiple devices

Par-It’s important to note that these techniques are effective when the model size can fit into a single GPU If the model size is too large for a single GPU, partitioning the

model becomes a viable solution

DeepSpeed Zero [67], introduced by the Microsoft research team, represents a

significant breakthrough in memory optimization during model training Illustrated

in Figure [2.9] the three main methods include optimized state sharding Pos, gradient

sharding P,, and parameter sharding P, These advancements not only enable efficienttraining on devices with multiple GPUs but also integrate an Offload mechanism to

simultaneously leverage both GPU VRAM and CPU RAM, ensuring comprehensive

system performance optimization

Thanks to these technological advances, optimizing efficiency on devices with ited resources has become more accessible than ever, and fine-tuning Language Models

lim-(LLMs) is no longer a formidable challenge.

2.2 Research on Multimodal

Although Large Language Models (LLM) have demonstrated remarkable zero-shot

capabilities on many tasks related to language, they still appear “blind” to the field of

images, as they are primarily trained to process words

To overcome this limitation and leverage simultaneous advancements in both image

processing and natural language, researchers have conducted studies combining both

image and language models One notable contribution is OpenAl’s CLIP{6I] CLIP

23

Trang 36

Figure 2.10: CLIP|61] uses Contrastive Learning jointly learning different modalities

proposes building a shared embedding space to align images and language Using

Contrastive Learning to optimize similarity between pairs of images and text (Figure

2.10), CLIP is pretrained on over 400 million image-text pairs As a result, CLIP

enhances Zero-shot and Few-shot capabilities, enabling object recognition in imageswithout prior training, based on understanding natural language descriptions

Usually, when referring to Multimodal Large Language Models (MLLM), we often mention models based on Large Language Models (LLM) with the capability to intake and process information from multiple sources According to [92], from the perspective

of developing an Artificial Generalist Model (AGI), MLLM is considered a significant

step beyond LLM, for the following reasons:

e MLLM is similar to how humans observe the world, utilizing multiple senses to

naturally intake information, often in a complementary and cooperative manner.

e MLLM provides a more user-friendly interactive interface Thanks to its

capa-bility to support diverse inputs, users can interact and communicate with thisintelligent assistant in a more flexible manner

e MLLM is a more comprehensive tool for solving tasks While LLM can handle

various NLP tasks, MLLM has the ability to address a broader range of tasks,

providing diversity and comprehensiveness in information processing capabilities

24

Trang 37

Help me write an advertisement for ® & The image shows a person running

across a busy street with cars and

buses passing by The person is

taken from a surveillance camera and

stylish lighting solution for your asia red circlelaround it:

home or office? Look no further than our brass toucan lamp! This

beautifully crafted bird lamp is sure

to add a touch of whimsy and charm to

any room The toucan's beak is

designed to hold a light bulb, œ As an Al language model, | do not

providing a warm and inviting glow have the ability to form opinions or

The base is made of wood, adding a emotions However, based on the

k image, it appears that the person is natural touch to the overall design running across a busy street with

The toucan lamp is a great addition cars and buses passing by This could

to any room, whether you're looking be considered dangerous and

to brighten up a dark corner or add a potentially hazardous to the person's

touch of personality to your space safety It is important to always be

Order yours today and bring a touch aware of your surroundings and take

of the tropics to your home or necessary precautions to ensure your

office safety.

Figure 2.11: Reponse from MiniGPT-4[98] has shown that this open source MLLM

has the ability to deeply understand the content of the image

GPT-4[57| by OpenAT has garnered significant attention from the research

commu-nity, especially for the impressive capabilities demonstrated by this chatbot However,

it’s worth noting that GPT-4 is not an open-source software, and as of the current ment, no official information about its configuration has been disclosed Nevertheless,

mo-the research community continues its efforts to develop open-source MLLM, and some

projects have achieved notable successes

As shown in Figure2.11] MLLMs are able to follow the human’s prompt to generate

responses properly They are now capable of writing an advertisement from an image,recognizing risks that appear in CCTV cameras, generating code for a website based onimages from that website, understanding the meaning of memes on social media, andreading and comprehending text from images without the need for Optical Character

Recognition (OCR) technology These advancements demonstrate the strength of the

research community and open up the prospect of practical applications for MLLMs

In recent years, numerous research endeavors have revolved around this domain

As per [92], these works can be broadly classified into two primary groups:

25

Trang 38

1 End-to-end Training with Large Language Models: In this area,

collec-tions of multimodal datasets that include images, text, and cross-modal

instruc-tions have been gathered to aid in the training of Language Models (LLM)

fo-cused on understanding visual information Various models, such as Flamingo

[1], GPT-4 [57], as well as open-source versions like MiniGPT-4 [98], Kosmos-2

have demonstrated remarkable proficiency in comprehending visual data and

generating responses that align contextually.

While these models have improved Language Models for diverse tasks, persistent challenges impede their progress toward an optimal solution A major hurdle is

data sourcing, requiring extensive conversational and question-and-answer datarelated to imagery Generating intricate task-specific data remains complex Ad-ditionally, creating a unified architectural framework for tasks like segmentation

and image generation poses a significant challenge, crucial for practical tions

applica-2 Chaining Tools with Large Language Models: This approach involves

care-ful preparation of prompts with the aim of empowering Language Models (LLM),

as seen in Langchain [12] The goal is to enable these models to coordinate various

tools, such as segmentation models, to perform specific sub-tasks without

requir-ing additional resource-intensive trainrequir-ing processes Noteworthy contributions

like Visual ChatGPT [88], MM-REACT illustrate the advantages of this

ap-proach, especially in achieving diverse task capabilities by seamlessly integrating tools within an AI model.

Despite the inherent adaptability of this approach, user prompts frequently volve a level of complexity that is not consistently straightforward The presence

in-of uncertainties within these prompts and their occasional inadequacy in

accu-rately selecting and initiating required tasks presents common challenges

Conse-quently, the ensuing process may involve integrating multiple models, potentiallyleading to delays in execution and incurring operational expenses associated with

26

Trang 39

the simultaneous use of numerous models for various tasks.

This study aims to develop an image-assessing and modifying Chatbot, emphasizing aesthetics It will scrutinize models in this category, conducting a thorough analysis of

their strengths and weaknesses The goal is to establish a foundational understandingfor informed decisions on training and seamless integration to align with the intendedpurpose.

2.3 End-to-end training with Large Language Model

Within the realm of Vision Language Models, there are two predominant structuralparadigms: Image-to-Tert and Text-to-Image architectures In the former, a model

generates an text, while in the latter, another model produces image based on the

provided image or text input

The Image-to-Text framework finds prominence in tasks such as Image Captioningand Visual Question Answering, where the input predominantly comprises images andqueries, yielding natural language responses as output These models primarily servetasks related to the assessment of images

Conversely, the Text-to-Image framework is commonly deployed for image

gener-ation, typically involving a descriptive directive sentence as input and generating a

corresponding image as output In some cases, the input goes beyond a simple rective sentence and includes additional information in the form of prompts or binarymasks These masks outline specific regions for alteration, a method commonly used

di-in tasks di-involvdi-ing localized image editdi-ing

2.3.1 Image Assessment Models

Within the domain of artificial intelligence models designed to clarify and facilitate

user inquiries pertaining to images, predominant tasks often revolve around Image Captioning and Visual Question Answering Image Captioning, in particular, focuses

27

Trang 40

<start> Giraffes standing <end>

† ft ft Tf

Pretrained CNN Softmax Softmax | | Softmax Softmax

using ImageNet dataset † +

Figure 2.12: Traditional CNN-LSTM are widely used previously [83]

on the generation of comprehensive descriptions elucidating the content depicted within

a given image This underscores the model’s proficiency in discerning and articulating image attributes through the use of natural and coherent language.

In contrast, Visual Question Answering establishes a question-and-answer

frame-work wherein users interact with various aspects of the image This necessitates the

model not only to recognize image elements but also to possess cognitive reasoningcapabilities in order to provide accurate responses This task demands a deeper com-prehension of both visual stimuli and textual queries when compared to the inherentcomplexities involved in Image Captioning

Research problems have historically been tackled through the utilization of the

classical CNN-LSTM architecture, as depicted in figure [2.12] and referenced in previous

works [90 [3] However, recent years have witnessed significant advancements in this domain, notably through the adoption of Transformer architectures [82] in lieu of the

antiquated CNN or LSTM models Notably, OpenAI’s CLIP model stands as a

pioneering work that has showcased promising outcomes, underscoring the potential of

this burgeoning research field

Subsequently, this subject matter has garnered substantial attention from ous research groups, employing diverse training methodologies across varied datasets,

numer-28

Tiêu đề	Visual Editing with Aesthetic Assessment Chatbot
Người hướng dẫn	Dr. NGUYEN VINH TIEP
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Graduation Thesis
Năm xuất bản	2024
Thành phố	HO CHI MINH CITY

Định dạng
Số trang	116
Dung lượng	75,72 MB