Khóa luận tốt nghiệp Khoa học máy tính: Tái tạo ảnh từ dữ liệu chụp cộng hưởng từ chức năng (fMRI) sử dụng mô hình Latent Diffusion

VIETNAM NATIONAL UNIVERSITY — HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY FACULTY OF COMPUTER SCIENCE PHAM THI BICH NGA - 20521642 DAM VU TRONG TAI - 20521855 GRADUATE DISSERTAT

Trang 1

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER SCIENCE

PHAM THI BICH NGA

DAM VU TRONG TAI

GRADUATE DISSERTATION

BACHELOR OF COMPUTER SCIENCE

Trang 2

VIETNAM NATIONAL UNIVERSITY — HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF COMPUTER SCIENCE

PHAM THI BICH NGA - 20521642

DAM VU TRONG TAI - 20521855

GRADUATE DISSERTATION

IMAGE RECONSTRUCTION FROM FUNCTION

MAGNETIC RESONANCE IMAGING (FMRD DATA

USING LATENT DIFFUSION MODEL

BACHELOR OF COMPUTER SCIENCE

SUPERVISED BY

DR NGUYEN VINH TIEP

Trang 3

The committee for evaluating the graduation thesis is established according to Decision

NO dated by the Rector of the University of

Information Technology

| — Chairman.

Qe eee cence ence ence tees ee š&%< — Secretary.

Bocce cece cece cece cece sees eee e cece eee e beens — Member.

— Member.

Trang 4

The successful of this dissertation is attributed to the invaluable assistance and support extended by

numer-ous individuals We express profound gratitude for their feedback Primarily, we wish to convey appreciation

to our supervisor, Dr Nguyen Vinh Tiep, for his dedicated guidance, enthusiastic direction, and invaluable

in-structions throughout this research endeavor His wise counsel and unwavering support played a vital role in

navigating the research process, leading to the successful completion of this thesis

Our sincere thanks are extended to the Dean of the Faculty and all the faculty members in the Faculty of

Computer Science at the University of Information Technology for their assistance and for equipping us with

the knowledge necessary to complete this thesis.

Acknowledgment is also extended to the Multimedia Laboratory (MMLab-UIT) for providing a research

environment and equipment for this study Additionally, our appreciation goes to the researchers of the MMLab

for their valuable feedback and probing questions that contributed to our research Their insights aided in

identifying and rectifying mistakes, ultimately enhancing the quality of this thesis

Trang 5

2.1.1 Vision-Language Model Foundations

-2.1.2 Contrastive Language-Image PretraningModel

2.2 DiffusionModels 2 Ặ Q Q Q Q Q Q HQ HH va

2.2.1 IntuiionOfDifusionModels Ặ

2.2.2 LatenDifusionModels Ặ Ặ Ặ ee

3 Related Work

3.1 Human Brain Activity And Its Representations ẶẶ.

Trang 6

5.3.1 Comparision With OtherModels 53

Trang 7

1.1.1 Example of Image Reconstruction process from fMRI data fMRI data acquired from fMRI

scanner We process the {MRI reconstruction stage to reconstruct the original image from the

fMRI data 2 Q Q Q Q Q Q Q Q Q H n v v v k k kg

2.1.1 the new learning paradigm with VLMs enables effective usage of web data and zero-shot

pre-dictions without task-specific fine-tuning ee

2.1.2 Architecture of Vision Transformer 2 0 ee

2.1.3 Summary of CLIP approach 2 2 ee

2.2.1 Forward Diffusion and Reverse Diffusion

process -2.2.2 Training and sampling algorithms ofDDPM:

2.2.3 The U-Net architecture 2 ee

2.2.4 The architecture of latent diffusion model

3.1.1 Magnetoencephalography (MEG) data have a high temporal resolution (on the order of msec),

which allows us to directly assess latency differences in these neural responses Functional

magnetic resonance imaging (fMRI) measures have high spatial resolution (on the order of

mm), which allows us to pinpoint the location of activity associated with a sensorimotor event

3.2.1 Framework diagram for visual stimulus reconstructiontask_

3.2.2 Overview of two variations of frameworks proposed by Shen et al (2019b): (A) ShenDNN and

(B) ShenDNN+DGN The yellow color denotes the use of pre-trained components .

3.2.3 The BeliyEncDec framework, introduced by Beliy et al (2019), involves two main training

stages: (A) supervised training of the Encoder and (B) a combination of supervised and

self-supervised training for the Decoder During this process, the Encoder’s weights remain

con-stant The components of the model trained on external unlabeled data are indicated in blue

3.2.4 GAN-based frameworks (A) SeeligerDCGAN framework based on deep convolutional GAN

(B) Framework proposed by Mozafarif°Ì Q Q Q Q Q Q

3.2.5 VanRullenVAE-GAN framework proposed by VanRullen and Reddy

Trang 8

LIST OF FIGURES

4.1.1 Analysis of limitations in the results of Brain-Diffuser model Corresponding to each groundtruth

image in the MS COCO dataset!*7! that the user observes, there will be from 1 to 5

accompany-ing captions The caption sentences will have a structure that describes the object category and

each of their relationships As shown in the image above, the cap will include the categorization

objects "vase", "roses", and their relationship is "roses" is in "vase" With the corresponding

groundtruth, the image reconstructed from the fMRI data must have the same content as the

original image and description However, the reconstructed image is still not in the correct

object category - "roses" and the details of the image are still not the same as the original image 34

4.1.2 Overview of the entire image reconstruction process 35

4.1.3 Architecture of Brain-Diffuser model When subject views the original image, their brain

ac-tivity is recorded and processed into fMRI, These fMRI; is then simultaneously input

into three Branches The first Branch uses a level reconstruction model to create a

low-resolution image Text Branch will input ƒ Mƒ RI;,, and output {MRI Text Embeddings ƒ MRI;

based on the Ridge model learned for the Text Branch Vision Branch will input ƒ M f‡7„, and

output fMRI Vision Embeddings based on the Ridge model learned for the Vision Branch

These low-res images, along with the text and Vision Embeddings, are incorporated into a

pretrained Diffusion model to generate reconstructed images The Text Branch focuses on

en-riching information related to object categories and relationships, while the Vision Branch

con-centrates on refining image details The collaborative process between these branches ensures

the accurate and comprehensive reproduction of the originalimage 36

4.1.4 Distribution’s visualization of Text Branch and Vision Branch of Brain-Diffuser model Text

Distribution on the left is Brain-Difuser fMRI Text Embeddings - fMRI; (green point)

and CLIP Text Embeddings groundtruth (blue poinnt) Vision Distribution on the right is

Brain-Difuser fMRI Vision embedding - f M RI, (green point) and CLIP Vision Embeddings

groundtruth (blue poinnt) Text Distribution and Vision Distribution using Ridge Regression

model exhibit a significant gap Depend on principles that the greater the overlap between the

two distributions, the more information is retained These large gaps lead in Text and Vision

Distribution cause information loss during the mapping process Therefore, Ridge Regression

model is not effective in mapping distributions UMAP"! dimensionality reduction is applied

to visualize the data points of the distributions with 982 samples in the testing set 37

4.1.5 Ridge Regression Model Architect Overview for Vision and Text Branch 38

4.2.1 Overview of our training framework using MindMapper for mapping Text, Vision Branches 40

4.2.2 Overview of our MindMapper architecture for mapping {MRI to CLIP Text/Vision Embeddings 41

4.2.3 UMAP plots depict Text Embeddings grountruth (blue), Ridge (Baseline) for Text Branch

em-beddings grountruth (green), MindMapper for Text Branch emem-beddings (red) Distance between

the given grountruth embeddings Embeddings and MindMapper Embeddings is lower

suggest-ing that the MindMapper helps to align the two embeddsuggest-ings spaces .- 42

Trang 9

4.2.4 The UMAP plots illustrate embeddings for Vision groundtruth (blue), Ridge (Baseline) for

Vision Branch embeddings (green), and MindMapper for Vision Branch embeddings (red).

Both the left and right images show a significant separation This results in information loss,

indicating that models such as Ridge Regression or MindMapper are not effective in learning

these mappings for Vision Branch 2 2 2 ee 43

4.2.5 Overview of our framework using MindMapper and Visual Denoiser for mapping high-level

features components 2 Ặ Ặ Q Q Q Q Q Q HQ HH HH KT ha 45

4.2.6 UMAP plots illustrate Ridge (Baseline) Vision Embeddings (in green), MindMapper Vision

Embeddings (in red), and MindMapper+ VisualDenoiser (in pink), compared with the groundtruth

Vision Embeddings (in blue) Baseline and MindMapper vision distributions have a large gap

compared to groundtruth distributions But MindMapper+VisualDenoise vision distribution

has the highest overlap, showing that the model contributes effectively in the mapping process 46

4.2.7 A comprehensive overview of the Training and Inference processes 48

5.3.1 Comparison of fMRI Reconstructions for differentmodels 54

5.3.2 Percentage of each modelinranks Ặ Ặ Ặ Ặ QC 56

5.4.1 MindMapper model uses for Text Branch Result compared to Brain-Diffuser Model Result

and Groundtruth Image MindMapper model apply to Text Branch significantly contributes to

making the model aware of object classes and relationships in the semantic context of the image 59

5.4.2 In comparing the outcomes of the MindMapper and VisualDenoiser model for the Vision

branch and MindMapper for the Text branch at the final stage, against the scenario where only

MindMapper is employed for the Text branch along with Brain-Diffuser and Groundtruth, the

combined MindMapper and VisualDenoiser model showcases a substantial improvement in

re-constructing the meaningful content of the image with high detail for both Text and Vision

branches 2 ee 60

Trang 10

List of Tables

5.3.1 Scores of different models on the Brain-Image Reconstruction task For each measure, the best

value is in bold For PixCorr, SSIM, AlexNet-2, AlexNet-5, InceptionV3 and CLIP metrics,

higher is better For EffNet-B and SwAV distances, lower is better The arrow pointing up or

down indicates this 2 Ặ Ặ On g k k x k k V ky

5.3.2 The average rank of different models depends on the results of user studles

5.4.1 Quantitative comparison of MindMapper compared to Baseline Result For each measure, the

best value is in bold (For PixCorr, SSIM, AlexNet-2, AlexNet-5, Inception and CLIP metrics,

higher is better For EffNet-B and SwAV distances, lower is better.The arrow pointing up or

down indicates this CC HH HH HH ng k k v k k k kg

5.4.2 Quantitative comparison of MindMapper and VisualDenoiser compared to Baseline Result

For PixCorr, SSIM, AlexNet(2), AlexNet(5), Inception and CLIP metrics, higher is better For

EffNet-B and SwAV distances, lower is better When comparing our MDIR model consisting

of MindMapper for the Text branch combined with MindMapper and VisualDenoiser for the

Vision branch, the model shows significantly improved performance on the high-level metrics

EffNet-B and SwAV, which are also increased to 12.5% and 11.5% and and the remaining

metrics also increased by up to 2.6% 2 ee

5.4.3 Performance Metrics of MDIR with different DDIM steps They consume 12GB GPU for the

inference phrase ee

Trang 11

In the realm of neural decoding research, a particularly fascinating area involves the reconstruction of

visually perceived natural images using {MRI signals Previous approaches have succeeded in recreating the

low-level properties (shape, texture, structures) in the reconstructed images while the high-level features (object

category, image details) are incapable of generating exactly Because they utilized simple projection methods

such as Ridge Regression to compress fMRI data into corresponding space such as CLIP space that can be used

to condition the visual stimulus reconstruction process Nevertheless, those methods are unable to capture the

complicated patterns in the CLIP space and most of them used separated simple models to map fMRI to distinct

tokens in high-dimensional space is not reliable due to missing the interaction information between tokens can

lead to information loss issues In this work, we focus on improving the projection of the high-level feature

by leveraging two branches including a text branch for mainly object class and relationships perception and a

vision branch for image details reconstruction We propose a MindMapper model to deal with information loss

in previous methods in the Text Branch Additionally, due to the complexity of vision distribution, MindMapper

is proposed to be used in conjunction with VisualDenoiser VisualDenoiser is introduced to continue to reduce

the gap between outputs of MindMapper and CLIP Image embeddings and provides additional information to

enhance the projection process Our model shows a significant increase of +12.5%, and +11.5% in EfficientNet,

SwAV respectively, while the remaining metrics, such as PixCorr, AlexNet-2, AlexNet-5, InceptionV3, and

CLIP show a slight improvement of up to 3%

Trang 12

Chapter 1

Introduction

In this chapter, we systematically introduce the focal theme, offering a comprehensive overview that delves

into the intrinsic importance of this subject in the realm of research Additionally, we meticulously define the

problem Following this, we articulate our motivation and establish the objectives of this thesis To conclude,

we provide the contributions that this research makes to the existing body of knowledge.

1.1 Overview

1.1.1 Practical Context

Have you ever wondered if scientists could peek inside your brain and decode the images you’re thinking

about, the scenes in your dreams, or memories from your past? Right now, they can do this through measuring

brain activity signals There are many methods to capture signals of brain activity like fMRI (functional

Mag-netic Resonance Imaging), EEG (Electroencephalography), but those signals are like a meaningless secret code,

and it can’t exactly reveal the specific content of the images represented by those data Therefore, to understand

what kind of visual content those signal data represent in the brain, a new challenge has emerged: image

re-construction from human brain activity, especially fMRI data is used because of its efficiency and modernity

that can preserve data information Image reconstruction from functional Magnetic Resonance Imaging - {MRI

data has many applications in real-life

¢ Memories Restoration:

Image reconstruction from human brain signals, especially leveraging fMRI, promises revolutionary

ap-plications in memories restoration By decoding the neural patterns associated with memories, this

tech-nology holds the potential to restore and visualize past experiences The contribution lies in offering a

direct window into the neural underpinnings of memory, paving the way for therapeutic interventions and

cognitive rehabilitation for individuals with memory-related disorders.

Trang 13

¢ Dream Visualization:

The application of image reconstruction from human brain signals extends into the fascinating realm of

dream visualization Using fMRI, researchers can decode brain activity during dreaming states,

translat-ing these neural patterns into visual representations of dreams This not only enriches our understandtranslat-ing

of the dreaming process but also holds potential applications in studying and addressing sleep disorders

¢ Biomedical Research Advancements:

In the domain of biomedical research, the fusion of image reconstruction with {MRI data opens avenues

for transformative advancements By accurately capturing and reconstructing neural responses to visual

stimuli, researchers gain unprecedented insights into the intricate neural mechanisms underpinning

cog-nitive processes This technology contributes to advancing our understanding of neurological disorders,

cognitive functions, and facilitates the development of targeted interventions and treatments

* Creativity in Art and Design:

The synergy between image reconstruction from human brain signals and creative endeavors in art and

design unlocks innovative possibilities By utilizing [MRI data, artists and designers can tap directly into

mental imagery, fostering a new era of creative expression Beyond the artistic realm, this technology

serves as a means for individuals with limited traditional forms of expression to communicate their

creative visions, particularly benefiting those with physical disabilities

¢ Smart Interaction:

Image reconstruction from human brain signals, when integrated into Brain-Computer Interfaces (BCIs),

propels smart interaction to new heights By decoding intentions and visualizations directly from the

brain using fMRI, this technology enhances the efficiency and intuitiveness of human-computer

interac-tion It has the potential to redefine communication for individuals with motor impairments, providing a

direct link between neural activity and external devices

In summary, image reconstruction from human brain signals, with a focus on fMRI, is a frontier in

neuro-science with far-reaching applications From restoring memories to envisioning dreams, advancing biomedical

research, fostering creativity, and enabling smart interactions, the contributions of this technology are profound

As the field continues to evolve, the potential for real-world impact expands, positioning image reconstruction

from human brain signals as a key player at the intersection of neuroscience and advanced imaging

technolo-gies

1.1.2 Problem Statement

Image reconstruction from human brain activity is a subtask of human visual decoding Image

reconstruc-tion from human brain activity aims to reconstruct the image from recorded brain activity data, typically

ob-tained through neuroimaging techniques such as functional magnetic resonance imaging (fMRI),

Trang 14

electroen-1.1 OVERVIEW

Original Image fMRI Image Reconstructed Image

fMRI

) } Scanner ReconstructionfMRI

Figure 1.1.1: Example of Image Reconstruction process from fMRI data {MRI data acquired from {MRI

scanner We process the {MRI reconstruction stage to reconstruct the original image from the fMRI data

cephalography (EEG), and magnetoencephalography (MEG) In this thesis, we utilize fMRI to reconstruct

visual stimuli Figure 1.1.1 illustrates an example of image reconstruction from human brain activity

¢ Input: 3D brain fMRI data acquired from a fMRI scanner when a subject observes a given image

¢ Output: A corresponding high-resolution reconstructed image that the subject observed

1.1.3 Challenges

Besides common challenges of computer vision tasks, Image Reconstruction from fMRI data has its own

difficulties.

¢ The training dataset is one of the main challenges we encounter in this work With a small number of

sample instances, this poses a significant limitation to constructing a model that yields satisfactory image

reconstruction results

¢ Limited Resolution Current neuroimaging techniques, such as functional magnetic resonance imaging

(fMRI) and electroencephalography (EEG), have limitations in both spatial and temporal resolution This

makes it challenging to reconstruct detailed and semantically reasonable images of brain activity

* Data preprocessing is also a significant challenge because fMRI data is raw 3D data with an extremely

large number of dimensions Feature extraction methods used by previous models“! 74! to generate fMRI

data after preprocessing lack information or overlook important details And extracting information from

3D fMRI data, then flattening it into 1D vectors that are subsequently used as input to reconstruction

model, can lead to the loss of spatial information in brain activity patterns

* Requires high computational resources Since we are constructing a complex model for a small amount of

data, it necessitates numerous modules to incorporate additional information Therefore, these techniques

Trang 15

s The essence of image reconstruction models is that they are generative models, and the main task of the

image reconstruction problem 1s to transform fMRI data into the type of data accepted by the generative

model Therefore, the key factor is to build a good mapping model from fMRI data to the data type

of the generative model However, with the limitation of the number of samples in the dataset and the

diversity and complexity in the representation of patterns of the generative model’s data type, make it

very challenging to construct a mapping model for this problem

¢ Lack of evaluation metric The absence of a standardized evaluation procedure for assessing the

recon-struction quality in terms of realism, naturalness, logical coherence compared to the original image, and

other methods is extremely difficult Evaluating these images relies on human perception, and there is no

perfect metric for comparison with human judgment

1.2 Motivation

The quest to solve the challenge of image reconstruction from human brain signals is driven by its

tremen-dous potential to decode the intricacies of the nervous system and gain insights into the information encoded

within the human brain Successfully addressing this problem could have far-reaching implications, unlocking a

myriad of applications For instance, it could pave the way for memory restoration, offering hope to individuals

with memory-related conditions, advance biomedical research by aiding in the understanding and diagnosis of

cognitive disorders and neurological conditions Additionally, it could spark innovations in the fields of art and

design However, despite its transformative potential, the exploration of this problem is currently limited That

is one of the motivations why we study this problem

Additionally, the development of models to address this problem is still wide open Many earlier studies on

image reconstruction focused on either the Generic Object Decoding”! or the Deep Image Reconstruction !°*!

datasets curated by the Kamitani Lab These datasets comprise 1200 training and 50 testing images sourced

from ImageNet Pioneering works in this field include those by Shen et al!°*!, Beliy et al!., and Gaziv et al?7!,

who utilized basic CNN and DNN models to examine the image generation process However, these generated

images primarily conveyed low-level features such as layout, sharpness, and contours, resulting in blurry and

indistinct images

More recently, Allen et allintroduced another dataset for visual encoding and decoding studies named the

Natural Scenes Dataset (NSD)!°! With its increased number, diversity, and complexity of images, the NSD

dataset has become the predominant benchmark for {MRI-based image reconstruction Works by Gu et al!?°1,

Ozcelik et all, and Takagi et al!7“! also employed the NSD dataset in conjunction with deep generative

models like GANs and Diffusion to reconstruct images with high semantic and high-level detail aspects

Nev-ertheless, reconstructing scenes with multiple objects and complex semantic descriptions, such as COCO”!

images from the NSD dataset, remains a challenge

Given the notable recent success of latent diffusion models in generative AI applications, we can build an

image reconstruction models could also benefit from the advantages offered by these models

Trang 16

1.3 OBJECTIVES

1.3 Objectives

In this research, we aim to propose a novel image reconstruction model that builds upon the sensible and

effective architecture of an existing baseline model, incorporating novel mapping modules to enhance its

per-formance and avoid information loss during data mapping The proposed model is designed to mitigate

infor-mation loss, preventing inaccuracies in the semantics and structure of the reconstructed image compared to the

original.

1.4 Contributions

Our main contributions to this work are listed in the following points:

¢ Investigating the overview of related research, we conducted a survey on the current research regarding

Image Reconstruction using different datasets and methods Subsequently, we identified key components

that can be leveraged to design a framework for our Image Reconstruction from fMRI data

* Our work focuses on improving the two main branches in the task, namely the Text branch and the Vision

branch, to support the image generation process.

— In the Text branch, we propose using a single enhanced deep network model called MindMapper,

capable of mapping fMRI data to corresponding text embeddings to provide information about

object categories and relationships in the image

— In the Vision branch, we suggest using the MindMapper model that we proposed, combined with

the pretrained DALL-E 2 prior model, to assist in supplementing information for the details of the

image

Our proposed model can reconstruct the high-resolution and high-naturalistic images effectively and

efficiently Additionally, our model amplifies object class information and proficiently extracts scene

details from fMRI data

1.5 Dissertation Structure

Our thesis consists of 6 chapters:

¢ Chapter 1: Introduction This chapter presents an overview of image reconstruction from human brain

activity problems, which consists of research motivation, definition, challenges, and our main

contribu-tions.

¢ Chapter 2: Background This chapter gives an introduction to fundamental knowledge that plays a vital

role in our thesis.

Trang 17

Chapter 3: Related work Prior research related to our thesis will be presented in this chapter.

Chapter 4: Mapping Denoising Image Reconstruction - MDIR This chapter delves into our proposed

MDIR model

Chapter 5: Experiment This chapter contains the evaluation metrics used to evaluate our results, our

experiment setting as well as the results we attained.

Chapter 6: Conclusion This chapter sums up our thesis and our contributions, and some future work is

mentioned

Trang 18

Chapter 2

Background

Prior to discussing our methodology, we present intuitions of components that are used in our approaches

There are some robust and powerful models involves Variational Autoencoder (VAE), Contrastive

Language-Image Pretrained models

2.1 Vision Language Models

Many deep neural network (DNN) training processes for visual recognition heavily depend on

crowd-labeled data, often involving the training of a separate DNN for each specific visual recognition task This

conventional approach results in a laborious and time-consuming visual recognition paradigm To tackle these

challenges, there has been considerable recent research into Vision-Language Models (VLMs) These models

leverage large-scale image-text pairs from the internet to learn rich vision-language correlations, offering the

potential for zero-shot predictions across diverse visual recognition tasks using a single VLM

2.1.1 Vision-Language Model Foundations

A Vision-Language Model (VLM) PÌ undergoes pre-training using extensive image-text pairs readily

avail-able on the internet The pre-trained VLM can then be directly employed for various downstream visual

recog-nition tasks without the need for fine-tuning as illustrated in figure 2.1.1 The pre-training of the VLM involves

being guided by specific vision-language objectives, facilitating the learning of image-text correspondences

from the vast image-text pairs dataset This process employs a contrastive objective, wherein the VLM learns

by bringing paired images and texts close together while pushing others farther apart in the embedding space

Consequently, the pre-trained VLMs acquire comprehensive knowledge of vision-language correspondences,

allowing them to make zero-shot predictions by matching the embeddings of any given images and texts This

novel learning paradigm makes effective use of web data and enables zero-shot predictions without the need

for task-specific fine-tuning It is a straightforward implementation that delivers impressive results, as

Trang 19

demon-(2) Zero-shot Prediction without Fine-tuning

(1) Vision-Language Model Pre-training

Dog

Text Prompt

me olext Text Cat es 7

Ths dog —+| Deep Neural - Pee Deep Neural itt

with a frisbee in Network H [class] Neto: Downstream Tasks:

its mouth Image Classification

Pre-training Objectives:

Image-text Contrastive Leaning

Object Detection

Classes in = s

Downstream Task cmantic Segmentation

Masked Cross-modal Modelling i

Image Similarity Calculation

Deep Neural l8 Image —E |

Network Deep Neural iposi|S[eag| - Picup)

Network l

'Web-scale Image-text Pairs Unlabelled Images from A photo of a (almost infinitely available on the internet) Downstream Tasks [Cat]

Figure 2.1.1: the new learning paradigm with VLMs enables effective usage of web data and zero-shot

predictions without task-specific fine-tuning

strated by the superior zero-shot performance achieved by the pre-trained CLIP model on 36 visual recognition

tasks ranging from classic image classification to human action and optical character recognition.

The primary goal of Vision-Language Model (VLM) pre-training*! >! is to impart the model with the

ability to understand and correlate images and texts effectively, ultimately enabling proficient zero-shot

pre-dictions in visual recognition tasks The process involves leveraging image-text pairs to train the VLM through

specific pre-training objectives*! >], To achieve this, the VLM employs both a text encoder and an image

encoder on the given image-text pairs These encoders extract features from the images and texts, respectively

The model then learns the correlation between vision and language using predefined pre-training objectives

The resulting vision-language correlation is crucial for subsequent zero-shot evaluations on new, unseen data,

accomplished by comparing the embeddings of any provided images and texts*!>!, The VLM pre-training

mechanism utilizes a deep neural network operating on N image-text pairs within a designated pre-training

dataset This neural network consists of an image encoder and a text encoder, responsible for encoding the

features of the image and text in an image-text pair, producing an image embedding and a text embedding,

re-spectively The subsequent section delves into the architectural details of widely adopted deep neural networks

in the context of VLM pre-training

To be specific, architectures have been widely utilized for learning image features is Transformer-based

architectures The Vision Transformer!'*! utilizes a series of Transformer blocks, each composed of a

multi-head self-attention layer and a feed-forward network In the architecture depicted in Fig 2.1.2, the input image

undergoes an initial step of being divided into fixed-size patches These patches are then processed through

linear projection and position embedding before being fed into the Transformer encoder This allows the model

to capture complex relationships within the image In the context of Vision-Language Models (VLM)

stud-ies*], subtle adjustments are made, often including the addition of an extra normalization layer preceding the Transformer encoder This modified architecture, building upon the Transformer framework!'>!!®°l, is designed

to efficiently extract features from images, making it well-suited for tasks involving both vision and language

understanding in VLM pre-training

The utilization of Transformer and its variations is widespread in the realm of learning text features The

Trang 20

2.1 WISION LANGUAGE MODELS

Vision Transformer (ViT)

Figure 2.1.2: Architecture of Vision Transformer

incorporating a multi-head self-attention layer and a multi-layer perceptron (MLP) The decoder mirrors this

structure, comprising 6 blocks, each hosting 3 sub-layers: a multi-head attention layer, a masked multi-head

layer, and an MLP Numerous Very Large Model (VLM) studies, such as CLIP [58] adhere to the standard

Transformerl#°l with minor adjustments, akin to those found in GPT2!°°! These studies typically involve

train-ing from scratch, foregotrain-ing the use of pre-trained GPT2'! weights for initialization.

To learn the rich vision-language correlation, they utilized contrastive learning More specifically, they

allow VLMs to learn discriminative representations by pulling paired samples as close and pushing others far

away in the feature space!*4! 4°! which is defined as follows:

where k € P(i) = {k|k € B, yx = y;} and y is the category label of (zr, zr) With Eqs 2.1.1 and 2.1.2,

the image-text label infoNCE loss is defined as: CJM#NCE = clot + cil

Trang 21

(1) Contrastive pre-training (2) Create dataset classifier from label text

Figure 2.1.3: Summary of CLIP approach

2.1.2 Contrastive Language-Image Pretraining Model

CLIPPl, which stands for Contrastive Language-Image Pre-training, represents a multimodal learning

framework crafted by OpenAI Its training involves acquiring visual concepts through natural language

guid-ance and establishing a connection between textual and visual information The model undergoes joint training

on an extensive dataset comprising images and their associated textual descriptions, akin to the zero-shot

capa-bilities observed in GPT-2!50! and GPT-3"7.

As a Vision Language model, CLIP®*lis a deep neural network that contains two encoders: an image

en-coder and a text enen-coder The dual enen-coders generate embeddings within a common vector space These shared

embedding spaces enable CLIP to analyze and understand the connections between text and image

representa-tions, facilitating the learning of their inherent relationships CLIP provided various pre-trained models based

on different vision models such as Resnet, and Vision Transformer (ViT)

CLIP}! undergoes pre-training using an extensive dataset comprising 400 million pairs of image and text

data sourced from the internet Throughout this pre-training process, the model encounters pairs consisting of

images and corresponding text captions Among these pairs, some accurately align (where the caption precisely

describes the image), while others present mismatches This diverse dataset facilitates the creation of shared

latent space embeddings as CLIP learns to capture the intricate relationships between visual and textual

ele-ments This pivotal characteristic positions CLIP as a breakthrough in Vision Language models, enabling its

application across a spectrum of tasks and achieving state-of-the-art performances

2.2 Diffusion Models

Diffusion Models (DMs), abbreviated Denosing Diffusion Probabilistic models (DDPM), is a generative

model that has been gaining popularity in the field of deep learning It was first introduced by Robin Rombach

Trang 22

2.2 DIFFUSION MODELS

et al.!°° in 2021 Diffusion Models are designed to capture the diffusion process in a network by modeling the

underlying network structure and the dynamics of information spreading within the network This approach

al-lows for state-of-the-art synthesis results on image data and beyond, while also providing a guiding mechanism

to control the image generation process without retraining

2.2.1 Intuition Of Diffusion Models

Researchers were concentrating on research Generative Adversarial Networks (GANs) (41 Variational

Au-toencoder (VAEs)!“!l, and Autoregressive models for AI generation An innovative generative model was

pro-posed by Ho.et al 2020 which is known as Denoising Diffusion Probabilistic Model (DDPM)"°! Diffusion

models are fundamentally different from all the previous generative methods, illustrated in 2.2.1 Intuitively,

they aim to decompose the image generation process (sampling) in many small “denoising” steps With

high-quality and fidelity image-generating ability, it rises as a state-of-the-art model in generative AI and has been

drawing so much attention in research and changing the game Diffusion models recently achieved

state-of-the-art results for most image tasks including text-to-image with DALLE'! but many other image

generation-related tasks too, like image inpainting, style transfer or image super-resolution

qŒx;|X¿_ 1)

".¬ na.

H i

s a

Figure 2.2.1: Forward Diffusion and Reverse Diffusion process

The diffusion process involves the incremental addition of Gaussian noise to an input image Xo,

accom-plished through a sequence of T steps Referred to as the forward process, it is distinct from the forward pass of

a neural network This initial stage is pivotal in creating the targets for subsequent neural network training (the

image after applying t < T noise steps Afterward, a neural network (known as a noise predictor) is trained to

recover the original data by reversing the noising process By being able to model the reverse process, we can

generate new data

Trang 23

Diffusion models can be seen as latent variable models Latent means that we are referring to a hidden

continuous feature space In such a way, they may look similar to variational autoencoders (VAEs).

In practice, they are formulated using a Markov chain of T steps Here, a Markov chain means that each

step only depends on the previous one, which is a mild assumption Importantly, we are not constrained to using

a specific type of neural network, unlike flow-based models

Given a data-point xo sampled from the real data distribution I(x) (ap ~ II(z)), one can define a forward

diffusion process by adding noise Specifically, at each step of the Markov chain we add Gaussian noise with

variance 3; to #;_¡, producing a new latent variable x; with distribution II(z;|z;_1) This diffusion process

can be formulated as follows:

H(œ;|z¿—1) = Nữu: tụ = VO = Bites, ot = 5:1)

Since we are in the multi-dimensional scenario, 7 is the identity matrix, indicating that each dimension has

the same standard deviation Ø; Note that Q(x;|x1_1) is still a normal distribution, defined by the mean /¿ and the variance © where pu, = (1 — Ø;¿)#;—¡, and ©, = ,I > will always be a diagonal matrix of variances (here

8).

Thus, we can go in a closed form from the input data x9 to x7 in a tractable way Mathematically, this is

the posterior probability and is defined as:

H(ziz|zo) = TI—¡ H(Œile¿—)

Variance Schedule:

The variance parameter Ø, can be fixed to a constant or chosen as a schedule over the T timesteps In fact,

one can define a variance schedule, which can be linear, quadratic, cosine, etc The original DDPM authors

utilized a linear schedule increasing from 3, = 10-4 to 6, = 0.02 But in later paper, it was shown that

employing a cosine schedule works even better

Reverse Diffusion Process:

As T — ov, the latent zp tends towards an approximately isotropic Gaussian distribution Therefore,

by successfully learning the inverse distribution g(z;_—¡ | x4), we can generate samples of zr from N(0, 7), proceed through the reverse process and obtain a sample from g(o), creating a new data point from the original

data distribution

In practical terms, we lack precise knowledge of g(+¿—1, |, 71) This is due to its complexity, as estimating (I — A) q(x+~1, |, 2+) involves computations related to the distribution of data.

To address this, we opt to approximate g(#;_¡, |,¿) using a modeled version PO, like a neural network.

As q(at — 1,|, zz) is also expected to be Gaussian, especially for sufficiently small 3;, we can design Pg as

Gaussian, focusing on defining its mean and variance:

If we apply the reverse formula for all timesteps (pg(xo.7')), also called trajectory), we can go from #+ to

the data distribution:

By additionally, conditioning the model on timestep ứ, it will learn to predict the Gaussian parameters for

Trang 24

Algorithm 1 Training Algorithm 2 Sampling

1: repeat 1: xr ~ N(0,1)

2: xo ~ q(xo) : fort =T, ,1do

3: t~ Uniform({1, ,7}) IO Wifes Leen <0

4: e~ N(0,1) Z (0,1) i „ else z

5

2 3

_ 1 lea

Take gradient descent step on 4 Xt = Te (x: — d5; eø(x:, Đ)) + Ø:Z

5 6

Vo |le — eø(V#xo + V1 — Ge, t)||” : end for

6: until converged : return xo

Figure 2.2.2: Training and sampling algorithms of DDPMs.

each timestep

To do that, we need to train a model by optimizing the negative log-likelihood of the training data The

simplified version of the objective:

Lm = Bay, [lle eo (a + (I—aje,9 ||

In order for the model’s input and output to be the same size, A U-Net was employed A U-Net is a

symmet-ric architecture (2.2.3 that uses skip connections between encoder and decoder blocks of corresponding feature

dimension, with input and output having the same spatial size Typically, the input image is up- and

down-sampled to achieve its original size Additionally, U-Net help generate high-quality image at high resolution

A pipeline of multiple diffusion models was used at increasing resolutions Noise conditioning augmentation

between pipeline models is significant to the final image quality, which is to apply strong data augmentation

to the conditioning input z of each super-resolution model pị0(z|z) The conditioning noise helps reduce

com-pounding errors in the pipeline setup.

Subsequently, a neural network undergoes training to reverse the noising process, aiming to recover the

original data This reverse diffusion process, often termed the sampling process of a generative model, holds

the key to generating new data This thesis delineates the mechanisms underlying both the forward and reverse

diffusion processes, emphasizing their significance in the realm of generative models

Guided Diffusion:

A vital aspect of image generation is conditioning the sampling process to manipulate the generated

sam-ples To further "guide" the generation, techniques have even been developed that incorporate image

embed-dings into the diffusion Mathematically, the guidance refers to conditioning a prior data distribution p(x) with

a condition y (the class label or an image/text embedding, resulting in p(z |) To turn a diffusion model pg into

a conditional diffusion model, we can add conditioning information y at each diffusion step

T

po(ror | U) = po(@r) [1-1 Po(we-1 | te, 9)

Overall, guided diffusion models aim to learn V log po (x; | y) we could formulate it into a formulation:

V log p(t | y) = Vlog pe(xt) + s - V log(pa(w | x2)

To do that, we need to refer to a technique called Classifier guidance Sohl-Dickstein et al illustrated that we

Trang 25

¥

LH : te => conv 3x3, ReLU

ñ " + copy and crop

Jie —:: mem $f0ie612

6 SF < 45 4 up-conv 2x2

Figure 2.2.3: The U-Net architecture

can use a second model f(y | +¿, £) to guide the diffusion toward the target class y during training A classifier

¿(0 | ve, ¢) was trained on the noisy image 2; to predict its class y Then we can use the gradient to guide the

diffusion

The most common model used for guiding diffusion model to generate images as we expect is CLIP

(Con-trastive Language-Image Pretraining) CLIP, a product of OpenAI’s groundbreaking research, is a neural

net-work architecture trained through contrastive learning on a vast dataset encompassing diverse images and their

associated textual descriptions This unique training approach empowers CLIP to understand intricate

associa-tions between images and text, effectively learning joint representaassocia-tions that capture the relaassocia-tionships between

the two modalities

By leveraging the joint image-text representations learned by CLIP, the diffusion model attains a more

robust foundation for understanding correlations between images and their corresponding textual contexts The

model gains a nuanced comprehension of the complex interplay between visual and textual elements, which

significantly contributes to its ability to produce contextually accurate and coherent outputs.

This joint comprehension facilitates the diffusion model’s capacity to generate more contextually precise

and cohesive responses, particularly when the task involves producing content that encompasses both images

and text It enables the model to synthesize responses that reflect a deeper understanding of the relationship

between visual and textual elements, resulting in more accurate, relevant, and meaningful outputs

The fusion of CLIP’s capabilities within the diffusion model not only expands the model’s horizons by

imbuing it with the ability to comprehend and relate visual and textual information but also equips it with

Trang 26

the potential to generate more sophisticated and contextually relevant image-based outputs This integration

ultimately holds promise for applications requiring the synthesis of images in response to textual prompts,

presenting a new frontier in AI-driven content generation and comprehension

2.2.2 Latent Diffusion Models

Although the Diffusion models achieved impressive results, limitations of Diffusion models are that they

work sequentially on the whole image, meaning that both the training and inference times are expensive The

image space is enormous, think a 512x512 image with three color channels (red, green, and blue) is a

768432-dimensional space and notice that many values for one image That is the reason it needs hundreds of GPUs to

train such as a model and we have to wait a few minutes to get our results It bounds the playground for only

the biggest companies like Google or OpenAI

To overcome this limitation, they transformed Diffusion models into Latent Diffusion Models This means

that they implemented this diffusion approach within a compressed image representation instead of the image

itself and then worked to reconstruct the image So they are not working with the pixel space, or regular images,

anymore Working in such a compressed space does not only allow for more efficient and faster generations as

the data size is much smaller but also allow for working with different modalities Since they are encoding the

inputs you can feed it any kind of input like images or texts and the model will learn to encode these inputs in

the same sub-space that the diffusion model will use to generate an image

The Latent Diffusion Model (LDM) is an innovative approach that employs an initial image, denoted as X,

which is then encoded into a condensed, information-rich space known as the latent space Z This encoding

process resembles that of a Generative Adversarial Network (GAN) It involves an encoder model that extracts

essential information from the image, akin to downsampling, reducing its size while preserving as much

perti-nent information as possible This results in a condensed representation of the input within the latent space.

Following this, the model incorporates additional conditioning inputs, such as text or other images, merging

them with the condensed image representation using an attention mechanism This attention mechanism, a

feature of transformer models, learns the optimal way to combine these inputs within the latent space The

merged inputs then serve as the initial noise for the subsequent diffusion process.

The diffusion process within the latent space begins with an initial random noise signal This signal

un-dergoes a sequence of steps, gradually refining and transforming within the latent space With each iteration,

the noise evolves into more organized and structured representations Throughout this iterative process, the

latent space encapsulates increasingly meaningful features and patterns, culminating in the generation of

high-quality, diverse data samples More specifically, the key idea of LDMs is to separate the training into two phases

including perceptual image compression and latent diffusion

Perceptual Image Compression:

In their initial phase, Rombach et al utilize an autoencoder training method that combines a perceptual

loss and a patch-based adversarial objective This approach is designed to ensure reconstructed images adhere

Trang 27

closely to the image manifold, fostering local realism and reducing blurriness that often arises when solely

relying on pixel-space losses such as L2 or L1 objectives Starting with an image x in RGB space, the encoder

transforms x into a latent representation z, followed by the decoder reconstructing the image based on this

derived latent representation

Specifically, in the input layer, images are encoded by an AutoKL Encoder This process converts a

high-dimensional RGB image into a condensed two-high-dimensional representation, enabling the Dynamic Probabilistic

Model (DPM) to operate within this structured latent space where intricate, high-frequency details are

ab-stracted This more compressed latent space is better suited for likelihood-based generative models, allowing

them to focus on the essential, meaningful aspects of the data and train in a computationally efficient,

lower-dimensional space.

Additionally, to prevent a high degree of variability in the latent space, Rombach et al experiment with two

forms of regularization: KL-reg, resembling a Variational Autoencoder (VAE), and VQ-reg Consequently, the

forward chain within a Diffusion Probabilistic Model is considered during autoencoder training, enabling the

encoder to estimate the distribution of the learned latent space during the learning process

Latent Diffusion Models:

Once the autoencoder is trained, the Dynamic Probabilistic Model (DPM) advances through its forward

chain in the subsequent phase by directly sampling z, from the encoder To facilitate the reverse process,

Rombach et al implement a time-conditional U-Net Moreover, they harness the DPM’s capacity to model

conditional distributions, transforming their suggested Latent Diffusion Model (LDM) into a more adaptable

conditional image generator This is achieved by enhancing the foundational U-Net with a cross-attention

mech-anism, which is particularly effective for attention-based models to learn from diverse input modalities

Rombach et al introduce an additional encoder, functioning as a transformer This encoder is responsible

for converting input from a different modality, typically in the form of textual prompts, into an

intermedi-ary representation Subsequently, this representation is aligned with the layers of the U-Net using multi-head

cross-attention Throughout this second stage, both the time-conditional U-Net and encoder are concurrently

optimized The proposed process is visually depicted in Figure 2.2.4 This integration of language

representa-tion and visual synthesis culminates in a robust model that demonstrates remarkable adaptability to complex,

user-defined text prompts

Essentially, the model navigates and explores the latent space, progressively refining the initial noise into

coherent data representations Each step in the diffusion process enhances the quality and richness of the

gen-erated data by refining the latent representations.

The LDM fundamentally utilizes diffusion processes to transition from randomness to structured, realistic

data It achieves this by iteratively refining the latent representations, thereby transforming noise into

mean-ingful data This progressive refinement constitutes the foundation of the Latent Diffusion Model, enabling the

generation of diverse and high-fidelity data across various domains, including images, text, and more

Additionally, the Latent Diffusion Model stands out for its efficiency, as the latent space is significantly

Trang 28

Because of the significantly reduced data size, working in such a compressed space not only enables

work-ing with diverse modalities but also more efficient and faster creation You can give it any type of input, such

as texts or images, because they are encoding the inputs The model will learn to encode these inputs in the

same sub-space that the diffusion model uses to generate images One model will use text or visuals to direct

generations, much like the CLIP model

Trang 29

Related Work

In this chapter, we delve into the fascinating visual stimulus reconstruction topics by exploring human

brain activity and its common representations which can be used for visual stimulus reconstruction Alongside

an analysis of the advantages and disadvantages of these representations, we present a rich literature review on

human brain signal-based image reconstruction This includes an exploration of Machine Learning-based

meth-ods, VAE-based methmeth-ods, GAN-based methmeth-ods, and Diffusion-based methods Through this journey, we aim

to not only provide insights into the methodologies but also ignite a curiosity about the profound implications

of understanding and reconstructing visual stimuli from the human brain

3.1 Human Brain Activity And Its Representations

For a long time, scientists have desired to research how the human brain reacts to the world This

fascina-tion has led to a profound explorafascina-tion of human brain activity and its representafascina-tions, delving into the

intri-cate mechanisms that govern cognitive processes and shape our perceptions of the surrounding environment

Understanding the complex interplay between neural signals, cognitive functions, and the formation of mental

representations holds the key to unlocking the mysteries of consciousness and paving the way for advancements

in fields such as neuroscience, psychology, and artificial intelligence

3.1.1 Human Brain Activity

Human brain activity refers to the intricate and dynamic electrochemical processes occurring within the

brain, orchestrating the vast network of neurons that enable cognition, perception, and various bodily functions

At its core, brain activity involves the generation and transmission of electrical impulses between neurons,

form-ing complex neural circuits This intricate dance of neurotransmitters and electrical signals allows the brain to

process information, store memories, and regulate bodily functions Neurotransmitters act as messengers,

fa-cilitating communication between neurons by crossing synapses Brain activity is not confined to conscious

Trang 30

3.1 HUMAN BRAIN ACTIVITY AND ITS REPRESENTATIONS

thought; it also encompasses subconscious processes such as regulating heartbeat, breathing, and maintaining

homeostasis Employing advanced neuroimaging techniques such as functional magnetic resonance imaging

({MRD, electroencephalography (EEG), and magnetoencephalography (MEG) enables the study of these

pat-terns, unveiling correlations between specific cognitive tasks and neural responses The brain’s plasticity, or

its ability to adapt and reorganize, is reflected in changing patterns of activity, influenced by learning,

experi-ence, and environmental factors Understanding human brain activity is essential for unraveling the mysteries

of consciousness, and mental health, and developing treatments for neurological disorders.

3.1.2 Signal Representations

Currently, there are many techniques to capture the human brain activity Nonetheless,

Electroencephalog-raphy (EEG), MagnetoencephalogElectroencephalog-raphy (MEG), and functional magnetic resonance imaging (fMRI) are the

most common neuroimaging techniques to elicit brain behavior By using these techniques different activity

patterns can be measured within the brain to decode the content of mental processes, especially the visual and

auditory content The datasets capture a spectrum of neural signals and responses, enabling the study of brain

activities during various cognitive tasks, sensory perceptions, motor actions, and emotional processing The

definition and analysis of these techniques are provided following

EEG measures the electrical activity of our brain via electrodes that are placed on the scalp The information

is collected from the surface measurements, how active the brain is This can be useful for quickly determining

how brain activity can change in response to stimuli, and can also be useful for measuring abnormal activity,

such as with epilepsy.

Electroencephalography (EEG) presents several notable advantages as a neuroimaging technique One of

its primary strengths lies in its exceptional temporal resolution, capable of capturing brain activity in

real-time This means that EEG can detect and monitor neural activity within milliseconds, allowing researchers

to observe and analyze rapid changes in brain function, an attribute particularly beneficial in the study of

dynamic brain processes Another significant advantage of EEG is its non-invasive nature This technique

involves placing electrodes on the scalp, making it a safe and comfortable method for individuals participating

in studies Being non-invasive reduces the risk and discomfort commonly associated with other more invasive

neuroimaging methods

Moreover, EEG is relatively more affordable and accessible compared to other neuroimaging technologies

The cost-effectiveness and ease of access to EEG equipment make it a preferred choice for many researchers

and clinicians, enabling a wider range of studies and applications Furthermore, the versatility of EEG is

note-worthy It finds application across various fields, from clinical diagnostics to cognitive neuroscience and

brain-computer interface research Its adaptability allows for a broad spectrum of investigations, making EEG a

valuable tool in exploring brain activity and cognitive functions.

Electroencephalography (EEG), despite its many strengths, suffers from a significant drawback in the form

of limited spatial resolution This restriction is primarily due to the methodology of measuring electrical brain

activity via electrodes placed on the scalp The signals detected by EEG are subject to distortion and attenuation

Trang 31

as they pass through the skull and surrounding tissues Consequently, the accurate localization of the exact

source of neural activity becomes a challenging task EEG struggles to precisely pinpoint the specific brain

regions contributing to the recorded electrical signals This limitation hampers the ability to discern deeper

brain structures Neural activity from these regions faces substantial distortion as it travels through the skull

and superficial tissues, making it difficult to accurately locate the source.

Magnetoencephalography (MEG) stands as a cutting-edge neuroimaging method that captures the complex

neural dynamics within the human brain by detecting neuromagnetic fields outside the skull It shares

similari-ties with EEG in providing exceptional temporal resolution, registering brain activity in milliseconds However,

MEG distinguishes itself with an unparalleled advantage in spatial resolution This technology excels in

pre-cisely mapping the spatial distribution of neural processes, pinpointing brain activity locations with remarkable

accuracy, often reaching mere millimeters This unique capability allows for an in-depth exploration of the

specific and dynamic neural mechanisms governing human cognition and behavior

By measuring the spatial distribution of the magnetic field outside the head, the locations of the neural

sources of the MEG signals can be estimated The spatial resolution of MEG depends on the characteristics of

the underlying activity

MEG directly measures the brain’s magnetic fields, offering an accurate assessment of neuronal activity

This direct measurement not only provides a clearer understanding of the brain’s functional organization but

also makes MEG less susceptible to interferences caused by the skull’s electrical properties, enhancing its

effectiveness in studying brain regions close to the skull and deep brain structures This sensitivity to deep

brain structures is particularly beneficial in understanding neurological conditions originating in these areas

While both technologies complement each other, it’s crucial to note that only MEG provides precise

tim-ing and spatial information about brain activity MEG signals directly stem from neuronal electrical activity,

allowing for data collection even from subjects in a sleeping state This characteristic opens up the potential for

real-world applications of MEG data, expanding its utility beyond controlled laboratory settings.

Although MEG provides good temporal resolution (detecting changes in brain activity quickly) and its

spatial resolution is better than EEG, its spatial resolution is still limited It can pinpoint the brain region where

activity is occurring but might not offer precise localization within that region Additionally, the cost to gather

the MEG dataset is expensive which can limit its accessibility in many research.

Over the limitations of MEG and EEG, The functional MRI (f{MRD) has superior spatial resolution,

facili-tating highly accurate localization of brain functions and activities It helps researchers analyze brain activity

and utilize this data in neural decoding effectively

Functional Magnetic Resonance Imaging (fMRI) is a powerful neuroimaging technique that offers

sub-stantial advantages in decoding brain activity, particularly in the domain of visual processing By capturing

changes in blood flow and oxygenation levels, {MRI grants researchers the ability to discern and map the

en-gagement of specific brain regions during visual tasks or stimulus processing This capability proves invaluable

in understanding the intricate neural mechanisms underlying visual decoding

One significant advantage of fMRI lies in its non-invasive nature, making it a safe and widely accessible

Trang 32

3.1 HUMAN BRAIN ACTIVITY AND ITS REPRESENTATIONS

MEG (Gradiometers) Functional MRI (All Events)

Event Grand-Averaged Event-Related Fields Topographic map Statistical Parametric Map [t=0 at vert line] [all sensors superimposed] (100-ms window around peak] {one-sample t-test, p<.05 FWE]

@

Auditory

`

Response

Figure 3.1.1: Magnetoencephalography (MEG) data have a high temporal resolution (on the order of msec),

which allows us to directly assess latency differences in these neural responses Functional magnetic resonance imaging (fMRI) measures have high spatial resolution (on the order of mm), which allows us to

pinpoint the location of activity associated with a sensorimotor event

tivity via blood oxygen level-dependent (BOLD) signals, fMRI enables researchers to infer neural engagement

without intruding into the brain’s physiological integrity, offering a safe means to study cognitive functions

Moreover, the spatiotemporal resolution provided by fMRI is unparalleled This high-resolution mapping

capability is instrumental in discerning detailed localization and temporal sequences of brain activity during

visual tasks This precision aids in the identification and differentiation of neural activation patterns associated

with various aspects of visual processing, such as object recognition, visual attention, or even perception of

motion The ability to pinpoint specific brain regions involved in these processes provides a comprehensive

understanding of how the brain processes visual information

Furthermore, fMRI facilitates the examination of neural connectivity and interactions among different brain

regions involved in visual decoding By analyzing functional connectivity, researchers can uncover how these

regions communicate and coordinate during visual tasks, contributing to a more holistic comprehension of the

neural networks engaged in visual processing

The absence of known side effects associated with fMRI further enhances its appeal as a tool for

investi-gating visual decoding Its non-invasive nature, coupled with its ability to deliver detailed spatial and temporal

information, makes {MRI an indispensable asset in unraveling the complexities of visual cognition and

under-standing the underlying neural mechanisms associated with various visual processes

The superior spatial resolution and comprehensive whole-brain coverage of functional Magnetic Resonance

Imaging (fMRI) have captured significant attention from researchers in the fields of Neuroscience and

Com-puter Science These advantages position fMRI as a leading technique in brain decoding, particularly in the

Trang 33

context of visual reconstruction The wealth of high-quality fMRI data available for research purposes has been

pivotal in facilitating significant strides in understanding the intricacies of brain activity during visual

process-ing Leveraging these robust datasets, researchers have achieved promising outcomes in the domain of brain

decoding, utilizing fMRI to unveil the complex neural underpinnings of visual reconstruction This

advance-ment not only sheds light on the intricate mechanisms governing how the brain processes visual information

but also enhances our understanding of the broader cognitive processes involved Ultimately, the utilization of

fMRI has enabled a deeper exploration and comprehension of the human brain, providing invaluable insights

into its functions, behaviors, and capabilities

Given analysis illustrated that functional Magnetic Resonance Imaging (fMRI) stands out as the most

suit-able data for accurate visual stimulus reconstruction

3.2 Human Brain Signal-based Image Reconstruction

Addressing the challenge of understanding visual cognition at a deeper level, human brain signal-based

image reconstruction emerges as an innovative solution at the intersection of neuroscience and computer

sci-ence The problem lies in the complexity of decoding and reconstructing visual information from the intricate

neural signals generated during visual tasks Researchers employ advanced techniques like functional

mag-netic resonance imaging (fMRI) and electroencephalography (EEG) to record and analyze these brain signals.

The solution entails leveraging these signals to decipher the neural representations underlying visual stimuli,

ultimately reconstructing images that closely mirror what individuals perceive This novel approach not only

provides valuable insights into the complexities of visual perception but also holds significant promise for

applications in neuroscience, such as brain-computer interfaces, neuroprosthetics, and rehabilitation As our

understanding of the human brain advances, human brain signal-based image reconstruction stands as a

power-ful tool in unraveling the mysteries of visual cognition and enhancing neurotechnological capabilities

3.2.1 Definition

Human brain signal-based image reconstruction refers to the process of creating visual representations of

images or scenes based on the neural activity recorded from the human brain This technology aims to decode

the complex patterns of electrical signals generated by the brain when an individual observes or imagines

spe-cific visual stimuli By employing advanced neuroimaging techniques such as EEG, MEG, or fMRI, researchers

can capture the complicated neural patterns associated with different visual experiences Depending on contexts

and requirements such as real-time processing or accuracy in reconstruction, the appropriate technique will be

used These signals are then subjected to decoding algorithms, which unravel the neural responses related to

specific visual stimuli or mental tasks The culmination of this process is the reconstruction of images that

reflect the perceived visual information encoded in the brain signals

The importance of human brain signal-based image reconstruction lies in its potential applications and

rel-evance By understanding how the brain represents and processes visual information, this approach contributes

Trang 34

3.2 HUMAN BRAIN SIGNAL-BASED IMAGE RECONSTRUCTION

Images fMRI activity

Figure 3.2.1: Framework diagram for visual stimulus reconstruction task

to the development of brain-computer interfaces, advancements in neuroimaging, and applications in medical

diagnostics It provides a unique window into the neural mechanisms underlying visual perception Key

charac-teristics include the intricate process of decoding relevant brain signals and subsequent image reconstruction

The complexity of this endeavor requires expertise in both neuroscience and computational methods to bridge

the gap between observed neural activity and resulting visual representations

When EEG is a non-invasive technique which makes it the most practical methodology to record the

elec-trophysiological dynamics of the brain!“9H8!I2I52l, However, the incapacity to record the spatial information

of EEG is the main causative factor behind the worse result in visual stimulus reconstruction Another

tech-nique complemented for EEG is MEG which can measure brain activity at a spatial and temporal resolution

by recording the fluctuation of magnetic fields elicited by the post-synaptic potentials of pyramidal neurons !.

Therefore they can be applied to decode dynamic stimuli such as speech, video, etc, as well as used in

real-time use cases In contrast to EEG, fMRI’s ability!’°! °°! to visualize deep brain structures opens new frontiers

for exploring complex cognitive functions By creating detailed maps of brain regions engaged during specific

tasks, {MRI contributes to a richer understanding of the neural underpinnings of various cognitive processes In

the context of visual stimulus reconstruction, f{MRI’s high spatial fidelity enables researchers to pinpoint with

greater accuracy the regions involved in processing and responding to external stimuli

3.2.2 Approaches

In the exploration of decoding neural activity and advancing our understanding of the brain, various

ap-proaches have been employed Two prominent methodologies that have gained significant traction are machine

learning-based methods and deep learning-based methods These computational techniques harness the power

of algorithms to analyze complex patterns within neuroimaging data, enabling researchers to uncover hidden

insights and make predictions about brain function In the following sections, we delve into the nuances of these

Inttps://ai.meta.com/static-resource/image-decoding

Trang 35

approaches, highlighting their applications, strengths, and contributions to the burgeoning field of

neuroscien-tific research

Prior to the advent of deep learning, conventional approaches to natural image reconstruction relied on

estimating a linear relationship between fMRI signals and manually crafted image features through the use of

linear regression models! 541? 11, These techniques primarily concentrated on extracting predetermined

low-level features from stimulus images, including local image structures or characteristics of Gabor filters?! 1,

In recent years, there has been a notable shift toward the integration of deep learning-based methods in the

domain of visual stimulus reconstruction Deep learning methods have demonstrated remarkable capabilities

in capturing complex and high-dimensional relationships between neural signals and visual stimuli, surpassing

the limitations of traditional linear regression models Moreover, the multilayer architecture enables learning

non-linear projections from human brain activity In general, they involve 4 approaches that are reliable and

have proven their efficiency as well as their potential of them such as convolutional neural networks (CNNs),

Generative Adversarial Networks (GANs), variational autoencoders (VAEs), and especially Diffusion Models

(DMs).

Non Generative Methods:

One notable approach is the utilization of convolutional neural networks (CNNs) for visual stimulus

recon-struction Unlike the hand-crafted features used in traditional methods, CNNs automatically learn hierarchical

representations of input data These networks consist of multiple layers of convolutional filters that can capture

both low-level and high-level features, enabling a more nuanced understanding of the relationships between

neural responses and visual stimuli It is one of the first deep-learning was used for the reconstruction and

classification from human brain activity At that time, only simple datasets were available so training a CNN

model from scratch to reconstruct images was still feasible

In contrast to a simpler multilayer feed-forward neural network that overlooks the structural information

of input images, Convolutional Neural Networks (CNNs) exhibit superior feature extraction capabilities This

is attributed to the information filtering carried out by convolutional layers within a localized pixel

neighbor-hood!41, The stacking of convolutional layers facilitates the learning of hierarchical visual features, commonly

referred to as feature abstraction, from input images The lower layers focus on grasping low-level details,

while the higher layers extract overarching high-level visual information'“*!, CNNs find extensive application

in image processing tasks, including image reconstruction, where architectures like encoder-decoder! 771,

U-(241 and variational autoencoders l#! leverage stacked convolutional

Net!!9), generative adversarial networks

layers for feature extraction at multiple levels.

Shen!°*! employed a pre-trained DNN based on VGG-19 to extract hierarchical features from stimulus

im-ages (refer to Figure 3.2.2) The DNN comprises sixteen convolutional layers followed by three fully connected

layers This approach was inspired by the observation that hierarchical image representations from various

lay-ers of deep neural networks correlate with brain activity in the visual cortex Leveraging this correlation, a

hierarchical mapping is established from fMRI signals in the low/high-level areas of visual cortices to

Trang 36

corre-3.2 HUMAN BRAIN SIGNAL-BASED IMAGE RECONSTRUCTION

A

> là kii, id

oa In} Decoded image |

fMRI activity features

ea

Image features

Reconstructed

image ) B

optimization Decoded image

4® fMRI activity features

[pm}> Là, hia, is — Ì

Image features

Figure 3.2.2: Overview of two variations of frameworks proposed by Shen et al (2019b): (A) ShenDNN and

(B) ShenDNN+DGN The yellow color denotes the use of pre-trained components.

Reconstructed

£ image

+7

(D) that translates fMRI activity patterns into multilayer DNN features Prior to the reconstruction task, the

de-coder D undergoes training on the train set using the method outlined by°! The decoded fMRI features align

with the hierarchical image features derived from the DNN Optimization in the feature space is conducted

by minimizing the disparity between the hierarchical DNN features of the image and the multilayer features

decoded from fMRI activity

Another non-genative approach is encoder-decoder models find extensive application in tasks such as

image-to-image translation *! and sequence-to-sequence models!!*!, These models employ a two-stage

archi-tecture, featuring an encoder (E) responsible for compressing the input vector (x) into a latent representation

(z = E(x)), and a decoder (D) generating the output vector (y = D(z)) from this latent representation!!, The

compressed latent representation vector (z) acts as a bottleneck, encapsulating a low-dimensional

representa-tion of the input The training aims to minimize the reconstrucrepresenta-tion error, quantifying the disparity between the

reconstructed output and the ground-truth input

Beliy '5] introduced a convolutional neural network (CNN)-based encoder-decoder model named BeliyEncDec.

Trang 37

Stimulus image fMRI activity

fMRI activity Stimulus image

Figure 3.2.3: The BeliyEncDec framework, introduced by Beliy et al (2019), involves two main training

stages: (A) supervised training of the Encoder and (B) a combination of supervised and self-supervised

training for the Decoder During this process, the Encoder’s weights remain constant The components of the

model trained on external unlabeled data are indicated in blue

In this model, the encoder (E) learns the mapping from stimulus images to corresponding fMRI activity, while

the decoder (D) learns the reverse mapping The architecture of BeliyEncDec, illustrated in Figure 3.2.3,

in-volves two combined networks (E-D and D-E) where inputs and outputs correspond to natural images and {MRI

recordings, respectively This unique setup facilitates self-supervised training on a larger dataset of unlabeled

data, including 50,000 additional images from the ImageNet validation set and unlabeled {MRI samples

Com-petitive results were demonstrated on two natural image reconstruction datasets: Generic Object Decoding”!

and vim-1!°!, The training occurs in two steps Initially, the encoder (E) establishes a mapping from stimulus images to fMRI activity, utilizing the weights of the first convolutional layer of the pretrained AlexNet?! Sub-

Trang 38

Real image Image space Lape wee fMRI

A B

Figure 3.2.4: GAN-based frameworks (A) SeeligerDCGAN framework based on deep convolutional GAN

(B) Framework proposed by Mozafari!>*!

sequently, with the encoder fixed, the decoder (D) is jointly trained using both labeled and unlabeled data The

overall loss of the model encompasses the fMRI loss of the encoder (E) and the Image loss (RGB and features

loss) of the decoder (D).

In a subsequent study, Gaziv “2! enhanced the reconstruction accuracy of BeliyEncDec by introducing a loss function based on perceptual similarity measures'*°!, The perceptual similarity loss is computed by extracting

multilayer features from both original and reconstructed images using VGG and comparing these features

lay-erwise

Generative Methods:

Generative models assume that the data is generated from some probability distribution p(#) and can be

classified as implicit and explicit Implicit models do not define the distribution of the data but instead specify a

random sampling process with which to draw samples from p(x) Explicit models, on the other hand, explicitly

define the probability density function, which is used to train the model

Robust image-generation models play a crucial role in enhancing performance across diverse tasks

Specif-ically, generative models contribute to improving the quality of reconstructions Two prevalent types of

genera-tive models for visual stimulus reconstruction are Generagenera-tive Adversarial Networks (GANs)!*4!, combinations

of Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), and Diffusion Models

(DMs).

Generative Adversarial Networks (GANs), a category of generative models defined implicitly, have

gar-nered considerable attention for their capacity to generate lifelike images In the realm of natural image

recon-struction, GANs are extensively utilized to capture the distribution of stimulus images Comprising generator

and discriminator networks, a GAN operates by having the generator G take a random noise vector z

(typi-cally drawn from a Gaussian distribution) and produce a synthetic sample G(z) with statistical properties akin

to those in the training set images Throughout the training process, the generator’s proficiency in creating

realistic images steadily improves, reaching a point where the discriminator cannot differentiate between a

gen-uine sample and a generated counterfeit one GAN-based frameworks boast several advantageous features in

Trang 39

comparison to other generative methods Firstly, they do not necessitate strong assumptions about the form of

the output probability distribution Secondly, the utilization of adversarial training, involving the discriminator,

facilitates unsupervised training of the GAN

in-corporating advancements through the integration of convolutional and deconvolutional layers The researchers

focused on learning a direct linear mapping from the functional magnetic resonance imaging (fMRI space to

the latent space of the GAN, as illustrated in Figure 3.2.4A For the image domain of natural stimuli, the

gen-erator G underwent pretraining on down-sampled 64 x 64 grayscale images converted from ImageNet!'*! and Microsoft COCO!"! datasets In the domain of handwritten character stimuli, DCGAN was pre-trained using

15,000 handwritten characters

In a separate study, Mozafari!>*! adopted a variant of the GAN known as the BigBiGAN model!!! enabling

the reconstruction of more realistic images The BigBiGAN model, with its latent space, captures high-level

semantic information from fMRI data to enhance the generation of lifelike images Referred to as

MozafariBig-BiGAN, this framework incorporates a pre-trained encoder E that generates a latent space vector E(x) from

the input image x and a generator G that produces an image G(z) from the latent space vector z, as depicted

in Figure 3.2.4B During training, the authors computed a linear mapping W from latent vectors E(x) to fMRI

activity using a general linear regression model In the test stage, the linear mapping is inverted to compute the

latent vectors z from the test {MRI activity

The majority of prior research in visual decoding utilizing fMRI data involves selecting voxels

correspond-ing to specific visual regions of interest (ROIs) These voxel measurements are then flattened into 1D vectors,

serving as input for the visual decoding process However, this approach exhibits significant drawbacks The

definition of ROIs and the selection process during fMRI preprocessing are subjective and can vary between

individuals Moreover, the spatial topology of 2D cortical areas is often overlooked when using vectorized

voxel responses To address these limitations, a novel visual decoding framework, Cortex2Image, is proposed

by OzcelikP°Ì This framework comprises a Cortex2Semantic model, a Cortex2Detail model, and a pre-trained

and frozen image generator named Instance-Conditioned GAN (IC-GAN) Notable improvements include an

architecture shared across individual subjects in the Cortex2Image model, which consumes cortex-wide brain

activity through a standardized mesh representation of the cortex, rather than relying on specific ROIs The

utilization of surface convolutions in this model enables the exploitation of spatial information in brain

ac-tivity patterns Additionally, the end-to-end training of the Cortex2Detail model in conjunction with IC-GAN

enhances computational efficiency compared to previous methods that involve additional steps in noise vector

optimization

Larsenf#! introduced a hybrid model, VAE-GAN, seamlessly integrating Variational Autoencoder (VAE)

and Generative Adversarial Network (GAN) This framework combines VAE for generating latent features with

a GAN discriminator that learns to differentiate between fake and authentic images VAE-GAN unifies the

VAE decoder and GAN generator into a singular entity The advantages of VAE-GAN are twofold: firstly, the

adversarial loss of GAN facilitates the generation of visually more realistic images, and secondly, VAE-GAN

Trang 40

Image space Latent space

Figure 3.2.5: VanRullenVAE-GAN framework proposed by VanRullen and Reddy

GANs—where a generator produces a limited subset of different outcomes

VanRullen and Reddyl”?! employed a VAE network pre-trained on the CelebA dataset using GAN

proce-dures to acquire a variational latent space Similar to the MozafariBigBiGAN framework, they established a

linear mapping between the latent feature space and fMRI patterns, bypassing probabilistic inference During

training, the pre-trained encoder from VAE-GAN remains fixed, and the learning process focuses on the linear

mapping between the latent feature space and {MRI patterns In the testing stage, fMRI patterns undergo

trans-lation into VAE latent codes through inverse mapping, followed by using these codes for facial reconstruction

The VAE’s latent space, functioning as a variational layer, offers a meaningful representation of each image,

capable of portraying faces and facial features as linear combinations Due to the VAE’s training objective,

proximate points in this space correspond to similar facial images, consistently yielding visually plausible

re-sults Consequently, the VAE’s latent space ensures robustness in brain decoding, minimizing mapping errors

and producing more realistic reconstructions closely resembling the original stimuli images This approach

not only facilitates the reconstruction of natural-looking faces but also enables gender decoding The

architec-tural configuration of the framework, termed VanRullenVAE-GAN, encompasses three networks, as depicted

in Figure 3.2.5.

Ren et al.!°*! demonstrates the feasibility of deriving promising solutions by acquiring visually-guided

la-tent cognitive representations from fMRI signals and subsequently decoding them into corresponding image

stimuli The approach involves training an encoder to map brain signals to a low-dimensional representation

that retains crucial visually relevant information Simultaneously, a robust decoder is trained to recover this

latent representation back into the original visual stimulus To achieve this objective, the D-VAE/GAN model

employs a Dual VAE-based Encoding Network to generate low-dimensional latent features for both fMRI

sig-nals and perceived images Subsequently, a novel GAN-based inter-modality knowledge distillation method

Tiêu đề	Image Reconstruction From Function Magnetic Resonance Imaging (fMRI) Data Using Latent Diffusion Model
Tác giả	Pham Thi Bich Nga, Dam Vu Trong Tai
Người hướng dẫn	Dr. Nguyen Vinh Tiep
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Graduate Dissertation
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	80
Dung lượng	53,88 MB