VIETNAM NATIONAL UNIVERSITY — HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY FACULTY OF COMPUTER SCIENCE PHAM THI BICH NGA - 20521642 DAM VU TRONG TAI - 20521855 GRADUATE DISSERTAT
Trang 1UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER SCIENCE
PHAM THI BICH NGA
DAM VU TRONG TAI
GRADUATE DISSERTATION
BACHELOR OF COMPUTER SCIENCE
Trang 2VIETNAM NATIONAL UNIVERSITY — HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF COMPUTER SCIENCE
PHAM THI BICH NGA - 20521642
DAM VU TRONG TAI - 20521855
GRADUATE DISSERTATION
IMAGE RECONSTRUCTION FROM FUNCTION
MAGNETIC RESONANCE IMAGING (FMRD DATA
USING LATENT DIFFUSION MODEL
BACHELOR OF COMPUTER SCIENCE
SUPERVISED BY
DR NGUYEN VINH TIEP
Trang 3The committee for evaluating the graduation thesis is established according to Decision
NO dated by the Rector of the University of
Information Technology
| — Chairman.
Qe eee cence ence ence tees ee š&%< — Secretary.
Bocce cece cece cece cece sees eee e cece eee e beens — Member.
— Member.
Trang 4The successful of this dissertation is attributed to the invaluable assistance and support extended by
numer-ous individuals We express profound gratitude for their feedback Primarily, we wish to convey appreciation
to our supervisor, Dr Nguyen Vinh Tiep, for his dedicated guidance, enthusiastic direction, and invaluable
in-structions throughout this research endeavor His wise counsel and unwavering support played a vital role in
navigating the research process, leading to the successful completion of this thesis
Our sincere thanks are extended to the Dean of the Faculty and all the faculty members in the Faculty of
Computer Science at the University of Information Technology for their assistance and for equipping us with
the knowledge necessary to complete this thesis.
Acknowledgment is also extended to the Multimedia Laboratory (MMLab-UIT) for providing a research
environment and equipment for this study Additionally, our appreciation goes to the researchers of the MMLab
for their valuable feedback and probing questions that contributed to our research Their insights aided in
identifying and rectifying mistakes, ultimately enhancing the quality of this thesis
Trang 52.1.1 Vision-Language Model Foundations
-2.1.2 Contrastive Language-Image PretraningModel
2.2 DiffusionModels 2 Ặ Q Q Q Q Q Q HQ HH va
2.2.1 IntuiionOfDifusionModels Ặ
2.2.2 LatenDifusionModels Ặ Ặ Ặ ee
3 Related Work
3.1 Human Brain Activity And Its Representations ẶẶ.
Trang 65.3.1 Comparision With OtherModels 53
Trang 71.1.1 Example of Image Reconstruction process from fMRI data fMRI data acquired from fMRI
scanner We process the {MRI reconstruction stage to reconstruct the original image from the
fMRI data 2 Q Q Q Q Q Q Q Q Q H n v v v k k kg
2.1.1 the new learning paradigm with VLMs enables effective usage of web data and zero-shot
pre-dictions without task-specific fine-tuning ee
2.1.2 Architecture of Vision Transformer 2 0 ee
2.1.3 Summary of CLIP approach 2 2 ee
2.2.1 Forward Diffusion and Reverse Diffusion
process -2.2.2 Training and sampling algorithms ofDDPM:
2.2.3 The U-Net architecture 2 ee
2.2.4 The architecture of latent diffusion model
3.1.1 Magnetoencephalography (MEG) data have a high temporal resolution (on the order of msec),
which allows us to directly assess latency differences in these neural responses Functional
magnetic resonance imaging (fMRI) measures have high spatial resolution (on the order of
mm), which allows us to pinpoint the location of activity associated with a sensorimotor event
3.2.1 Framework diagram for visual stimulus reconstructiontask_
3.2.2 Overview of two variations of frameworks proposed by Shen et al (2019b): (A) ShenDNN and
(B) ShenDNN+DGN The yellow color denotes the use of pre-trained components .
3.2.3 The BeliyEncDec framework, introduced by Beliy et al (2019), involves two main training
stages: (A) supervised training of the Encoder and (B) a combination of supervised and
self-supervised training for the Decoder During this process, the Encoder’s weights remain
con-stant The components of the model trained on external unlabeled data are indicated in blue
3.2.4 GAN-based frameworks (A) SeeligerDCGAN framework based on deep convolutional GAN
(B) Framework proposed by Mozafarif°Ì Q Q Q Q Q Q
3.2.5 VanRullenVAE-GAN framework proposed by VanRullen and Reddy
Trang 8LIST OF FIGURES
4.1.1 Analysis of limitations in the results of Brain-Diffuser model Corresponding to each groundtruth
image in the MS COCO dataset!*7! that the user observes, there will be from 1 to 5
accompany-ing captions The caption sentences will have a structure that describes the object category and
each of their relationships As shown in the image above, the cap will include the categorization
objects "vase", "roses", and their relationship is "roses" is in "vase" With the corresponding
groundtruth, the image reconstructed from the fMRI data must have the same content as the
original image and description However, the reconstructed image is still not in the correct
object category - "roses" and the details of the image are still not the same as the original image 34
4.1.2 Overview of the entire image reconstruction process 35
4.1.3 Architecture of Brain-Diffuser model When subject views the original image, their brain
ac-tivity is recorded and processed into fMRI, These fMRI; is then simultaneously input
into three Branches The first Branch uses a level reconstruction model to create a
low-resolution image Text Branch will input ƒ Mƒ RI;,, and output {MRI Text Embeddings ƒ MRI;
based on the Ridge model learned for the Text Branch Vision Branch will input ƒ M f‡7„, and
output fMRI Vision Embeddings based on the Ridge model learned for the Vision Branch
These low-res images, along with the text and Vision Embeddings, are incorporated into a
pretrained Diffusion model to generate reconstructed images The Text Branch focuses on
en-riching information related to object categories and relationships, while the Vision Branch
con-centrates on refining image details The collaborative process between these branches ensures
the accurate and comprehensive reproduction of the originalimage 36
4.1.4 Distribution’s visualization of Text Branch and Vision Branch of Brain-Diffuser model Text
Distribution on the left is Brain-Difuser fMRI Text Embeddings - fMRI; (green point)
and CLIP Text Embeddings groundtruth (blue poinnt) Vision Distribution on the right is
Brain-Difuser fMRI Vision embedding - f M RI, (green point) and CLIP Vision Embeddings
groundtruth (blue poinnt) Text Distribution and Vision Distribution using Ridge Regression
model exhibit a significant gap Depend on principles that the greater the overlap between the
two distributions, the more information is retained These large gaps lead in Text and Vision
Distribution cause information loss during the mapping process Therefore, Ridge Regression
model is not effective in mapping distributions UMAP"! dimensionality reduction is applied
to visualize the data points of the distributions with 982 samples in the testing set 37
4.1.5 Ridge Regression Model Architect Overview for Vision and Text Branch 38
4.2.1 Overview of our training framework using MindMapper for mapping Text, Vision Branches 40
4.2.2 Overview of our MindMapper architecture for mapping {MRI to CLIP Text/Vision Embeddings 41
4.2.3 UMAP plots depict Text Embeddings grountruth (blue), Ridge (Baseline) for Text Branch
em-beddings grountruth (green), MindMapper for Text Branch emem-beddings (red) Distance between
the given grountruth embeddings Embeddings and MindMapper Embeddings is lower
suggest-ing that the MindMapper helps to align the two embeddsuggest-ings spaces .- 42
Trang 94.2.4 The UMAP plots illustrate embeddings for Vision groundtruth (blue), Ridge (Baseline) for
Vision Branch embeddings (green), and MindMapper for Vision Branch embeddings (red).
Both the left and right images show a significant separation This results in information loss,
indicating that models such as Ridge Regression or MindMapper are not effective in learning
these mappings for Vision Branch 2 2 2 ee 43
4.2.5 Overview of our framework using MindMapper and Visual Denoiser for mapping high-level
features components 2 Ặ Ặ Q Q Q Q Q Q HQ HH HH KT ha 45
4.2.6 UMAP plots illustrate Ridge (Baseline) Vision Embeddings (in green), MindMapper Vision
Embeddings (in red), and MindMapper+ VisualDenoiser (in pink), compared with the groundtruth
Vision Embeddings (in blue) Baseline and MindMapper vision distributions have a large gap
compared to groundtruth distributions But MindMapper+VisualDenoise vision distribution
has the highest overlap, showing that the model contributes effectively in the mapping process 46
4.2.7 A comprehensive overview of the Training and Inference processes 48
5.3.1 Comparison of fMRI Reconstructions for differentmodels 54
5.3.2 Percentage of each modelinranks Ặ Ặ Ặ Ặ QC 56
5.4.1 MindMapper model uses for Text Branch Result compared to Brain-Diffuser Model Result
and Groundtruth Image MindMapper model apply to Text Branch significantly contributes to
making the model aware of object classes and relationships in the semantic context of the image 59
5.4.2 In comparing the outcomes of the MindMapper and VisualDenoiser model for the Vision
branch and MindMapper for the Text branch at the final stage, against the scenario where only
MindMapper is employed for the Text branch along with Brain-Diffuser and Groundtruth, the
combined MindMapper and VisualDenoiser model showcases a substantial improvement in
re-constructing the meaningful content of the image with high detail for both Text and Vision
branches 2 ee 60
Trang 10List of Tables
5.3.1 Scores of different models on the Brain-Image Reconstruction task For each measure, the best
value is in bold For PixCorr, SSIM, AlexNet-2, AlexNet-5, InceptionV3 and CLIP metrics,
higher is better For EffNet-B and SwAV distances, lower is better The arrow pointing up or
down indicates this 2 Ặ Ặ On g k k x k k V ky
5.3.2 The average rank of different models depends on the results of user studles
5.4.1 Quantitative comparison of MindMapper compared to Baseline Result For each measure, the
best value is in bold (For PixCorr, SSIM, AlexNet-2, AlexNet-5, Inception and CLIP metrics,
higher is better For EffNet-B and SwAV distances, lower is better.The arrow pointing up or
down indicates this CC HH HH HH ng k k v k k k kg
5.4.2 Quantitative comparison of MindMapper and VisualDenoiser compared to Baseline Result
For PixCorr, SSIM, AlexNet(2), AlexNet(5), Inception and CLIP metrics, higher is better For
EffNet-B and SwAV distances, lower is better When comparing our MDIR model consisting
of MindMapper for the Text branch combined with MindMapper and VisualDenoiser for the
Vision branch, the model shows significantly improved performance on the high-level metrics
EffNet-B and SwAV, which are also increased to 12.5% and 11.5% and and the remaining
metrics also increased by up to 2.6% 2 ee
5.4.3 Performance Metrics of MDIR with different DDIM steps They consume 12GB GPU for the
inference phrase ee
Trang 11In the realm of neural decoding research, a particularly fascinating area involves the reconstruction of
visually perceived natural images using {MRI signals Previous approaches have succeeded in recreating the
low-level properties (shape, texture, structures) in the reconstructed images while the high-level features (object
category, image details) are incapable of generating exactly Because they utilized simple projection methods
such as Ridge Regression to compress fMRI data into corresponding space such as CLIP space that can be used
to condition the visual stimulus reconstruction process Nevertheless, those methods are unable to capture the
complicated patterns in the CLIP space and most of them used separated simple models to map fMRI to distinct
tokens in high-dimensional space is not reliable due to missing the interaction information between tokens can
lead to information loss issues In this work, we focus on improving the projection of the high-level feature
by leveraging two branches including a text branch for mainly object class and relationships perception and a
vision branch for image details reconstruction We propose a MindMapper model to deal with information loss
in previous methods in the Text Branch Additionally, due to the complexity of vision distribution, MindMapper
is proposed to be used in conjunction with VisualDenoiser VisualDenoiser is introduced to continue to reduce
the gap between outputs of MindMapper and CLIP Image embeddings and provides additional information to
enhance the projection process Our model shows a significant increase of +12.5%, and +11.5% in EfficientNet,
SwAV respectively, while the remaining metrics, such as PixCorr, AlexNet-2, AlexNet-5, InceptionV3, and
CLIP show a slight improvement of up to 3%
Trang 12Chapter 1
Introduction
In this chapter, we systematically introduce the focal theme, offering a comprehensive overview that delves
into the intrinsic importance of this subject in the realm of research Additionally, we meticulously define the
problem Following this, we articulate our motivation and establish the objectives of this thesis To conclude,
we provide the contributions that this research makes to the existing body of knowledge.
1.1 Overview
1.1.1 Practical Context
Have you ever wondered if scientists could peek inside your brain and decode the images you’re thinking
about, the scenes in your dreams, or memories from your past? Right now, they can do this through measuring
brain activity signals There are many methods to capture signals of brain activity like fMRI (functional
Mag-netic Resonance Imaging), EEG (Electroencephalography), but those signals are like a meaningless secret code,
and it can’t exactly reveal the specific content of the images represented by those data Therefore, to understand
what kind of visual content those signal data represent in the brain, a new challenge has emerged: image
re-construction from human brain activity, especially fMRI data is used because of its efficiency and modernity
that can preserve data information Image reconstruction from functional Magnetic Resonance Imaging - {MRI
data has many applications in real-life
¢ Memories Restoration:
Image reconstruction from human brain signals, especially leveraging fMRI, promises revolutionary
ap-plications in memories restoration By decoding the neural patterns associated with memories, this
tech-nology holds the potential to restore and visualize past experiences The contribution lies in offering a
direct window into the neural underpinnings of memory, paving the way for therapeutic interventions and
cognitive rehabilitation for individuals with memory-related disorders.
Trang 13¢ Dream Visualization:
The application of image reconstruction from human brain signals extends into the fascinating realm of
dream visualization Using fMRI, researchers can decode brain activity during dreaming states,
translat-ing these neural patterns into visual representations of dreams This not only enriches our understandtranslat-ing
of the dreaming process but also holds potential applications in studying and addressing sleep disorders
¢ Biomedical Research Advancements:
In the domain of biomedical research, the fusion of image reconstruction with {MRI data opens avenues
for transformative advancements By accurately capturing and reconstructing neural responses to visual
stimuli, researchers gain unprecedented insights into the intricate neural mechanisms underpinning
cog-nitive processes This technology contributes to advancing our understanding of neurological disorders,
cognitive functions, and facilitates the development of targeted interventions and treatments
* Creativity in Art and Design:
The synergy between image reconstruction from human brain signals and creative endeavors in art and
design unlocks innovative possibilities By utilizing [MRI data, artists and designers can tap directly into
mental imagery, fostering a new era of creative expression Beyond the artistic realm, this technology
serves as a means for individuals with limited traditional forms of expression to communicate their
creative visions, particularly benefiting those with physical disabilities
¢ Smart Interaction:
Image reconstruction from human brain signals, when integrated into Brain-Computer Interfaces (BCIs),
propels smart interaction to new heights By decoding intentions and visualizations directly from the
brain using fMRI, this technology enhances the efficiency and intuitiveness of human-computer
interac-tion It has the potential to redefine communication for individuals with motor impairments, providing a
direct link between neural activity and external devices
In summary, image reconstruction from human brain signals, with a focus on fMRI, is a frontier in
neuro-science with far-reaching applications From restoring memories to envisioning dreams, advancing biomedical
research, fostering creativity, and enabling smart interactions, the contributions of this technology are profound
As the field continues to evolve, the potential for real-world impact expands, positioning image reconstruction
from human brain signals as a key player at the intersection of neuroscience and advanced imaging
technolo-gies
1.1.2 Problem Statement
Image reconstruction from human brain activity is a subtask of human visual decoding Image
reconstruc-tion from human brain activity aims to reconstruct the image from recorded brain activity data, typically
ob-tained through neuroimaging techniques such as functional magnetic resonance imaging (fMRI),
Trang 14electroen-1.1 OVERVIEW
Original Image fMRI Image Reconstructed Image
fMRI
) } Scanner ReconstructionfMRI
Figure 1.1.1: Example of Image Reconstruction process from fMRI data {MRI data acquired from {MRI
scanner We process the {MRI reconstruction stage to reconstruct the original image from the fMRI data
cephalography (EEG), and magnetoencephalography (MEG) In this thesis, we utilize fMRI to reconstruct
visual stimuli Figure 1.1.1 illustrates an example of image reconstruction from human brain activity
¢ Input: 3D brain fMRI data acquired from a fMRI scanner when a subject observes a given image
¢ Output: A corresponding high-resolution reconstructed image that the subject observed
1.1.3 Challenges
Besides common challenges of computer vision tasks, Image Reconstruction from fMRI data has its own
difficulties.
¢ The training dataset is one of the main challenges we encounter in this work With a small number of
sample instances, this poses a significant limitation to constructing a model that yields satisfactory image
reconstruction results
¢ Limited Resolution Current neuroimaging techniques, such as functional magnetic resonance imaging
(fMRI) and electroencephalography (EEG), have limitations in both spatial and temporal resolution This
makes it challenging to reconstruct detailed and semantically reasonable images of brain activity
* Data preprocessing is also a significant challenge because fMRI data is raw 3D data with an extremely
large number of dimensions Feature extraction methods used by previous models“! 74! to generate fMRI
data after preprocessing lack information or overlook important details And extracting information from
3D fMRI data, then flattening it into 1D vectors that are subsequently used as input to reconstruction
model, can lead to the loss of spatial information in brain activity patterns
* Requires high computational resources Since we are constructing a complex model for a small amount of
data, it necessitates numerous modules to incorporate additional information Therefore, these techniques
Trang 15s The essence of image reconstruction models is that they are generative models, and the main task of the
image reconstruction problem 1s to transform fMRI data into the type of data accepted by the generative
model Therefore, the key factor is to build a good mapping model from fMRI data to the data type
of the generative model However, with the limitation of the number of samples in the dataset and the
diversity and complexity in the representation of patterns of the generative model’s data type, make it
very challenging to construct a mapping model for this problem
¢ Lack of evaluation metric The absence of a standardized evaluation procedure for assessing the
recon-struction quality in terms of realism, naturalness, logical coherence compared to the original image, and
other methods is extremely difficult Evaluating these images relies on human perception, and there is no
perfect metric for comparison with human judgment
1.2 Motivation
The quest to solve the challenge of image reconstruction from human brain signals is driven by its
tremen-dous potential to decode the intricacies of the nervous system and gain insights into the information encoded
within the human brain Successfully addressing this problem could have far-reaching implications, unlocking a
myriad of applications For instance, it could pave the way for memory restoration, offering hope to individuals
with memory-related conditions, advance biomedical research by aiding in the understanding and diagnosis of
cognitive disorders and neurological conditions Additionally, it could spark innovations in the fields of art and
design However, despite its transformative potential, the exploration of this problem is currently limited That
is one of the motivations why we study this problem
Additionally, the development of models to address this problem is still wide open Many earlier studies on
image reconstruction focused on either the Generic Object Decoding”! or the Deep Image Reconstruction !°*!
datasets curated by the Kamitani Lab These datasets comprise 1200 training and 50 testing images sourced
from ImageNet Pioneering works in this field include those by Shen et al!°*!, Beliy et al!., and Gaziv et al?7!,
who utilized basic CNN and DNN models to examine the image generation process However, these generated
images primarily conveyed low-level features such as layout, sharpness, and contours, resulting in blurry and
indistinct images
More recently, Allen et allintroduced another dataset for visual encoding and decoding studies named the
Natural Scenes Dataset (NSD)!°! With its increased number, diversity, and complexity of images, the NSD
dataset has become the predominant benchmark for {MRI-based image reconstruction Works by Gu et al!?°1,
Ozcelik et all, and Takagi et al!7“! also employed the NSD dataset in conjunction with deep generative
models like GANs and Diffusion to reconstruct images with high semantic and high-level detail aspects
Nev-ertheless, reconstructing scenes with multiple objects and complex semantic descriptions, such as COCO”!
images from the NSD dataset, remains a challenge
Given the notable recent success of latent diffusion models in generative AI applications, we can build an
image reconstruction models could also benefit from the advantages offered by these models
Trang 161.3 OBJECTIVES
1.3 Objectives
In this research, we aim to propose a novel image reconstruction model that builds upon the sensible and
effective architecture of an existing baseline model, incorporating novel mapping modules to enhance its
per-formance and avoid information loss during data mapping The proposed model is designed to mitigate
infor-mation loss, preventing inaccuracies in the semantics and structure of the reconstructed image compared to the
original.
1.4 Contributions
Our main contributions to this work are listed in the following points:
¢ Investigating the overview of related research, we conducted a survey on the current research regarding
Image Reconstruction using different datasets and methods Subsequently, we identified key components
that can be leveraged to design a framework for our Image Reconstruction from fMRI data
* Our work focuses on improving the two main branches in the task, namely the Text branch and the Vision
branch, to support the image generation process.
— In the Text branch, we propose using a single enhanced deep network model called MindMapper,
capable of mapping fMRI data to corresponding text embeddings to provide information about
object categories and relationships in the image
— In the Vision branch, we suggest using the MindMapper model that we proposed, combined with
the pretrained DALL-E 2 prior model, to assist in supplementing information for the details of the
image
Our proposed model can reconstruct the high-resolution and high-naturalistic images effectively and
efficiently Additionally, our model amplifies object class information and proficiently extracts scene
details from fMRI data
1.5 Dissertation Structure
Our thesis consists of 6 chapters:
¢ Chapter 1: Introduction This chapter presents an overview of image reconstruction from human brain
activity problems, which consists of research motivation, definition, challenges, and our main
contribu-tions.
¢ Chapter 2: Background This chapter gives an introduction to fundamental knowledge that plays a vital
role in our thesis.
Trang 17Chapter 3: Related work Prior research related to our thesis will be presented in this chapter.
Chapter 4: Mapping Denoising Image Reconstruction - MDIR This chapter delves into our proposed
MDIR model
Chapter 5: Experiment This chapter contains the evaluation metrics used to evaluate our results, our
experiment setting as well as the results we attained.
Chapter 6: Conclusion This chapter sums up our thesis and our contributions, and some future work is
mentioned
Trang 18Chapter 2
Background
Prior to discussing our methodology, we present intuitions of components that are used in our approaches
There are some robust and powerful models involves Variational Autoencoder (VAE), Contrastive
Language-Image Pretrained models
2.1 Vision Language Models
Many deep neural network (DNN) training processes for visual recognition heavily depend on
crowd-labeled data, often involving the training of a separate DNN for each specific visual recognition task This
conventional approach results in a laborious and time-consuming visual recognition paradigm To tackle these
challenges, there has been considerable recent research into Vision-Language Models (VLMs) These models
leverage large-scale image-text pairs from the internet to learn rich vision-language correlations, offering the
potential for zero-shot predictions across diverse visual recognition tasks using a single VLM
2.1.1 Vision-Language Model Foundations
A Vision-Language Model (VLM) PÌ undergoes pre-training using extensive image-text pairs readily
avail-able on the internet The pre-trained VLM can then be directly employed for various downstream visual
recog-nition tasks without the need for fine-tuning as illustrated in figure 2.1.1 The pre-training of the VLM involves
being guided by specific vision-language objectives, facilitating the learning of image-text correspondences
from the vast image-text pairs dataset This process employs a contrastive objective, wherein the VLM learns
by bringing paired images and texts close together while pushing others farther apart in the embedding space
Consequently, the pre-trained VLMs acquire comprehensive knowledge of vision-language correspondences,
allowing them to make zero-shot predictions by matching the embeddings of any given images and texts This
novel learning paradigm makes effective use of web data and enables zero-shot predictions without the need
for task-specific fine-tuning It is a straightforward implementation that delivers impressive results, as
Trang 19demon-(2) Zero-shot Prediction without Fine-tuning
(1) Vision-Language Model Pre-training
Dog
Text Prompt
me olext Text Cat es 7
Ths dog —+| Deep Neural - Pee Deep Neural itt
with a frisbee in Network H [class] Neto: Downstream Tasks:
its mouth Image Classification
Pre-training Objectives:
Image-text Contrastive Leaning
Object Detection
Classes in = s
Downstream Task cmantic Segmentation
Masked Cross-modal Modelling i
Image Similarity Calculation
Deep Neural l8 Image —E |
Network Deep Neural iposi|S[eag| - Picup)
Network l
'Web-scale Image-text Pairs Unlabelled Images from A photo of a (almost infinitely available on the internet) Downstream Tasks [Cat]
Figure 2.1.1: the new learning paradigm with VLMs enables effective usage of web data and zero-shot
predictions without task-specific fine-tuning
strated by the superior zero-shot performance achieved by the pre-trained CLIP model on 36 visual recognition
tasks ranging from classic image classification to human action and optical character recognition.
The primary goal of Vision-Language Model (VLM) pre-training*! >! is to impart the model with the
ability to understand and correlate images and texts effectively, ultimately enabling proficient zero-shot
pre-dictions in visual recognition tasks The process involves leveraging image-text pairs to train the VLM through
specific pre-training objectives*! >], To achieve this, the VLM employs both a text encoder and an image
encoder on the given image-text pairs These encoders extract features from the images and texts, respectively
The model then learns the correlation between vision and language using predefined pre-training objectives
The resulting vision-language correlation is crucial for subsequent zero-shot evaluations on new, unseen data,
accomplished by comparing the embeddings of any provided images and texts*!>!, The VLM pre-training
mechanism utilizes a deep neural network operating on N image-text pairs within a designated pre-training
dataset This neural network consists of an image encoder and a text encoder, responsible for encoding the
features of the image and text in an image-text pair, producing an image embedding and a text embedding,
re-spectively The subsequent section delves into the architectural details of widely adopted deep neural networks
in the context of VLM pre-training
To be specific, architectures have been widely utilized for learning image features is Transformer-based
architectures The Vision Transformer!'*! utilizes a series of Transformer blocks, each composed of a
multi-head self-attention layer and a feed-forward network In the architecture depicted in Fig 2.1.2, the input image
undergoes an initial step of being divided into fixed-size patches These patches are then processed through
linear projection and position embedding before being fed into the Transformer encoder This allows the model
to capture complex relationships within the image In the context of Vision-Language Models (VLM)
stud-ies*], subtle adjustments are made, often including the addition of an extra normalization layer preceding the Transformer encoder This modified architecture, building upon the Transformer framework!'>!!®°l, is designed
to efficiently extract features from images, making it well-suited for tasks involving both vision and language
understanding in VLM pre-training
The utilization of Transformer and its variations is widespread in the realm of learning text features The
Trang 202.1 WISION LANGUAGE MODELS
Vision Transformer (ViT)
Figure 2.1.2: Architecture of Vision Transformer
incorporating a multi-head self-attention layer and a multi-layer perceptron (MLP) The decoder mirrors this
structure, comprising 6 blocks, each hosting 3 sub-layers: a multi-head attention layer, a masked multi-head
layer, and an MLP Numerous Very Large Model (VLM) studies, such as CLIP [58] adhere to the standard
Transformerl#°l with minor adjustments, akin to those found in GPT2!°°! These studies typically involve
train-ing from scratch, foregotrain-ing the use of pre-trained GPT2'! weights for initialization.
To learn the rich vision-language correlation, they utilized contrastive learning More specifically, they
allow VLMs to learn discriminative representations by pulling paired samples as close and pushing others far
away in the feature space!*4! 4°! which is defined as follows:
where k € P(i) = {k|k € B, yx = y;} and y is the category label of (zr, zr) With Eqs 2.1.1 and 2.1.2,
the image-text label infoNCE loss is defined as: CJM#NCE = clot + cil
Trang 21(1) Contrastive pre-training (2) Create dataset classifier from label text
Figure 2.1.3: Summary of CLIP approach
2.1.2 Contrastive Language-Image Pretraining Model
CLIPPl, which stands for Contrastive Language-Image Pre-training, represents a multimodal learning
framework crafted by OpenAI Its training involves acquiring visual concepts through natural language
guid-ance and establishing a connection between textual and visual information The model undergoes joint training
on an extensive dataset comprising images and their associated textual descriptions, akin to the zero-shot
capa-bilities observed in GPT-2!50! and GPT-3"7.
As a Vision Language model, CLIP®*lis a deep neural network that contains two encoders: an image
en-coder and a text enen-coder The dual enen-coders generate embeddings within a common vector space These shared
embedding spaces enable CLIP to analyze and understand the connections between text and image
representa-tions, facilitating the learning of their inherent relationships CLIP provided various pre-trained models based
on different vision models such as Resnet, and Vision Transformer (ViT)
CLIP}! undergoes pre-training using an extensive dataset comprising 400 million pairs of image and text
data sourced from the internet Throughout this pre-training process, the model encounters pairs consisting of
images and corresponding text captions Among these pairs, some accurately align (where the caption precisely
describes the image), while others present mismatches This diverse dataset facilitates the creation of shared
latent space embeddings as CLIP learns to capture the intricate relationships between visual and textual
ele-ments This pivotal characteristic positions CLIP as a breakthrough in Vision Language models, enabling its
application across a spectrum of tasks and achieving state-of-the-art performances
2.2 Diffusion Models
Diffusion Models (DMs), abbreviated Denosing Diffusion Probabilistic models (DDPM), is a generative
model that has been gaining popularity in the field of deep learning It was first introduced by Robin Rombach
Trang 222.2 DIFFUSION MODELS
et al.!°° in 2021 Diffusion Models are designed to capture the diffusion process in a network by modeling the
underlying network structure and the dynamics of information spreading within the network This approach
al-lows for state-of-the-art synthesis results on image data and beyond, while also providing a guiding mechanism
to control the image generation process without retraining
2.2.1 Intuition Of Diffusion Models
Researchers were concentrating on research Generative Adversarial Networks (GANs) (41 Variational
Au-toencoder (VAEs)!“!l, and Autoregressive models for AI generation An innovative generative model was
pro-posed by Ho.et al 2020 which is known as Denoising Diffusion Probabilistic Model (DDPM)"°! Diffusion
models are fundamentally different from all the previous generative methods, illustrated in 2.2.1 Intuitively,
they aim to decompose the image generation process (sampling) in many small “denoising” steps With
high-quality and fidelity image-generating ability, it rises as a state-of-the-art model in generative AI and has been
drawing so much attention in research and changing the game Diffusion models recently achieved
state-of-the-art results for most image tasks including text-to-image with DALLE'! but many other image
generation-related tasks too, like image inpainting, style transfer or image super-resolution
qŒx;|X¿_ 1)
".¬ na.
H i
s a
Figure 2.2.1: Forward Diffusion and Reverse Diffusion process
The diffusion process involves the incremental addition of Gaussian noise to an input image Xo,
accom-plished through a sequence of T steps Referred to as the forward process, it is distinct from the forward pass of
a neural network This initial stage is pivotal in creating the targets for subsequent neural network training (the
image after applying t < T noise steps Afterward, a neural network (known as a noise predictor) is trained to
recover the original data by reversing the noising process By being able to model the reverse process, we can
generate new data
Trang 23Diffusion models can be seen as latent variable models Latent means that we are referring to a hidden
continuous feature space In such a way, they may look similar to variational autoencoders (VAEs).
In practice, they are formulated using a Markov chain of T steps Here, a Markov chain means that each
step only depends on the previous one, which is a mild assumption Importantly, we are not constrained to using
a specific type of neural network, unlike flow-based models
Given a data-point xo sampled from the real data distribution I(x) (ap ~ II(z)), one can define a forward
diffusion process by adding noise Specifically, at each step of the Markov chain we add Gaussian noise with
variance 3; to #;_¡, producing a new latent variable x; with distribution II(z;|z;_1) This diffusion process
can be formulated as follows:
H(œ;|z¿—1) = Nữu: tụ = VO = Bites, ot = 5:1)
Since we are in the multi-dimensional scenario, 7 is the identity matrix, indicating that each dimension has
the same standard deviation Ø; Note that Q(x;|x1_1) is still a normal distribution, defined by the mean /¿ and the variance © where pu, = (1 — Ø;¿)#;—¡, and ©, = ,I > will always be a diagonal matrix of variances (here
8).
Thus, we can go in a closed form from the input data x9 to x7 in a tractable way Mathematically, this is
the posterior probability and is defined as:
H(ziz|zo) = TI—¡ H(Œile¿—)
Variance Schedule:
The variance parameter Ø, can be fixed to a constant or chosen as a schedule over the T timesteps In fact,
one can define a variance schedule, which can be linear, quadratic, cosine, etc The original DDPM authors
utilized a linear schedule increasing from 3, = 10-4 to 6, = 0.02 But in later paper, it was shown that
employing a cosine schedule works even better
Reverse Diffusion Process:
As T — ov, the latent zp tends towards an approximately isotropic Gaussian distribution Therefore,
by successfully learning the inverse distribution g(z;_—¡ | x4), we can generate samples of zr from N(0, 7), proceed through the reverse process and obtain a sample from g(o), creating a new data point from the original
data distribution
In practical terms, we lack precise knowledge of g(+¿—1, |, 71) This is due to its complexity, as estimating (I — A) q(x+~1, |, 2+) involves computations related to the distribution of data.
To address this, we opt to approximate g(#;_¡, |,¿) using a modeled version PO, like a neural network.
As q(at — 1,|, zz) is also expected to be Gaussian, especially for sufficiently small 3;, we can design Pg as
Gaussian, focusing on defining its mean and variance:
If we apply the reverse formula for all timesteps (pg(xo.7')), also called trajectory), we can go from #+ to
the data distribution:
By additionally, conditioning the model on timestep ứ, it will learn to predict the Gaussian parameters for
Trang 242.2 DIFFUSION MODELS
Algorithm 1 Training Algorithm 2 Sampling
1: repeat 1: xr ~ N(0,1)
2: xo ~ q(xo) : fort =T, ,1do
3: t~ Uniform({1, ,7}) IO Wifes Leen <0
4: e~ N(0,1) Z (0,1) i „ else z
5
2 3
_ 1 lea
Take gradient descent step on 4 Xt = Te (x: — d5; eø(x:, Đ)) + Ø:Z
5 6
Vo |le — eø(V#xo + V1 — Ge, t)||” : end for
6: until converged : return xo
Figure 2.2.2: Training and sampling algorithms of DDPMs.
each timestep
To do that, we need to train a model by optimizing the negative log-likelihood of the training data The
simplified version of the objective:
Lm = Bay, [lle eo (a + (I—aje,9 ||
In order for the model’s input and output to be the same size, A U-Net was employed A U-Net is a
symmet-ric architecture (2.2.3 that uses skip connections between encoder and decoder blocks of corresponding feature
dimension, with input and output having the same spatial size Typically, the input image is up- and
down-sampled to achieve its original size Additionally, U-Net help generate high-quality image at high resolution
A pipeline of multiple diffusion models was used at increasing resolutions Noise conditioning augmentation
between pipeline models is significant to the final image quality, which is to apply strong data augmentation
to the conditioning input z of each super-resolution model pị0(z|z) The conditioning noise helps reduce
com-pounding errors in the pipeline setup.
Subsequently, a neural network undergoes training to reverse the noising process, aiming to recover the
original data This reverse diffusion process, often termed the sampling process of a generative model, holds
the key to generating new data This thesis delineates the mechanisms underlying both the forward and reverse
diffusion processes, emphasizing their significance in the realm of generative models
Guided Diffusion:
A vital aspect of image generation is conditioning the sampling process to manipulate the generated
sam-ples To further "guide" the generation, techniques have even been developed that incorporate image
embed-dings into the diffusion Mathematically, the guidance refers to conditioning a prior data distribution p(x) with
a condition y (the class label or an image/text embedding, resulting in p(z |) To turn a diffusion model pg into
a conditional diffusion model, we can add conditioning information y at each diffusion step
T
po(ror | U) = po(@r) [1-1 Po(we-1 | te, 9)
Overall, guided diffusion models aim to learn V log po (x; | y) we could formulate it into a formulation:
V log p(t | y) = Vlog pe(xt) + s - V log(pa(w | x2)
To do that, we need to refer to a technique called Classifier guidance Sohl-Dickstein et al illustrated that we
Trang 25¥
LH : te => conv 3x3, ReLU
ñ " + copy and crop
Jie —:: mem $f0ie612
6 SF < 45 4 up-conv 2x2
Figure 2.2.3: The U-Net architecture
can use a second model f(y | +¿, £) to guide the diffusion toward the target class y during training A classifier
¿(0 | ve, ¢) was trained on the noisy image 2; to predict its class y Then we can use the gradient to guide the
diffusion
The most common model used for guiding diffusion model to generate images as we expect is CLIP
(Con-trastive Language-Image Pretraining) CLIP, a product of OpenAI’s groundbreaking research, is a neural
net-work architecture trained through contrastive learning on a vast dataset encompassing diverse images and their
associated textual descriptions This unique training approach empowers CLIP to understand intricate
associa-tions between images and text, effectively learning joint representaassocia-tions that capture the relaassocia-tionships between
the two modalities
By leveraging the joint image-text representations learned by CLIP, the diffusion model attains a more
robust foundation for understanding correlations between images and their corresponding textual contexts The
model gains a nuanced comprehension of the complex interplay between visual and textual elements, which
significantly contributes to its ability to produce contextually accurate and coherent outputs.
This joint comprehension facilitates the diffusion model’s capacity to generate more contextually precise
and cohesive responses, particularly when the task involves producing content that encompasses both images
and text It enables the model to synthesize responses that reflect a deeper understanding of the relationship
between visual and textual elements, resulting in more accurate, relevant, and meaningful outputs
The fusion of CLIP’s capabilities within the diffusion model not only expands the model’s horizons by
imbuing it with the ability to comprehend and relate visual and textual information but also equips it with
Trang 262.2 DIFFUSION MODELS
the potential to generate more sophisticated and contextually relevant image-based outputs This integration
ultimately holds promise for applications requiring the synthesis of images in response to textual prompts,
presenting a new frontier in AI-driven content generation and comprehension
2.2.2 Latent Diffusion Models
Although the Diffusion models achieved impressive results, limitations of Diffusion models are that they
work sequentially on the whole image, meaning that both the training and inference times are expensive The
image space is enormous, think a 512x512 image with three color channels (red, green, and blue) is a
768432-dimensional space and notice that many values for one image That is the reason it needs hundreds of GPUs to
train such as a model and we have to wait a few minutes to get our results It bounds the playground for only
the biggest companies like Google or OpenAI
To overcome this limitation, they transformed Diffusion models into Latent Diffusion Models This means
that they implemented this diffusion approach within a compressed image representation instead of the image
itself and then worked to reconstruct the image So they are not working with the pixel space, or regular images,
anymore Working in such a compressed space does not only allow for more efficient and faster generations as
the data size is much smaller but also allow for working with different modalities Since they are encoding the
inputs you can feed it any kind of input like images or texts and the model will learn to encode these inputs in
the same sub-space that the diffusion model will use to generate an image
The Latent Diffusion Model (LDM) is an innovative approach that employs an initial image, denoted as X,
which is then encoded into a condensed, information-rich space known as the latent space Z This encoding
process resembles that of a Generative Adversarial Network (GAN) It involves an encoder model that extracts
essential information from the image, akin to downsampling, reducing its size while preserving as much
perti-nent information as possible This results in a condensed representation of the input within the latent space.
Following this, the model incorporates additional conditioning inputs, such as text or other images, merging
them with the condensed image representation using an attention mechanism This attention mechanism, a
feature of transformer models, learns the optimal way to combine these inputs within the latent space The
merged inputs then serve as the initial noise for the subsequent diffusion process.
The diffusion process within the latent space begins with an initial random noise signal This signal
un-dergoes a sequence of steps, gradually refining and transforming within the latent space With each iteration,
the noise evolves into more organized and structured representations Throughout this iterative process, the
latent space encapsulates increasingly meaningful features and patterns, culminating in the generation of
high-quality, diverse data samples More specifically, the key idea of LDMs is to separate the training into two phases
including perceptual image compression and latent diffusion
Perceptual Image Compression:
In their initial phase, Rombach et al utilize an autoencoder training method that combines a perceptual
loss and a patch-based adversarial objective This approach is designed to ensure reconstructed images adhere
Trang 27closely to the image manifold, fostering local realism and reducing blurriness that often arises when solely
relying on pixel-space losses such as L2 or L1 objectives Starting with an image x in RGB space, the encoder
transforms x into a latent representation z, followed by the decoder reconstructing the image based on this
derived latent representation
Specifically, in the input layer, images are encoded by an AutoKL Encoder This process converts a
high-dimensional RGB image into a condensed two-high-dimensional representation, enabling the Dynamic Probabilistic
Model (DPM) to operate within this structured latent space where intricate, high-frequency details are
ab-stracted This more compressed latent space is better suited for likelihood-based generative models, allowing
them to focus on the essential, meaningful aspects of the data and train in a computationally efficient,
lower-dimensional space.
Additionally, to prevent a high degree of variability in the latent space, Rombach et al experiment with two
forms of regularization: KL-reg, resembling a Variational Autoencoder (VAE), and VQ-reg Consequently, the
forward chain within a Diffusion Probabilistic Model is considered during autoencoder training, enabling the
encoder to estimate the distribution of the learned latent space during the learning process
Latent Diffusion Models:
Once the autoencoder is trained, the Dynamic Probabilistic Model (DPM) advances through its forward
chain in the subsequent phase by directly sampling z, from the encoder To facilitate the reverse process,
Rombach et al implement a time-conditional U-Net Moreover, they harness the DPM’s capacity to model
conditional distributions, transforming their suggested Latent Diffusion Model (LDM) into a more adaptable
conditional image generator This is achieved by enhancing the foundational U-Net with a cross-attention
mech-anism, which is particularly effective for attention-based models to learn from diverse input modalities
Rombach et al introduce an additional encoder, functioning as a transformer This encoder is responsible
for converting input from a different modality, typically in the form of textual prompts, into an
intermedi-ary representation Subsequently, this representation is aligned with the layers of the U-Net using multi-head
cross-attention Throughout this second stage, both the time-conditional U-Net and encoder are concurrently
optimized The proposed process is visually depicted in Figure 2.2.4 This integration of language
representa-tion and visual synthesis culminates in a robust model that demonstrates remarkable adaptability to complex,
user-defined text prompts
Essentially, the model navigates and explores the latent space, progressively refining the initial noise into
coherent data representations Each step in the diffusion process enhances the quality and richness of the
gen-erated data by refining the latent representations.
The LDM fundamentally utilizes diffusion processes to transition from randomness to structured, realistic
data It achieves this by iteratively refining the latent representations, thereby transforming noise into
mean-ingful data This progressive refinement constitutes the foundation of the Latent Diffusion Model, enabling the
generation of diverse and high-fidelity data across various domains, including images, text, and more
Additionally, the Latent Diffusion Model stands out for its efficiency, as the latent space is significantly
Trang 28Because of the significantly reduced data size, working in such a compressed space not only enables
work-ing with diverse modalities but also more efficient and faster creation You can give it any type of input, such
as texts or images, because they are encoding the inputs The model will learn to encode these inputs in the
same sub-space that the diffusion model uses to generate images One model will use text or visuals to direct
generations, much like the CLIP model
Trang 29Related Work
In this chapter, we delve into the fascinating visual stimulus reconstruction topics by exploring human
brain activity and its common representations which can be used for visual stimulus reconstruction Alongside
an analysis of the advantages and disadvantages of these representations, we present a rich literature review on
human brain signal-based image reconstruction This includes an exploration of Machine Learning-based
meth-ods, VAE-based methmeth-ods, GAN-based methmeth-ods, and Diffusion-based methods Through this journey, we aim
to not only provide insights into the methodologies but also ignite a curiosity about the profound implications
of understanding and reconstructing visual stimuli from the human brain
3.1 Human Brain Activity And Its Representations
For a long time, scientists have desired to research how the human brain reacts to the world This
fascina-tion has led to a profound explorafascina-tion of human brain activity and its representafascina-tions, delving into the
intri-cate mechanisms that govern cognitive processes and shape our perceptions of the surrounding environment
Understanding the complex interplay between neural signals, cognitive functions, and the formation of mental
representations holds the key to unlocking the mysteries of consciousness and paving the way for advancements
in fields such as neuroscience, psychology, and artificial intelligence
3.1.1 Human Brain Activity
Human brain activity refers to the intricate and dynamic electrochemical processes occurring within the
brain, orchestrating the vast network of neurons that enable cognition, perception, and various bodily functions
At its core, brain activity involves the generation and transmission of electrical impulses between neurons,
form-ing complex neural circuits This intricate dance of neurotransmitters and electrical signals allows the brain to
process information, store memories, and regulate bodily functions Neurotransmitters act as messengers,
fa-cilitating communication between neurons by crossing synapses Brain activity is not confined to conscious
Trang 303.1 HUMAN BRAIN ACTIVITY AND ITS REPRESENTATIONS
thought; it also encompasses subconscious processes such as regulating heartbeat, breathing, and maintaining
homeostasis Employing advanced neuroimaging techniques such as functional magnetic resonance imaging
({MRD, electroencephalography (EEG), and magnetoencephalography (MEG) enables the study of these
pat-terns, unveiling correlations between specific cognitive tasks and neural responses The brain’s plasticity, or
its ability to adapt and reorganize, is reflected in changing patterns of activity, influenced by learning,
experi-ence, and environmental factors Understanding human brain activity is essential for unraveling the mysteries
of consciousness, and mental health, and developing treatments for neurological disorders.
3.1.2 Signal Representations
Currently, there are many techniques to capture the human brain activity Nonetheless,
Electroencephalog-raphy (EEG), MagnetoencephalogElectroencephalog-raphy (MEG), and functional magnetic resonance imaging (fMRI) are the
most common neuroimaging techniques to elicit brain behavior By using these techniques different activity
patterns can be measured within the brain to decode the content of mental processes, especially the visual and
auditory content The datasets capture a spectrum of neural signals and responses, enabling the study of brain
activities during various cognitive tasks, sensory perceptions, motor actions, and emotional processing The
definition and analysis of these techniques are provided following
EEG measures the electrical activity of our brain via electrodes that are placed on the scalp The information
is collected from the surface measurements, how active the brain is This can be useful for quickly determining
how brain activity can change in response to stimuli, and can also be useful for measuring abnormal activity,
such as with epilepsy.
Electroencephalography (EEG) presents several notable advantages as a neuroimaging technique One of
its primary strengths lies in its exceptional temporal resolution, capable of capturing brain activity in
real-time This means that EEG can detect and monitor neural activity within milliseconds, allowing researchers
to observe and analyze rapid changes in brain function, an attribute particularly beneficial in the study of
dynamic brain processes Another significant advantage of EEG is its non-invasive nature This technique
involves placing electrodes on the scalp, making it a safe and comfortable method for individuals participating
in studies Being non-invasive reduces the risk and discomfort commonly associated with other more invasive
neuroimaging methods
Moreover, EEG is relatively more affordable and accessible compared to other neuroimaging technologies
The cost-effectiveness and ease of access to EEG equipment make it a preferred choice for many researchers
and clinicians, enabling a wider range of studies and applications Furthermore, the versatility of EEG is
note-worthy It finds application across various fields, from clinical diagnostics to cognitive neuroscience and
brain-computer interface research Its adaptability allows for a broad spectrum of investigations, making EEG a
valuable tool in exploring brain activity and cognitive functions.
Electroencephalography (EEG), despite its many strengths, suffers from a significant drawback in the form
of limited spatial resolution This restriction is primarily due to the methodology of measuring electrical brain
activity via electrodes placed on the scalp The signals detected by EEG are subject to distortion and attenuation
Trang 31as they pass through the skull and surrounding tissues Consequently, the accurate localization of the exact
source of neural activity becomes a challenging task EEG struggles to precisely pinpoint the specific brain
regions contributing to the recorded electrical signals This limitation hampers the ability to discern deeper
brain structures Neural activity from these regions faces substantial distortion as it travels through the skull
and superficial tissues, making it difficult to accurately locate the source.
Magnetoencephalography (MEG) stands as a cutting-edge neuroimaging method that captures the complex
neural dynamics within the human brain by detecting neuromagnetic fields outside the skull It shares
similari-ties with EEG in providing exceptional temporal resolution, registering brain activity in milliseconds However,
MEG distinguishes itself with an unparalleled advantage in spatial resolution This technology excels in
pre-cisely mapping the spatial distribution of neural processes, pinpointing brain activity locations with remarkable
accuracy, often reaching mere millimeters This unique capability allows for an in-depth exploration of the
specific and dynamic neural mechanisms governing human cognition and behavior
By measuring the spatial distribution of the magnetic field outside the head, the locations of the neural
sources of the MEG signals can be estimated The spatial resolution of MEG depends on the characteristics of
the underlying activity
MEG directly measures the brain’s magnetic fields, offering an accurate assessment of neuronal activity
This direct measurement not only provides a clearer understanding of the brain’s functional organization but
also makes MEG less susceptible to interferences caused by the skull’s electrical properties, enhancing its
effectiveness in studying brain regions close to the skull and deep brain structures This sensitivity to deep
brain structures is particularly beneficial in understanding neurological conditions originating in these areas
While both technologies complement each other, it’s crucial to note that only MEG provides precise
tim-ing and spatial information about brain activity MEG signals directly stem from neuronal electrical activity,
allowing for data collection even from subjects in a sleeping state This characteristic opens up the potential for
real-world applications of MEG data, expanding its utility beyond controlled laboratory settings.
Although MEG provides good temporal resolution (detecting changes in brain activity quickly) and its
spatial resolution is better than EEG, its spatial resolution is still limited It can pinpoint the brain region where
activity is occurring but might not offer precise localization within that region Additionally, the cost to gather
the MEG dataset is expensive which can limit its accessibility in many research.
Over the limitations of MEG and EEG, The functional MRI (f{MRD) has superior spatial resolution,
facili-tating highly accurate localization of brain functions and activities It helps researchers analyze brain activity
and utilize this data in neural decoding effectively
Functional Magnetic Resonance Imaging (fMRI) is a powerful neuroimaging technique that offers
sub-stantial advantages in decoding brain activity, particularly in the domain of visual processing By capturing
changes in blood flow and oxygenation levels, {MRI grants researchers the ability to discern and map the
en-gagement of specific brain regions during visual tasks or stimulus processing This capability proves invaluable
in understanding the intricate neural mechanisms underlying visual decoding
One significant advantage of fMRI lies in its non-invasive nature, making it a safe and widely accessible
Trang 323.1 HUMAN BRAIN ACTIVITY AND ITS REPRESENTATIONS
MEG (Gradiometers) Functional MRI (All Events)
Event Grand-Averaged Event-Related Fields Topographic map Statistical Parametric Map [t=0 at vert line] [all sensors superimposed] (100-ms window around peak] {one-sample t-test, p<.05 FWE]
@
Auditory
`
Response
Figure 3.1.1: Magnetoencephalography (MEG) data have a high temporal resolution (on the order of msec),
which allows us to directly assess latency differences in these neural responses Functional magnetic resonance imaging (fMRI) measures have high spatial resolution (on the order of mm), which allows us to
pinpoint the location of activity associated with a sensorimotor event
tivity via blood oxygen level-dependent (BOLD) signals, fMRI enables researchers to infer neural engagement
without intruding into the brain’s physiological integrity, offering a safe means to study cognitive functions
Moreover, the spatiotemporal resolution provided by fMRI is unparalleled This high-resolution mapping
capability is instrumental in discerning detailed localization and temporal sequences of brain activity during
visual tasks This precision aids in the identification and differentiation of neural activation patterns associated
with various aspects of visual processing, such as object recognition, visual attention, or even perception of
motion The ability to pinpoint specific brain regions involved in these processes provides a comprehensive
understanding of how the brain processes visual information
Furthermore, fMRI facilitates the examination of neural connectivity and interactions among different brain
regions involved in visual decoding By analyzing functional connectivity, researchers can uncover how these
regions communicate and coordinate during visual tasks, contributing to a more holistic comprehension of the
neural networks engaged in visual processing
The absence of known side effects associated with fMRI further enhances its appeal as a tool for
investi-gating visual decoding Its non-invasive nature, coupled with its ability to deliver detailed spatial and temporal
information, makes {MRI an indispensable asset in unraveling the complexities of visual cognition and
under-standing the underlying neural mechanisms associated with various visual processes
The superior spatial resolution and comprehensive whole-brain coverage of functional Magnetic Resonance
Imaging (fMRI) have captured significant attention from researchers in the fields of Neuroscience and
Com-puter Science These advantages position fMRI as a leading technique in brain decoding, particularly in the
Trang 33context of visual reconstruction The wealth of high-quality fMRI data available for research purposes has been
pivotal in facilitating significant strides in understanding the intricacies of brain activity during visual
process-ing Leveraging these robust datasets, researchers have achieved promising outcomes in the domain of brain
decoding, utilizing fMRI to unveil the complex neural underpinnings of visual reconstruction This
advance-ment not only sheds light on the intricate mechanisms governing how the brain processes visual information
but also enhances our understanding of the broader cognitive processes involved Ultimately, the utilization of
fMRI has enabled a deeper exploration and comprehension of the human brain, providing invaluable insights
into its functions, behaviors, and capabilities
Given analysis illustrated that functional Magnetic Resonance Imaging (fMRI) stands out as the most
suit-able data for accurate visual stimulus reconstruction
3.2 Human Brain Signal-based Image Reconstruction
Addressing the challenge of understanding visual cognition at a deeper level, human brain signal-based
image reconstruction emerges as an innovative solution at the intersection of neuroscience and computer
sci-ence The problem lies in the complexity of decoding and reconstructing visual information from the intricate
neural signals generated during visual tasks Researchers employ advanced techniques like functional
mag-netic resonance imaging (fMRI) and electroencephalography (EEG) to record and analyze these brain signals.
The solution entails leveraging these signals to decipher the neural representations underlying visual stimuli,
ultimately reconstructing images that closely mirror what individuals perceive This novel approach not only
provides valuable insights into the complexities of visual perception but also holds significant promise for
applications in neuroscience, such as brain-computer interfaces, neuroprosthetics, and rehabilitation As our
understanding of the human brain advances, human brain signal-based image reconstruction stands as a
power-ful tool in unraveling the mysteries of visual cognition and enhancing neurotechnological capabilities
3.2.1 Definition
Human brain signal-based image reconstruction refers to the process of creating visual representations of
images or scenes based on the neural activity recorded from the human brain This technology aims to decode
the complex patterns of electrical signals generated by the brain when an individual observes or imagines
spe-cific visual stimuli By employing advanced neuroimaging techniques such as EEG, MEG, or fMRI, researchers
can capture the complicated neural patterns associated with different visual experiences Depending on contexts
and requirements such as real-time processing or accuracy in reconstruction, the appropriate technique will be
used These signals are then subjected to decoding algorithms, which unravel the neural responses related to
specific visual stimuli or mental tasks The culmination of this process is the reconstruction of images that
reflect the perceived visual information encoded in the brain signals
The importance of human brain signal-based image reconstruction lies in its potential applications and
rel-evance By understanding how the brain represents and processes visual information, this approach contributes
Trang 343.2 HUMAN BRAIN SIGNAL-BASED IMAGE RECONSTRUCTION
Images fMRI activity
Figure 3.2.1: Framework diagram for visual stimulus reconstruction task
to the development of brain-computer interfaces, advancements in neuroimaging, and applications in medical
diagnostics It provides a unique window into the neural mechanisms underlying visual perception Key
charac-teristics include the intricate process of decoding relevant brain signals and subsequent image reconstruction
The complexity of this endeavor requires expertise in both neuroscience and computational methods to bridge
the gap between observed neural activity and resulting visual representations
When EEG is a non-invasive technique which makes it the most practical methodology to record the
elec-trophysiological dynamics of the brain!“9H8!I2I52l, However, the incapacity to record the spatial information
of EEG is the main causative factor behind the worse result in visual stimulus reconstruction Another
tech-nique complemented for EEG is MEG which can measure brain activity at a spatial and temporal resolution
by recording the fluctuation of magnetic fields elicited by the post-synaptic potentials of pyramidal neurons !.
Therefore they can be applied to decode dynamic stimuli such as speech, video, etc, as well as used in
real-time use cases In contrast to EEG, fMRI’s ability!’°! °°! to visualize deep brain structures opens new frontiers
for exploring complex cognitive functions By creating detailed maps of brain regions engaged during specific
tasks, {MRI contributes to a richer understanding of the neural underpinnings of various cognitive processes In
the context of visual stimulus reconstruction, f{MRI’s high spatial fidelity enables researchers to pinpoint with
greater accuracy the regions involved in processing and responding to external stimuli
3.2.2 Approaches
In the exploration of decoding neural activity and advancing our understanding of the brain, various
ap-proaches have been employed Two prominent methodologies that have gained significant traction are machine
learning-based methods and deep learning-based methods These computational techniques harness the power
of algorithms to analyze complex patterns within neuroimaging data, enabling researchers to uncover hidden
insights and make predictions about brain function In the following sections, we delve into the nuances of these
Inttps://ai.meta.com/static-resource/image-decoding
Trang 35approaches, highlighting their applications, strengths, and contributions to the burgeoning field of
neuroscien-tific research
Prior to the advent of deep learning, conventional approaches to natural image reconstruction relied on
estimating a linear relationship between fMRI signals and manually crafted image features through the use of
linear regression models! 541? 11, These techniques primarily concentrated on extracting predetermined
low-level features from stimulus images, including local image structures or characteristics of Gabor filters?! 1,
In recent years, there has been a notable shift toward the integration of deep learning-based methods in the
domain of visual stimulus reconstruction Deep learning methods have demonstrated remarkable capabilities
in capturing complex and high-dimensional relationships between neural signals and visual stimuli, surpassing
the limitations of traditional linear regression models Moreover, the multilayer architecture enables learning
non-linear projections from human brain activity In general, they involve 4 approaches that are reliable and
have proven their efficiency as well as their potential of them such as convolutional neural networks (CNNs),
Generative Adversarial Networks (GANs), variational autoencoders (VAEs), and especially Diffusion Models
(DMs).
Non Generative Methods:
One notable approach is the utilization of convolutional neural networks (CNNs) for visual stimulus
recon-struction Unlike the hand-crafted features used in traditional methods, CNNs automatically learn hierarchical
representations of input data These networks consist of multiple layers of convolutional filters that can capture
both low-level and high-level features, enabling a more nuanced understanding of the relationships between
neural responses and visual stimuli It is one of the first deep-learning was used for the reconstruction and
classification from human brain activity At that time, only simple datasets were available so training a CNN
model from scratch to reconstruct images was still feasible
In contrast to a simpler multilayer feed-forward neural network that overlooks the structural information
of input images, Convolutional Neural Networks (CNNs) exhibit superior feature extraction capabilities This
is attributed to the information filtering carried out by convolutional layers within a localized pixel
neighbor-hood!41, The stacking of convolutional layers facilitates the learning of hierarchical visual features, commonly
referred to as feature abstraction, from input images The lower layers focus on grasping low-level details,
while the higher layers extract overarching high-level visual information'“*!, CNNs find extensive application
in image processing tasks, including image reconstruction, where architectures like encoder-decoder! 771,
U-(241 and variational autoencoders l#! leverage stacked convolutional
Net!!9), generative adversarial networks
layers for feature extraction at multiple levels.
Shen!°*! employed a pre-trained DNN based on VGG-19 to extract hierarchical features from stimulus
im-ages (refer to Figure 3.2.2) The DNN comprises sixteen convolutional layers followed by three fully connected
layers This approach was inspired by the observation that hierarchical image representations from various
lay-ers of deep neural networks correlate with brain activity in the visual cortex Leveraging this correlation, a
hierarchical mapping is established from fMRI signals in the low/high-level areas of visual cortices to
Trang 36corre-3.2 HUMAN BRAIN SIGNAL-BASED IMAGE RECONSTRUCTION
A
> là kii, id
oa In} Decoded image |
fMRI activity features
ea
Image features
Reconstructed
image ) B
optimization Decoded image
4® fMRI activity features
[pm}> Là, hia, is — Ì
Image features
Figure 3.2.2: Overview of two variations of frameworks proposed by Shen et al (2019b): (A) ShenDNN and
(B) ShenDNN+DGN The yellow color denotes the use of pre-trained components.
Reconstructed
£ image
+7
(D) that translates fMRI activity patterns into multilayer DNN features Prior to the reconstruction task, the
de-coder D undergoes training on the train set using the method outlined by°! The decoded fMRI features align
with the hierarchical image features derived from the DNN Optimization in the feature space is conducted
by minimizing the disparity between the hierarchical DNN features of the image and the multilayer features
decoded from fMRI activity
Another non-genative approach is encoder-decoder models find extensive application in tasks such as
image-to-image translation *! and sequence-to-sequence models!!*!, These models employ a two-stage
archi-tecture, featuring an encoder (E) responsible for compressing the input vector (x) into a latent representation
(z = E(x)), and a decoder (D) generating the output vector (y = D(z)) from this latent representation!!, The
compressed latent representation vector (z) acts as a bottleneck, encapsulating a low-dimensional
representa-tion of the input The training aims to minimize the reconstrucrepresenta-tion error, quantifying the disparity between the
reconstructed output and the ground-truth input
Beliy '5] introduced a convolutional neural network (CNN)-based encoder-decoder model named BeliyEncDec.
Trang 37Stimulus image fMRI activity
fMRI activity Stimulus image
Figure 3.2.3: The BeliyEncDec framework, introduced by Beliy et al (2019), involves two main training
stages: (A) supervised training of the Encoder and (B) a combination of supervised and self-supervised
training for the Decoder During this process, the Encoder’s weights remain constant The components of the
model trained on external unlabeled data are indicated in blue
In this model, the encoder (E) learns the mapping from stimulus images to corresponding fMRI activity, while
the decoder (D) learns the reverse mapping The architecture of BeliyEncDec, illustrated in Figure 3.2.3,
in-volves two combined networks (E-D and D-E) where inputs and outputs correspond to natural images and {MRI
recordings, respectively This unique setup facilitates self-supervised training on a larger dataset of unlabeled
data, including 50,000 additional images from the ImageNet validation set and unlabeled {MRI samples
Com-petitive results were demonstrated on two natural image reconstruction datasets: Generic Object Decoding”!
and vim-1!°!, The training occurs in two steps Initially, the encoder (E) establishes a mapping from stimulus images to fMRI activity, utilizing the weights of the first convolutional layer of the pretrained AlexNet?! Sub-
Trang 383.2 HUMAN BRAIN SIGNAL-BASED IMAGE RECONSTRUCTION
Real image Image space Lape wee fMRI
A B
Figure 3.2.4: GAN-based frameworks (A) SeeligerDCGAN framework based on deep convolutional GAN
(B) Framework proposed by Mozafari!>*!
sequently, with the encoder fixed, the decoder (D) is jointly trained using both labeled and unlabeled data The
overall loss of the model encompasses the fMRI loss of the encoder (E) and the Image loss (RGB and features
loss) of the decoder (D).
In a subsequent study, Gaziv “2! enhanced the reconstruction accuracy of BeliyEncDec by introducing a loss function based on perceptual similarity measures'*°!, The perceptual similarity loss is computed by extracting
multilayer features from both original and reconstructed images using VGG and comparing these features
lay-erwise
Generative Methods:
Generative models assume that the data is generated from some probability distribution p(#) and can be
classified as implicit and explicit Implicit models do not define the distribution of the data but instead specify a
random sampling process with which to draw samples from p(x) Explicit models, on the other hand, explicitly
define the probability density function, which is used to train the model
Robust image-generation models play a crucial role in enhancing performance across diverse tasks
Specif-ically, generative models contribute to improving the quality of reconstructions Two prevalent types of
genera-tive models for visual stimulus reconstruction are Generagenera-tive Adversarial Networks (GANs)!*4!, combinations
of Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), and Diffusion Models
(DMs).
Generative Adversarial Networks (GANs), a category of generative models defined implicitly, have
gar-nered considerable attention for their capacity to generate lifelike images In the realm of natural image
recon-struction, GANs are extensively utilized to capture the distribution of stimulus images Comprising generator
and discriminator networks, a GAN operates by having the generator G take a random noise vector z
(typi-cally drawn from a Gaussian distribution) and produce a synthetic sample G(z) with statistical properties akin
to those in the training set images Throughout the training process, the generator’s proficiency in creating
realistic images steadily improves, reaching a point where the discriminator cannot differentiate between a
gen-uine sample and a generated counterfeit one GAN-based frameworks boast several advantageous features in
Trang 39comparison to other generative methods Firstly, they do not necessitate strong assumptions about the form of
the output probability distribution Secondly, the utilization of adversarial training, involving the discriminator,
facilitates unsupervised training of the GAN
Seeliger!©’! employed a deep convolutional Generative Adversarial Network (DCGAN) architecture?!
in-corporating advancements through the integration of convolutional and deconvolutional layers The researchers
focused on learning a direct linear mapping from the functional magnetic resonance imaging (fMRI space to
the latent space of the GAN, as illustrated in Figure 3.2.4A For the image domain of natural stimuli, the
gen-erator G underwent pretraining on down-sampled 64 x 64 grayscale images converted from ImageNet!'*! and Microsoft COCO!"! datasets In the domain of handwritten character stimuli, DCGAN was pre-trained using
15,000 handwritten characters
In a separate study, Mozafari!>*! adopted a variant of the GAN known as the BigBiGAN model!!! enabling
the reconstruction of more realistic images The BigBiGAN model, with its latent space, captures high-level
semantic information from fMRI data to enhance the generation of lifelike images Referred to as
MozafariBig-BiGAN, this framework incorporates a pre-trained encoder E that generates a latent space vector E(x) from
the input image x and a generator G that produces an image G(z) from the latent space vector z, as depicted
in Figure 3.2.4B During training, the authors computed a linear mapping W from latent vectors E(x) to fMRI
activity using a general linear regression model In the test stage, the linear mapping is inverted to compute the
latent vectors z from the test {MRI activity
The majority of prior research in visual decoding utilizing fMRI data involves selecting voxels
correspond-ing to specific visual regions of interest (ROIs) These voxel measurements are then flattened into 1D vectors,
serving as input for the visual decoding process However, this approach exhibits significant drawbacks The
definition of ROIs and the selection process during fMRI preprocessing are subjective and can vary between
individuals Moreover, the spatial topology of 2D cortical areas is often overlooked when using vectorized
voxel responses To address these limitations, a novel visual decoding framework, Cortex2Image, is proposed
by OzcelikP°Ì This framework comprises a Cortex2Semantic model, a Cortex2Detail model, and a pre-trained
and frozen image generator named Instance-Conditioned GAN (IC-GAN) Notable improvements include an
architecture shared across individual subjects in the Cortex2Image model, which consumes cortex-wide brain
activity through a standardized mesh representation of the cortex, rather than relying on specific ROIs The
utilization of surface convolutions in this model enables the exploitation of spatial information in brain
ac-tivity patterns Additionally, the end-to-end training of the Cortex2Detail model in conjunction with IC-GAN
enhances computational efficiency compared to previous methods that involve additional steps in noise vector
optimization
Larsenf#! introduced a hybrid model, VAE-GAN, seamlessly integrating Variational Autoencoder (VAE)
and Generative Adversarial Network (GAN) This framework combines VAE for generating latent features with
a GAN discriminator that learns to differentiate between fake and authentic images VAE-GAN unifies the
VAE decoder and GAN generator into a singular entity The advantages of VAE-GAN are twofold: firstly, the
adversarial loss of GAN facilitates the generation of visually more realistic images, and secondly, VAE-GAN
Trang 403.2 HUMAN BRAIN SIGNAL-BASED IMAGE RECONSTRUCTION
Image space Latent space
Figure 3.2.5: VanRullenVAE-GAN framework proposed by VanRullen and Reddy
GANs—where a generator produces a limited subset of different outcomes
VanRullen and Reddyl”?! employed a VAE network pre-trained on the CelebA dataset using GAN
proce-dures to acquire a variational latent space Similar to the MozafariBigBiGAN framework, they established a
linear mapping between the latent feature space and fMRI patterns, bypassing probabilistic inference During
training, the pre-trained encoder from VAE-GAN remains fixed, and the learning process focuses on the linear
mapping between the latent feature space and {MRI patterns In the testing stage, fMRI patterns undergo
trans-lation into VAE latent codes through inverse mapping, followed by using these codes for facial reconstruction
The VAE’s latent space, functioning as a variational layer, offers a meaningful representation of each image,
capable of portraying faces and facial features as linear combinations Due to the VAE’s training objective,
proximate points in this space correspond to similar facial images, consistently yielding visually plausible
re-sults Consequently, the VAE’s latent space ensures robustness in brain decoding, minimizing mapping errors
and producing more realistic reconstructions closely resembling the original stimuli images This approach
not only facilitates the reconstruction of natural-looking faces but also enables gender decoding The
architec-tural configuration of the framework, termed VanRullenVAE-GAN, encompasses three networks, as depicted
in Figure 3.2.5.
Ren et al.!°*! demonstrates the feasibility of deriving promising solutions by acquiring visually-guided
la-tent cognitive representations from fMRI signals and subsequently decoding them into corresponding image
stimuli The approach involves training an encoder to map brain signals to a low-dimensional representation
that retains crucial visually relevant information Simultaneously, a robust decoder is trained to recover this
latent representation back into the original visual stimulus To achieve this objective, the D-VAE/GAN model
employs a Dual VAE-based Encoding Network to generate low-dimensional latent features for both fMRI
sig-nals and perceived images Subsequently, a novel GAN-based inter-modality knowledge distillation method