2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Image Retrieval with Text Feedback based on Transformer Deep Model Truc Luong-Phuong Huynh1 , Ngoc Quoc Ly2 Faculty of Information Techonology, Computer Vision & Cognitive Cybernetics Dept VNUHCM-University of Science Ho Chi Minh City, Vietnam 1712842@student.hcmus.edu.vn, lqngoc@fit.hcmus.edu.vn Abstract—Image retrieval with text feedback has many potentials when applied in product retrieval for e-commerce platforms Given an input image and text feedback, the system needs to retrieve images that not only look visually similar to the input image but also have some modified details mentioned in the text feedback This is a tricky task as it requires a good understanding of image, text, and also their combination In this paper, we propose a novel framework called Image-Text Modify Attention (ITMA) and a Transformer-based combining function that performs preservation and transformation features of the input image based on the text feedback and captures important features of database images By using multiple image features at different Convolution Neural Network (CNN) depths, the combining function can have multi-level visual information to achieve an impressive representation that satisfies for effective image retrieval We conduct quantitative and qualitative experiments on two datasets: CSS and FashionIQ ITMA outperforms existing approaches on these datasets and can deal with many types of text feedback such as object attributes and natural language We are also the first ones to discover the exceptional behavior of the attention mechanism in this task which ignores input image regions where text feedback wants to remove or change Index Terms—Image Retrieval with Text Feedback, Convolution Neural Network, Attention Mechanism, Transformer Deep Model I I NTRODUCTION Image retrieval is a well-known problem in computer vision that has been researched and applied in the industry for a long time The major problem of studying this task is how to exploit the intention of humans when they retrieve something Most retrieval systems are based on image-to-image matching [7] and text-to-image matching [5] Therefore, it is difficult for people to express their ideas in a single image or a couple of words To overcome these limitations, people are trying to create models with suitable inputs that not only allow models to give accurate predictions but also retain the whole ideas of the user in them Some of them use desired changes to the input image of the user to describe the target image They are usually expressed in text form of certain attributes [8, 18] or relative attributes [13] More recently, researchers can use a more general type of text feedback that is natural language [28] but it is still an unusual area of research Being able to deal with a more general type of language feedback is creating a system that can run in applications for real users 978-1-6654-1001-4/21/$31.00 ©2021 IEEE Fig Examples of customers using image retrieval with text feedback in a fashion e-commerce application Customers upload their clothes (red outline) and type text feedback describing changes they want (red sentence) The system gives suggestions about customers’ target clothes (green outline) In this work, inputs are an image and text feedback describing modifications of the input image Unlike previous approaches, we consider more general language input types which are object attributes [26] and natural language [28], and propose a simpler approach giving better performances The core ideas behind our method are: (1) using text features to modify image features, and (2) modified image features need to “live in” the same space with the database image features to match between them Regarding the first one, we use the attention mechanism to in-charge of preservation and transformation image features according to text features at the same time As a result, we not only obtain a more compact model than previous methods but also are the first ones to discover a very special behavior of the attention mechanism when it avoids retaining features that the text semantics wants to transform through our visualizations As for letting them interact in the same space, we perform the attention learning at features from various CNN depths for both input and database images Using the attention mechanism for database images assists in selecting the best features through matching image features to achieve their best representations 182 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Overall, our work makes the following contributions: • We introduce Image-Text Modify Attention (ITMA), an approach for image retrieval with text feedback through attention learning on multi-level visual features and linguistic semantics which brings improvements over the state-of-the-art on two visual datasets with text feedback CSS and FashionIQ • We quantitatively demonstrate the ability of various components of our retrieval system through baseline and ablation experiments • We show a distinctive behavior of the attention mechanism on image features according to text feedback when applied in this retrieval task through visualization experiments II R ELATED WORK A Image Retrieval with various types of text feedback Many efforts have been made to find a method that can improve the accuracy of retrieval systems Some of them use user feedback as an interactive signal to navigate the model to targets In general, user feedback is in various forms such as relative attribute [13], attribute [8, 18], natural language [28], sketch [32] However, text form is the most used type of communication between humans and computers which can hold an adequate amount of information to specify complicated ideas of people about the target As the first attempt in tackling image retrieval with text feedback, Text Image Residual Gating (TIRG) [26] proposed two functions for transformation and preservation image features separately Visiolinguistic Attention Learning (VAL) [2] is more advanced by using image features at multiple CNN depths Besides, these approaches use only CNN to learn the representation of images and two functions above to modify input image features to resemble the CNN target image features Since the representation learned from CNN was not good enough, VAL added another loss function to bring the image features closer to the semantics of their text descriptions We share the same idea with them in transformation and preservation image features according to text semantics and using image features at multiple convolution layers However, we improve them by building a combining function for both transformation and preservation image features and combining once on multiple CNN image features We use CNN and the combining function for all images to get their best representation with one loss function and no need for image descriptions during training Our ultimate aim is to improve results by not using an overcomplicated framework B Attention Mechanism Attention Mechanism [1] is widely used in tasks related to image and language, the purpose of this mechanism is to imitate the human sense of focusing on important information which stands out in the background [3] To specify fixed positions in the image, attention mechanism creates different weights on image spatial information and the value of weights shows the importance of fixed areas in the image This helps to choose image areas that contain information to describe the image for Image Captioning [25], or to answer the question for Visual Question Answering (VQA) [30] Regarding problems relating to vision and language, coattention [15, 17] is used to fuse information of the inputs through creating weights on important factors of them In addition, several approaches based on self-attention, which is built on Transformer [24], are proposed recently for VQA [6,14,33] to learn the latent relations among spatial information C Compositional Learning Compositional Learning is considered an essential function when developing an intelligent machine Its target is to learn encoded features that can encompass multiple primitives [27] Although CNNs can learn the composition of visual information, they still can’t learn a clear composition of language and image information Recently, extended research [23] from pre-train strategies of BERT [4] has been proposed to learn a compositional representation of image and text to solve VQA, Image Captioning, and Image-Text Matching Unfortunately, these works mainly fix the feature extractions in complicated object detection [21] and recognition [29] model This not only limits its applications in a variety of problems but also leads to an overcomplicated and heavy framework We propose to use image features at varying depths inside CNN and combine them with text features This is an effective method to combine image and text features to a compositional representation through a simpler and lighter model III M ETHOD Details for training process are illustrated in Fig During training, the system predicts a representation φxt which is most similar to target representation φy We begin as an image-toimage retrieval system that matches between features of x and target y Then, we gradually learn meaningful modification to features of x according to features of t Meanwhile, target images y are not available during testing so the system predicts images from the database whose representation is most similar to the representation φxt A Image Encoder Having the idea from VAL [2], we use a standard CNN to encode images and features from multiple layers inside In essence, CNN gradually filters image features through its layers to retain the most representative components However, they also remove some of the features that could be important in our modification step in order to achieve impressive representation Therefore, we use features from two different layers of CNN and call them Middle Layer φM ∈ Rw×h×c 0 and High Layer φH ∈ Rw ×h ×c to prevent important details from being removed: {φM , φH } = fCN N (x) (1) Actually, the number of CNN layers is a hyperparameter, and using more than two layers in our case will not benefit 183 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig An overview of the training pipeline Given an input pair of an image x (red outline) and text feedback t (red sentence), a target image y (green outline), and images from the database (gray outline) Red arrows show directions for the input pair (x and t), green arrows show directions for database images including target y, and blue arrows show direction for both input pair and database images We have modules: (a) Image Encoder, (b) Text Encoder, (c) Combining Function, and (d) Loss Function between combined features and features from database the performance After this, φM and φH will have the same number of channels which is C = 512 through a learned linear projection self-attention learning First, we project φ into the latent space as query, key, value (i.e Q, K, V ): Q = φWQ , K = φWK , V = φWV , (5) B Text Encoder To get the text representation, we need to define a function ftext (t) to encode the text feedback t to a representative vector φt whose size d = 512: φt = fLST M (t) ∈ Rd (2) We use a standard Long Short Term Memory (LSTM) [10], followed by a max-pooling and a linear transformation as the text encoder φt is obtained by passing each word of t into the text encoder and taking the output from the last timestep C Combining Function To obtain a composite representation of image and text, we transform and preserve image features according to text feedback semantics Inspired by Transformer in Multimodal Learning [12], we create a composite Transformer using multilevel features of CNN 1) Image-Text Representation: As the information flows through visual and linguistic domains, input image features from multiple layers of CNN φM , φH and text features φt are fused to obtain the image-text representation: φ = [φM , φH , φt ] (3) To be more specific, we reshape φM to Rn×C (n = w × h), φH to Rn ×C (n0 = w0 × h0 ) and concatenate all of them This has the similar spirit as Relation Network [22], the relation between input image and text is performed in φ As for features from database images, φ has contributions only from φM and φH : φ = [φM , φH ] where WQ , WK , and WV are × convolutions The selfattention is followed by fully-connected layers as in the Transformer encoder [24] Here self-attention refers to the equation: (4) 2) Image-Text Self-attention: To figure out the latent connections between image regions needed for learning the preservation and transformation, we pass the image-text representation φ through a multi-head Transformer The core idea is to capture important vision and linguistic information through Attn(Q, K, V ) = f (QK T )V, (6) where f is the softmax function as in [24] Basically, this selfattention exploits interactions between components formed in the image-text representation For each one, it generates an attention mask to highlight the spatial information that is needed for learning the feature transformation and preservation, and visual matching 3) Embeddings: Due to the absence of φt in image features from the database, to be able to match input image and target features, they must be embedded in the same space We first average pool within each feature type Specifically, we it separately on each φM , φH and φt for image-text features and φM and φH for database features to get two representations with shape 512×3 and 512×2 respectively They are averaged and then normalized to become two vectors with 512 elements each With the final representations, we multiply each vector by a learned scale weight initialized as In the training process, the representation φxt and φy (Fig 2d) are passed to the loss function described in the next section In the testing process, images from database whose representation is in the top k with highest cosine similarity with the representation φxt are predicted as results D Loss Function Our training target is to bring the representation φxt closer to the target representation φy while pushing it far away from other representations We adopt the classification-based loss from TIRG [26] for our training Specially, we train a batch of B queries, in query ith , we have the representation φxi ti needed to get closer to its target representation φyi The other representations needed to push far away φyj , where j is not i, 184 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) are target representations from other queries within the same batch Thus B X exp{κ(φxi ti , φyi )} L= }, −log{ PB B i=1 j=1 exp{κ(φxi ti , φyj )} TABLE I R ETRIEVAL PERFORMANCE ON CSS TEST SET WITH COMPARISONS TO EXISTING APPROACHES REPORTED IN [11, 26] U SING IMAGE FEATURES FROM MULTIPLE LEVELS HAS BOOSTED RESULTS COMPARED TO USING ONLY FEATURES FROM FINAL CONVOLUTION LAYERS AS IN PREVIOUS METHODS (7) Method where κ is the similarity function implemented as the dot product in our experiments We not use the triplet-based loss for CSS as in [26] because through early experiments, it does not work well for the R@k metric IV E XPERIMENTS A Implementation details We conduct all the experiments in PyTorch With the image encoder, we use ResNet-50 [9] (output feature size is 512) pretrained on ImageNet With the text encoder, we use LSTM [10] (hidden size is 512) with random initial parameters Our model is trained using the initial learning rate of 0.001 for parameters of the image encoder and 0.01 for the remaining parameters The learning rate decreases by 10 for every 50K iterations and the training ends after 150K iterations for FasionIQ and 60K iterations for CSS Our model has attention blocks with 512 units for each Q, K, and V , attention heads, and 256 units for the Fully Connected Layer of attention blocks We use a batch size of 32 which is the same as previous papers R@1 Show & Tell [25] Param hashing [19] Relationship [22] FiLM [20] TIRG [26] Locally Bounded Features [11] ITMA (Ours) 33.0 60.5 62.1 65.6 73.7 79.2 87.8 ± ± ± ± ± ± ± 3.2 1.9 1.2 0.5 1.0 1.2 0.9 there has not been much experiment on it and we outperform other competitors shown in Table II Fig presents our qualitative results on FashionIQ Although there is much semantics hidden behind the natural language text feedback, ITMA is still able to capture almost every aspect of the text including contents relating to fashion e.g color, gloss, and printing, etc We also found that our model is able to understand not only global descriptions such as the overall colors and patterns on the outfit but also local details like a logo in a specific location TABLE II R ETRIEVAL PERFORMANCE ON FASHION IQ B Results 1) CSS [26]: (which is short for Color, Shape, and Size) is an attribute-based retrieval dataset including 32K queries (16K for train set and 16K for test set) Text feedback, e.g “make yellow sphere small”, modify synthesized images of 3-by-3 grid scene Although CSS looks relatively simple, we can carry out carefully control experiments on it Unlike most other datasets, trained models on CSS are likely to be overfitted by large configurations Therefore, we limit our model by using only features from the High Layer of ResNet50 and one attention block In addition, replacing softmax with identity function in attention block improves results in R@1 Besides, using sinusoidal encodings as in [24] not only does not improve the performance when using softmax but also impairs the performance when using the identity function We compared with R@1 performances reported in [11, 26] about them and other recent methods [19, 20, 22, 25] We also used CSS 3D images provided by [26] for our experiments and ITMA outperforms them for the retrieval task shown in Table I 2) FashionIQ [28]: is a fashion retrieval dataset with natural language text feedback It consists of 77648 images collected from the e-commerce site Amazon.com of categories: Dresses, Toptees, and Shirts Among 46,609 training images, there are 18,000 pairs of query-target images along with text feedback sentences in natural language to describe one to two modified properties e.g ”longer more dressy” We used R@50 and R@100 results and the same FasionIQ dataset from [2] As FashionIQ is a relatively new dataset, VALIDATION SET WITH COMPARISON TO EXISTING APPROACHES REPORTED IN [2] O UR SIMPLE COMBINING FUNCTION PERFORMING BOTH FEATURE PRESERVATION AND TRANSFORMATION HELPS THE MODEL GENERALIZE BETTER THAN THE TWO SEPARATED FUNCTIONS OF TIRG AND VAL Method TIRG [26] VAL [2] ITMA (Ours) (R@10 + R@50) / 31.2 35.4 36.6 ± 0.4 C Ablation studies We first experiment with the influence of Middle Layer (14 × 14) and High Layer (7 × 7) in ResNet-50 on our model Table III shows that using both of them substantially improves results The result in Table IV shows the sensitivity of our model to the various number of units in the Fully Connected Layer behind the attention blocks Table V shows the effect but relatively small of using the different number of attention blocks in the architecture Overall, we use a combination that achieves the best results for our architecture including different layers of ResNet-50, 256 units in the Fully Connected Layer of attention blocks, and attention blocks D Visualization Attention maps in Fig emphasize image regions according to bold words in text feedback that explains the behaviors of attention in retrieval task With additional changes like “longer dress” or “longer sleeves”, the model put positive 185 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig Top 10 results of ITMA for FashionIQ validation set queries Images with green outline are ”correct” images TABLE III R ETRIEVAL PERFORMANCE ON FASHION IQ VALIDATION SET WITH AND WITHOUT IMAGE FEATURES FROM M IDDLE L AYER (14 X 14) W E USE ATTENTION BLOCKS AND 256 UNITS OF F ULLY C ONNECTED L AYERS T HE SECOND ROW IS OUR STANDARD MODEL 7x7 X X 14x14 X Dresses R10 (R50) 22.9 ± 1.1 (47.7 ± 1.0) 23.8 ± 0.6 (48.6 ± 1.0) Top & Tees R10 (R50) 26.4 ± 0.6 (52.4 ± 0.5) 27.9 ± 0.8 (53.6 ± 0.6) Shirts R10 (R50) 19.5 ± 0.9 (42.0 ± 0.6) 21.3 ± 0.7 (44.2 ± 0.3) (R@10 + R@50)/2 35.1 ± 0.6 36.6 ± 0.4 TABLE IV R ETRIEVAL PERFORMANCE ON FASHION IQ VALIDATION SET BY ADJUSTING THE WIDTH OF THE F ULLY C ONNECTED L AYER OF THE ATTENTION BLOCKS W E USE ATTENTION BLOCKS AND IMAGE FEATURES FROM H IGH L AYER (7 X 7) AND M IDDLE L AYER (14 X 14) Width 64 128 256 512 Dresses R10 (R50) 23.9 24.7 23.8 24.2 ± ± ± ± 0.8 0.6 0.6 0.4 (48.6 (49.0 (48.6 (48.3 ± ± ± ± 0.8) 0.7) 1.0) 0.4) Top & Tees R10 (R50) 27.7 27.8 27.9 27.7 ± ± ± ± 1.0 0.6 0.8 0.8 (53.7 (54.1 (53.6 (53.8 ± ± ± ± 0.8) 0.3) 0.6) 1.1) Shirts R10 (R50) 21.1 21.1 21.3 21.2 ± ± ± ± 0.5 0.4 0.7 1.1 (43.3 (43.9 (44.2 (43.2 ± ± ± ± 0.6) 0.9) 0.3) 0.3) (R10 + R50)/2 36.4 36.8 36.6 36.4 ± ± ± ± 0.3 0.2 0.4 0.5 TABLE V R ETRIEVAL PERFORMANCE ON FASHION IQ VALIDATION SET USING A DIFFERENT NUMBER OF ATTENTION BLOCKS W E USE 256 C ONNECTED L AYERS AND IMAGE FEATURES FROM H IGH L AYER (7 X 7) AND M IDDLE L AYER (14 X 14) # Attention Blocks Dresses R10 (R50) 24.2 ± 1.2 (48.3 ± 1.3) 23.8 ± 0.6 (48.6 ± 1.0) 24.1 ± 0.3 (48.3 ± 0.4) 23.7 ± 0.8 (47.9 ± 1.0) Top & 27.9 ± 27.9 ± 27.3 ± 27.1 ± Tees R10 (R50) 0.8 (54.0 ± 0.9) 0.8 (53.6 ± 0.6) 0.7 (53.6 ± 0.9) 0.8 (52.9 ± 1.0) Shirts R10 (R50) 20.6 ± 0.7 (43.3 ± 0.6) 21.3 ± 0.7 (44.2 ± 0.3) 20.6 ± 0.2 (43.5 ± 0.7) 20.0 ± 0.5 (42.6 ± 0.7) UNITS OF F ULLY (R10 + R50)/2 36.4 ± 0.5 36.6 ± 0.4 36.2 ± 0.4 35.7 ± 0.6 weights to the corresponding areas in the items (red areas) On the contrary, the model put negative weights to the corresponding areas in the items (purple areas) for diminished changes such as “shorter dress” The behavior of a word ignoring what it makes reference to seems to contradict with attention mechanism from Image Captioning [31] which shows that the attention will use words to attend to relevant regions of the image Actually, our text feedback describes modifications to the image and the model tries to learn a final representation which is close to the target representation Therefore, a word avoids referring to its referenced items because that representation tends not to support any information about the target Fig Attention visualization on FashionIQ We use attention maps from among attention heads in a block of our best model and obtain this visualization by averaging over the first 200 images from the FashionIQ validation set 186 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) V C ONCLUSION AND FUTURE WORKS In this paper, we introduced ITMA, a novel approach for image retrieval with text feedback ITMA extracts features from representing parts for each input and stacks them into one sequence which can be processed with the attention model We evaluated two retrieval datasets with text feedback and showed the ability of our model when dealing with various types of user language including attribute and natural language In the future, we will use a fashion ontology [16] whose structure includes categories, attributes, colors, textures, shape, etc to assist text feedback and our training model We will also develop a smarter system to recommend the most suitable clothes for the customer ACKNOWLEDGMENT We would like to thank Hong-Huan Do, Vinh-Loi Ly, and Sy-Tuyen Ho for their helpful discussions R EFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio Neural machine translation by jointly learning to align and translate arXiv preprint arXiv:1409.0473, 2014 [2] Yanbei Chen, Shaogang Gong, and Loris Bazzani Image search with text feedback by visiolinguistic attention learning In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3001–3011, 2020 [3] Maurizio Corbetta and Gordon L Shulman Control of goal-directed and stimulus-driven attention in the brain Nature Reviews Neuroscience, 3(3):201–215, 2002 [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova Bert: Pre-training of deep bidirectional transformers for language understanding arXiv preprint arXiv:1810.04805, 2018 [5] Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov Devise: A deep visual-semantic embedding model 2013 [6] Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li Dynamic fusion with intraand inter-modality attention flow for visual question answering In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6639–6648, 2019 [7] Albert Gordo, Jon Almaz´an, Jerome Revaud, and Diane Larlus Deep image retrieval: Learning global representations for image search In European Conference on Computer Vision, pages 241–257 Springer, 2016 [8] Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S Davis Automatic spatially-aware fashion concept discovery In Proceedings of the IEEE International Conference on Computer Vision, pages 1463–1471, 2017 [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun Deep residual learning for image recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016 [10] Sepp Hochreiter and Jăurgen Schmidhuber Long short-term memory Neural Computation, 9(8):1735–1780, 1997 [11] Mehrdad Hosseinzadeh and Yang Wang Composed query image retrieval using locally bounded features In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3596– 3605, 2020 [12] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei Attention on attention for image captioning In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4634–4643, 2019 [13] Adriana Kovashka, Devi Parikh, and Kristen Grauman Whittlesearch: Image search with relative attribute feedback In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2973–2980 IEEE, 2012 [14] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks arXiv preprint arXiv:1908.02265, 2019 [15] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh Hierarchical question-image co-attention for visual question answering Advances in Neural Information Processing Systems, 29:289–297, 2016 [16] Ngoc Q Ly, Tuong K Do, and Binh X Nguyen Large-scale coarseto-fine object retrieval ontology and deep local multitask learning Computational Intelligence and Neuroscience, 2019, 2019 [17] Duy-Kien Nguyen and Takayuki Okatani Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6087–6096, 2018 [18] Hung M Nguyen, Ngoc Q Ly, and Trang TT Phung Large-scale face image retrieval system at attribute level based on facial attribute ontology and deep neuron network In Asian Conference on Intelligent Information and Database Systems, pages 539–549 Springer, 2018 [19] Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han Image question answering using convolutional neural network with dynamic parameter prediction In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 30–38, 2016 [20] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville Film: Visual reasoning with a general conditioning layer In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018 [21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Faster rcnn: Towards real-time object detection with region proposal networks Advances in Neural Information Processing Systems, 28:91–99, 2015 [22] Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap A simple neural network module for relational reasoning arXiv preprint arXiv:1706.01427, 2017 [23] Hao Tan and Mohit Bansal Lxmert: Learning cross-modality encoder representations from transformers arXiv preprint arXiv:1908.07490, 2019 [24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin Attention is all you need In Advances in Neural Information Processing Systems, pages 5998–6008, 2017 [25] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan Show and tell: A neural image caption generator In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015 [26] Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays Composing text and image for image retrieval-an empirical odyssey In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6439–6448, 2019 [27] Kun Wei, Muli Yang, Hao Wang, Cheng Deng, and Xianglong Liu Adversarial fine-grained composition learning for unseen attribute-object recognition In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3741–3749, 2019 [28] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris Fashion iq: A new dataset towards retrieving images by natural language feedback In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11307–11317, 2021 [29] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy Rethinking spatiotemporal feature learning for video understanding arXiv preprint arXiv:1712.04851, 1(2):5, 2017 [30] Huijuan Xu and Kate Saenko Ask, attend and answer: Exploring question-guided spatial attention for visual question answering In European Conference on Computer Vision, pages 451–466 Springer, 2016 [31] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio Show, attend and tell: Neural image caption generation with visual attention In International Conference on Machine Learning, pages 2048–2057 PMLR, 2015 [32] Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen-Change Loy Sketch me that shoe In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 799–807, 2016 [33] Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian Deep modular co-attention networks for visual question answering In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6281–6290, 2019 187 ... tackling image retrieval with text feedback, Text Image Residual Gating (TIRG) [26] proposed two functions for transformation and preservation image features separately Visiolinguistic Attention Learning... mechanism on image features according to text feedback when applied in this retrieval task through visualization experiments II R ELATED WORK A Image Retrieval with various types of text feedback. .. Conference on Information and Computer Science (NICS) Overall, our work makes the following contributions: • We introduce Image- Text Modify Attention (ITMA), an approach for image retrieval with