Report specialized project semester 232 academic year 2023 2024 develop a multimodal chatbot for a fashion store

CLIP Encoder Contrastive Language-Image Pretrain- ing Encoder, a deep learning model that learns joint representations of images and text through contrastive... Transformer A machine l

DatasetSegmentalion

The PhoBERT tokenizer necessitates word and sentence segmentation of input data, which we will achieve using the RDRSegmenter Prior to training, we will segment the dataset captions with this segmentation tool.

The training set of UIT-OpenVilC consists of 9,088 images with 41,238 captions, the validation set has 2,011 images with 10,002 captions, and the test set comprises 2,001 images with 10,001 captions

PhoCLIP is an innovative multimodal retrieval-augmented generation system specifically designed for the Vietnamese language, leveraging the strengths of CLIP and PhoBERT This approach utilizes CLIP’s multi-modal representation learning to effectively link images and text within a shared embedding space By incorporating PhoBERT, a leading text encoder tailored for Vietnamese, PhoCLIP significantly improves its understanding and interpretation of Vietnamese text inputs.

Our design mirrors that of OpenAI's CLIP, which features a text encoder and an image encoder These components are followed by a projection layer that aligns their hidden states into a unified multimodal space, generally maintaining a dimensionality of 512.

Figure 4.1: Architecture of PhoCLIP, adopt from OpenAI CLIP

In our project, we enhance the original text encoder by integrating the PhoBERT encoder tokenizer, leveraging its pretrained weights For the image encoding, we test two widely-used architectures: Vision Transformer and ResNet The projection layer is designed as a linear layer, and while CLIP employs a single linear layer for this purpose, we investigate the impact of utilizing one or two linear layers in our experiments By combining PhoBERT with CLIP's framework, PhoCLIP effectively understands and associates Vietnamese text with images, facilitating precise image retrieval based on textual queries.

Settings 0.0.0.0 00000000000 Hơ 63

We conducted four experimental models using PhoBERT-base as the text encoder, with frozen vision encoders to train only the PhoBERT component and projection layers We tested both ViT-Base and ResNet50 as frozen vision encoders, adapting a method similar to LiT while varying the number of projection layers The models were trained for a specified number of epochs due to a lack of reduction in training loss, and the model size configuration is detailed in Table 4.1.

Table 4.1: Experimental models size Vision Encoder Linear Layers Parameters

We employ the COCO-35L, UIT-OpenVilIC, and KTVIC datasets for training and testing our models, while the CC3M-35L dataset is set aside for future development due to its substantial size In total, we have trained on approximately 666,518 image-text pairs.

Results 2.0 20.0000 63

How VectorDatabaselsUsed

We utilize a Vector Database system to effectively store and manage our product data, which is essential for our inferencing pipeline This database acts as the core of our system, enabling efficient storage and retrieval of critical information.

We utilize the Pho-CLIP encoder to convert images into vector representations, which are then enriched with essential details like product name, price, description, size, color, and discount This comprehensive data is stored in our Vector Database, facilitating efficient retrieval of relevant information during the inferencing process.

Our inferencing pipeline, as shown in Figure 4.3, utilizes the RAG model enhanced by PhoCLIP It starts with a user query that includes both an image and text, depicting the user's desired product appearance.

The PhoCLIP encoder transforms multimodal input into vector representations, utilizing its expertise in the Vietnamese language These vectors retrieve relevant contexts from the RAG Database, which contains vector representations, images, and detailed product information such as name, price, description, size, color, and discounts This context, along with the original query, is processed by a Vietnamese LLM to generate tailored responses and corresponding images, delivering comprehensive and personalized results that enhance the user experience.

Integrating PhoCLIP’s multimodal capabilities with the comprehensive data from the RAG Database enables our inferencing pipeline to provide strong, contextually relevant responses tailored to the varied needs of users in the Vietnamese market Additionally, we have selected BARTPho as the foundational LLM for our RAG system in this project.

In the future, we may explore experimenting with ViT5 or even MixSUra Our approach will involve trying, testing, and evaluating the performance of these models based on pre-

67 determined metrics, particularly in Phase 2 of the thesis, which focuses on future work

By starting with BARTPho, we can establish a solid foundation for our experiments and pipeline development, leveraging its fine-tuned capabilities for the Vietnamese language

We will systematically evaluate the effectiveness of alternative models such as ViT5 and MixSUra to determine which one best meets our specific goals and requirements.

The implementation of a multimodal chatbot offers an effective solution to the challenges faced by consumers and fashion retailers in today's retail environment By integrating image and text, this technology allows customers to efficiently explore a wide range of products, reducing the time and effort spent searching Additionally, leveraging intelligent algorithms and machine learning, the chatbot provides personalized recommendations based on individual preferences, enhancing the overall shopping experience The development of the PhoCLIP model, specifically designed for the Vietnamese language, represents a significant advancement in multimodal technologies for the Vietnamese market As the first CLIP model for Vietnamese, PhoCLIP paves the way for new opportunities in multimodal systems that cater to the unique linguistic and cultural characteristics of Vietnamese consumers.

PhoCLIP offers a significant opportunity to set new benchmarks for multimodal systems in the Vietnamese language By utilizing PhoCLIP as a core element, developers can enhance its features to create advanced and precise multimodal chatbots that effectively combine image and text comprehension.

Fashion retailers can significantly improve product discovery and customer assistance by implementing a new pipeline that utilizes hybrid Multimodal-RAG technology This innovation automates processes, leading to enhanced customer satisfaction through accurate, context-aware recommendations and personalized support.

Fashion retailers benefit from enhanced operational efficiency through streamlined processes, which alleviate the demands on human customer service representatives For additional details regarding our workload and future plans related to this project, please see Appendix B.

Tiêu đề	Develop A Multimodal Chatbot For A Fashion Store
Tác giả	Vo Hoang Nhat Khang, Nguyen Phan Tri Duc
Người hướng dẫn	Assoc. Prof. Quan Thanh Tho, Ph.D.
Trường học	Vietnam National University Ho Chi Minh City
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2023-2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	105
Dung lượng	9,54 MB