To provide a complete knowledge base, the project integrates data taken from approximately 300 websites on HSU's official website, which 1s processed and stored in a vector database call
Trang 1MINISTRY OF EDUCATION AND TRAINING
AI05DE01 ARTIFICIAL INTELLIGENCE
FINAL PROJECT REPORT
HSU CHATBOT
Lecturer: Lé Thanh Tung
Member List:
1 Lê Văn Niém — 22207193
2 Phan Văn Khải- 22206077
3 Nguyễn Trần Trung Kiên- 22205375
JULY 02,2024
Trang 2MINISTRY OF EDUCATION AND TRAINING
HOA SEN UNIVERSITY
FACULTY OF INFORMATION TECHNOLOGY
AI05DE01 ARTIFICIAL INTELLIGENCE
FINAL PROJECT REPORT HSU CHATBOT
Lecturer: Lé Thanh Tung
Member List:
1 Lê Văn Niềm- 22207193
2 Phan Van Khai — 22206077
3 Nguyén Tran Trung Kién — 22205375
JULY 2,2024 PLEDGE
Trang 3“We have read and understand the academic integrity violations We pledge on
our personal honor that this work was done by us and does not violate academic
integrity.”
Day month year
(Student’s full name and signature)
Trang 4ABSTRACT
This project introduces a revolutionary AI chatbot that is intended to improve
user engagement and streamline information access for prospective and current
Hoa Sen University (HSU) clients The chatbot uses cutting-edge technology to
offer a tailored and thorough guide to everything HSU
The chatbot's fundamental language model is Google's Gemini 1.5, which is
well-known for its outstanding natural language processing and creation skills
To provide a complete knowledge base, the project integrates data taken from
approximately 300 websites on HSU's official website, which 1s processed and
stored in a vector database called ChromaDB This database enables the efficient
retrieval of relevant information depending on user queries
The chatbot's functionality is based on LangChain, a robust framework for
creating sophisticated conversational bots LangChain's Retrieval-Augmented
Generation (RAG) approach is used, which enables the chatbot to retrieve
important information from the ChromaDB database and smoothly integrate it
into its responses This ensures that the chatbot's responses are not just useful but
also personalized to each user's specific demands
The HSU AI Chatbot strives to improve user happiness and foster better ties
between the university and its community by making information easily
available, engaging, and instructive Its ability to provide tailored insights on
HSU's academic programs, facilities, student life, and other facets of university
life has the potential to dramatically improve the user experience, promoting a
better awareness of the university's offerings and ideals
1H
Trang 5ACKNOWLEDGEMENT
Trang 6LECTURER’S REVIEW
Ho Chi Minh City, Day month year 2023
REVIEWER
Trang 7TABLE OF CONTENTS
LECTURER’S REVIEW
1 Introduction
2.1 Overview of the system
2.2 System Architecture/System Flow
2.3 Detailed Description of System Components
Project Scope
`
Reference
vi
Trang 8LIST OF TABLES, DIAGRAMS, IMAGES
Image 3: Chatbof”S OUfUI Gà 1H 111141131911 1916 TT TT 0 16014 12
vii
Trang 91 Introduction
This project introduces an artificial intelligence chatbot that will act as a thorough guide for prospective students and current Hoa Sen University clients The chatbot is
powered by Google's Gemini 1.5 language model and LangChain's Retrieval-
Augmented Generation (RAG) approach, and it uses a vector database (ChromaDB) to extract information from over 300 webpages on HSU's main website The chatbot
responds to user requests with individualized responses that provide thorough
information about HSU's academic programs, facilities, student life, and other topics
This project seeks to improve the user experience by offering an easily available and
useful resource, streamlining communication, and increasing engagement with the
university
2 System Description
2.1 Overview of the system
The Hoa Sen academic AI Chatbot is a cutting-edge conversational assistant that aims to increase user engagement and provide quick access to academic information It makes use of Google's strong Gemini 1.5 language model, which ensures natural and
informative responses, as well as a massive knowledge base derived from over 300
webpages scraped from HSU's official website This information is kept in a vector
database known as ChromaDB, which allows the chatbot to swiftly extract relevant
information in response to user inquiries
The chatbot's functionality is based on LangChain, a framework for creating
conversational bots It uses the Retrieval-Augmented Generation (RAG) approach,
which allows the chatbot to retrieve important information from ChromaDB and
smoothly integrate it into its responses
This innovative approach empowers the chatbot to deliver personalized and
comprehensive responses to user inquiries about HSU's academic programs, facilities, student life, and more It offers a convenient and engaging way for prospective students and current customers to explore the university and find the information they need The
Trang 10chatbot aims to improve user satisfaction and strengthen the connection between HSU and its community by offering a readily accessible and informative resource
2.2 System Architecture/System Flow
Chunked a ¬ kK > ot
Texts Generate Embeddings om Prompt Embedding
— “| Embeddings =f 7/77 ——,
\ ” Qi
Most relevant text passages (context) |
Result
sm c3 LUM —=E|
Image 1: Chathot’s flow
Prompt
2.3 Detailed Description of System Components
1 Data Gathering and Preparation:
Webpage Scraping: The project begins by extracting relevant information from Hoa Sen University's official website This involves systematically collecting data from various webpages, potentially using BeautifulSoup
Data Chunking: The scraped data is divided into manageable chunks, ensuring efficient processing and storage This involve splitting text data from webpage by 1000 words per chunks for the vector database chew
Embedding with Google AI: Google AT's embedding technology is used to transform the text chunks into numerical representations This allows the ChromaDB database to efficiently store and retrieve information based on semantic similarity, meaning it can understand the meaning of text rather than just matching keywords
ChromaDB Storage: The embedded data chunks are stored within the ChromaDB vector database ChromaDB is optimized for storing and retrieving large amounts of text data, enabling efficient search based on semantic similarity
2 Agent phase:
User Query: The user interacts with the chatbot by typing in a question about Hoa Sen University
Trang 11understanding model (like Gemini 1.5) to interpret its meaning and intent
ChromaDB Retrieval: The chatbot utilizes the ChromaDB database to retrieve relevant chunks of information based on the user's query This retrieval basically a tool included in the
custom agent is driven by semantic similarity, ensuring that the chatbot finds the most relevant
information even if the user's query uses different words than the original text
Response Generation: The chatbot combines the retrieved data with Gemini model
capabilities to generate a comprehensive and informative response for the user
Chat History: The chatbot stores the history of user interactions, allowing it to
potentially provide more personalized responses in future interactions
3 Project Scope
Data Gathering:
The project begins with acquiring important information from Hoa Sen University (HSU)
Scraping websites from HSU's official website, with a focus on academics, student life,
facilities, and general university information In addition, documents providing essential
information concerning HSU are collected To facilitate efficient storage and processing,
scraped data and document content are separated into manageable parts These pieces are
subsequently converted to numerical representations via Google Al's embedding technology
Finally, the embedded data chunks are saved in a ChromaDB vector database, which is designed
to store and retrieve vast volumes of text data based on semantic similarity This method builds
a comprehensive knowledge base from which the chatbot can present consumers with correct
and relevant information
Chatbot Development:
The chatbot’s functionality is based on LangChain, a framework for creating advanced
conversational bots The project uses the ChromaDB database for information retrieval,
allowing the chatbot to access the relevant knowledge base during discussions To interpret user questions and provide natural responses, the chatbot incorporates Google's sophisticated Gemini 1.5 language model A basic conversational flow is created to direct user activities, and a small
chat history function is incorporated to save the current conversation for future reference This
combination of technology results in a chatbot that can interact with users, interpret their
questions, acquire relevant information, and deliver informative responses
10
Trang 12Data from social platforms: The project will not include data from social media platforms like Facebook or Instagram, focusing on official website and provided documents
Multi-lingual support: The chatbot will primarily operate in Vietnamese
Advanced voice interaction: The chatbot will primarily be text-based
Complex user authentication: User accounts and authentication will be kept simple for the
initial version
Real-time data updates: The chatbot’s knowledge base will be updated periodically, but it will not have real-time access to dynamically changing information
Different chat sessions: The chatbot did not implement separate chat sessions for different
users
4 Results
Image 2: User's input
Image 3: Chatbot’s output
11
Trang 13relevant information from the knowledge base and generate coherent responses to user queries
Notably, the chatbot consistently retrieved accurate answers from the ChromaDB vector
database, even when user questions were phrased differently from the original text within the
scraped webpages and documents
For example, when a user inquired about the university's history, the chatbot retrieved information from a webpage about the university's founding and milestones, successfully
identifying and extracting the relevant information from a child page within the HSU website
This demonstrates that the chatbot's ability to navigate through the complex structure of the
HSU website and identify relevant content based on user questions, proving that it has
successfully implemented the core processes of data gathering, embedding, storage, and
retrieval, as designed
This achievement signifies the successful implementation of the project's core components, proving the effectiveness of using Google Al's embedding technology and ChromaDB for
storing and retrieving information based on semantic similarity It validates the chatbot’s ability
to navigate through a complex web structure and identify relevant information, ultimately
providing a valuable resource for users seeking information about Hoa Sen University
5 Summary
The HSU AI Chatbot project aimed to develop a conversational assistant that would serve
as a comprehensive guide for prospective students and current customers of the university The
project utilized advanced natural language processing (NLP) and a knowledge base built from
data scraped from HSU's official website and provided documents
The project faced several challenges:
Data Gathering: While the goal was to collect data from both the website and social media platforms, the project ultimately focused on the website and provided documents due to
limitations The sheer volume of data collected from the website posed a challenge in terms of
processing and storage
Data Integration: Finding a suitable method to integrate the data into ChromaDB, a vector database optimized for semantic search, proved to be a hurdle
Agent Development: Creating a custom agent that could effectively utilize the knowledge base and provide engaging responses required considerable effort
12
Trang 14with the chatbot presented a significant challenge
Despite these difficulties, the project successfully demonstrated the core functionality of the chatbot It successfully retrieved relevant information from the ChromaDB database, proving the effectiveness of the embedding technology and data storage methods The chatbot was able
to navigate through the website's structure and identify pertinent information, even when user questions were phrased differently from the original text
This project laid the groundwork for future development, highlighting the need for addressing the challenges encountered, particularly in relation to user interface development, data
integration, and the inclusion of social media data The project demonstrated the potential of AI chatbots to enhance user experience and provide valuable information about Hoa Sen
University
6 Reference
Custom agent | LangChain (n.d.)
Chroma Docs (n.d.) https://docs.trychroma.com/
Quickstart: Send requests to the Vertex AI API for Gemini (n.d.) Google Cloud
https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart-multimodal
13