We employ a combination of deep learning Convolutional Neural NetworkCNN models, some machine learning models using ensemble techniques, and... Keywords: Al-driven advertising, URL class
Trang 1STUDENT RESEARCH REPORTBuilding AI applications based on URL classification
to improve advertising performance
Trang 2TEAM LEADER INFORMATION
- Program: Management Information System
- Address: Thanh Xuan, Hanoi
- Phone no /Email: 21070222@vnu.edu.vn
II Academic Results (from the first year to now)
Overall score Academic rating
2021-2022
1H Other achievements:
Hanoi, April 15, 2024
Advisor Team Leader
(Sign and write full name) (Sign and write full name)
fe yo Truong Céng Doan Bui Khanh Linh
Trang 3[TABLE OF CONTENTS]
II 0 :(90)0)0/00190)0005057555 5
xi9 5
2 Member 08111 5
Si (:) 5
rà 0/5 5
{2.1 6
SUMMARY REPORT IN STUDENT RESEARCH 5-5 5< 5 5< ssss<ssese2 7 Bi cv i0 (¿0 n3 7
PP 2/1100 21116 Ố 8
"m9 ccecccccccccceccceseceseceseeeeseeseeeceseceeeceaeeeeeceaecseeeeaeeseeseeeseaeeeeeeeseesieeeneenteeeeaeens 8 P0090 00 13 2.2.1 Model ArclhIf€CfUT€ cv TT HH TH HH ngư 13 2.2.2 Evaluation MAfTIX - s11 vn HH ngàn HH ng 19 3 Results & DisCUSSIONS Ầ.Ầ.ẦẦẦẦẦẦo 20
3.1 Environment and Hyperparameters Setup cccceseeseeseeceeseeeeeteeeseeeeeeaeenees 20 3.2 ReSUILt eee eeccesccnecteenecnececsecsevsessessessessessesseesesseeeeaceceecaececeesaeenesaesaesnesneeas 21 3.3 DiSCUSSION cccccceceeseecceeeeecseeeeecseesesseesessesseesesseesessesseesessessesaesseeaeeaseaseeeeentens 23 4 Conclusion & Recommendations cccecceesssseeeeeneeeeeneseesesseesesseesesseesesseeeeeaeenees 24 S00 0 o.d 25
0.002 .,ÔỎ 26 C2 26
Trang 4[LIST OF TABLES]
Table 1 Hyperparameter 35 Table 2 Accuracy comparison with other methods cccecceeseeseeseeseeeeeeneeeeeeteeeseeeeeaes
Trang 5[LIST OF FIGURES]
Figure 1 Sample of the dataset with URL links and Caft€ðOTI€S ¿5-55 <++ss+xs+ 9
Figure 2 Example of processing 4fa - + 3111211911191 3119111111111 E111 crxrrệt 11Figure 3 Overview of the dataset cccccccccescesccecessceseeseceseceeesecsecesecseeeseceeeeseeeeeeaeenseens 12
Figure 5 Ensemble MOdeÌL - - - c2 321 12113211151 19311115911 11 911 11 9 1 H1 T1 ng re, 19Figure 6 Evaluation metrics ÍOTIUÏlA c2 2c 3223358833 8351E31E51E55E E111 1x 20Figure 7 090ì0)1319:8i71006 1222277 5 A4(4GAL-LLFŸDD 20Figure 8 Model deplOY1m€Iit - - - ¿c2 313221321331 1193 3511131111111 1 118 11 11 111 H1 Hy Hư 23
Trang 6INTRODUCTION Building AI applications based on URL classification to improve advertising performance
1 Project Code
CN.NC.SV.23_ 15
2 Member List:
Full Name Class ID
Bui Khanh Linh MIS2021A 21070222
Nguyen Quynh Trang MIS2021A 21070365
TS.Truong Cong Doan - Faculty of Applied Sciences.
4 Abstract:
Nowadays, there are still many challenges when bringing products to manycustomers smartly and effectively in marketing processes One of the main reasons is thatbusinesses need to optimize customer data analysis From there, many issues arise such
as real-time decision making, audience segmentation, optimizing ad quality, Applying
AI to improve advertising performance will be a good solution to solve the problemsmentioned above AI automates time-consuming tasks, allowing advertisers to focus onstrategy and creativity rather than manual data analysis and adjustments It makes themarketing more efficient and accurate, not only cost savings but also makes a goodcompetitive advantage Besides, AI technologies continuously learn and adapt based onnew data to ensure ongoing optimization and improved performance over time
Our research is centered on developing a robust AI solution for URL classification
in advertising We employ a combination of deep learning Convolutional Neural Network(CNN) models, some machine learning models using ensemble techniques, and
Trang 7meticulous hyper-parameter tuning to optimize performance This comprehensive
approach aims to enhance the precision of advertising placement and optimization,
maximizing visibility and relevance to the intended audience Moreover, our system
features a user-friendly interface accessible for users to efficiently classify URL links intorelevant categories While our current efforts do not extend to the analysis of user
behavior or demographic factors, implementing this model serves as a foundational step
toward enhancing advertising performance
In summary, the research aims to develop an AI application that incorporatesmachine learning and deep learning techniques to enhance advertisement placement
accuracy and ultimately maximize return on investment for advertisers The expectedresults of this study will provide actionable insights and recommendations, we aim toempower advertisers and publishers alike in navigating the dynamic landscape of digitalmarketing Through ongoing refinement and exploration of AI modeling methodologies,
we anticipate further advancements in enhancing the performance and impact of onlineadvertising initiatives
5 Keywords:
Al-driven advertising, URL classification, Deep learning for marketing, Ensemble
Learning Techniques, Hyperparameter tuning
Trang 8SUMMARY REPORT IN STUDENT RESEARCH,
2023-2024 ACADEMIC YEARLiterature Review
The optimization of online advertising performance through URL classification
has garnered significant attention in recent years This section provides an overview ofexisting research in the field, focusing on the application of ensemble learning andhyperparameter tuning techniques to address this challenge
Previous studies have highlighted the importance of URL classification inenhancing advertising performance For example, Hung and Diep (2022) proposed aCNN-based method for URL classification, achieving an impressive F1-score of 0.9759.Their study demonstrated the effectiveness of CNNs in improving ad targeting andconversion rates, emphasizing the potential of deep learning models in online advertising.Similarly, Nugroho and Suhartanto (2020) addressed the challenges associated withtraining DenseNet models by employing hyperparameter tuning techniques Byoptimizing learning rate and batch size using random search, their study achieved anaverage accuracy of 95%, showcasing the effectiveness of hyperparameter tuning inenhancing advertising performance through improved model optimization Furthermore,Phaisangittisagul (2019) proposed an algorithm based on deep learning and text modelsfor target advertising classification Their study achieved a satisfactory performanceaccuracy of 82.95% on testing data, highlighting the potential of advanced models inclassifying advertising content and enhancing targeted advertising strategies
In this research, we extend the existing literature by incorporating ensemble
learning and hyperparameter tuning methodologies Our study investigates variousmachine learning algorithms, including decision trees, random forests, XGBoost, Logistic
Regression, Support Vector Machines (SVM), and deep learning CNNs, to identify the
most effective approach for URL classification and advertising performanceoptimization
Trang 9One of our challenges is that our data set is limited to Vietnam, which makes it
difficult to expand to the global market Future research directions will include how to
process data from different languages combined with other advanced aggregationtechniques such as stacking and aggregation pruning for targeted advertising strategies
In conclusion, the literature review provides a comprehensive overview of existingresearch in URL classification and advertising performance optimization By leveragingensemble learning and hyperparameter tuning techniques, our research aims to contribute
to the advancement of personalized and efficient online advertising strategies
Data & Methodology
2.1 Data
In our research, we engaged in experiments utilizing a dataset generously provided
by a website browser This dataset encompassed approximately 380,000 URLs,meticulously categorized into 19 distinct classes These classifications serve dualpurposes: aiding in strategically displaying relevant advertisements on web pages based
on their respective topics, and serving as valuable training data for machine learningalgorithms All the URL contents are in Vietnamese, with URLs typically comprised ofVietnamese words, sans accent marks Classes include: Automotive, Books & Literature,Business & Finance, Careers, Education, Entertainment & Art, Family & Relationships,Food & Drink, Healthy Living, Home & Garden, News & Politics, Science &Technology, Sports, Style & Fashion, Travel, Real Estate, Games, Laws & Policies,Environment
Trang 10URL Category
205946 kienthuc_net.vn/giai-ma/3-con-giap-khong-can-t Science & Technology
141599 24h.com.vn/am-thuc/ngon-quen-sau-voi-thit-ngam Food & Drink
151951 riviu.vn/bun-dau-hang-khay-co-tuyen Food & Drink
45137 tai-lieu.com/tai-liewhoan-thien-nghiep-vu-tu- | Business & Finance
355610 htips://poki.com/en/g/the-walking-merge Games
Figure 1 Sample of the dataset with URL links and Categories.
Preprocessing:
Text preprocessing methods play a crucial role in enhancing the performance of
both traditional machine learning classifiers and deep learning models The effectiveness
of these methods varies across different domains and languages, underscoring the
importance of tailoring them to specific contexts In our experiment, we employ several
preprocessing techniques, including lowercasing, tokenization, stop-word removal,
pipeline, and additional steps These methods are implemented using function members
of the ViUtils class, extracted from the Python Vietnamese Toolkit
Lowercasing: Lowercasing involves converting all letters in a document tolowercase in this text preprocessing process The reason is that uppercase and lowercaseletters typically convey the same meaning in text analysis Treating them uniformlyensures consistency and avoids treating the same word differently based on its case Forinstance, "Cat" and "cat" would be considered identical after lowercasing Neglecting to
perform lowercasing may result in decreased accuracy, as it introduces unnecessary
variations that could impact the analysis negatively
Stop-word removal: Stop words are frequently occurring words in a language thatcontribute little to the overall meaning of a text, such as articles, prepositions, andconjunctions Removing stop words helps streamline the text by eliminating redundant
information, allowing the model to focus on more relevant content This process typically
Trang 11does not have adverse effects on the model's performance Furthermore, removing stopwords reduces the dataset size, leading to shorter training times due to the reduced
number of tokens In the context of URL data preprocessing, specific types of stop wordsare identified and removed:
e Scheme: Common schemes like "http" or "https" provide information about the
protocol used
e Resource types: These denote server-side file types such as ".html", ".htm", or
"php", which are essential for web page retrieval but may not contributesubstantially to content analysis
e Delimiters: Characters like "-", ":", and "/" are used to separate components and
words in URLs While important for URL structure, they are often irrelevant for
content analysis and can be safely removed
Pipeline: A machine learning model's workflows and stages can be automated
and codified using a pipeline Data extraction, preprocessing, model training, anddeployment are all handled by the pipeline's numerous consecutive phases The pipeline
is a key component of the product for systems that employ ML models It incorporates allrecommended procedures for data processing to build a machine-learning model that isoptimal for a given collection of data Furthermore, the pipeline makes it possible to runthe model on a big scale The system will be able to swiftly update machine learningmodels regularly with an end-to-end pipeline design
Domain removal: While domains provide valuable context in certain contexts,such as identifying reputable sources, they often yield little insight into the specific topics
or content of URLs As such, we opt to remove domain information from the URLsduring preprocessing, streamlining the dataset for subsequent analysis Following this,numerical digits are systematically removed from the text to eliminate irrelevantnumerical information that may not contribute significantly to the understanding of the
content.
10
Trang 12An example of the preprocessing data process with the original URL
bonbanh.com/can-tho/oto/toyota-vios-nam-2015
Orginal URL: bonbanh.com/ 1am-dong/oto-so-tu-dong-tu-nam-2816/page, 2
URL after removing stopwords: bonbanh.com/1am-dong/oto-so-tu-dong-tu-nam-2816/page,2
URL after removing domains: /lam-dong/oto-so-tu-dong-tu-nam-2016/page, 2
URL after removing numbers: /lam-dong/oto-so-tu-dong-tu-nam-/page,
URL after replacing underscores, hyphens, and slashes with spaces: lam dong oto so tu dong tu nam page,
Figure 2 Example of processing data.
Tokenization: Tokenization serves as the step in text preprocessing in the CNN
model, breaking down a piece of text into smaller units known as tokens These tokenscan encompass individual words, characters, or even n-gram characters, providingflexibility in subsequent processing tasks In our experimentation, we predominantlyencounter URLs sourced from Vietnamese newspapers and forums Notably, these URLsare often crafted with SEO (Search Engine Optimization) techniques, featuringSEO-friendly structures that include descriptive words reflecting the content or page titlerather than query strings or parameters Amidst this preprocessing phase, we incorporateseveral additional steps tailored to our dataset:
e Accent removal: Vietnamese URLs frequently feature accent marks, which are
systematically removed during preprocessing to ensure uniformity and ease of
analysis
e Single words as tokens: Vietnamese words can be composed of multiple smaller
words joined together In our experiment, we opt to split these compound wordsinto their constituent parts, considering single words as individual tokens Thisapproach facilitates clearer token representation, particularly given the prevalence
of unaccented or accent-removed words in our dataset
Stemming words:
After the above preprocessing steps are defined in url_to_text, we use stemmingwords for the SVM model to reduce words to their root or base form, a process known asstemming To achieve this, we use a Snowball-derived algorithm provided by the Natural
Language Toolkit (NLTK) This algorithm systematically removes suffixes from words to
11
Trang 13derive their root form, allowing us to capture the core meaning of each word regardless of
its grammatical variations The StemmedCountVectorizer class, a custom implementationderived from scikit-learn's CountVectorizer, integrates stemming from the tokenizationprocess In this class, the build_analyzer method is overridden to incorporate the derivedinto the token process By applying word stemming during the tokenization process, weensure that words with similar meanings are consistently represented, enhancing themodel's ability to extract meaningful features from text data This preprocessing step notonly reduces the dimensionality of the feature space but also improves the generalization
ability of our machine-learning model by promoting semantic consistency across
documents Overall, stemming serves as a basic preprocessing technique that plays an
important role in preparing text data for the next stages of our classification process
The statistics by class are shown in Figure 1 The data can be seen to be quiteunbalanced The number of data samples in the classes varies from about 10000 samples
to about 45000 samples
Dataset
Automotive Books & Literature
Business & Finance
Careers
Education
Entertainment & Art
Family & Relationships
Food & Drink
? Healthy Living
Z Home & Garden
5 News & Politics
Real Estate Science & Technology
Trang 14comprising five machine learning models and one deep learning model, and found thatthey generally performed well, with minimal variation in accuracy on the test set To
enhance the efficacy of our research outcomes, we utilized the Majority Voting Ensemble
technique, which entails aggregating predictions from multiple models This process
involves tallying predictions for each class label and ultimately selecting the label with
the highest number of votes
2.2.1 Model Architecture Decision Tree
In machine learning, decision trees are a crucial tool for solving regression and
classification issues Understanding and elucidating the relationship between attributesand outcomes is crucial during data analysis By providing a structure that is simple tocomprehend and resembles the decisions made by people, decision trees aid in this
process.
A decision tree's architecture consists of the following steps The tree begins at theroot node and finishes at the leaf nodes The data is split into sub-branches at eachintermediate node according to a criteria, which is typically a data feature, and acomparison threshold Each node in the tree assesses a particular feature together with athreshold value, and these features stand for the characteristics of the data The optimalapproach to splitting is determined throughout the data partitioning process using purityevaluation techniques like Gini Impurity or Entropy Ultimately, this process will keepgoing until a certain point is reached, such as the tree's maximum depth or the point atwhich no further divisions may be produced
Random Forest
The Random Forest model is one of the important methods in machine learningused for both classification and regression problems The process of building the RandomForest model includes the following steps First, from the training data set, severalsubsamples are randomly taken through the bootstrap sampling method Next, at each
node of each tree, only a random number of features are selected to best evaluate the
13