Báo cáo nghiên cứu khoa học: Building AI applications based on URL classification to improve advertising performance

We employ a combination of deep learning Convolutional Neural NetworkCNN models, some machine learning models using ensemble techniques, and... Keywords: Al-driven advertising, URL class

Trang 1

STUDENT RESEARCH REPORTBuilding AI applications based on URL classification

to improve advertising performance

Trang 2

TEAM LEADER INFORMATION

- Program: Management Information System

- Address: Thanh Xuan, Hanoi

- Phone no /Email: 21070222@vnu.edu.vn

II Academic Results (from the first year to now)

Overall score Academic rating

2021-2022

1H Other achievements:

Hanoi, April 15, 2024

Advisor Team Leader

(Sign and write full name) (Sign and write full name)

fe yo Truong Céng Doan Bui Khanh Linh

Trang 3

[TABLE OF CONTENTS]

II 0 :(90)0)0/00190)0005057555 5

xi9 5

2 Member 08111 5

Si (:) 5

rà 0/5 5

{2.1 6

SUMMARY REPORT IN STUDENT RESEARCH 5-5 5< 5 5< ssss<ssese2 7 Bi cv i0 (¿0 n3 7

PP 2/1100 21116 Ố 8

"m9 ccecccccccccceccceseceseceseeeeseeseeeceseceeeceaeeeeeceaecseeeeaeeseeseeeseaeeeeeeeseesieeeneenteeeeaeens 8 P0090 00 13 2.2.1 Model ArclhIf€CfUT€ cv TT HH TH HH ngư 13 2.2.2 Evaluation MAfTIX - s11 vn HH ngàn HH ng 19 3 Results & DisCUSSIONS Ầ.Ầ.ẦẦẦẦẦẦo 20

3.1 Environment and Hyperparameters Setup cccceseeseeseeceeseeeeeteeeseeeeeeaeenees 20 3.2 ReSUILt eee eeccesccnecteenecnececsecsevsessessessessessesseesesseeeeaceceecaececeesaeenesaesaesnesneeas 21 3.3 DiSCUSSION cccccceceeseecceeeeecseeeeecseesesseesessesseesesseesessesseesessessesaesseeaeeaseaseeeeentens 23 4 Conclusion & Recommendations cccecceesssseeeeeneeeeeneseesesseesesseesesseesesseeeeeaeenees 24 S00 0 o.d 25

0.002 .,ÔỎ 26 C2 26

Trang 4

[LIST OF TABLES]

Table 1 Hyperparameter 35 Table 2 Accuracy comparison with other methods cccecceeseeseeseeseeeeeeneeeeeeteeeseeeeeaes

Trang 5

[LIST OF FIGURES]

Figure 1 Sample of the dataset with URL links and Caft€ðOTI€S ¿5-55 <++ss+xs+ 9

Figure 2 Example of processing 4fa - + 3111211911191 3119111111111 E111 crxrrệt 11Figure 3 Overview of the dataset cccccccccescesccecessceseeseceseceeesecsecesecseeeseceeeeseeeeeeaeenseens 12

Figure 5 Ensemble MOdeÌL - - - c2 321 12113211151 19311115911 11 911 11 9 1 H1 T1 ng re, 19Figure 6 Evaluation metrics ÍOTIUÏlA c2 2c 3223358833 8351E31E51E55E E111 1x 20Figure 7 090ì0)1319:8i71006 1222277 5 A4(4GAL-LLFŸDD 20Figure 8 Model deplOY1m€Iit - - - ¿c2 313221321331 1193 3511131111111 1 118 11 11 111 H1 Hy Hư 23

Trang 6

INTRODUCTION Building AI applications based on URL classification to improve advertising performance

1 Project Code

CN.NC.SV.23_ 15

2 Member List:

Full Name Class ID

Bui Khanh Linh MIS2021A 21070222

Nguyen Quynh Trang MIS2021A 21070365

TS.Truong Cong Doan - Faculty of Applied Sciences.

4 Abstract:

Nowadays, there are still many challenges when bringing products to manycustomers smartly and effectively in marketing processes One of the main reasons is thatbusinesses need to optimize customer data analysis From there, many issues arise such

as real-time decision making, audience segmentation, optimizing ad quality, Applying

AI to improve advertising performance will be a good solution to solve the problemsmentioned above AI automates time-consuming tasks, allowing advertisers to focus onstrategy and creativity rather than manual data analysis and adjustments It makes themarketing more efficient and accurate, not only cost savings but also makes a goodcompetitive advantage Besides, AI technologies continuously learn and adapt based onnew data to ensure ongoing optimization and improved performance over time

Our research is centered on developing a robust AI solution for URL classification

in advertising We employ a combination of deep learning Convolutional Neural Network(CNN) models, some machine learning models using ensemble techniques, and

Trang 7

meticulous hyper-parameter tuning to optimize performance This comprehensive

approach aims to enhance the precision of advertising placement and optimization,

maximizing visibility and relevance to the intended audience Moreover, our system

features a user-friendly interface accessible for users to efficiently classify URL links intorelevant categories While our current efforts do not extend to the analysis of user

behavior or demographic factors, implementing this model serves as a foundational step

toward enhancing advertising performance

In summary, the research aims to develop an AI application that incorporatesmachine learning and deep learning techniques to enhance advertisement placement

accuracy and ultimately maximize return on investment for advertisers The expectedresults of this study will provide actionable insights and recommendations, we aim toempower advertisers and publishers alike in navigating the dynamic landscape of digitalmarketing Through ongoing refinement and exploration of AI modeling methodologies,

we anticipate further advancements in enhancing the performance and impact of onlineadvertising initiatives

5 Keywords:

Al-driven advertising, URL classification, Deep learning for marketing, Ensemble

Learning Techniques, Hyperparameter tuning

Trang 8

SUMMARY REPORT IN STUDENT RESEARCH,

2023-2024 ACADEMIC YEARLiterature Review

The optimization of online advertising performance through URL classification

has garnered significant attention in recent years This section provides an overview ofexisting research in the field, focusing on the application of ensemble learning andhyperparameter tuning techniques to address this challenge

Previous studies have highlighted the importance of URL classification inenhancing advertising performance For example, Hung and Diep (2022) proposed aCNN-based method for URL classification, achieving an impressive F1-score of 0.9759.Their study demonstrated the effectiveness of CNNs in improving ad targeting andconversion rates, emphasizing the potential of deep learning models in online advertising.Similarly, Nugroho and Suhartanto (2020) addressed the challenges associated withtraining DenseNet models by employing hyperparameter tuning techniques Byoptimizing learning rate and batch size using random search, their study achieved anaverage accuracy of 95%, showcasing the effectiveness of hyperparameter tuning inenhancing advertising performance through improved model optimization Furthermore,Phaisangittisagul (2019) proposed an algorithm based on deep learning and text modelsfor target advertising classification Their study achieved a satisfactory performanceaccuracy of 82.95% on testing data, highlighting the potential of advanced models inclassifying advertising content and enhancing targeted advertising strategies

In this research, we extend the existing literature by incorporating ensemble

learning and hyperparameter tuning methodologies Our study investigates variousmachine learning algorithms, including decision trees, random forests, XGBoost, Logistic

Regression, Support Vector Machines (SVM), and deep learning CNNs, to identify the

most effective approach for URL classification and advertising performanceoptimization

Trang 9

One of our challenges is that our data set is limited to Vietnam, which makes it

difficult to expand to the global market Future research directions will include how to

process data from different languages combined with other advanced aggregationtechniques such as stacking and aggregation pruning for targeted advertising strategies

In conclusion, the literature review provides a comprehensive overview of existingresearch in URL classification and advertising performance optimization By leveragingensemble learning and hyperparameter tuning techniques, our research aims to contribute

to the advancement of personalized and efficient online advertising strategies

Data & Methodology

2.1 Data

In our research, we engaged in experiments utilizing a dataset generously provided

by a website browser This dataset encompassed approximately 380,000 URLs,meticulously categorized into 19 distinct classes These classifications serve dualpurposes: aiding in strategically displaying relevant advertisements on web pages based

on their respective topics, and serving as valuable training data for machine learningalgorithms All the URL contents are in Vietnamese, with URLs typically comprised ofVietnamese words, sans accent marks Classes include: Automotive, Books & Literature,Business & Finance, Careers, Education, Entertainment & Art, Family & Relationships,Food & Drink, Healthy Living, Home & Garden, News & Politics, Science &Technology, Sports, Style & Fashion, Travel, Real Estate, Games, Laws & Policies,Environment

Trang 10

URL Category

205946 kienthuc_net.vn/giai-ma/3-con-giap-khong-can-t Science & Technology

141599 24h.com.vn/am-thuc/ngon-quen-sau-voi-thit-ngam Food & Drink

151951 riviu.vn/bun-dau-hang-khay-co-tuyen Food & Drink

45137 tai-lieu.com/tai-liewhoan-thien-nghiep-vu-tu- | Business & Finance

355610 htips://poki.com/en/g/the-walking-merge Games

Figure 1 Sample of the dataset with URL links and Categories.

Preprocessing:

Text preprocessing methods play a crucial role in enhancing the performance of

both traditional machine learning classifiers and deep learning models The effectiveness

of these methods varies across different domains and languages, underscoring the

importance of tailoring them to specific contexts In our experiment, we employ several

preprocessing techniques, including lowercasing, tokenization, stop-word removal,

pipeline, and additional steps These methods are implemented using function members

of the ViUtils class, extracted from the Python Vietnamese Toolkit

Lowercasing: Lowercasing involves converting all letters in a document tolowercase in this text preprocessing process The reason is that uppercase and lowercaseletters typically convey the same meaning in text analysis Treating them uniformlyensures consistency and avoids treating the same word differently based on its case Forinstance, "Cat" and "cat" would be considered identical after lowercasing Neglecting to

perform lowercasing may result in decreased accuracy, as it introduces unnecessary

variations that could impact the analysis negatively

Stop-word removal: Stop words are frequently occurring words in a language thatcontribute little to the overall meaning of a text, such as articles, prepositions, andconjunctions Removing stop words helps streamline the text by eliminating redundant

information, allowing the model to focus on more relevant content This process typically

Trang 11

does not have adverse effects on the model's performance Furthermore, removing stopwords reduces the dataset size, leading to shorter training times due to the reduced

number of tokens In the context of URL data preprocessing, specific types of stop wordsare identified and removed:

e Scheme: Common schemes like "http" or "https" provide information about the

protocol used

e Resource types: These denote server-side file types such as ".html", ".htm", or

"php", which are essential for web page retrieval but may not contributesubstantially to content analysis

e Delimiters: Characters like "-", ":", and "/" are used to separate components and

words in URLs While important for URL structure, they are often irrelevant for

content analysis and can be safely removed

Pipeline: A machine learning model's workflows and stages can be automated

and codified using a pipeline Data extraction, preprocessing, model training, anddeployment are all handled by the pipeline's numerous consecutive phases The pipeline

is a key component of the product for systems that employ ML models It incorporates allrecommended procedures for data processing to build a machine-learning model that isoptimal for a given collection of data Furthermore, the pipeline makes it possible to runthe model on a big scale The system will be able to swiftly update machine learningmodels regularly with an end-to-end pipeline design

Domain removal: While domains provide valuable context in certain contexts,such as identifying reputable sources, they often yield little insight into the specific topics

or content of URLs As such, we opt to remove domain information from the URLsduring preprocessing, streamlining the dataset for subsequent analysis Following this,numerical digits are systematically removed from the text to eliminate irrelevantnumerical information that may not contribute significantly to the understanding of the

content.

10

Trang 12

An example of the preprocessing data process with the original URL

bonbanh.com/can-tho/oto/toyota-vios-nam-2015

Orginal URL: bonbanh.com/ 1am-dong/oto-so-tu-dong-tu-nam-2816/page, 2

URL after removing stopwords: bonbanh.com/1am-dong/oto-so-tu-dong-tu-nam-2816/page,2

URL after removing domains: /lam-dong/oto-so-tu-dong-tu-nam-2016/page, 2

URL after removing numbers: /lam-dong/oto-so-tu-dong-tu-nam-/page,

URL after replacing underscores, hyphens, and slashes with spaces: lam dong oto so tu dong tu nam page,

Figure 2 Example of processing data.

Tokenization: Tokenization serves as the step in text preprocessing in the CNN

model, breaking down a piece of text into smaller units known as tokens These tokenscan encompass individual words, characters, or even n-gram characters, providingflexibility in subsequent processing tasks In our experimentation, we predominantlyencounter URLs sourced from Vietnamese newspapers and forums Notably, these URLsare often crafted with SEO (Search Engine Optimization) techniques, featuringSEO-friendly structures that include descriptive words reflecting the content or page titlerather than query strings or parameters Amidst this preprocessing phase, we incorporateseveral additional steps tailored to our dataset:

e Accent removal: Vietnamese URLs frequently feature accent marks, which are

systematically removed during preprocessing to ensure uniformity and ease of

analysis

e Single words as tokens: Vietnamese words can be composed of multiple smaller

words joined together In our experiment, we opt to split these compound wordsinto their constituent parts, considering single words as individual tokens Thisapproach facilitates clearer token representation, particularly given the prevalence

of unaccented or accent-removed words in our dataset

Stemming words:

After the above preprocessing steps are defined in url_to_text, we use stemmingwords for the SVM model to reduce words to their root or base form, a process known asstemming To achieve this, we use a Snowball-derived algorithm provided by the Natural

Language Toolkit (NLTK) This algorithm systematically removes suffixes from words to

11

Trang 13

derive their root form, allowing us to capture the core meaning of each word regardless of

its grammatical variations The StemmedCountVectorizer class, a custom implementationderived from scikit-learn's CountVectorizer, integrates stemming from the tokenizationprocess In this class, the build_analyzer method is overridden to incorporate the derivedinto the token process By applying word stemming during the tokenization process, weensure that words with similar meanings are consistently represented, enhancing themodel's ability to extract meaningful features from text data This preprocessing step notonly reduces the dimensionality of the feature space but also improves the generalization

ability of our machine-learning model by promoting semantic consistency across

documents Overall, stemming serves as a basic preprocessing technique that plays an

important role in preparing text data for the next stages of our classification process

The statistics by class are shown in Figure 1 The data can be seen to be quiteunbalanced The number of data samples in the classes varies from about 10000 samples

to about 45000 samples

Dataset

Automotive Books & Literature

Business & Finance

Careers

Education

Entertainment & Art

Family & Relationships

Food & Drink

? Healthy Living

Z Home & Garden

5 News & Politics

Real Estate Science & Technology

Trang 14

comprising five machine learning models and one deep learning model, and found thatthey generally performed well, with minimal variation in accuracy on the test set To

enhance the efficacy of our research outcomes, we utilized the Majority Voting Ensemble

technique, which entails aggregating predictions from multiple models This process

involves tallying predictions for each class label and ultimately selecting the label with

the highest number of votes

2.2.1 Model Architecture Decision Tree

In machine learning, decision trees are a crucial tool for solving regression and

classification issues Understanding and elucidating the relationship between attributesand outcomes is crucial during data analysis By providing a structure that is simple tocomprehend and resembles the decisions made by people, decision trees aid in this

process.

A decision tree's architecture consists of the following steps The tree begins at theroot node and finishes at the leaf nodes The data is split into sub-branches at eachintermediate node according to a criteria, which is typically a data feature, and acomparison threshold Each node in the tree assesses a particular feature together with athreshold value, and these features stand for the characteristics of the data The optimalapproach to splitting is determined throughout the data partitioning process using purityevaluation techniques like Gini Impurity or Entropy Ultimately, this process will keepgoing until a certain point is reached, such as the tree's maximum depth or the point atwhich no further divisions may be produced

Random Forest

The Random Forest model is one of the important methods in machine learningused for both classification and regression problems The process of building the RandomForest model includes the following steps First, from the training data set, severalsubsamples are randomly taken through the bootstrap sampling method Next, at each

node of each tree, only a random number of features are selected to best evaluate the

13

Tiêu đề	Building AI applications based on URL classification to improve advertising performance
Tác giả	Bui Khanh Linh
Người hướng dẫn	Truong Cong Doan
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Management Information System
Thể loại	Student Research Report
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	28
Dung lượng	6,59 MB