Application of deep learning and text embedding methods for self admitted technical debt detection = Ứng dụng mô hình học sâu và các kỹ thuật xử lý văn bản trong phát hiện lỗi mã nguồn

soft-Previous studies have leveraged text-mining techniques for automated TD detection in source code comments a.k.a Self Admitted Technical Debt–SATD, primarily focusing on object-orien

Trang 1

Application of deep learning and text

embedding methods for self-admitted technical

debt detection TRAN THI DINH

dinh.tt212255m@sis.hust.edu.vn

Thesis advisor : Dr Bui Thi Mai Anh

Signature of advisor

Department : Department of Software Engineering

Institute : School of Information and Communication Technology

Hanoi, 10-2023

Trang 2

THESIS ASSIGNMENT

1 Student’s information :

Name : Tran Thi Dinh

Phone : 0971236392 Email: dinh.tt212255m@sis.hust.edu.vn

Class : Data Science (Elitech)

Affiliation : Hanoi University of Science and Technology

Duration : 10/2021 - 10/2023

2 Thesis title : Application of deep learning and text embedding methods for

self-admitted technical debt detection

3 Declarations/Disclosures :

I herewith formally declare that I — Tran Thi Dinh — have performed the workand presentation in this thesis independently under supervisions of Dr Bui Thi MaiAnh All of the results are genuine and are not copied from any other sources.Every reference materials are clearly listed in the bibliography I will accept fullresponsibility for even one copy that violates school regulations

Hanoi, date month year 2023

Author

Tran Thi Dinh

4 Attestation of thesis advisor:

Hanoi, date month year 2023

Thesis Advisor

Dr Bui Thi Mai Anh

Trang 3

I would like to take this moment to express my deep and heartfelt gratitude to theindividuals whose unwavering support, invaluable guidance, and unwavering assistancehave been the cornerstone of my successful journey in completing this thesis.

Foremost, I extend my sincerest thanks to Dr.Bui Thi Mai Anh and Dr.Nguyen ThanhPhuong, whose mentorship has been a beacon of wisdom and expertise Their continuoussupport and mentorship have not only illuminated the path of this research but have alsosignificantly contributed to its depth and quality The insightful feedback and constructivecritique they have generously provided have not only steered my work in the right directionbut have also fostered my academic growth

Furthermore, beyond these formal acknowledgments, I wish to express my gratitude

to my friends and family, whose unwavering support, encouragement, and invaluable sights have accompanied me throughout this thesis journey Your camaraderie and moti-vation have been a constant source of inspiration, reminding me of the strength derivedfrom community and shared aspirations

in-In conclusion, I humbly acknowledge that this thesis would have remained a dreamwithout the unwavering support and collaboration of these remarkable individuals Theircollective contributions have elevated the quality and depth of this research, and I amprofoundly thankful for their unwavering dedication to my academic voyage

Trang 4

Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh

ABSTRACT

Software teams typically turn to sub-optimal solutions that deviate from the best ware development principles in order to strike a balance between short-term efficiency andlong-term stability Such solutions might lead to maintenance issues, so called TechnicalDebt (TD), which need be paid later on

soft-Previous studies have leveraged text-mining techniques for automated TD detection

in source code comments (a.k.a Self Admitted Technical Debt–SATD), primarily focusing

on object-oriented languages like Java However, SATD detection becomes challenging

in scripting languages such as R, which employ dynamic programming paradigms andhave highly compact and algorithm-aligned comments

In this thesis, we introduce DebtSniffer as a practical approach for detecting SATD

in both R packages and Java source codes We utilize a code-embedding technique, i.e.,pre-trained BERT models, to retain the rich semantic information embedded in R sourcecode comments and Java source codes Subsequently, we apply graph convolutional net-works to establish connections between scattered comment sentences and learn represen-tations for both labeled training data and unlabeled test data by propagating label impactthrough the graph convolution To assess the performance of DebtSniffer, we conductedexperiments over 4,961 R comments from 503 open source projects which were typicallycategorized into 12 TD classes and four Java sources: source code comments, commitmessages, pull requests, and issue tracking systems

The experimental results show that DebtSniffer accurately identifies SATD, forming the current state-of-the-art approaches based on traditional word embedding tech-niques

outper-Keywords: Self-Admitted Technical Debt, Pretrained BERT model, Graph

Convolu-tional Network, Software Engineering

Trang 5

Abstract ii

1.1 Problem Statement 1

1.2 Research Objectives 8

1.3 Contributions 8

1.4 Organization of Thesis 9

2 Related Work 11 2.1 General techniques for SATD detection 11

2.2 SATD detection in the R language 14

2.3 SATD detection in the Java language 16

3 Background 18 3.1 Convolutional Neural Networks (CNN) 18

3.1.1 Architecture 18

3.1.2 Applications 19

3.2 Graph Convolutional Networks (GCN) 20

3.2.1 Structure 21

3.2.2 Applications 22

3.3 Text embedding models 23

3.3.1 ALBERT (A Lite BERT) 23

3.3.2 RoBERTa (A Robustly Optimized BERT Pretraining Approach) 23 3.3.3 CodeBERT 24

Trang 6

3.3.4 Transfer Learning for Code Tasks 24

4 Methodology 25 4.1 Data encoding 25

4.2 Pretrained LM with CNN Model 27

4.3 Pretrained LM with GCN Model (DebtSniffer) 30

5 Experiments 34 5.1 Empirical Settings 34

5.1.1 Dataset and baselines 34

5.1.2 Settings 36

5.1.3 Metrics 38

5.2 Results and Discussion 38

5.2.1 Effectiveness of DebtSniffer on R dataset 38

5.2.2 Effectiveness of DebtSniffer on Java datasets 39

5.2.3 Ablation study 42

5.2.4 Threats to validity 44

6 Conclusions and Future works 45 6.1 Summary 45

6.2 Future works 46

Trang 7

1.1.1 An example of Self-Admitted Technical Debt of the Defects type in R 2

1.1.2 Real-world scenarios where SATD has caused substantial problems 3

1.1.3 Example - SATD Type: Algorithm in Java 7

1.1.4 Possible Resolution 7

3.1.1 A basic convolutional neural network (CNN) architecture 19

3.2.1 The graph convolutional neural network [14] 20

3.2.2 The graph convolutional neural network for text data 21

4.1.1 The SATD preprocessing workflow 25

4.2.1 The architecture of Pretrained LM with CNN model Pretrained Code-BERT is used as an example 27

4.2.2 The input representation of BERT model 28

4.3.1 The architecture of Pretrained LM with GCN model 30

4.3.2 Schematic of GCN model 31

5.1.1 Comment length distribution in R dataset 36

5.1.2 SATD length distribution in Java dataset 37

5.2.1 Influence of the number of GC layers in SATD detection for R packages The baseline is equal to zero number of GC layers 42

5.2.2 Influence of the number of GC layers in SATD detection for Code com-ment source in Java The baseline is equal to zero number of GC layers 43

Trang 8

List of Tables

1.1.1 Taxonomy TD definitions, based on Codabux et al [5] 4

1.1.2 Types of SATD with Examples 6

5.1.1 Statistics of the dataset 35

5.1.2 Number of different types of SATD 36

5.2.1 Comparison Results in R dataset (%) 39

5.2.2 Code comment dataset 40

5.2.3 Pull request dataset 40

5.2.4 Commit message dataset 40

5.2.5 Issue dataset 40

5.2.6 Influence of the number of GC layers on average F1-score of R language dataset and Java Code comment data 42

Trang 9

CNN Convolutional Neural Network.

GCN Graph Convolutional Network.

LM Language Models.

NLP Natural Language Processing.

OOP Object Oriented Programming.

PMI Pointwise Mutual Information.

PPMI Positive Pointwise Mutual Information.

RNN Recurrent Neural Networks.

SATD Self-Admitted Technical Debt.

Trang 10

Chapter 1

Introduction

SATD refers to code artifacts within a software project where developers explicitly knowledge the presence of suboptimal or problematic code but do not immediately ad-dress it SATD can manifest as code comments, such as ”// TODO” or ”// FIXME,” andtypically reflects issues related to design flaws, code smells, or deferred maintenance.The identification of SATD instances is crucial for several reasons It allows developmentteams to prioritize technical debt repayment, maintain code quality, and reduce the risk ofproject delays and increased maintenance costs

ac-1.1 Problem Statement

In the dynamic realm of software development, a common challenge that developers quently encounter is the inexorable march of time The pressures of deadlines and projectconstraints often force them to make expedient decisions and adopt shortcuts to expeditethe development process However, while these shortcuts may offer immediate relief, theycan potentially sow the seeds of long-term consequences [30] These expedient measures,taken in the heat of project development, can inadvertently give rise to a myriad of issues.One of the foremost concerns is the inadvertent creation of low-quality code These hastycoding practices, often driven by the urgency of project timelines, may result in code thatlacks the robustness, efficiency, and maintainability that are the hallmarks of high-qualitysoftware Consequently, such code can become a source of frustration for developers andmay hinder the overall progress of the project

fre-The concept of Technical Debt (TD) was introduced by Cunningham [6], referring tothe phenomenon of “not-quite-right code” that represents an incomplete, temporary, orsub-optimal solution In the pursuit of short-term advantages, incurring debts, over the

Trang 11

long term, must be repaid at an increasing cost [30].

Self-Admitted Technical Debt (SATD) represents a particular form of TD where

devel-opers leave comments within the source code to acknowledge sections that are not fullycompleted or optimized, indicating the need for further refinement or additional atten-tion [17]

Figure 1.1.1: An example of Self-Admitted Technical Debt of the Defects type in R

Figure 1.1.1 is an example of the SATD categorized into “Defect” In this example,the comment explicitly mentions the function name, ’validate input’, and the variable

’threshold’, indicating that there is an issue related to the initialization of the variable.However, it suggests deferring the resolution of this defect, which is a characteristic of the

”Defects” type of SATD

Detecting and effectively dealing with SATD issues presents a multifaceted challengewithin the realm of software development One of the primary complexities lies in thefact that SATD concerns often fly under the radar, remaining obscure and known to only

a select few developers

The inherent intricacy of SATD identification and management is compounded by theurgency of the matter Failing to address these debts in a timely and efficient mannercan lead to a cascading effect of adverse consequences These repercussions reverberatethrough various facets of software quality, making it imperative to allocate the requisitetime and resources for resolution

In essence, the proactive detection and management of SATD issues are critical ponents in the continuous pursuit of software quality enhancement The intricacies in-volved in this process necessitate a thorough understanding of the various SATD types,

com-as well com-as the adoption of effective strategies for their detection and subsequent tion By doing so, software development teams can ensure that their projects remain on atrajectory towards improved reliability, maintainability, and overall excellence

mitiga-While Self-Admitted Technical Debt (SATD) may not lead to traditional ”disasters” inthe sense of natural disasters, it can have significant negative impacts on software projects,

Trang 12

organizations, and even individuals SATD can lead to software failures, causing tions to crash or malfunction For instance, in 2012, Knight Capital Group lost 440 milliondolars in under an hour due to a trading algorithm error caused by an undetected softwarebug, which can be considered a financial disaster

applica-Figure 1.1.2: Real-world scenarios where SATD has caused substantial problems

Technical debt often includes poor security practices Unresolved security issues insoftware can lead to data breaches, exposing sensitive information The Equifax databreach in 2017, affecting 147 million people, was partly attributed to unpatched softwarevulnerabilities

As a multi-paradigm programming language, R is being used more and more in plications related to data science and statistics [33] Despite the fact that the R end-userprogramming community has been growing, the bulk of contributors are statisticians andscientists rather than software engineers Indeed, prior research has indicated that fewR-users are familiar with the nuances of the programming language and that they do notview themselves as developers [26] While there are several studies dealing with SATD inprogramming languages such as Java [30], little attention has been paid to the detection oftechnical debt in the R language We hypothesize that the detection of SATD in R pack-ages is more difficult than that in other languages due to the mixing of several program-ming paradigms including dynamically typing In the context of Java, which is widelyused in various software projects, SATD detection becomes particularly relevant SATD

ap-in Java refers to code segments where developers have knowap-ingly ap-introduced suboptimalsolutions, which are often documented as comments within the code.Java codebases aretypically large and complex, making it challenging to manually identify SATD And SATDcomments in Java can vary widely in terms of content and style, making automated de-tection a non-trivial task Several tools and frameworks are available for SATD detection

Trang 13

in Java, including CodeBERT, DebtWatcher, and SATDClassifier, each offering differentfeatures and approaches.

SATD encompasses a broad spectrum of issues and shortcomings in software opment It’s a recognition of the fact that not all technical debt is created equal Just as inreal life, there are various facets and forms of debt Their definitions are summarised inTable 1.1.1

devel-Table 1.1.1: Taxonomy TD definitions, based on Codabux et al [5]

Debt type Definition

Architecture Refers to the problems encountered in product architecture,

for example, violation of modularity, which can affect chitectural requirements (e.g performance, robustness)

ar-Build Refers to issues that make the build task harder and

unneces-sarily time-consuming The build process can involve codethat does not contribute value to the customer Moreover,

if the build process needs to run ill-defined dependencies,the process becomes unnecessarily slow In the context of

R, Build TD encompasses anything related to Travis, cov.io, GitHub Actions, CI, AppVeyor, CRAN, CMD

negatively affect the legibility of the code, making it moredifficult to maintain Usually, this TD can be identified byexamining the source code for issues related to bad codingpractices In the context of R, code debt encompasses any-thing related to renaming classes and functions, ’<-’ and ’=’,parameters and arguments in functions, FALSE/TRUE vsF/T, print vs warning/message

Defect Refers to known defects, usually identified by testing

activ-ities or by the user and reported on bug tracking systems

Continued on the next page

Trang 14

Table 1.1.1 – continued from previous page

Design Refers to debt that can be discovered by analyzing the source

code and identifying violations of the principles of goodobject-oriented design (e.g very large or tightly coupledclasses) In the context of R, design debt encompasses any-thing related to S3 classes and S4 methods, exporting func-tions with ’@export’ or the name pattern (visibility), inter-nal functions with coupling issues, location of functions inthe same file, selective importing ’@import’ (whole pack-age) or ’@importFrom’ (a specific function), notations ’::’and ’:::’, returning objects (dataframes or tibbles), and Tidy-verse vs baseR

Documentation Refers to the problems found in software project

documen-tation and can be identified by looking for missing, quate, or incomplete documentation In the context of R,documentation debt encompasses anything related to Roxy-gen2 (e.g., ’@param’, ’@return’, ’@example’), Pkgdown,Readme files, and Vignettes

inade-Requirements Refers to trade-offs made concerning what requirements the

development team needs to implement or how to implementthem Some examples of this type of debt are: requirementsthat are only partially implemented, requirements that areimplemented but not for all cases, requirements that are im-plemented but in a way that does not fully satisfy all thenon-functional requirements (e.g security, performance).Test Refers to issues found in testing activities that can affect the

quality of those activities Examples of this type of debt areplanned tests that were not run, or known deficiencies in thetest suite (e.g low code coverage) In the context of R, testdebt encompasses anything related to coverage, covr, unittesting (e.g., testthat), and test automation

Continued on the next page

Trang 15

Table 1.1.1 – continued from previous page

Usability Refers to inappropriate usability decisions that must be

ad-justed later Examples of this debt are the lack of usabilitystandards and inconsistency among navigational aspects ofthe software In the context of R, this encompasses anythingrelated to usability, interfaces, visualization, and so on

Versioning Refers to problems in source code versioning, such as

un-necessary code forks

Table 1.1.2: Types of SATD with Examples

Type of SATD Example

Code/Design ”Oh I didn’t realize we got duplicated logic We need to

refac-tor this.” - [from Superset-pull-request-6831]

”Need to add better handling for hz instance cleanup.” - [fromCamel-jira-issue-10563]

”Some new, friendlier APIs may be called for.” - [from github-issue-5940]

Druid-Documentation ”Could you also please document the meaning of the various

metrics” - [from Spark-pull-request-6905]

”I think we should document this” - [from issue-1905]

Accumulo-jira-”Currently, the API docs are missing from our website.” - [fromMxnet-github-issue-6648]

Test ”It’d be good to add some usages of DurationGranularity to the

query tests” - [from Druid-github-issue-3994]

”I did another cycle of review the unit tests, sorry I still notsee value in denial-of-service tests?” - [from Zookeeper-pull-request-689]

”I would like to have at least a simple testcase aroundthe UseV2WireProtocol feature” - [from Bookkeeper-github-issue-272]

Requirement ”TODO: add a dynamic context in front of every selector with

a traversal” - [from Heron-code-comment]

”Remaining todo list for SQL parse module ” - [from github-issue-2505]

Pinot-”Union is not supported yet But I might be adding that bility quite soon.” - [from Samza-pull-request-295]

capa-To provide some insight into what the different types of SATD look like, Table 1.1.2

Trang 16

Figure 1.1.3: Example - SATD Type: Algorithm in Java

provide some identified representative examples from each type in Table 13

Figure 1.1.4: Possible Resolution

SATD is a prevalent issue in software development While its existence is widely knowledged, the need to classify SATD into specific types is essential for several reasons

ac-By classifying SATD into types, development teams can better prioritize which aspects oftechnical debt to address first Not all types of SATD are equal in terms of their impact onsoftware quality, so classification helps identify critical areas that require immediate at-tention Different types of SATD may require distinct mitigation strategies For example,architectural debt might call for refactoring at a higher level, while code debt may needcode-level improvements Classification guides the selection of appropriate strategies toresolve each type effectively Addressing SATD is crucial for maintaining and enhancingsoftware quality When SATD is categorized, it becomes easier to identify the root causes

of problems and apply solutions that lead to cleaner, more maintainable code, reduced fects, and enhanced software reliability Limited development resources, such as time andmanpower, must be allocated judiciously Classification helps in distributing resources ef-ficiently, as teams can allocate efforts to specific types of SATD based on their potentialimpact and urgency By classifying SATD, teams can track the occurrence of differenttypes over time This not only allows for better monitoring of the debt but also helps inimplementing preventive measures and best practices to reduce the accumulation of spe-cific types of SATD in the future The classification of Self-Admitted Technical Debt isnot just an academic exercise but a practical necessity It enables software developmentteams to tackle debt in a more structured, effective, and strategic manner, ultimately lead-

Trang 17

de-ing to higher software quality and more successful projects.

Figure 1.1.3 an example of SATD type Algorithm in Java along with a code examplethat could address it in figure 1.1.4 In his example, this code uses a linear search to findthe maximum element in an array Linear search has a time complexity of O(n), which isnot optimal for large arrays The improved code sorts the array in ascending order, making

it easier to identify the maximum element Sorting has a time complexity of O(n log n) inthe worst case, which is more efficient for larger arrays In this example, the SATD type

”Algorithm” was addressed by replacing the non-optimal linear search algorithm with amore efficient sorting algorithm This change enhances the performance of the code.Conventional SATD detection approaches using text-mining techniques, particularlyrelying on source code comments patterns [2, 27] yielded promising results Still, despitethe high precision rate that pattern-based techniques can attain, they are unable to reportSATD comments that do not fit any established patterns Recent studies investigated theadvantages of using Natural Language Processing to more accurately portray the seman-tic connections between various SATD types and textual comments [7, 23, 30] Althoughtext-mining-based methods outperformed pattern-based approaches in prediction accu-racy, almost all of these studies have exclusively targeted to object-oriented programminglanguages

a comment Our goal is to capture not only the link among scattered sentences, but also

to learn representations for both training data and unlabeled test data by propagating labelimpact through graph convolution In addition to comparing with SOTA methods in twodatasets, we built a hybrid model, combining the pre-trained LM and CNN to comparewith our model DebtSniffer

1.3 Contributions

Our work makes the following contributions:

1 We introduced DebtSniffer, an innovative SATD detection tool capable of ing the SATD problem across two major programming languages: R and Java Debt-

Trang 18

address-Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh

Sniffer leverages the power of pre-trained LM models and augments their ities through the incorporation of graph convolutional networks

capabil-2 We conducted an extensive evaluation using real-world datasets Our evaluationprocess involved rigorous comparisons with state-of-the-art baseline models and analternative hybrid model, BERT-CNN By systematically benchmarking DebtSnif-fer against these models, we demonstrated its superior performance across variousSATD categories and programming languages

3 In a commitment to transparency and openness, we have made the DebtSniffer toolpublicly available online1 By sharing this tool with the broader research and soft-ware engineering communities, we actively contribute to the principles of open sci-ence

In summary, our contributions in the development of DebtSniffer, its comprehensiveevaluation against state-of-the-art models, and our commitment to open science collec-tively represent a substantial and impactful advancement in the domain of SATD detec-tion DebtSniffer not only extends the reach of SATD detection to multiple programminglanguages but also sets a high standard for future research and tools in the field of softwareengineering

1.4 Organization of Thesis

The rest of the thesis is constructed as follows:

Chapter 2 provides a comprehensive review of previous studies and research endeavorswithin the domain of Self-Admitted Technical Debt (SATD) detection We delve into theexisting literature to gain insights into the evolution of this field and identify the gaps ourwork aims to address

Chapter 3 offers foundational knowledge that is pivotal for comprehending the cate concepts and methodologies featured in this thesis We offer concise explanations ofkey principles, theories, and technologies that form the bedrock of our work

intri-Chapter 4 introduces and elaborate on the novel models we’ve developed for the tection of SATD We delve into the technical details of our approach, explaining how itharnesses various machine-learning techniques to enhance the accuracy of SATD identi-fication

de-Chapter 5 showcases the outcomes of our rigorous experimentation We present a

1 https://github.com/ICSME2023-DebtSniffer/DebtSniffer/

Trang 19

detailed analysis of our approach’s performance across a diverse array of datasets and incomparison to various baseline models Through this, we substantiate the effectivenessand robustness of our methodology.

Chapter 6 provides a concise summary of our accomplishments and discusses somedirections for further research

Trang 20

Chapter 2

Related Work

The detection and management of Self-Admitted Technical Debt (SATD) have becomeincreasingly critical in software development, as they impact code quality, maintainabil-ity, and overall project success In this section, we review the existing literature and ap-proaches related to SATD detection, highlighting the evolution of techniques and research

in this domain Besides, we provide a comprehensive overview of Convolutional NeuralNetworks (CNNs), Graph Convolutional Networks (GCNs), and pre-trained Transformermodels, including ALBERT, RoBERTa, and CodeBERT

Our work focuses on the SATD detection from R code comments and Java source

codes Therefore, we divided the related work into three sections, i.e., General

Tech-niques for SATD detection, SATD detection in the R languages and SATD detection in Java languages.

2.1 General techniques for SATD detection

The concept of self-admitted technical debt (SATD) was initially introduced by Potdarand Shihab [27] They conducted an analysis of 101,762 comments from four open sourceprojects, leading to the identification of 62 patterns that can be used to detect SATD com-ments

However, comment pattern-based approaches have limitations in identifying SATDcomments that do not conform to established patterns To overcome this, there were sometechniques have been then explored, demonstrating promising outcomes [10]

Lexical and Comment-Based Approaches In the initial stages of Self-Admitted

Tech-nical Debt (SATD) detection, early methodologies predominantly hinged upon lexical and

Trang 21

comment-based analyses These techniques involved scouring code comments for lar keywords or phrases, with common examples being ”TODO” or ”FIXME.” While thesemethods were straightforward to implement, they often encountered limitations stemmingfrom their inherent simplicity The main issues included a lack of contextual understand-ing and a relatively high incidence of false positives The primary challenge lay in thefact that these approaches failed to consider the broader code context within which com-ments were embedded As a result, they could easily misinterpret innocuous comments

particu-as instances of SATD, leading to a significant number of false positives Additionally,these methods were typically keyword-driven, making them rigid and less adaptable tovariations in how developers expressed and documented technical debt Consequently,the lack of contextual analysis hindered the precision and granularity of SATD detection,limiting the utility of these early methods in practical software development contexts

Machine Learning-Based Approaches Machine learning-based methodologies have

surged in popularity for the detection of SATD This rise in prominence is attributed tothe inherent capacity of machine learning models to conduct a nuanced analysis of codecomments within their broader contextual framework Among these machine learning ap-proaches, Natural Language Processing (NLP) techniques have emerged as a particularlypowerful toolset, allowing for precise and context-aware SATD detection One of the keystrengths of NLP-based SATD detection lies in its ability to perform text classification,enabling the model to discern the underlying intent and tone of code comments This goesbeyond the mere identification of keywords or phrases, as it delves into the subtleties ofhuman language

Furthermore, NLP-based SATD detection methods are adaptable and extensible Theycan be fine-tuned and customized to the specific requirements and coding conventions ofdifferent software projects and programming languages This adaptability allows devel-opers and researchers to tailor SATD detection models to suit the unique characteristics oftheir codebases, fostering more accurate and context-aware results

SVM and Nạve Bayes Classifiers In the initial stages of adopting machine learning for

SATD detection, early methods made use of classification algorithms like Support VectorMachines (SVM) and Nạve Bayes classifiers These techniques were instrumental in thetask of distinguishing between SATD and non-SATD comments However, it’s important

to note that during this early phase, the models heavily relied on handcrafted features,marking a significant difference from the more modern, data-driven approaches

Trang 22

Deep Learning Models Recent advancements in deep learning have introduced more

so-phisticated models for SATD detection Convolutional Neural Networks (CNNs) and current Neural Networks (RNNs) have been used to capture complex relationships withincode comments, achieving improved accuracy

Re-The integration of CNNs and RNNs into SATD detection models has led to substantialaccuracy improvements These models are no longer confined to manual feature engineer-ing, but instead harness the full potential of data-driven learning This adaptability equipsthem to cater to diverse coding styles, programming languages, and domains, providing

a robust and versatile solution for SATD detection In a software development landscapecharacterized by constant evolution and change, these sophisticated models are crucial inensuring that SATD detection remains effective and context-aware

Hybrid Approaches Hybrid approaches represent a dynamic and comprehensive

evo-lution in the realm of Self-Admitted Technical Debt (SATD) detection These innovativemethodologies seamlessly integrate lexical analysis, machine learning, and code metrics,forging a powerful synergy that enhances the precision and depth of SATD identification.These multifaceted methods draw insights from code comments, source code, and projecthistory, culminating in a holistic and nuanced perspective on technical debt instances.The vast body of research on SATD primarily revolves around its detection and man-agement within the domain of Object-Oriented (OO) software projects Java, in particular,has served as a focal point for many of these investigations However, as software devel-opment landscapes continue to evolve, there arises a pressing necessity to broaden ourhorizons and explore how established SATD detection methodologies fare in the context

of scientific programming languages

Scientific programming languages, such as Matlab, R, Python, and others, introduce aunique set of challenges due to their diverse paradigms and specialized use cases Unliketraditional OO languages, these languages are specifically tailored to cater to scientific anddata analysis tasks Consequently, they often employ different coding practices, structures,and idioms

This paradigm shift in programming languages necessitates a critical examination ofthe adaptability and efficacy of existing SATD detection techniques What has proven ef-fective in the OO realm may not seamlessly translate to these scientific languages Hence,

it is imperative to assess how SATD manifests itself in codebases developed using lab, R, Python, and similar languages This includes an exploration of the specific types

Mat-of SATD that emerge, as well as an evaluation Mat-of whether current detection strategies canaccurately pinpoint these issues

Trang 23

In essence, this research endeavor seeks to bridge a significant gap in SATD ship by extending its purview to encompass scientific programming languages Through

scholar-a comprehensive scholar-anscholar-alysis, we scholar-aim to shed light on how SATD mscholar-anifests in these diversecoding environments and to refine our SATD detection methodologies accordingly Ulti-mately, this expansion of scope will contribute to a more holistic understanding of SATDacross the software development spectrum

Furthermore, it’s important to note that SATD detection in the realm of Java ming holds significant importance in the overarching goal of maintaining code quality andmanaging technical debt effectively Java, being one of the most widely used program-ming languages, forms the backbone of countless software applications across various do-mains Therefore, ensuring the integrity and cleanliness of Java codebases is paramount.Traditionally, SATD detection in Java codebases has been recognized as a corner-stone in this pursuit of code quality It serves as a proactive mechanism to identify andaddress potential sources of technical debt before they escalate into more complex andcostly issues This proactive stance aligns with modern software development paradigmsthat emphasize the importance of preventive maintenance over reactive fixes

program-In the context of Java, two primary approaches have emerged as fundamental strategiesfor SATD detection: lexical analysis and comment-based analysis These approaches,rooted in the linguistic and structural aspects of Java code, provide a robust foundation foridentifying and categorizing self-admitted technical debt By leveraging lexical patternsand insights from code comments, these methodologies enable developers and softwareteams to not only flag instances of SATD but also gain valuable context regarding thenature and scope of the debt

2.2 SATD detection in the R language

Due to the prevalence of Self-Admitted Technical Debt (SATD) research within the realm

of Object-Oriented (OO) projects, with a primary focus on languages like Java, there is

a growing necessity to expand our understanding of how SATD detection methods areadapted for scientific programming languages Scientific programming languages, such

as Matlab, R, Python, and others, possess distinct characteristics compared to their Oriented counterparts Therefore, the exploration of SATD in the context of these lan-guages introduces a host of challenges and opportunities

Object-Scientific programming languages are frequently chosen for data analysis, tational modeling, and other research activities These languages are well-suited for datamanipulation, statistical analysis, and visual representation, making them indispensable in

Trang 24

compu-Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh

fields such as data science, bioinformatics, and quantitative research Given the paramountrole of these languages in scientific domains, SATD’s presence can have critical impli-cations Yet, compared to the extensive research conducted in the OO realm, SATD’sunderstanding and detection within scientific programming languages are relatively un-derexplored

In Java and other OO languages, SATD is primarily observed in the form of codecomments These comments typically reflect developers’ awareness of suboptimal code

or areas needing improvement However, in the scientific programming landscape, SATDmay manifest differently It could be embedded in code comments, but it can also surface

in the documentation, function and variable names, or even in the choice of data structuresand algorithms These unique expressions of SATD in scientific programming languagesrequire tailored detection strategies

Furthermore, SATD in scientific programming may encompass issues distinct from

OO contexts For example, in Matlab or Python, inefficiencies in numerical algorithms canlead to suboptimal performance In R, data handling inefficiencies might impact statisticalanalyses The SATD challenges in these languages extend beyond mere syntactic analysisand often necessitate a profound understanding of the specific scientific domain to identifysubtle technical debt manifestations

As the adoption of scientific programming languages continues to grow, ensuring codequality and managing technical debt in these contexts is becoming increasingly important.Consequently, there is an urgent call for research to develop and adapt SATD detectiontechniques that can effectively pinpoint these unique forms of debt in scientific codebases

By doing so, we can ensure that the robustness, reliability, and maintainability of scientificsoftware align with the high standards required by researchers in various fields, ultimatelyadvancing the quality and impact of scientific research

Codabux et al [5] collected over 5,000 comments from 157 packages that were

pub-lished at the rOpenSci platform, manually analyzed and figured out 10 types of TD for

R packages In addition to a proposed taxonomy of TD for R packages, this work also

highlighted that Documentation Debt is commonly reported comparing to other TD types.

Inspired from the study of Codabux et al [5], Khan and Uddin [12] applied a pre-trainedBERT model combined with some Machine Learning algorithms to automatically clas-

sify SATD comments from two R platforms, rOpenSci and BioConductor They found

that generic platforms such as rOpenSci are more prone to TD than domain specific form (i.e., BioConductor) Vidoni [32] analyzed 503 R packages from GitHub and exam-ined more than 164k comments to generate a baseline dataset The author suggested twonovel TD types for R source code, namely ALGORITHM and PEOPLE instead of only

Trang 25

plat-10 categories as in earlier research The closest work to ours is the research of Sharma

et al [31], where the authors investigated two variants of the pre-trained BERT model toautomatically detect 12 SATD types from R source code comments However, BERT is

a pre-trained model for natural languages, rather than code Thus, DebtSniffer is ent from Sharma et al [31] as it exploits CodeBERT–a model trained on source code–toclassify SATD

differ-2.3 SATD detection in the Java language

Detecting Self-Admitted Technical Debt (SATD) in Java is of paramount importance when

it comes to ensuring and enhancing code quality and effectively managing technical debtwithin software projects Technical debt can accumulate in codebases over time due tovarious factors, such as development pressures, tight project schedules, or the evolvingnature of software requirements However, if not addressed promptly, this technical debtcan have severe repercussions on software projects, including reduced maintainability,increased defect rates, and higher development costs

The detection of SATD in Java, specifically, plays a pivotal role in mitigating theseissues It offers a proactive approach to identifying areas in the code where developershave consciously or unconsciously taken shortcuts or made compromises due to time con-straints, complexity, or a lack of better solutions By recognizing these instances, SATDdetection provides an opportunity to rectify, refactor, or document these suboptimal codesegments, thus reducing the long-term impact on software quality

In the realm of SATD detection, two fundamental approaches are commonly ployed: lexical and comment-based approaches The lexical approach involves the anal-ysis of code elements, such as identifiers, comments, and their relationships, to identifypotential instances of technical debt It focuses on code patterns, naming conventions, andspecific code smells that may indicate the presence of SATD On the other hand, comment-based approaches center on the textual information embedded in code comments, oftentermed self-admitted comments Developers leave these comments to communicate as-pects of the code that might need improvement, or areas that have been compromised insome way Potdar and Shihab [27] conducted a comprehensive analysis of SATD com-ments in open-source Java projects The study identified common keywords like ”TODO,”

em-”FIXME,” and ”XXX” as indicators of SATD In this work, Potdar and Shihab manuallysummarized 62 patterns that can be used to identify SATD comments, after reading morethan 100,000 source-code comments from different Java projects Based on this first work,Wehaibi et al [36] examined the relation between self-admitted technical debts and de-

Trang 26

fects They found that SATD is not related to defects, rather making the system moredifficult to change in the future In addition, Maldonado and Shihab [24] further dividedSATD into five types, namely design debt, defect debt, documentation debt, requirementdebt and test debt All previous SATD detection studies aimed to identify debt instances

at the file level Yan et al [37] proposed a novel approach to automate the detection ofSATD at the change level The idea is to catch the introduction of SATD when a soft-ware change occurs, instead of inspecting if a file that was changed previously containsSATD The authors built a determination model using a 325 Random Forest classifica-tion with data labeled from comment analysis, and features extracted from source controlrepositories Thereafter, Ren et al [29] introduced a convolutional neural network-basedapproach to improving the identification performance, while Wang et al [35] explored theefficiency of an attention-based approach in SATD identification Additionally, Chen et al.[4] trained an XGBoost classifier to identify three types of SATD, namely design, defect,and requirement debt from source code comments Most recently, Li et al [18] created aSATD dataset from four different sources: source code comments, commit messages, pullrequests, and issue-tracking systems They manually tagged each item in this dataset asnon-SATD or SATD (including types of SATD) and then proposed an approach (MT-Text-CNN) to identify four types of SATD from four sources Simultaneously, they analyzed asample of the identified SATD to explore the relations between SATD in different sources

Trang 27

In this chapter, we provide a comprehensive overview of Convolutional Neural Networks(CNNs), Graph Convolutional Networks (GCNs), and pre-trained Transformer models,including ALBERT, RoBERTa, and CodeBERT

3.1 Convolutional Neural Networks (CNN)

A convolutional neural network (CNN) is one of the most significant networks in the deeplearning field Since CNN made impressive achievements in many areas, including butnot limited to computer vision and natural language processing, it attracted much attentionfrom both industry and academia in the past few years [19] The powerful learning ability

of deep CNN is primarily due to the use of multiple feature extraction stages that canautomatically learn representations from the data [11]

3.1.1 Architecture

A typical CNN architecture generally comprises alternate layers of convolution and ing followed by one or more fully connected layers at the end In some cases, a fullyconnected layer is replaced with a global average pooling layer [11]

pool-Convolutional Layers pool-Convolutional layers are the core of the feature extraction

pro-cess They consist of multiple filters (kernels) that slide across the input image Each filterdetects different features, such as edges, textures, or more complex patterns Convolutioninvolves element-wise multiplication of the filter with a local region of the input imageand then summing the results These layers are responsible for learning hierarchical and

Trang 28

Figure 3.1.1: A basic convolutional neural network (CNN) architecture

increasingly abstract features from the input data

Pooling Layers Pooling layers reduce the spatial dimensions of the feature maps while

retaining the most salient information Common pooling operations include max-poolingand average-pooling Pooling helps reduce computational complexity and makes the net-work more robust to variations in input scale and position

Fully Connected Layers Fully connected layer is mostly used at the end of the network

for classification Unlike pooling and convolution, it is a global operation [11] Eachneuron in a fully connected layer is connected to all neurons in the previous layer Thislayer processes the high-level features learned in earlier layers The output layer typicallyhas as many neurons as there are classes in a classification or regression task

3.1.2 Applications

CNNs have found applications beyond computer vision, including natural language cessing (NLP) CNNs can classify images into predefined categories, such as recognizingobjects in photos, identifying handwritten digits, or classifying diseases in medical im-ages [1] They also have been used for text classification, sentiment analysis, and eventext generation, by treating text data as a two-dimensional matrix [3, 34]

Trang 29

pro-Figure 3.2.1: The graph convolutional neural network [14]

3.2 Graph Convolutional Networks (GCN)

Graph Convolutional Neural Networks (GCNs) are primarily designed for feature tion and analysis on graph-structured data [14] At their core, GCNs are designed to han-dle data structured as graphs, a versatile mathematical abstraction used to represent andmodel relationships between entities In a graph, data entities are depicted as nodes, whiletheir connections or interactions are encoded as edges This representation is exceptionallyvaluable in scenarios where relationships and context are crucial, which spans a wide array

extrac-of fields GCNs leverage the inherent structure extrac-of graphs to enhance their understanding

of the data As illustrated in Figure 3.2.1, GCNs operate on the principle of informationpropagation, where each node aggregates information from its neighboring nodes Thisiterative process occurs through multiple layers, allowing GCNs to capture increasinglycomplex patterns and dependencies in the data The critical insight behind GCNs is thateach node’s representation is continually refined based on both its local neighborhoodand the global graph structure, facilitating the extraction of rich and contextually relevantfeatures

While GCNs primary application is on graph data, it’s possible to adapt GCNs fortext-based data when you represent text as a graph [39] Figure 3.2.2 illustrates a basicarchitecture of a GCN suitable for text feature extraction in such cases

Trang 30

Figure 3.2.2: The graph convolutional neural network for text data

3.2.1 Structure

Graph Representation from Text Convert text data into a graph representation can be

done in various ways, such as:

• Document-Word Graph : Treat each document as a node and words as edges necting documents Edges can be weighted by word co-occurrence or similarity[39]

con-• Word-Word Graph: Each word is a node, and edges connect words that co-occur indocuments Edges can be weighted based on co-occurrence frequency or semanticsimilarity

• Dependency Tree: Construct a graph using the syntactic or semantic dependenciesbetween words in sentences

Input Layer Each node in the graph corresponds to a text element (e.g., document, word,

or sentence) and is associated with an initial feature vector These feature vectors canrepresent word embeddings, TF-IDF scores, or other text-based representations [25]

Graph Convolutional Layers The core of the GCN architecture involves multiple graph

convolutional layers that operate on the graph representation of the text data Each layerperforms the following steps:

• Message Aggregation: Nodes collect information from their neighboring nodes

Trang 31

based on the edges in the graph This involves aggregating features from boring nodes, often with a weighted sum based on edge weights.

neigh-• Feature Transformation: The aggregated information is then transformed using alearnable weight matrix

• Non-Linearity: An activation function (e.g., ReLU) is applied element-wise to thetransformed features

These operations enable nodes to capture information from their local context in the graph

For instance, considering a two-layer GCN, the output representation Z with input X are

formed as below

Z = f (X, A) = sof tmax

(ˆ

AReLU ( ˆ AXW(0))W(1)

)

in which A is a symmetric adjacency matrix, ˆ A is the normalization of A, W(0) is an

input-to-hidden weight matrix for hidden layer 0 and W(1) is the hidden-to-output weightmatrix

Pooling Layers (Optional) Similar to CNNs, GCNs for text can incorporate pooling

layers to down-sample the graph, reducing its size while retaining important structuralinformation

Output Layer The final layer produces the output of the GCN Depending on the specific

text-based task, this layer can have different architectures: For text classification tasks, asoftmax activation function may be applied to predict class labels For text generation

or summarization, a recurrent or transformer-based decoder can be used to generate textsequences

3.2.2 Applications

GCNs have emerged as a powerful tool for various applications in the realm of based machine learning Their ability to capture and process complex relationships withingraph-structured data has led to significant advancements in a wide range of fields GCNsare used to identify influential nodes or users in social networks [20] This is essentialfor targeted marketing, and recommendation systems [40] Additionally, GCNs improvesyntactic and semantic parsing tasks in NLP by capturing dependencies and relationshipsbetween words and phrases [21] They aid in extracting relationships between entities intext, a crucial task in information extraction [28]

Trang 32

graph-Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh

3.3 Text embedding models

Pre-trained BERT (Bidirectional Encoder Representations from Transformers) modelshave revolutionized the field of natural language processing (NLP) by capturing rich con-textual information from large text corpora[8] Among the prominent BERT-based mod-els, ALBERT, RoBERTa, and CodeBERT stand out for their unique enhancements andspecialized applications In this section, we delve into each of these models:

3.3.1 ALBERT (A Lite BERT)

ALBERT [15] is designed to address the efficiency and scalability challenges of BERTwhile maintaining or even improving its performance Here are some key features andinnovations of ALBERT:

Parameter Reduction ALBERT significantly reduces the number of parameters

com-pared to the original BERT model It achieves this by factorizing the embedding matrixand sharing parameters across layers

Cross-layer Parameter Sharing ALBERT introduces cross-layer parameter sharing,

al-lowing information to flow more efficiently across layers This enhances model trainingand generalization

Sentence Order Prediction (SOP) In addition to the standard Masked Language

Mod-eling (MLM) task, ALBERT uses the SOP task, which involves predicting whether twosentences in a document are in the correct order This task further improves pre-training

Improved Performance Despite its parameter reduction, ALBERT often outperforms

BERT on various downstream NLP tasks while being more memory-efficient and fasterduring inference

3.3.2 RoBERTa (A Robustly Optimized BERT Pretraining Approach)RoBERTa [22] builds upon BERT’s success and focuses on optimization and robust pre-training It addresses various aspects of the pre-training process:

Larger Batch Sizes It uses larger batch sizes and dynamic masking to expose the model

to more diverse training data, leading to improved generalization

Tiêu đề	Application of Deep Learning and Text Embedding Methods for Self-Admitted Technical Debt Detection
Tác giả	Tran Thi Dinh
Người hướng dẫn	Dr. Bui Thi Mai Anh
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Software Engineering
Thể loại	thesis
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	64
Dung lượng	2,05 MB