soft-Previous studies have leveraged text-mining techniques for automated TD detection in source code comments a.k.a Self Admitted Technical Debt–SATD, primarily focusing on object-orien
Trang 1Application of deep learning and text
embedding methods for self-admitted technical
debt detection TRAN THI DINH
dinh.tt212255m@sis.hust.edu.vn
Thesis advisor : Dr Bui Thi Mai Anh
Signature of advisor
Department : Department of Software Engineering
Institute : School of Information and Communication Technology
Hanoi, 10-2023
Trang 2THESIS ASSIGNMENT
1 Student’s information :
Name : Tran Thi Dinh
Phone : 0971236392 Email: dinh.tt212255m@sis.hust.edu.vn
Class : Data Science (Elitech)
Affiliation : Hanoi University of Science and Technology
Duration : 10/2021 - 10/2023
2 Thesis title : Application of deep learning and text embedding methods for
self-admitted technical debt detection
3 Declarations/Disclosures :
I herewith formally declare that I — Tran Thi Dinh — have performed the workand presentation in this thesis independently under supervisions of Dr Bui Thi MaiAnh All of the results are genuine and are not copied from any other sources.Every reference materials are clearly listed in the bibliography I will accept fullresponsibility for even one copy that violates school regulations
Hanoi, date month year 2023
Author
Tran Thi Dinh
4 Attestation of thesis advisor:
Hanoi, date month year 2023
Thesis Advisor
Dr Bui Thi Mai Anh
Trang 3I would like to take this moment to express my deep and heartfelt gratitude to theindividuals whose unwavering support, invaluable guidance, and unwavering assistancehave been the cornerstone of my successful journey in completing this thesis.
Foremost, I extend my sincerest thanks to Dr.Bui Thi Mai Anh and Dr.Nguyen ThanhPhuong, whose mentorship has been a beacon of wisdom and expertise Their continuoussupport and mentorship have not only illuminated the path of this research but have alsosignificantly contributed to its depth and quality The insightful feedback and constructivecritique they have generously provided have not only steered my work in the right directionbut have also fostered my academic growth
Furthermore, beyond these formal acknowledgments, I wish to express my gratitude
to my friends and family, whose unwavering support, encouragement, and invaluable sights have accompanied me throughout this thesis journey Your camaraderie and moti-vation have been a constant source of inspiration, reminding me of the strength derivedfrom community and shared aspirations
in-In conclusion, I humbly acknowledge that this thesis would have remained a dreamwithout the unwavering support and collaboration of these remarkable individuals Theircollective contributions have elevated the quality and depth of this research, and I amprofoundly thankful for their unwavering dedication to my academic voyage
Trang 4Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
ABSTRACT
Software teams typically turn to sub-optimal solutions that deviate from the best ware development principles in order to strike a balance between short-term efficiency andlong-term stability Such solutions might lead to maintenance issues, so called TechnicalDebt (TD), which need be paid later on
soft-Previous studies have leveraged text-mining techniques for automated TD detection
in source code comments (a.k.a Self Admitted Technical Debt–SATD), primarily focusing
on object-oriented languages like Java However, SATD detection becomes challenging
in scripting languages such as R, which employ dynamic programming paradigms andhave highly compact and algorithm-aligned comments
In this thesis, we introduce DebtSniffer as a practical approach for detecting SATD
in both R packages and Java source codes We utilize a code-embedding technique, i.e.,pre-trained BERT models, to retain the rich semantic information embedded in R sourcecode comments and Java source codes Subsequently, we apply graph convolutional net-works to establish connections between scattered comment sentences and learn represen-tations for both labeled training data and unlabeled test data by propagating label impactthrough the graph convolution To assess the performance of DebtSniffer, we conductedexperiments over 4,961 R comments from 503 open source projects which were typicallycategorized into 12 TD classes and four Java sources: source code comments, commitmessages, pull requests, and issue tracking systems
The experimental results show that DebtSniffer accurately identifies SATD, forming the current state-of-the-art approaches based on traditional word embedding tech-niques
outper-Keywords: Self-Admitted Technical Debt, Pretrained BERT model, Graph
Convolu-tional Network, Software Engineering
Trang 5Abstract ii
1.1 Problem Statement 1
1.2 Research Objectives 8
1.3 Contributions 8
1.4 Organization of Thesis 9
2 Related Work 11 2.1 General techniques for SATD detection 11
2.2 SATD detection in the R language 14
2.3 SATD detection in the Java language 16
3 Background 18 3.1 Convolutional Neural Networks (CNN) 18
3.1.1 Architecture 18
3.1.2 Applications 19
3.2 Graph Convolutional Networks (GCN) 20
3.2.1 Structure 21
3.2.2 Applications 22
3.3 Text embedding models 23
3.3.1 ALBERT (A Lite BERT) 23
3.3.2 RoBERTa (A Robustly Optimized BERT Pretraining Approach) 23 3.3.3 CodeBERT 24
Trang 6Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
3.3.4 Transfer Learning for Code Tasks 24
4 Methodology 25 4.1 Data encoding 25
4.2 Pretrained LM with CNN Model 27
4.3 Pretrained LM with GCN Model (DebtSniffer) 30
5 Experiments 34 5.1 Empirical Settings 34
5.1.1 Dataset and baselines 34
5.1.2 Settings 36
5.1.3 Metrics 38
5.2 Results and Discussion 38
5.2.1 Effectiveness of DebtSniffer on R dataset 38
5.2.2 Effectiveness of DebtSniffer on Java datasets 39
5.2.3 Ablation study 42
5.2.4 Threats to validity 44
6 Conclusions and Future works 45 6.1 Summary 45
6.2 Future works 46
Trang 71.1.1 An example of Self-Admitted Technical Debt of the Defects type in R 2
1.1.2 Real-world scenarios where SATD has caused substantial problems 3
1.1.3 Example - SATD Type: Algorithm in Java 7
1.1.4 Possible Resolution 7
3.1.1 A basic convolutional neural network (CNN) architecture 19
3.2.1 The graph convolutional neural network [14] 20
3.2.2 The graph convolutional neural network for text data 21
4.1.1 The SATD preprocessing workflow 25
4.2.1 The architecture of Pretrained LM with CNN model Pretrained Code-BERT is used as an example 27
4.2.2 The input representation of BERT model 28
4.3.1 The architecture of Pretrained LM with GCN model 30
4.3.2 Schematic of GCN model 31
5.1.1 Comment length distribution in R dataset 36
5.1.2 SATD length distribution in Java dataset 37
5.2.1 Influence of the number of GC layers in SATD detection for R packages The baseline is equal to zero number of GC layers 42
5.2.2 Influence of the number of GC layers in SATD detection for Code com-ment source in Java The baseline is equal to zero number of GC layers 43
Trang 8List of Tables
1.1.1 Taxonomy TD definitions, based on Codabux et al [5] 4
1.1.2 Types of SATD with Examples 6
5.1.1 Statistics of the dataset 35
5.1.2 Number of different types of SATD 36
5.2.1 Comparison Results in R dataset (%) 39
5.2.2 Code comment dataset 40
5.2.3 Pull request dataset 40
5.2.4 Commit message dataset 40
5.2.5 Issue dataset 40
5.2.6 Influence of the number of GC layers on average F1-score of R language dataset and Java Code comment data 42
Trang 9CNN Convolutional Neural Network.
GCN Graph Convolutional Network.
LM Language Models.
NLP Natural Language Processing.
OOP Object Oriented Programming.
PMI Pointwise Mutual Information.
PPMI Positive Pointwise Mutual Information.
RNN Recurrent Neural Networks.
SATD Self-Admitted Technical Debt.
Trang 10Chapter 1
Introduction
SATD refers to code artifacts within a software project where developers explicitly knowledge the presence of suboptimal or problematic code but do not immediately ad-dress it SATD can manifest as code comments, such as ”// TODO” or ”// FIXME,” andtypically reflects issues related to design flaws, code smells, or deferred maintenance.The identification of SATD instances is crucial for several reasons It allows developmentteams to prioritize technical debt repayment, maintain code quality, and reduce the risk ofproject delays and increased maintenance costs
ac-1.1 Problem Statement
In the dynamic realm of software development, a common challenge that developers quently encounter is the inexorable march of time The pressures of deadlines and projectconstraints often force them to make expedient decisions and adopt shortcuts to expeditethe development process However, while these shortcuts may offer immediate relief, theycan potentially sow the seeds of long-term consequences [30] These expedient measures,taken in the heat of project development, can inadvertently give rise to a myriad of issues.One of the foremost concerns is the inadvertent creation of low-quality code These hastycoding practices, often driven by the urgency of project timelines, may result in code thatlacks the robustness, efficiency, and maintainability that are the hallmarks of high-qualitysoftware Consequently, such code can become a source of frustration for developers andmay hinder the overall progress of the project
fre-The concept of Technical Debt (TD) was introduced by Cunningham [6], referring tothe phenomenon of “not-quite-right code” that represents an incomplete, temporary, orsub-optimal solution In the pursuit of short-term advantages, incurring debts, over the
Trang 11long term, must be repaid at an increasing cost [30].
Self-Admitted Technical Debt (SATD) represents a particular form of TD where
devel-opers leave comments within the source code to acknowledge sections that are not fullycompleted or optimized, indicating the need for further refinement or additional atten-tion [17]
Figure 1.1.1: An example of Self-Admitted Technical Debt of the Defects type in R
Figure 1.1.1 is an example of the SATD categorized into “Defect” In this example,the comment explicitly mentions the function name, ’validate input’, and the variable
’threshold’, indicating that there is an issue related to the initialization of the variable.However, it suggests deferring the resolution of this defect, which is a characteristic of the
”Defects” type of SATD
Detecting and effectively dealing with SATD issues presents a multifaceted challengewithin the realm of software development One of the primary complexities lies in thefact that SATD concerns often fly under the radar, remaining obscure and known to only
a select few developers
The inherent intricacy of SATD identification and management is compounded by theurgency of the matter Failing to address these debts in a timely and efficient mannercan lead to a cascading effect of adverse consequences These repercussions reverberatethrough various facets of software quality, making it imperative to allocate the requisitetime and resources for resolution
In essence, the proactive detection and management of SATD issues are critical ponents in the continuous pursuit of software quality enhancement The intricacies in-volved in this process necessitate a thorough understanding of the various SATD types,
com-as well com-as the adoption of effective strategies for their detection and subsequent tion By doing so, software development teams can ensure that their projects remain on atrajectory towards improved reliability, maintainability, and overall excellence
mitiga-While Self-Admitted Technical Debt (SATD) may not lead to traditional ”disasters” inthe sense of natural disasters, it can have significant negative impacts on software projects,
Trang 12Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
organizations, and even individuals SATD can lead to software failures, causing tions to crash or malfunction For instance, in 2012, Knight Capital Group lost 440 milliondolars in under an hour due to a trading algorithm error caused by an undetected softwarebug, which can be considered a financial disaster
applica-Figure 1.1.2: Real-world scenarios where SATD has caused substantial problems
Technical debt often includes poor security practices Unresolved security issues insoftware can lead to data breaches, exposing sensitive information The Equifax databreach in 2017, affecting 147 million people, was partly attributed to unpatched softwarevulnerabilities
As a multi-paradigm programming language, R is being used more and more in plications related to data science and statistics [33] Despite the fact that the R end-userprogramming community has been growing, the bulk of contributors are statisticians andscientists rather than software engineers Indeed, prior research has indicated that fewR-users are familiar with the nuances of the programming language and that they do notview themselves as developers [26] While there are several studies dealing with SATD inprogramming languages such as Java [30], little attention has been paid to the detection oftechnical debt in the R language We hypothesize that the detection of SATD in R pack-ages is more difficult than that in other languages due to the mixing of several program-ming paradigms including dynamically typing In the context of Java, which is widelyused in various software projects, SATD detection becomes particularly relevant SATD
ap-in Java refers to code segments where developers have knowap-ingly ap-introduced suboptimalsolutions, which are often documented as comments within the code.Java codebases aretypically large and complex, making it challenging to manually identify SATD And SATDcomments in Java can vary widely in terms of content and style, making automated de-tection a non-trivial task Several tools and frameworks are available for SATD detection
Trang 13in Java, including CodeBERT, DebtWatcher, and SATDClassifier, each offering differentfeatures and approaches.
SATD encompasses a broad spectrum of issues and shortcomings in software opment It’s a recognition of the fact that not all technical debt is created equal Just as inreal life, there are various facets and forms of debt Their definitions are summarised inTable 1.1.1
devel-Table 1.1.1: Taxonomy TD definitions, based on Codabux et al [5]
Debt type Definition
Architecture Refers to the problems encountered in product architecture,
for example, violation of modularity, which can affect chitectural requirements (e.g performance, robustness)
ar-Build Refers to issues that make the build task harder and
unneces-sarily time-consuming The build process can involve codethat does not contribute value to the customer Moreover,
if the build process needs to run ill-defined dependencies,the process becomes unnecessarily slow In the context of
R, Build TD encompasses anything related to Travis, cov.io, GitHub Actions, CI, AppVeyor, CRAN, CMD
negatively affect the legibility of the code, making it moredifficult to maintain Usually, this TD can be identified byexamining the source code for issues related to bad codingpractices In the context of R, code debt encompasses any-thing related to renaming classes and functions, ’<-’ and ’=’,parameters and arguments in functions, FALSE/TRUE vsF/T, print vs warning/message
Defect Refers to known defects, usually identified by testing
activ-ities or by the user and reported on bug tracking systems
Continued on the next page
Trang 14Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
Table 1.1.1 – continued from previous page
Debt type Definition
Design Refers to debt that can be discovered by analyzing the source
code and identifying violations of the principles of goodobject-oriented design (e.g very large or tightly coupledclasses) In the context of R, design debt encompasses any-thing related to S3 classes and S4 methods, exporting func-tions with ’@export’ or the name pattern (visibility), inter-nal functions with coupling issues, location of functions inthe same file, selective importing ’@import’ (whole pack-age) or ’@importFrom’ (a specific function), notations ’::’and ’:::’, returning objects (dataframes or tibbles), and Tidy-verse vs baseR
Documentation Refers to the problems found in software project
documen-tation and can be identified by looking for missing, quate, or incomplete documentation In the context of R,documentation debt encompasses anything related to Roxy-gen2 (e.g., ’@param’, ’@return’, ’@example’), Pkgdown,Readme files, and Vignettes
inade-Requirements Refers to trade-offs made concerning what requirements the
development team needs to implement or how to implementthem Some examples of this type of debt are: requirementsthat are only partially implemented, requirements that areimplemented but not for all cases, requirements that are im-plemented but in a way that does not fully satisfy all thenon-functional requirements (e.g security, performance).Test Refers to issues found in testing activities that can affect the
quality of those activities Examples of this type of debt areplanned tests that were not run, or known deficiencies in thetest suite (e.g low code coverage) In the context of R, testdebt encompasses anything related to coverage, covr, unittesting (e.g., testthat), and test automation
Continued on the next page
Trang 15Table 1.1.1 – continued from previous page
Debt type Definition
Usability Refers to inappropriate usability decisions that must be
ad-justed later Examples of this debt are the lack of usabilitystandards and inconsistency among navigational aspects ofthe software In the context of R, this encompasses anythingrelated to usability, interfaces, visualization, and so on
Versioning Refers to problems in source code versioning, such as
un-necessary code forks
Table 1.1.2: Types of SATD with Examples
Type of SATD Example
Code/Design ”Oh I didn’t realize we got duplicated logic We need to
refac-tor this.” - [from Superset-pull-request-6831]
”Need to add better handling for hz instance cleanup.” - [fromCamel-jira-issue-10563]
”Some new, friendlier APIs may be called for.” - [from github-issue-5940]
Druid-Documentation ”Could you also please document the meaning of the various
metrics” - [from Spark-pull-request-6905]
”I think we should document this” - [from issue-1905]
Accumulo-jira-”Currently, the API docs are missing from our website.” - [fromMxnet-github-issue-6648]
Test ”It’d be good to add some usages of DurationGranularity to the
query tests” - [from Druid-github-issue-3994]
”I did another cycle of review the unit tests, sorry I still notsee value in denial-of-service tests?” - [from Zookeeper-pull-request-689]
”I would like to have at least a simple testcase aroundthe UseV2WireProtocol feature” - [from Bookkeeper-github-issue-272]
Requirement ”TODO: add a dynamic context in front of every selector with
a traversal” - [from Heron-code-comment]
”Remaining todo list for SQL parse module ” - [from github-issue-2505]
Pinot-”Union is not supported yet But I might be adding that bility quite soon.” - [from Samza-pull-request-295]
capa-To provide some insight into what the different types of SATD look like, Table 1.1.2
Trang 16Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
Figure 1.1.3: Example - SATD Type: Algorithm in Java
provide some identified representative examples from each type in Table 13
Figure 1.1.4: Possible Resolution
SATD is a prevalent issue in software development While its existence is widely knowledged, the need to classify SATD into specific types is essential for several reasons
ac-By classifying SATD into types, development teams can better prioritize which aspects oftechnical debt to address first Not all types of SATD are equal in terms of their impact onsoftware quality, so classification helps identify critical areas that require immediate at-tention Different types of SATD may require distinct mitigation strategies For example,architectural debt might call for refactoring at a higher level, while code debt may needcode-level improvements Classification guides the selection of appropriate strategies toresolve each type effectively Addressing SATD is crucial for maintaining and enhancingsoftware quality When SATD is categorized, it becomes easier to identify the root causes
of problems and apply solutions that lead to cleaner, more maintainable code, reduced fects, and enhanced software reliability Limited development resources, such as time andmanpower, must be allocated judiciously Classification helps in distributing resources ef-ficiently, as teams can allocate efforts to specific types of SATD based on their potentialimpact and urgency By classifying SATD, teams can track the occurrence of differenttypes over time This not only allows for better monitoring of the debt but also helps inimplementing preventive measures and best practices to reduce the accumulation of spe-cific types of SATD in the future The classification of Self-Admitted Technical Debt isnot just an academic exercise but a practical necessity It enables software developmentteams to tackle debt in a more structured, effective, and strategic manner, ultimately lead-
Trang 17de-ing to higher software quality and more successful projects.
Figure 1.1.3 an example of SATD type Algorithm in Java along with a code examplethat could address it in figure 1.1.4 In his example, this code uses a linear search to findthe maximum element in an array Linear search has a time complexity of O(n), which isnot optimal for large arrays The improved code sorts the array in ascending order, making
it easier to identify the maximum element Sorting has a time complexity of O(n log n) inthe worst case, which is more efficient for larger arrays In this example, the SATD type
”Algorithm” was addressed by replacing the non-optimal linear search algorithm with amore efficient sorting algorithm This change enhances the performance of the code.Conventional SATD detection approaches using text-mining techniques, particularlyrelying on source code comments patterns [2, 27] yielded promising results Still, despitethe high precision rate that pattern-based techniques can attain, they are unable to reportSATD comments that do not fit any established patterns Recent studies investigated theadvantages of using Natural Language Processing to more accurately portray the seman-tic connections between various SATD types and textual comments [7, 23, 30] Althoughtext-mining-based methods outperformed pattern-based approaches in prediction accu-racy, almost all of these studies have exclusively targeted to object-oriented programminglanguages
a comment Our goal is to capture not only the link among scattered sentences, but also
to learn representations for both training data and unlabeled test data by propagating labelimpact through graph convolution In addition to comparing with SOTA methods in twodatasets, we built a hybrid model, combining the pre-trained LM and CNN to comparewith our model DebtSniffer
1.3 Contributions
Our work makes the following contributions:
1 We introduced DebtSniffer, an innovative SATD detection tool capable of ing the SATD problem across two major programming languages: R and Java Debt-
Trang 18address-Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
Sniffer leverages the power of pre-trained LM models and augments their ities through the incorporation of graph convolutional networks
capabil-2 We conducted an extensive evaluation using real-world datasets Our evaluationprocess involved rigorous comparisons with state-of-the-art baseline models and analternative hybrid model, BERT-CNN By systematically benchmarking DebtSnif-fer against these models, we demonstrated its superior performance across variousSATD categories and programming languages
3 In a commitment to transparency and openness, we have made the DebtSniffer toolpublicly available online1 By sharing this tool with the broader research and soft-ware engineering communities, we actively contribute to the principles of open sci-ence
In summary, our contributions in the development of DebtSniffer, its comprehensiveevaluation against state-of-the-art models, and our commitment to open science collec-tively represent a substantial and impactful advancement in the domain of SATD detec-tion DebtSniffer not only extends the reach of SATD detection to multiple programminglanguages but also sets a high standard for future research and tools in the field of softwareengineering
1.4 Organization of Thesis
The rest of the thesis is constructed as follows:
Chapter 2 provides a comprehensive review of previous studies and research endeavorswithin the domain of Self-Admitted Technical Debt (SATD) detection We delve into theexisting literature to gain insights into the evolution of this field and identify the gaps ourwork aims to address
Chapter 3 offers foundational knowledge that is pivotal for comprehending the cate concepts and methodologies featured in this thesis We offer concise explanations ofkey principles, theories, and technologies that form the bedrock of our work
intri-Chapter 4 introduces and elaborate on the novel models we’ve developed for the tection of SATD We delve into the technical details of our approach, explaining how itharnesses various machine-learning techniques to enhance the accuracy of SATD identi-fication
de-Chapter 5 showcases the outcomes of our rigorous experimentation We present a
1 https://github.com/ICSME2023-DebtSniffer/DebtSniffer/
Trang 19detailed analysis of our approach’s performance across a diverse array of datasets and incomparison to various baseline models Through this, we substantiate the effectivenessand robustness of our methodology.
Chapter 6 provides a concise summary of our accomplishments and discusses somedirections for further research
Trang 20Chapter 2
Related Work
The detection and management of Self-Admitted Technical Debt (SATD) have becomeincreasingly critical in software development, as they impact code quality, maintainabil-ity, and overall project success In this section, we review the existing literature and ap-proaches related to SATD detection, highlighting the evolution of techniques and research
in this domain Besides, we provide a comprehensive overview of Convolutional NeuralNetworks (CNNs), Graph Convolutional Networks (GCNs), and pre-trained Transformermodels, including ALBERT, RoBERTa, and CodeBERT
Our work focuses on the SATD detection from R code comments and Java source
codes Therefore, we divided the related work into three sections, i.e., General
Tech-niques for SATD detection, SATD detection in the R languages and SATD detection in Java languages.
2.1 General techniques for SATD detection
The concept of self-admitted technical debt (SATD) was initially introduced by Potdarand Shihab [27] They conducted an analysis of 101,762 comments from four open sourceprojects, leading to the identification of 62 patterns that can be used to detect SATD com-ments
However, comment pattern-based approaches have limitations in identifying SATDcomments that do not conform to established patterns To overcome this, there were sometechniques have been then explored, demonstrating promising outcomes [10]
Lexical and Comment-Based Approaches In the initial stages of Self-Admitted
Tech-nical Debt (SATD) detection, early methodologies predominantly hinged upon lexical and
Trang 21comment-based analyses These techniques involved scouring code comments for lar keywords or phrases, with common examples being ”TODO” or ”FIXME.” While thesemethods were straightforward to implement, they often encountered limitations stemmingfrom their inherent simplicity The main issues included a lack of contextual understand-ing and a relatively high incidence of false positives The primary challenge lay in thefact that these approaches failed to consider the broader code context within which com-ments were embedded As a result, they could easily misinterpret innocuous comments
particu-as instances of SATD, leading to a significant number of false positives Additionally,these methods were typically keyword-driven, making them rigid and less adaptable tovariations in how developers expressed and documented technical debt Consequently,the lack of contextual analysis hindered the precision and granularity of SATD detection,limiting the utility of these early methods in practical software development contexts
Machine Learning-Based Approaches Machine learning-based methodologies have
surged in popularity for the detection of SATD This rise in prominence is attributed tothe inherent capacity of machine learning models to conduct a nuanced analysis of codecomments within their broader contextual framework Among these machine learning ap-proaches, Natural Language Processing (NLP) techniques have emerged as a particularlypowerful toolset, allowing for precise and context-aware SATD detection One of the keystrengths of NLP-based SATD detection lies in its ability to perform text classification,enabling the model to discern the underlying intent and tone of code comments This goesbeyond the mere identification of keywords or phrases, as it delves into the subtleties ofhuman language
Furthermore, NLP-based SATD detection methods are adaptable and extensible Theycan be fine-tuned and customized to the specific requirements and coding conventions ofdifferent software projects and programming languages This adaptability allows devel-opers and researchers to tailor SATD detection models to suit the unique characteristics oftheir codebases, fostering more accurate and context-aware results
SVM and Nạve Bayes Classifiers In the initial stages of adopting machine learning for
SATD detection, early methods made use of classification algorithms like Support VectorMachines (SVM) and Nạve Bayes classifiers These techniques were instrumental in thetask of distinguishing between SATD and non-SATD comments However, it’s important
to note that during this early phase, the models heavily relied on handcrafted features,marking a significant difference from the more modern, data-driven approaches
Trang 22Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
Deep Learning Models Recent advancements in deep learning have introduced more
so-phisticated models for SATD detection Convolutional Neural Networks (CNNs) and current Neural Networks (RNNs) have been used to capture complex relationships withincode comments, achieving improved accuracy
Re-The integration of CNNs and RNNs into SATD detection models has led to substantialaccuracy improvements These models are no longer confined to manual feature engineer-ing, but instead harness the full potential of data-driven learning This adaptability equipsthem to cater to diverse coding styles, programming languages, and domains, providing
a robust and versatile solution for SATD detection In a software development landscapecharacterized by constant evolution and change, these sophisticated models are crucial inensuring that SATD detection remains effective and context-aware
Hybrid Approaches Hybrid approaches represent a dynamic and comprehensive
evo-lution in the realm of Self-Admitted Technical Debt (SATD) detection These innovativemethodologies seamlessly integrate lexical analysis, machine learning, and code metrics,forging a powerful synergy that enhances the precision and depth of SATD identification.These multifaceted methods draw insights from code comments, source code, and projecthistory, culminating in a holistic and nuanced perspective on technical debt instances.The vast body of research on SATD primarily revolves around its detection and man-agement within the domain of Object-Oriented (OO) software projects Java, in particular,has served as a focal point for many of these investigations However, as software devel-opment landscapes continue to evolve, there arises a pressing necessity to broaden ourhorizons and explore how established SATD detection methodologies fare in the context
of scientific programming languages
Scientific programming languages, such as Matlab, R, Python, and others, introduce aunique set of challenges due to their diverse paradigms and specialized use cases Unliketraditional OO languages, these languages are specifically tailored to cater to scientific anddata analysis tasks Consequently, they often employ different coding practices, structures,and idioms
This paradigm shift in programming languages necessitates a critical examination ofthe adaptability and efficacy of existing SATD detection techniques What has proven ef-fective in the OO realm may not seamlessly translate to these scientific languages Hence,
it is imperative to assess how SATD manifests itself in codebases developed using lab, R, Python, and similar languages This includes an exploration of the specific types
Mat-of SATD that emerge, as well as an evaluation Mat-of whether current detection strategies canaccurately pinpoint these issues
Trang 23In essence, this research endeavor seeks to bridge a significant gap in SATD ship by extending its purview to encompass scientific programming languages Through
scholar-a comprehensive scholar-anscholar-alysis, we scholar-aim to shed light on how SATD mscholar-anifests in these diversecoding environments and to refine our SATD detection methodologies accordingly Ulti-mately, this expansion of scope will contribute to a more holistic understanding of SATDacross the software development spectrum
Furthermore, it’s important to note that SATD detection in the realm of Java ming holds significant importance in the overarching goal of maintaining code quality andmanaging technical debt effectively Java, being one of the most widely used program-ming languages, forms the backbone of countless software applications across various do-mains Therefore, ensuring the integrity and cleanliness of Java codebases is paramount.Traditionally, SATD detection in Java codebases has been recognized as a corner-stone in this pursuit of code quality It serves as a proactive mechanism to identify andaddress potential sources of technical debt before they escalate into more complex andcostly issues This proactive stance aligns with modern software development paradigmsthat emphasize the importance of preventive maintenance over reactive fixes
program-In the context of Java, two primary approaches have emerged as fundamental strategiesfor SATD detection: lexical analysis and comment-based analysis These approaches,rooted in the linguistic and structural aspects of Java code, provide a robust foundation foridentifying and categorizing self-admitted technical debt By leveraging lexical patternsand insights from code comments, these methodologies enable developers and softwareteams to not only flag instances of SATD but also gain valuable context regarding thenature and scope of the debt
2.2 SATD detection in the R language
Due to the prevalence of Self-Admitted Technical Debt (SATD) research within the realm
of Object-Oriented (OO) projects, with a primary focus on languages like Java, there is
a growing necessity to expand our understanding of how SATD detection methods areadapted for scientific programming languages Scientific programming languages, such
as Matlab, R, Python, and others, possess distinct characteristics compared to their Oriented counterparts Therefore, the exploration of SATD in the context of these lan-guages introduces a host of challenges and opportunities
Object-Scientific programming languages are frequently chosen for data analysis, tational modeling, and other research activities These languages are well-suited for datamanipulation, statistical analysis, and visual representation, making them indispensable in
Trang 24compu-Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
fields such as data science, bioinformatics, and quantitative research Given the paramountrole of these languages in scientific domains, SATD’s presence can have critical impli-cations Yet, compared to the extensive research conducted in the OO realm, SATD’sunderstanding and detection within scientific programming languages are relatively un-derexplored
In Java and other OO languages, SATD is primarily observed in the form of codecomments These comments typically reflect developers’ awareness of suboptimal code
or areas needing improvement However, in the scientific programming landscape, SATDmay manifest differently It could be embedded in code comments, but it can also surface
in the documentation, function and variable names, or even in the choice of data structuresand algorithms These unique expressions of SATD in scientific programming languagesrequire tailored detection strategies
Furthermore, SATD in scientific programming may encompass issues distinct from
OO contexts For example, in Matlab or Python, inefficiencies in numerical algorithms canlead to suboptimal performance In R, data handling inefficiencies might impact statisticalanalyses The SATD challenges in these languages extend beyond mere syntactic analysisand often necessitate a profound understanding of the specific scientific domain to identifysubtle technical debt manifestations
As the adoption of scientific programming languages continues to grow, ensuring codequality and managing technical debt in these contexts is becoming increasingly important.Consequently, there is an urgent call for research to develop and adapt SATD detectiontechniques that can effectively pinpoint these unique forms of debt in scientific codebases
By doing so, we can ensure that the robustness, reliability, and maintainability of scientificsoftware align with the high standards required by researchers in various fields, ultimatelyadvancing the quality and impact of scientific research
Codabux et al [5] collected over 5,000 comments from 157 packages that were
pub-lished at the rOpenSci platform, manually analyzed and figured out 10 types of TD for
R packages In addition to a proposed taxonomy of TD for R packages, this work also
highlighted that Documentation Debt is commonly reported comparing to other TD types.
Inspired from the study of Codabux et al [5], Khan and Uddin [12] applied a pre-trainedBERT model combined with some Machine Learning algorithms to automatically clas-
sify SATD comments from two R platforms, rOpenSci and BioConductor They found
that generic platforms such as rOpenSci are more prone to TD than domain specific form (i.e., BioConductor) Vidoni [32] analyzed 503 R packages from GitHub and exam-ined more than 164k comments to generate a baseline dataset The author suggested twonovel TD types for R source code, namely ALGORITHM and PEOPLE instead of only
Trang 25plat-10 categories as in earlier research The closest work to ours is the research of Sharma
et al [31], where the authors investigated two variants of the pre-trained BERT model toautomatically detect 12 SATD types from R source code comments However, BERT is
a pre-trained model for natural languages, rather than code Thus, DebtSniffer is ent from Sharma et al [31] as it exploits CodeBERT–a model trained on source code–toclassify SATD
differ-2.3 SATD detection in the Java language
Detecting Self-Admitted Technical Debt (SATD) in Java is of paramount importance when
it comes to ensuring and enhancing code quality and effectively managing technical debtwithin software projects Technical debt can accumulate in codebases over time due tovarious factors, such as development pressures, tight project schedules, or the evolvingnature of software requirements However, if not addressed promptly, this technical debtcan have severe repercussions on software projects, including reduced maintainability,increased defect rates, and higher development costs
The detection of SATD in Java, specifically, plays a pivotal role in mitigating theseissues It offers a proactive approach to identifying areas in the code where developershave consciously or unconsciously taken shortcuts or made compromises due to time con-straints, complexity, or a lack of better solutions By recognizing these instances, SATDdetection provides an opportunity to rectify, refactor, or document these suboptimal codesegments, thus reducing the long-term impact on software quality
In the realm of SATD detection, two fundamental approaches are commonly ployed: lexical and comment-based approaches The lexical approach involves the anal-ysis of code elements, such as identifiers, comments, and their relationships, to identifypotential instances of technical debt It focuses on code patterns, naming conventions, andspecific code smells that may indicate the presence of SATD On the other hand, comment-based approaches center on the textual information embedded in code comments, oftentermed self-admitted comments Developers leave these comments to communicate as-pects of the code that might need improvement, or areas that have been compromised insome way Potdar and Shihab [27] conducted a comprehensive analysis of SATD com-ments in open-source Java projects The study identified common keywords like ”TODO,”
em-”FIXME,” and ”XXX” as indicators of SATD In this work, Potdar and Shihab manuallysummarized 62 patterns that can be used to identify SATD comments, after reading morethan 100,000 source-code comments from different Java projects Based on this first work,Wehaibi et al [36] examined the relation between self-admitted technical debts and de-
Trang 26Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
fects They found that SATD is not related to defects, rather making the system moredifficult to change in the future In addition, Maldonado and Shihab [24] further dividedSATD into five types, namely design debt, defect debt, documentation debt, requirementdebt and test debt All previous SATD detection studies aimed to identify debt instances
at the file level Yan et al [37] proposed a novel approach to automate the detection ofSATD at the change level The idea is to catch the introduction of SATD when a soft-ware change occurs, instead of inspecting if a file that was changed previously containsSATD The authors built a determination model using a 325 Random Forest classifica-tion with data labeled from comment analysis, and features extracted from source controlrepositories Thereafter, Ren et al [29] introduced a convolutional neural network-basedapproach to improving the identification performance, while Wang et al [35] explored theefficiency of an attention-based approach in SATD identification Additionally, Chen et al.[4] trained an XGBoost classifier to identify three types of SATD, namely design, defect,and requirement debt from source code comments Most recently, Li et al [18] created aSATD dataset from four different sources: source code comments, commit messages, pullrequests, and issue-tracking systems They manually tagged each item in this dataset asnon-SATD or SATD (including types of SATD) and then proposed an approach (MT-Text-CNN) to identify four types of SATD from four sources Simultaneously, they analyzed asample of the identified SATD to explore the relations between SATD in different sources
Trang 27In this chapter, we provide a comprehensive overview of Convolutional Neural Networks(CNNs), Graph Convolutional Networks (GCNs), and pre-trained Transformer models,including ALBERT, RoBERTa, and CodeBERT
3.1 Convolutional Neural Networks (CNN)
A convolutional neural network (CNN) is one of the most significant networks in the deeplearning field Since CNN made impressive achievements in many areas, including butnot limited to computer vision and natural language processing, it attracted much attentionfrom both industry and academia in the past few years [19] The powerful learning ability
of deep CNN is primarily due to the use of multiple feature extraction stages that canautomatically learn representations from the data [11]
3.1.1 Architecture
A typical CNN architecture generally comprises alternate layers of convolution and ing followed by one or more fully connected layers at the end In some cases, a fullyconnected layer is replaced with a global average pooling layer [11]
pool-Convolutional Layers pool-Convolutional layers are the core of the feature extraction
pro-cess They consist of multiple filters (kernels) that slide across the input image Each filterdetects different features, such as edges, textures, or more complex patterns Convolutioninvolves element-wise multiplication of the filter with a local region of the input imageand then summing the results These layers are responsible for learning hierarchical and
Trang 28Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
Figure 3.1.1: A basic convolutional neural network (CNN) architecture
increasingly abstract features from the input data
Pooling Layers Pooling layers reduce the spatial dimensions of the feature maps while
retaining the most salient information Common pooling operations include max-poolingand average-pooling Pooling helps reduce computational complexity and makes the net-work more robust to variations in input scale and position
Fully Connected Layers Fully connected layer is mostly used at the end of the network
for classification Unlike pooling and convolution, it is a global operation [11] Eachneuron in a fully connected layer is connected to all neurons in the previous layer Thislayer processes the high-level features learned in earlier layers The output layer typicallyhas as many neurons as there are classes in a classification or regression task
3.1.2 Applications
CNNs have found applications beyond computer vision, including natural language cessing (NLP) CNNs can classify images into predefined categories, such as recognizingobjects in photos, identifying handwritten digits, or classifying diseases in medical im-ages [1] They also have been used for text classification, sentiment analysis, and eventext generation, by treating text data as a two-dimensional matrix [3, 34]
Trang 29pro-Figure 3.2.1: The graph convolutional neural network [14]
3.2 Graph Convolutional Networks (GCN)
Graph Convolutional Neural Networks (GCNs) are primarily designed for feature tion and analysis on graph-structured data [14] At their core, GCNs are designed to han-dle data structured as graphs, a versatile mathematical abstraction used to represent andmodel relationships between entities In a graph, data entities are depicted as nodes, whiletheir connections or interactions are encoded as edges This representation is exceptionallyvaluable in scenarios where relationships and context are crucial, which spans a wide array
extrac-of fields GCNs leverage the inherent structure extrac-of graphs to enhance their understanding
of the data As illustrated in Figure 3.2.1, GCNs operate on the principle of informationpropagation, where each node aggregates information from its neighboring nodes Thisiterative process occurs through multiple layers, allowing GCNs to capture increasinglycomplex patterns and dependencies in the data The critical insight behind GCNs is thateach node’s representation is continually refined based on both its local neighborhoodand the global graph structure, facilitating the extraction of rich and contextually relevantfeatures
While GCNs primary application is on graph data, it’s possible to adapt GCNs fortext-based data when you represent text as a graph [39] Figure 3.2.2 illustrates a basicarchitecture of a GCN suitable for text feature extraction in such cases
Trang 30Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
Figure 3.2.2: The graph convolutional neural network for text data
3.2.1 Structure
Graph Representation from Text Convert text data into a graph representation can be
done in various ways, such as:
• Document-Word Graph : Treat each document as a node and words as edges necting documents Edges can be weighted by word co-occurrence or similarity[39]
con-• Word-Word Graph: Each word is a node, and edges connect words that co-occur indocuments Edges can be weighted based on co-occurrence frequency or semanticsimilarity
• Dependency Tree: Construct a graph using the syntactic or semantic dependenciesbetween words in sentences
Input Layer Each node in the graph corresponds to a text element (e.g., document, word,
or sentence) and is associated with an initial feature vector These feature vectors canrepresent word embeddings, TF-IDF scores, or other text-based representations [25]
Graph Convolutional Layers The core of the GCN architecture involves multiple graph
convolutional layers that operate on the graph representation of the text data Each layerperforms the following steps:
• Message Aggregation: Nodes collect information from their neighboring nodes
Trang 31based on the edges in the graph This involves aggregating features from boring nodes, often with a weighted sum based on edge weights.
neigh-• Feature Transformation: The aggregated information is then transformed using alearnable weight matrix
• Non-Linearity: An activation function (e.g., ReLU) is applied element-wise to thetransformed features
These operations enable nodes to capture information from their local context in the graph
For instance, considering a two-layer GCN, the output representation Z with input X are
formed as below
Z = f (X, A) = sof tmax
(ˆ
AReLU ( ˆ AXW(0))W(1)
)
in which A is a symmetric adjacency matrix, ˆ A is the normalization of A, W(0) is an
input-to-hidden weight matrix for hidden layer 0 and W(1) is the hidden-to-output weightmatrix
Pooling Layers (Optional) Similar to CNNs, GCNs for text can incorporate pooling
layers to down-sample the graph, reducing its size while retaining important structuralinformation
Output Layer The final layer produces the output of the GCN Depending on the specific
text-based task, this layer can have different architectures: For text classification tasks, asoftmax activation function may be applied to predict class labels For text generation
or summarization, a recurrent or transformer-based decoder can be used to generate textsequences
3.2.2 Applications
GCNs have emerged as a powerful tool for various applications in the realm of based machine learning Their ability to capture and process complex relationships withingraph-structured data has led to significant advancements in a wide range of fields GCNsare used to identify influential nodes or users in social networks [20] This is essentialfor targeted marketing, and recommendation systems [40] Additionally, GCNs improvesyntactic and semantic parsing tasks in NLP by capturing dependencies and relationshipsbetween words and phrases [21] They aid in extracting relationships between entities intext, a crucial task in information extraction [28]
Trang 32graph-Thesis advisor : Dr Bui Thi Mai Anh Tran Thi Dinh
3.3 Text embedding models
Pre-trained BERT (Bidirectional Encoder Representations from Transformers) modelshave revolutionized the field of natural language processing (NLP) by capturing rich con-textual information from large text corpora[8] Among the prominent BERT-based mod-els, ALBERT, RoBERTa, and CodeBERT stand out for their unique enhancements andspecialized applications In this section, we delve into each of these models:
3.3.1 ALBERT (A Lite BERT)
ALBERT [15] is designed to address the efficiency and scalability challenges of BERTwhile maintaining or even improving its performance Here are some key features andinnovations of ALBERT:
Parameter Reduction ALBERT significantly reduces the number of parameters
com-pared to the original BERT model It achieves this by factorizing the embedding matrixand sharing parameters across layers
Cross-layer Parameter Sharing ALBERT introduces cross-layer parameter sharing,
al-lowing information to flow more efficiently across layers This enhances model trainingand generalization
Sentence Order Prediction (SOP) In addition to the standard Masked Language
Mod-eling (MLM) task, ALBERT uses the SOP task, which involves predicting whether twosentences in a document are in the correct order This task further improves pre-training
Improved Performance Despite its parameter reduction, ALBERT often outperforms
BERT on various downstream NLP tasks while being more memory-efficient and fasterduring inference
3.3.2 RoBERTa (A Robustly Optimized BERT Pretraining Approach)RoBERTa [22] builds upon BERT’s success and focuses on optimization and robust pre-training It addresses various aspects of the pre-training process:
Larger Batch Sizes It uses larger batch sizes and dynamic masking to expose the model
to more diverse training data, leading to improved generalization