Our approach involves using computer vision and machine learning niques to automatically extract features from images to store these characteris-tics for each chili pepper variety.. pre-
Trang 1VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
COMPUTER SCIENCES
NGA PHAM THI - 21521168
GRADUATE THESIS
APPLYING MACHINE LEARNING FOR
CHILI PEPPER PHENOTYPING AND
FEATURE EXTRACTION
BACHELOR OF COMPUTER SCIENCE
LECTUREPhD DUNG MAI TIEN
HO CHI MINH CITY, 2024
Trang 2No one achieves anything great without the help of those around them,whether directly or indirectly To complete this thesis, I was fortunate to re-ceive much help and support from teachers, colleagues, friends, and family Iwould like to dedicate these first pages to express my gratitude to everyone whohas accompanied our group during this period
First of all, I would like to extend my deepest thanks to all the teachers
at the University of Information Technology in general and the teachers of the
Department of Computer Science in particular Thanks to the valuable edge imparted by the teachers, as well as their dedicated support throughout theprocess, our group was able to complete the thesis and achieve commendableresults
knowl-I especially want to thank Ph.D Dung Mai Tien and Ph.D Tuan Thai Thanh,who inspired, meticulously guided, and provided extensive knowledge, creating
a favorable environment for me to learn and exchange ideas with the seniorsand peers in the research group These are invaluable insights and experiences,beneficial not only for this graduation thesis but also for the future work ahead
Finally, I express my heartfelt gratitude to my family and loved ones, who
have always been a strong support and consistently backed every decision our
group made.
Despite having put in a lot of effort to perfect this thesis, it is hard to avoidmistakes and limitations I hope to receive sympathy and constructive feedbackfrom the teachers and friends
Ho Chi Minh city, June 24, 2024
Nga Pham Thi, Student1H
Trang 31.2.1 Objectives 1.2.2 Scope
13
Trang 42.1 ChapterOverViewW 00000 eee eee
2.2 Problem in the Field of Blology
2.3 Perspective of Computer Vision
3.4 Chili pepper Dataset for Step]
3.5 Statistics for the Chili Pepper Dataset
CHILI PEPPERS AND SEEDS DETECTION PROBLEM
4.1 ChapterOverVvieW 0.000 eee eee ne
4.2 Object Detection Problem
44 YOLO nn*+táồẳ ŠŠ & /@T Kẻ s5
44.1 YOLOv5 0 0.000002 eee eee
44.2 YOLOvW7 2 20 es4.5 Evaluation Metrics 0 020 eee eee ee
45.1 PrecsionandRecalÐ
45.2 loÙ Q Q Q Q Q Q và và và
4.5.3 mAP(Mean Average Precisions)
4.6 Results and Evaluation
4.6.1 Results 0 00.02 eee ee ee4.6.2 Evaluation 2 000.002 008
FEATURE EXTRACTING PROBLEM
5.1 Chapter Overview 2 0.000 eee ee eee
27
2929
29
32323739
44 44 44 46 47
47
49 49
Trang 55.3 5.4 5.5 5.6 5.7 5.8
5.9
5.2.1 OrientedBlock
5.2.2 Ratio convert Pixel to Millimeter Block
5.2.3 Chili pepperMaskBlock
Width and Length of Bounding_box Chilipepper
NumberofSeeds Ặ.ẶẶẶ Average Width and Length
Surface AT€A Q Q Q Q Q HQ ng va Degree of Redness
Wrinkles of Chili peppersedge
5.8.1 By Angles Formed by Three Consecutive Vertices
AnalysisProcedure
Conclusion @ 2⁄2 œø” tà -
3.8.2 By Smoothness ofContour
Analysis Procedure
Conclusion Ặ 0 2 ee 5.8.3 Using Contour Over a Defined Segment
AnalysisProcedure
Conclusion Ặ
Summary Ặ 0p eee ee ee ee 6 EVALUATION FEATURES EXTRACTING 6.1 6.2 6.3 6.4 Chapter Overview 00.0002 e eens 330 Consumer 2 ẶỒẶ Ặ
AnalysisResults Ặ.Ặ QC Clustering Data 2 0 0.00000 00000]
7 CONCLUSION AND FUTURE RESEARCH
7.1
7.2
Conclusion Ặ c Q Q Q ee ee ee Future research Ặ Q SH Q Q2
REFERENCE
67 67 67 68 73
75 75 77
81
Trang 6List of Figures
1.1
1.2
2.1
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.1
4.2
4.3
4.4
4.5
4.6
Input and Output Ặ 5
Phenotyping Feature 9
Arichitecture Pipeline 18
Unique species code (IT name of each pepper variety) 20
Green House ð⁄.@Œ£” À\  21
Camera and environment setup forimage capture 22
QR barcode to calculated mm/pixel 23
Cropping Chili pepperlmages - 23
Chili Pepper DatasetLabel 24
Structured of Chili pepper Dataset 26
Distribution ofData 26
Distribution of ObJectCounts 27
Input and Output ofStepl 31
YOLO architecture 33
Overview of YOLO 0 0000000008 33 Diagram of YOLO architectire 34
Darknet53 2.0 0 ee ee 35 The network architecture of Yolov5 It consists of three parts: (1) Backbone: CSPDarknet, (2) Neck: PANet, and (3) Head: Yolo Layer The data are first inputted to CSPDarknet for feature ex-traction and then fed to PANet for feature fusion Finally, YOLO Layer outputs detection results (class, score, location, size) 38
vii
Trang 74.8
4.9
4.10
4.11
5.1
5.2
5.3
5.4
5.5
5.6
6.1
6.2
6.3
6.4
6.5
Backbone YOLOv7 20000222 ee 40 Neck YOLOV7 2 0.0 000000 0b eee ee 41 Head YOLOV7 2 0 000002 eee ee ee 42
Illustration of how to calculate Precision and Recall 45
Illustration of loU Metrics 45
Input of Step2 0.2 000002 eee ee 52 QRbarcode 54
Illustration of Chili pepper mask with only Threshold 55
Illustration of Chili pepper mask after using Closing method 56
Refined Chili Pepper segmentation 57
Illustration of bb_x 2 2 ee ee, 58 Examples of 330 Consumer 68
Histogram of Feature Extracting 70
Scatter Matrix of Feature Extracting 71
Correlation Matrix of Feature Extracting 72
PCA Feature Extracting Ặ 74
Trang 8List of Tables
3.1
4.1
4.2
4.3
6.1
Distribution of ObJectCounts 27
Evaluation withclassChỈI 48
Evaluation with classSeed 48
Evaluation Overall Model Evaluation 48
Statistical Summary of Feature Extracting 69
1X
Trang 9List of Abbreviations
Ph.D Doctor of Philosophy
XI
Trang 10Provisional ølossary
Machine Learning Học máy
Deep Learning Học sâu
xiii
Trang 11In the current agricultural sector, identifying phenotypes and accurately
de-scribing the morphology of chili peppers involve manual inspections and surements performed by trained personnel This process is labor-intensive,time-consuming, and prone to errors due to subjective biases and human mis-takes
mea-With the rapid advancements in computer vision and machine learning, wepropose a method that utilizes machine learning and computer vision to au-tomate the process of phenotype identification and feature extraction of chilipeppers Additionally, we aim to establish a dataset for information retrievalregarding various chili pepper varieties This study is supported by a securedataset provided through the collaboration between the Department of Com-puter Science and the BIO-RESOURCE COMPUTING RESEARCH CENTER
of Jeju National University, South Korea
Our approach involves using computer vision and machine learning niques to automatically extract features from images to store these characteris-tics for each chili pepper variety This is highly beneficial for managing pepperbreeding and reproduction in the biological field, catering to the expansive mar-
tech-ket for chili peppers today.
To address the outlined problem, we will divide it into smaller sub-problemsfor step-by-step resolution, including localization, image processing, and featureextraction Subsequently, we will address the application and implementation
of the initial objectives
In summary, this thesis accomplishes the following:
1 Constructing a dataset from images of chili peppers cultivated by
biolo-1
Trang 12gists at the research institute.
For the localization and seed detection of chili peppers in images, to
meet real-time conditions, we propose using the one-stage object tion model YOLO [11]
detec- For feature extraction post-chili identification, we will utilize image cessing techniques within the computer vision domain, which will beelaborated on later
Trang 13pro-Chapter 1
INTRODUCTION
In this chapter, we will provide an overview of the problem of APPLYING
MA-CHINE LEARNING FOR CHILI PEPPER PHENOTYPING AND
FEA-TURE EXTRACTION, along with the challenges encountered during the
im-plementation of this project Subsequently, we will summarize the subjects,
scope, and research objectives of this thesis At the end of the chapter, we willpresent the accomplished work and the main structure of the thesis
1.1 Problem statement
Chili peppers, cultivated worldwide and used for thousands of years, arespicy fruits belonging to the Solanaceae family They are highly valued fortheir unique flavor, nutritional properties, and medicinal benefits Chili peppersare rich in various vitamins, including vitamins E, C, A, and B complex, as
well as minerals such as thiamine, folate, molybdenum, manganese, potassium,
calcium, and iron [2]
Additionally, they contain polyphenols (mainly luteolin), flavonoids, andquercetin In many regions, chili peppers play a crucial role in local cuisine, pro-viding unique flavors and adding depth to traditional dishes Beyond culinaryapplications, chili peppers are used in various industries, including pharmaceu-ticals, cosmetics, and even self-defense products, due to their capsaicinoid [12]content—the compound responsible for their characteristic spiciness
Trang 141.1 Problem statement
The chili pepper market has seen significant growth, driven by increasingconsumer preference for diverse and authentic flavors, as well as the recogni-
tion of the potential health benefits associated with capsaicinoids [12] Their
widespread use as a spice and functional food ingredient has increased global
demand for both fresh and processed chili products, creating opportunities for
growers, processors, and traders Furthermore, chili peppers have become an
important component in many industrial applications Their popularity extends
beyond culinary use, as they are utilized in pharmaceuticals, cosmetics, and
even self-defense products due to the presence of capsaicinoids [12]
The vast diversity of chili pepper varieties presents numerous challenges
With many types of chili peppers available, each having distinct phenotypic
traits in terms of shape, size, color, spiciness, and flavor, it brings
opportuni-ties and challenges for breeding programs and variety management Accurate
and efficient characterization of these phenotypic traits is crucial for
unlock-ing the full potential of chili pepper varieties and promotunlock-ing targeted breedunlock-ing
efforts Traditional methods of phenotyping and accurately describing chili
pep-per morphology involve manual inspections and measurements by trained pep-
per-sonnel, which are labor-intensive, time-consuming, and prone to human error
and subjectivity Moreover, these manual methods often lack the precision and
consistency required for comprehensive analysis and comparison of varieties
Digitizing phenotypic traits through advanced imaging techniques and assisted analysis offers a transformative solution to these challenges Researchers
computer-can quantify and extract numerical features with unprecedented accuracy and
objectivity by capturing high-resolution images of chili peppers and
leverag-ing machine learnleverag-ing algorithms This digital approach facilitates precise
mea-surements of traits such as fruit size, seed count, color parameters, as well as
other relevant morphological and biochemical characteristics The obtained
dig-ital data enables detailed variety profiling and supports data-driven
decision-making in breeding programs Moreover, digitizing phenotypic traits allows the
creation of comprehensive databases, enabling efficient storage, retrieval, and
analysis of varietal information This data-driven approach allows breeders to
4
Trang 151.1 Problem statement
identify desirable traits, assess genetic diversity, and make informed choices for
developing new varieties that meet market demands or specific environmental
conditions By adopting phenotypic digitization, the chili pepper industry can
unlock new avenues for variety management, accelerate breeding cycles, and
foster the development of improved varieties to meet the growing demands of
consumers and stakeholders
For these reasons, we were motivated to undertake the project “Applying
Machine Learning for Chili Pepper Phenotyping and Feature Extraction”
The project is divided into several sub-problems, which we will discuss later
First, we need to describe this project:
¢ We will first define the phenotypic traits that need to be extracted
¢ Input: Images of chili peppers from which we want to extract
informa-tion, including QR barcode(Figure 3.4)
¢ Output: The phenotypic characteristics of chili peppers digitized into
: €sv File telude 4 column |
- ( — "me of Seeds ¥ Width b_bo: “BE b
“Ss Avg with Chili tr enl: K Y Degree of Rerlne: Sie
Figure 1.1: Input and Output
Based on our limited understanding during the execution of this thesis, werealized that there are no scientific papers on the image processing of chili pep-
5
Trang 161.1 Problem statement
pers for information extraction Our team decided to define this problem bybreaking it down into sequential sub-problems These include the followingtasks:
¢ Identifying chili peppers and their seeds, which lays the foundation for
subsequent information extraction steps
¢ Defining the extractable information fields, specifically, we can extract
the following eight pieces of information:
Width and Length of the chili pepper’s bounding box Average Width and Length of the Chili pepper
Area of the Chili pepper
Degree of Redness of the Chili pepper
Number of seeds in a Chili pepper
Wrinkle of the Chili Pepper’s Edge
e Image processing to extract phenotypic characteristics
Analyzing these sub-problems allows us to find solutions to object detection
problems easily For the object detection problem, many models have alreadybeen developed to solve similar issues for other types of fruits Specifically, forthe feature extraction problem, we will analyze features and find ways to extractthese types of information from images Therefore, what we need to do is an-alyze algorithms and pattern similarities to apply existing solutions to our sub-problems We hope that our research in this project will provide an automatedsystem that contributes to the storage of phenotypic characteristics of variouschili pepper varieties, build a standardized and highly applicable dataset for fu-ture management and breeding, as well as help save and reduce manual effort,avoiding human subjectivity errors This will aid in the breeding and preserva-tion of essential characteristics of each chili pepper variety
Trang 171.1 Problem statement
After the implementation and exploration process, we identified the
follow-ing challenges for this project:
¢ Data:
— Currently, there is no complete dataset containing longitudinal
sec-tions of chili peppers Existing datasets are fragmented and lack
consistency in labeling and visualization They also lack specific
cultivation and harvesting processes and are not strictly monitored
to ensure data fairness
— Chili seeds often face occlusion issues, and the seeds are relatively
small Cross-sectional images may not accurately reflect the totalnumber of seeds for varieties with small and numerous seeds
— The number of collected images is limited, while the number of
chili pepper varieties is infinite Accurate phenotypic informationrequires images that ensure purebred varieties are strictly managed
and cared for
¢ Method:
— The processing flow of the system for extracting phenotypic
charac-teristics from chili pepper images is divided into two main modules:detecting chili peppers and seeds, followed by extracting informa-tion from each chili pepper Each module requires accurate determi-nation and appropriate reasoning regarding the nature and methodsfor each phenotypic information requirement
¢ Integration between biology and computer vision:
— Our model is based on requirements from the field of biology,
ne-cessitating the integration of biology and computer vision tially, computer vision is used to solve biological problems, whichcan lead to difficulties in reasoning and evaluation when definitions
Essen-from both fields need to be met simultaneously.
7
Trang 181.2 The Objectives and Scope
1.2 The Objectives and Scope
1.2.1 Objectives
We focus on solving the problem of extracting as much information as sible from the longitudinal section images of chili peppers To accomplish this,
pos-we set out the following specific objectives:
1 Define the parameters that need to be extracted in the field of biology
2 Identify the sub-problems to facilitate the determination of extractable
parameters according to the above definitions
3 Explore algorithms and studies of models in other related domains
4 Investigate and reason about the extractable information characteristics
and build a theoretical foundation for these parameters
5 Construct a dataset to support the information extraction process from
chili pepper images.
1.2.2 Scope
Within the limited scope of this thesis, our team focuses on completing thefollowing tasks:
¢ Define the parameters that need to be extracted in the field of biology
¢ Identify the sub-problems to facilitate the determination of extractable
parameters according to the above definitions
* Research algorithms for each sub-problem identified from the definitions
¢ Construct a dataset of images of chili peppers grown and harvested by
biologists.
Trang 191.3 Contributions
s® Experiment with YOLOv5[13] and YOLOv7[14] methods for the
detec-tion of chili peppers and seeds.
¢ Implement algorithms and rationale for the extractable information
char-acteristics and build a theoretical foundation for these parameters
¢ The phenotypic characteristics that the team focuses on extracting and
describing from chili pepper images are illustrated in Figure 1.2.
= _ at Ratio pixel to mm }
2
, Image's name} * (Wonber of Seeds | y
° (ag with sa} Length Chili NES
e Systematized knowledge, approaches, and solutions for the problem of
extracting phenotypic and morphological information from longitudinalsection images of chili peppers
¢ Extracted phenotypic characteristics of chili peppers using computer
vi-sion techniques for a biology-related problem
9
Trang 201.4 Implementation thesis
e Evaluated models and methods for module of detecting chili and seed
problem:
— For the problem of detecting chili peppers and seeds, we used YOLOv5[13]
and YOLOv7[14] YOLOv7[14] provided better results with a cision of 93.7% and a recall of 94.3%
pre-¢ Constructed a dataset of chili pepper images that were grown and
har-vested according to the standards evaluated by biological experts, servingthe problem of extracting morphological characteristics of chili peppers
¢ Developed a demonstration program that allows users to extract
pheno-typic information about chili peppers from images provided by users
1.4 Implementation thesis
The content we have implemented in this thesis is presented as follows:
¢ Research the definitions of morphological characteristics of chili peppers
in the field of biology.
e Study algorithms and methods in related domains to solve the sub-problems
identified.
¢ Construct a dataset of chili pepper images to support the detection of chili
peppers and seeds
¢ Conduct experiments and evaluate and compare the effectiveness of
vari-ous algorithms based on the identified problems
¢ Design processes and organize the workflow to ensure that the model’s
execution aligns with the input and required output information flow
¢ Implement the rationale for each morphological characteristic of chili
peppers.
10
Trang 211.5 Structure thesis
1.5
Develop a demonstration program for our thesis
Structure thesis
The thesis is divided into six main chapters, structured as follows:
Chapter 1: Introduction to the thesis
Chapter 2: Perspective on the problem in the field of Computer Visionand an overview of definitions of characteristics in Biology
Chapter 3: Chili Pepper Dataset and information on the image collectionprocess, dataset construction process
Chapter 4: Overview of the approach to solve the detection phase of chilipeppers and seeds
Chapter 5: Theoretical foundation and implementation of the rationale
for extracting morphological characteristics of chili peppers from images
Chapter 6: Preliminary evaluation and discussion on the accuracy of theextracted characteristic parameters
Chapter 7: Conclusion and future research directions
11
Trang 23require-we will clearly state the problem and define the basic concepts of the logical characteristics of chili peppers while reviewing some related researchmethods Additionally, we will present the perspective of the field of computervision on extracting characteristic information according to the definitions ofbiology.
morpho-2.2 Problem in the Field of Biology
Research on chili pepper varieties and breeding is a highly potential andimportant topic that Vietnamese biologists, in particular, and biologists world-wide, in general, are dedicating significant efforts to Examples of such researchinclude:
1 Assessment of Genetic Diversity in Pepper (Capsicum sp.) Landraces
from Ghana Using Agro-morphological Characters[1]: This study, ported in American Journal of Experimental Agriculture, examined the
re-13
Trang 242.2 Problem in the Field of Biology
agro-morphological traits of Capsicum species, including cultivated eties and wild species It emphasized the extensive genetic pool and theimportance of traits such as fruit size, shape, and disease resistance Thisresearch provides valuable insights for breeding programs targeting spe-
vari-cific fruit characteristics and improving resistance to pests and diseases.
The collection of morphological characteristics of chili peppers in the field of
Biology currently focuses on extracting information to evaluate how these
char-acteristics affect the genetic traits of chili peppers With the extensive breedingand the large number of chili pepper varieties, storing this characteristic in-formation is particularly essential for any research unit This information iscurrently measured and calculated manually by research teams, and there is nospecific database for this purpose This necessitates the creation of a dataset thatcomprehensively stores the morphological characteristics of each chili peppervariety Such a dataset would greatly facilitate the assessment of the genetic in-fluence on morphological traits, thereby providing a theoretical basis for genetic
breeding experiments in chili peppers.
Importance of Morphological Characteristics in the Field of Biology
Number of Seed
The number of seeds in a chili pepper is an important trait as it directlyaffects the plant’s reproductive potential and yield Studies have shown that
seed count is not only related to reproductive capability but also impacts fruit
quality and seedling development For example:
¢ Genetic studies: Some studies have explored the genes controlling seed
count in chili peppers These genes influence fruit development and thenumber of seeds inside, contributing to yield improvement through breed-
ing varieties with higher seed counts.
14
Trang 252.2 Problem in the Field of Biology
¢ Agricultural applications: Farmers often prefer chili varieties with more
seeds due to their high reproductive capacity, which helps maintain andexpand cultivation areas
Area of Chili pepper
The area of a chili pepper is a crucial indicator of the fruit’s size and shape,affecting its commercial and consumer appeal Research has indicated that:
¢ Impact on yield: Larger fruit areas generally correspond to higher yields
due to the increased weight and size of the fruit
¢ Genes related to fruit area: Genetic studies have identified genes that
affect fruit size and shape, allowing the breeding of varieties with desiredfruit areas
Degree of Redness
The redness of a chili pepper is one of the most important factors related tothe quality and commercial value of the pepper Redness affects not only thecolor and appearance but also the levels of capsaicin and carotenoids, which arecrucial compounds determining the pepper’s spiciness and nutritional value
¢ Nutritional quality and commercial value: High redness is often
asso-ciated with high capsaicin and carotenoid content, increasing the
nutri-tional and commercial value of the pepper.
¢ Genetics and breeding: Genes controlling the synthesis of capsaicin and
carotenoids have been extensively studied, helping to create chili varieties
with the desired redness to meet market demands
Wrinkle of the Chili Pepper’s Edge
The wrinkle degree of a chili pepper’s skin is an important characteristic inevaluating the quality and commercial value of the pepper Wrinkling affects
15
Trang 262.3 Perspective of Computer Vision
the aesthetic appeal and mouthfeel of consumers Notable studies on this trait
include:
¢ Morphological studies: The wrinkle degree of the chili pepper skin is
related to genetic and environmental factors Genes controlling this traithave been identified, aiding in the breeding of varieties with less wrinkled
or desired wrinkle degree
¢ Impact on processing: Less wrinkled Chili pepper skins are generally
preferred in processing methods like drying or making chili powder, as
they are easier to process and preserve
Length and Width of the Fruit
The length and width of a chili pepper are important indicators for ing the size and shape of the fruit, affecting yield and product quality Research
determin-on the length and width of chili peppers typically focuses determin-on:
¢ Genetic studies: Genes controlling the length and width of chili
pep-pers have been thoroughly studied These traits have high heritability andstrongly influence the plant’s yield and resilience
¢ Breeding applications: Breeding chili varieties with desired length and
width is an important goal in agriculture, helping to improve yield and
meet market demands For example, long and thin varieties are often
favored in Asian cuisines, while round and short varieties may be bettersuited for stuffing or pickling
2.3 Perspective of Computer Vision
1 High-throughput Characterization of Fruit Phenotypic Diversity among
New Mexican Chile Pepper (Capsicum spp.) Using the Tomato
Ana-lyzer Software[6]: Published in HortScience, this research characterized
105 genotypes of New Mexican chile pepper using the Tomato Analyzer
16
Trang 272.4 Summary
software The study identified key descriptors like perimeter, area, width,and height as major contributors to fruit shape diversity It highlightedthe high heritability and genetic effects of these traits, which are crucialfor breeding programs aimed at improving fruit morphology and yieldpotential
2 Genetic Diversity, Population Structure, and Heritability of Fruit Traits
in Capsicum annuum[9]: This study, published in PLOS ONE, explores
the heritability and diversity of fruit traits in Capsicum annuum It foundsignificant genetic diversity and heritability for traits such as fruit mass,length, diameter, and shape The study also used the Tomato Analyzer[8]
software to obtain precise measurements of fruit characteristics, aiding inthe understanding of the genetic factors controlling these traits
In the field of Computer Vision, the Tomato Analyzer[8] software was troduced to extract 32 pieces of information about tomatoes Although some
in-papers use this software to extract information about chili peppers, the extracted
features are limited due to the differences in shape between tomatoes and chili
peppers Additionally, important characteristics that affect the quality and yield
of chili peppers are not adequately captured.
Moreover, the presence of too much irrelevant information that does not fitthe phenotypic characteristics of chili peppers makes information extraction and
storage more challenging This drives us to develop a similar program tailored
specifically for chili peppers, capable of extracting more critical and relevant
information Our program aims to offer a user-friendly interface with unique
features that simplify storage and use compared to the original software
Trang 282.4 Summary
related research and identified their shortcomings, reiterating the motivations
for our research The information flow that our team aims to extract in this
thesis is illustrated in Figure 2.1
‘Goose's nane_—x) *f Number of Seeds)) 1 Width b_ box }
T Avg Width Chil TT Length oak) | Area Yous oF R
004 17158286 1_1.jpg 19 39.15 151.36 29.13 185.39.
00417158286 1 2.Jpg, 2 41.69 185.4 3082 18577
-(004 IT158286_1_3.Jpg 6 44.23 15258 32.92 185.25 | 004_IT158286_1_4.jpg 19 39.72 135496 30.99 136.84 -
.004_IT158286_1_5.jpg 22 46.26 — 153.15 28.86 18847 (004 IT158286 1 6.Jpg 13 4197 151.55 32.51 18473.
Figure 2.1: Arichitecture Pipeline
Our thesis is divided into two phases:
1 Phase of detecting chili seeds and fruits: In this phase, the team will
build a Chili Pepper dataset specifically for this purpose
2 Phase of extracting morphological characteristic information of chili
peppers: In this phase, the team will analyze and reason about the acteristics and use techniques in the field of Computer Vision to extract
char-these characteristics to meet the requirements in the field of Biology
18
Trang 29Chapter 3
CHILI PEPPER IMAGES AND CHILI
PEPPER DATASET
3.1 Chapter Overview
To execute any machine learning problem, it is essential to have a dataset to
begin with so that the model can learn and produce the desired output for the
user This chapter will present the process of building the Chili pepper dataset,discuss the consistency of the dataset and data assurance in the field of Biology,
and explain how we label and process the data from the perspective of those in
the field of Computer Vision
In biology, there are two types of experimental setups for trait evaluation:
1 Evaluation of the same variety under different environmental
condi-tions: This includes variations in water supply, fertilizers, light, ature, wind, etc This type of experiment is typically conducted when avariety has been selected, and there is a desire to evaluate its adaptability
temper-to specific conditions
2 Evaluation based on genetic makeup, growing multiple varieties
un-der the same environmental conditions
The dataset construction in this research follows the second method Chili per varieties were obtained from the Korean Genebank and grown under con-
pep-19
Trang 303.2 Chili Pepper Cultivation and Sample Selection
trolled conditions with consistent temperature, irrigation, and fertilization Eachvariety was photographed individually
3.2 Chili Pepper Cultivation and Sample Selection
The database used in this study comprises longitudinal cross-section images
of chili peppers and meticulously labeled seeds These chili peppers were not
sourced from commercial outlets Instead, they were cultivated and harvested at
the Rural Development Administration (RDA) located in Jeonju, South Korea,
under the direct supervision of experts in the field of Biology This approachallowed us to maintain a high level of control over the growing conditions andoverall quality of the peppers
The cultivation of chili pepper varieties was conducted under tightly trolled environmental conditions We carefully monitored and adjusted the tem-perature, water irrigation, and fertilization for the chili plants This level ofcontrol was crucial in maintaining the uniformity of our samples and reducingany potential deviations in our data caused by environmental variations Weconducted experiments in a garden with 77 chili pepper species, each contribut-ing unique characteristics to our research Each species was randomly assigned
con-to plots in our greenhouse and given a unique species code (Figure 3.1) for easy
identification and data tracking.
MIS}
T 1T250232
Figure 3.1: Unique species code (IT name of each pepper variety)
The varieties were arranged in a single plot in the field (here in the
green-house) like Figure 3.2, marked sequentially with red numbers (corresponding
20
Trang 313.2 Chili Pepper Cultivation and Sample Selection
to the labels in the picture) along with the variety codes (/T1234567 ) Eachvariety was planted with 6 plants per plot, and the order of each plot in thefield was random The plots were repeated three times (in three different green-houses), resulting in 18 plants per variety (6 plants * 3 repetitions) to minimizesystematic errors and edge effects
Our chili selection process was meticulous and based on standardized ria We considered factors such as the area and weight of the chili fruit, theircolor, and other relevant characteristics This allowed us to maintain a high level
crite-of consistency in sample selection and ensure the reliability crite-of the data
Sampling was conducted when the plants had flowered (were mature in
terms of vegetative growth) Leaves, flowers, and fruits were selected fromrepresentative positions of the variety based on criteria (area, weight, color,
etc.) Growth curves were drawn for each species and variety to base the sample
selection on this data The sampling process began when the plants had fullyfruited and reached maturity, ensuring that we evaluated the chili peppers at thestandard growth cycle stage
21
Trang 3222
Trang 333.4 Chili pepper Dataset for Step 1
GL 1T250232
Figure 3.4: QR barcode to calculated mm/pixel
We emphasized maintaining consistent lighting conditions during the tography process and keeping a consistent distance and angle for the shots Thiswas done to eliminate any potential biases or variations in the image data due
pho-to changing lighting conditions or different angles The images were saved in
JPG format, preserving the intricate details of the chili peppers with a pixel olution of 6024 x 4024 The Chili Pepper dataset contains a total of 32 images
res-before augmentation
3.4 Chili pepper Dataset for Step 1
We used cropping techniques to extract square patches of 640x640 pixelsfrom the original images, ensuring that each cropped region contains at leastone chili pepper (Figure 3.5)
OM
Figure 3.5: Cropping Chili pepper Images
23
Trang 343.4 Chili pepper Dataset for Step 1
This preprocessing step simplified subsequent image-processing tasks andoptimized the use of computational resources during model inference and train-
ing We utilized the Roboflow tool for the labeling process Roboflow is a
powerful and user-friendly platform that simplifies the task of annotating
ob-jects in images, a crucial step in preparing data for training machine learning
models, especially in computer vision
We identified two classes for our labeling task: "chili" and "seed." The
"chili" class represents the entire chili pepper, while the "seed" class includes
individual seeds within each chili pepper Labeling involved manually drawing
bounding boxes around each chili pepper and seed in the images For the "chili"
class, we drew tight bounding boxes around the chili pepper For the "seed"
class, we drew boxes around each visible seed within the chili pepper Seeds
with over 70% transparency and over 90% occlusion were ignored to avoid
in-troducing noise into the model
Figure 3.6: Chili Pepper Dataset Label
Our Chili Pepper dataset is structured in the YOLOv5[13] folder format,making it compatible with most popular model architectures and object detec-
tion frameworks, and specifically suitable for the input requirements of YOLOv5[13]
24
Trang 353.4 Chili pepper Dataset for Step 1
and YOLOv7[14], which we will experiment with later However, at this stage,
we have determined that the dataset is relatively small for training a deep ing model To enrich the diversity of the dataset and enhance the model’s gener-alization capability and performance on unseen samples, we employed variousdata augmentation techniques These data augmentation strategies aim to ex-pand the dataset by introducing variations in image orientation and other imagefeatures
learn-The final Chili pepper dataset includes over 700 annotated images of
dif-ferent chili pepper varieties captured under consistent lighting conditions andbackgrounds The images were manually labeled with bounding boxes aroundeach chili pepper and seed sample We split our annotated dataset into training,validation, and test sets to ensure robust model training and performance eval-uation This partitioning is a common practice in machine learning to preventoverfitting and to assess the generalization capability of the trained models Wefollowed a 70-20-10 split ratio, allocating 70% of the data for training purposes,20% for validation, and 10% for testing The training set is used to optimizethe model parameters during training In contrast, the validation set is used
to tune hyperparameters and monitor the model’s performance during training
to prevent overfitting The test set, entirely separate from the training process,serves as an objective measure of the model’s performance on unseen data Byevaluating the model’s predictions on the test set, we can obtain a realistic esti-mate of its generalization capability and assess its suitability for deployment inreal-world scenarios
The training, validation, and test sets were carefully managed and organized,with image file names and corresponding annotations stored in separate files foreach subset (Figure 3.7) This structured organization facilitated efficient dataloading and preprocessing during the training and evaluation phases of our ma-
chine learning workflow.
25
Trang 363.5 Statistics for the Chili Pepper Dataset
Figure 3.7: Structured of Chili pepper Dataset
3.5 Statistics for the Chili Pepper Dataset
The number of datasets after the final collection and preprocessing steps:
Data Distribution in Train, Validation, and Test Sets
Figure 3.8: Distribution of Data
Distribution of Data in Training, Validation, and Test Sets:
¢ Training set (Train): 584 samples
¢ Validation set (Validation): 40 samples
¢ Test set (Test): 30 samples
26
Trang 373.6 Summary
Figure 3.9: Distribution of Object Counts
The dataset consists of 2 objects, chili, and seed
Training set | Validation set | Test set
envi-collection process, as well as how we shaped and built the Chili Pepper Dataset
from the perspective of Computer Vision to serve the subsequent machine ing training process We also conducted a preliminary evaluation of the Chili
learn-Pepper dataset and discussed the assurance and consistency of each chili pepper
variety in this dataset
27
Trang 39Chapter 4
CHILI PEPPERS AND SEEDS
DETECTION PROBLEM
4.1 Chapter Overview
As outlined from the beginning, our proposed processing flow consists of
two steps: detecting chili peppers and seeds and extracting the morphological
characteristic information of the chili peppers This chapter will focus on thestep 1 For the problem of detecting chili peppers and seeds, we need the out-put to be bounding boxes containing information about the location and labels
of the chili peppers and seed objects Essentially, this is an object detection
problem, so this chapter will introduce the concept of object detection, how we
approached this problem, the methods we experimented with, and the mental evaluation results, along with the rationale for our final choice in this
experi-stage.
4.2 Object Detection Problem
Object detection is a fundamental problem that involves classifying and cating objects within an image or video There are two essential concepts tounderstand:
lo-29
Trang 404.2 Object Detection Problem
Image Classification
Predicting the label of an object in an image The input is an image
contain-ing one object, and the output is the label of the object
Object Localization
Determining the presence of objects in an image and specifying their tions using bounding boxes The input is an image with one or more objects,and the output is one or more bounding boxes defined by the coordinates of the
loca-center, width, and height
Object detection combines image classification and object localization Itinvolves drawing a bounding box around each object of interest in the image
and assigning a label to it For this problem, this first module will help by
providing parameters to segment the input image into individual chili peppers,facilitating the extraction of information in the next module Additionally, thenumber of seeds per chili pepper will be extracted based on the detection ofseeds by this module
The input and output of Step 1, like Figure 4.1
¢ Input:
— Longitudinal cross-section images of chili peppers.
¢ Output
— A text file where each line includes:
+ The class label of the object (bb[0])
+ The confidence score of the detection (bb[1])
* The coordinates of the center of the bounding box (bb[2],
bb[3])
+ The width and height of the bounding box (bb[4], bb[5])
30