Khóa luận tốt nghiệp Khoa học máy tính: Áp dụng học máy để phân tích kiểu hình và trích xuất các đặc điểm của trái ớt

Our approach involves using computer vision and machine learning niques to automatically extract features from images to store these characteris-tics for each chili pepper variety.. pre-

Trang 1

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

COMPUTER SCIENCES

NGA PHAM THI - 21521168

GRADUATE THESIS

APPLYING MACHINE LEARNING FOR

CHILI PEPPER PHENOTYPING AND

FEATURE EXTRACTION

BACHELOR OF COMPUTER SCIENCE

LECTUREPhD DUNG MAI TIEN

HO CHI MINH CITY, 2024

Trang 2

No one achieves anything great without the help of those around them,whether directly or indirectly To complete this thesis, I was fortunate to re-ceive much help and support from teachers, colleagues, friends, and family Iwould like to dedicate these first pages to express my gratitude to everyone whohas accompanied our group during this period

First of all, I would like to extend my deepest thanks to all the teachers

at the University of Information Technology in general and the teachers of the

Department of Computer Science in particular Thanks to the valuable edge imparted by the teachers, as well as their dedicated support throughout theprocess, our group was able to complete the thesis and achieve commendableresults

knowl-I especially want to thank Ph.D Dung Mai Tien and Ph.D Tuan Thai Thanh,who inspired, meticulously guided, and provided extensive knowledge, creating

a favorable environment for me to learn and exchange ideas with the seniorsand peers in the research group These are invaluable insights and experiences,beneficial not only for this graduation thesis but also for the future work ahead

Finally, I express my heartfelt gratitude to my family and loved ones, who

have always been a strong support and consistently backed every decision our

group made.

Despite having put in a lot of effort to perfect this thesis, it is hard to avoidmistakes and limitations I hope to receive sympathy and constructive feedbackfrom the teachers and friends

Ho Chi Minh city, June 24, 2024

Nga Pham Thi, Student1H

Trang 3

1.2.1 Objectives 1.2.2 Scope

13

Trang 4

2.1 ChapterOverViewW 00000 eee eee

2.2 Problem in the Field of Blology

2.3 Perspective of Computer Vision

3.4 Chili pepper Dataset for Step]

3.5 Statistics for the Chili Pepper Dataset

CHILI PEPPERS AND SEEDS DETECTION PROBLEM

4.1 ChapterOverVvieW 0.000 eee eee ne

4.2 Object Detection Problem

44 YOLO nn*+táồẳ ŠŠ & /@T Kẻ s5

44.1 YOLOv5 0 0.000002 eee eee

44.2 YOLOvW7 2 20 es4.5 Evaluation Metrics 0 020 eee eee ee

45.1 PrecsionandRecalÐ

45.2 loÙ Q Q Q Q Q Q và và và

4.5.3 mAP(Mean Average Precisions)

4.6 Results and Evaluation

4.6.1 Results 0 00.02 eee ee ee4.6.2 Evaluation 2 000.002 008

FEATURE EXTRACTING PROBLEM

5.1 Chapter Overview 2 0.000 eee ee eee

27

2929

29

32323739

44 44 44 46 47

47

49 49

Trang 5

5.3 5.4 5.5 5.6 5.7 5.8

5.9

5.2.1 OrientedBlock

5.2.2 Ratio convert Pixel to Millimeter Block

5.2.3 Chili pepperMaskBlock

Width and Length of Bounding_box Chilipepper

NumberofSeeds Ặ.ẶẶẶ Average Width and Length

Surface AT€A Q Q Q Q Q HQ ng va Degree of Redness

Wrinkles of Chili peppersedge

5.8.1 By Angles Formed by Three Consecutive Vertices

AnalysisProcedure

Conclusion @ 2⁄2 œø” tà -

3.8.2 By Smoothness ofContour

Analysis Procedure

Conclusion Ặ 0 2 ee 5.8.3 Using Contour Over a Defined Segment

AnalysisProcedure

Conclusion Ặ

Summary Ặ 0p eee ee ee ee 6 EVALUATION FEATURES EXTRACTING 6.1 6.2 6.3 6.4 Chapter Overview 00.0002 e eens 330 Consumer 2 ẶỒẶ Ặ

AnalysisResults Ặ.Ặ QC Clustering Data 2 0 0.00000 00000]

7 CONCLUSION AND FUTURE RESEARCH

7.1

7.2

Conclusion Ặ c Q Q Q ee ee ee Future research Ặ Q SH Q Q2

REFERENCE

67 67 67 68 73

75 75 77

81

Trang 6

List of Figures

1.1

1.2

2.1

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

4.1

4.2

4.3

4.4

4.5

4.6

Input and Output Ặ 5

Phenotyping Feature 9

Arichitecture Pipeline 18

Unique species code (IT name of each pepper variety) 20

Green House ð⁄.@Œ£” À\ Â 21

Camera and environment setup forimage capture 22

QR barcode to calculated mm/pixel 23

Cropping Chili pepperlmages - 23

Chili Pepper DatasetLabel 24

Structured of Chili pepper Dataset 26

Distribution ofData 26

Distribution of ObJectCounts 27

Input and Output ofStepl 31

YOLO architecture 33

Overview of YOLO 0 0000000008 33 Diagram of YOLO architectire 34

Darknet53 2.0 0 ee ee 35 The network architecture of Yolov5 It consists of three parts: (1) Backbone: CSPDarknet, (2) Neck: PANet, and (3) Head: Yolo Layer The data are first inputted to CSPDarknet for feature ex-traction and then fed to PANet for feature fusion Finally, YOLO Layer outputs detection results (class, score, location, size) 38

vii

Trang 7

4.8

4.9

4.10

4.11

5.1

5.2

5.3

5.4

5.5

5.6

6.1

6.2

6.3

6.4

6.5

Backbone YOLOv7 20000222 ee 40 Neck YOLOV7 2 0.0 000000 0b eee ee 41 Head YOLOV7 2 0 000002 eee ee ee 42

Illustration of how to calculate Precision and Recall 45

Illustration of loU Metrics 45

Input of Step2 0.2 000002 eee ee 52 QRbarcode 54

Illustration of Chili pepper mask with only Threshold 55

Illustration of Chili pepper mask after using Closing method 56

Refined Chili Pepper segmentation 57

Illustration of bb_x 2 2 ee ee, 58 Examples of 330 Consumer 68

Histogram of Feature Extracting 70

Scatter Matrix of Feature Extracting 71

Correlation Matrix of Feature Extracting 72

PCA Feature Extracting Ặ 74

Trang 8

List of Tables

3.1

4.1

4.2

4.3

6.1

Distribution of ObJectCounts 27

Evaluation withclassChỈI 48

Evaluation with classSeed 48

Evaluation Overall Model Evaluation 48

Statistical Summary of Feature Extracting 69

1X

Trang 9

List of Abbreviations

Ph.D Doctor of Philosophy

XI

Trang 10

Provisional ølossary

Machine Learning Học máy

Deep Learning Học sâu

xiii

Trang 11

In the current agricultural sector, identifying phenotypes and accurately

de-scribing the morphology of chili peppers involve manual inspections and surements performed by trained personnel This process is labor-intensive,time-consuming, and prone to errors due to subjective biases and human mis-takes

mea-With the rapid advancements in computer vision and machine learning, wepropose a method that utilizes machine learning and computer vision to au-tomate the process of phenotype identification and feature extraction of chilipeppers Additionally, we aim to establish a dataset for information retrievalregarding various chili pepper varieties This study is supported by a securedataset provided through the collaboration between the Department of Com-puter Science and the BIO-RESOURCE COMPUTING RESEARCH CENTER

of Jeju National University, South Korea

Our approach involves using computer vision and machine learning niques to automatically extract features from images to store these characteris-tics for each chili pepper variety This is highly beneficial for managing pepperbreeding and reproduction in the biological field, catering to the expansive mar-

tech-ket for chili peppers today.

To address the outlined problem, we will divide it into smaller sub-problemsfor step-by-step resolution, including localization, image processing, and featureextraction Subsequently, we will address the application and implementation

of the initial objectives

In summary, this thesis accomplishes the following:

1 Constructing a dataset from images of chili peppers cultivated by

biolo-1

Trang 12

gists at the research institute.

For the localization and seed detection of chili peppers in images, to

meet real-time conditions, we propose using the one-stage object tion model YOLO [11]

detec- For feature extraction post-chili identification, we will utilize image cessing techniques within the computer vision domain, which will beelaborated on later

Trang 13

pro-Chapter 1

INTRODUCTION

In this chapter, we will provide an overview of the problem of APPLYING

MA-CHINE LEARNING FOR CHILI PEPPER PHENOTYPING AND

FEA-TURE EXTRACTION, along with the challenges encountered during the

im-plementation of this project Subsequently, we will summarize the subjects,

scope, and research objectives of this thesis At the end of the chapter, we willpresent the accomplished work and the main structure of the thesis

1.1 Problem statement

Chili peppers, cultivated worldwide and used for thousands of years, arespicy fruits belonging to the Solanaceae family They are highly valued fortheir unique flavor, nutritional properties, and medicinal benefits Chili peppersare rich in various vitamins, including vitamins E, C, A, and B complex, as

well as minerals such as thiamine, folate, molybdenum, manganese, potassium,

calcium, and iron [2]

Additionally, they contain polyphenols (mainly luteolin), flavonoids, andquercetin In many regions, chili peppers play a crucial role in local cuisine, pro-viding unique flavors and adding depth to traditional dishes Beyond culinaryapplications, chili peppers are used in various industries, including pharmaceu-ticals, cosmetics, and even self-defense products, due to their capsaicinoid [12]content—the compound responsible for their characteristic spiciness

Trang 14

The chili pepper market has seen significant growth, driven by increasingconsumer preference for diverse and authentic flavors, as well as the recogni-

tion of the potential health benefits associated with capsaicinoids [12] Their

widespread use as a spice and functional food ingredient has increased global

demand for both fresh and processed chili products, creating opportunities for

growers, processors, and traders Furthermore, chili peppers have become an

important component in many industrial applications Their popularity extends

beyond culinary use, as they are utilized in pharmaceuticals, cosmetics, and

even self-defense products due to the presence of capsaicinoids [12]

The vast diversity of chili pepper varieties presents numerous challenges

With many types of chili peppers available, each having distinct phenotypic

traits in terms of shape, size, color, spiciness, and flavor, it brings

opportuni-ties and challenges for breeding programs and variety management Accurate

and efficient characterization of these phenotypic traits is crucial for

unlock-ing the full potential of chili pepper varieties and promotunlock-ing targeted breedunlock-ing

efforts Traditional methods of phenotyping and accurately describing chili

pep-per morphology involve manual inspections and measurements by trained pep-

per-sonnel, which are labor-intensive, time-consuming, and prone to human error

and subjectivity Moreover, these manual methods often lack the precision and

consistency required for comprehensive analysis and comparison of varieties

Digitizing phenotypic traits through advanced imaging techniques and assisted analysis offers a transformative solution to these challenges Researchers

computer-can quantify and extract numerical features with unprecedented accuracy and

objectivity by capturing high-resolution images of chili peppers and

leverag-ing machine learnleverag-ing algorithms This digital approach facilitates precise

mea-surements of traits such as fruit size, seed count, color parameters, as well as

other relevant morphological and biochemical characteristics The obtained

dig-ital data enables detailed variety profiling and supports data-driven

decision-making in breeding programs Moreover, digitizing phenotypic traits allows the

creation of comprehensive databases, enabling efficient storage, retrieval, and

analysis of varietal information This data-driven approach allows breeders to

4

Trang 15

identify desirable traits, assess genetic diversity, and make informed choices for

developing new varieties that meet market demands or specific environmental

conditions By adopting phenotypic digitization, the chili pepper industry can

unlock new avenues for variety management, accelerate breeding cycles, and

foster the development of improved varieties to meet the growing demands of

consumers and stakeholders

For these reasons, we were motivated to undertake the project “Applying

Machine Learning for Chili Pepper Phenotyping and Feature Extraction”

The project is divided into several sub-problems, which we will discuss later

First, we need to describe this project:

¢ We will first define the phenotypic traits that need to be extracted

¢ Input: Images of chili peppers from which we want to extract

informa-tion, including QR barcode(Figure 3.4)

¢ Output: The phenotypic characteristics of chili peppers digitized into

: €sv File telude 4 column |

- ( — "me of Seeds ¥ Width b_bo: “BE b

“Ss Avg with Chili tr enl: K Y Degree of Rerlne: Sie

Figure 1.1: Input and Output

Based on our limited understanding during the execution of this thesis, werealized that there are no scientific papers on the image processing of chili pep-

5

Trang 16

pers for information extraction Our team decided to define this problem bybreaking it down into sequential sub-problems These include the followingtasks:

¢ Identifying chili peppers and their seeds, which lays the foundation for

subsequent information extraction steps

¢ Defining the extractable information fields, specifically, we can extract

the following eight pieces of information:

Width and Length of the chili pepper’s bounding box Average Width and Length of the Chili pepper

Area of the Chili pepper

Degree of Redness of the Chili pepper

Number of seeds in a Chili pepper

Wrinkle of the Chili Pepper’s Edge

e Image processing to extract phenotypic characteristics

Analyzing these sub-problems allows us to find solutions to object detection

problems easily For the object detection problem, many models have alreadybeen developed to solve similar issues for other types of fruits Specifically, forthe feature extraction problem, we will analyze features and find ways to extractthese types of information from images Therefore, what we need to do is an-alyze algorithms and pattern similarities to apply existing solutions to our sub-problems We hope that our research in this project will provide an automatedsystem that contributes to the storage of phenotypic characteristics of variouschili pepper varieties, build a standardized and highly applicable dataset for fu-ture management and breeding, as well as help save and reduce manual effort,avoiding human subjectivity errors This will aid in the breeding and preserva-tion of essential characteristics of each chili pepper variety

Trang 17

After the implementation and exploration process, we identified the

follow-ing challenges for this project:

¢ Data:

— Currently, there is no complete dataset containing longitudinal

sec-tions of chili peppers Existing datasets are fragmented and lack

consistency in labeling and visualization They also lack specific

cultivation and harvesting processes and are not strictly monitored

to ensure data fairness

— Chili seeds often face occlusion issues, and the seeds are relatively

small Cross-sectional images may not accurately reflect the totalnumber of seeds for varieties with small and numerous seeds

— The number of collected images is limited, while the number of

chili pepper varieties is infinite Accurate phenotypic informationrequires images that ensure purebred varieties are strictly managed

and cared for

¢ Method:

— The processing flow of the system for extracting phenotypic

charac-teristics from chili pepper images is divided into two main modules:detecting chili peppers and seeds, followed by extracting informa-tion from each chili pepper Each module requires accurate determi-nation and appropriate reasoning regarding the nature and methodsfor each phenotypic information requirement

¢ Integration between biology and computer vision:

— Our model is based on requirements from the field of biology,

ne-cessitating the integration of biology and computer vision tially, computer vision is used to solve biological problems, whichcan lead to difficulties in reasoning and evaluation when definitions

Essen-from both fields need to be met simultaneously.

7

Trang 18

1.2 The Objectives and Scope

1.2.1 Objectives

We focus on solving the problem of extracting as much information as sible from the longitudinal section images of chili peppers To accomplish this,

pos-we set out the following specific objectives:

1 Define the parameters that need to be extracted in the field of biology

2 Identify the sub-problems to facilitate the determination of extractable

parameters according to the above definitions

3 Explore algorithms and studies of models in other related domains

4 Investigate and reason about the extractable information characteristics

and build a theoretical foundation for these parameters

5 Construct a dataset to support the information extraction process from

chili pepper images.

1.2.2 Scope

Within the limited scope of this thesis, our team focuses on completing thefollowing tasks:

¢ Define the parameters that need to be extracted in the field of biology

¢ Identify the sub-problems to facilitate the determination of extractable

parameters according to the above definitions

* Research algorithms for each sub-problem identified from the definitions

¢ Construct a dataset of images of chili peppers grown and harvested by

biologists.

Trang 19

1.3 Contributions

s® Experiment with YOLOv5[13] and YOLOv7[14] methods for the

detec-tion of chili peppers and seeds.

¢ Implement algorithms and rationale for the extractable information

char-acteristics and build a theoretical foundation for these parameters

¢ The phenotypic characteristics that the team focuses on extracting and

describing from chili pepper images are illustrated in Figure 1.2.

= _ at Ratio pixel to mm }

2

, Image's name} * (Wonber of Seeds | y

° (ag with sa} Length Chili NES

e Systematized knowledge, approaches, and solutions for the problem of

extracting phenotypic and morphological information from longitudinalsection images of chili peppers

¢ Extracted phenotypic characteristics of chili peppers using computer

vi-sion techniques for a biology-related problem

9

Trang 20

1.4 Implementation thesis

e Evaluated models and methods for module of detecting chili and seed

problem:

— For the problem of detecting chili peppers and seeds, we used YOLOv5[13]

and YOLOv7[14] YOLOv7[14] provided better results with a cision of 93.7% and a recall of 94.3%

pre-¢ Constructed a dataset of chili pepper images that were grown and

har-vested according to the standards evaluated by biological experts, servingthe problem of extracting morphological characteristics of chili peppers

¢ Developed a demonstration program that allows users to extract

pheno-typic information about chili peppers from images provided by users

1.4 Implementation thesis

The content we have implemented in this thesis is presented as follows:

¢ Research the definitions of morphological characteristics of chili peppers

in the field of biology.

e Study algorithms and methods in related domains to solve the sub-problems

identified.

¢ Construct a dataset of chili pepper images to support the detection of chili

peppers and seeds

¢ Conduct experiments and evaluate and compare the effectiveness of

vari-ous algorithms based on the identified problems

¢ Design processes and organize the workflow to ensure that the model’s

execution aligns with the input and required output information flow

¢ Implement the rationale for each morphological characteristic of chili

peppers.

10

Trang 21

1.5 Structure thesis

1.5

Develop a demonstration program for our thesis

Structure thesis

The thesis is divided into six main chapters, structured as follows:

Chapter 1: Introduction to the thesis

Chapter 2: Perspective on the problem in the field of Computer Visionand an overview of definitions of characteristics in Biology

Chapter 3: Chili Pepper Dataset and information on the image collectionprocess, dataset construction process

Chapter 4: Overview of the approach to solve the detection phase of chilipeppers and seeds

Chapter 5: Theoretical foundation and implementation of the rationale

for extracting morphological characteristics of chili peppers from images

Chapter 6: Preliminary evaluation and discussion on the accuracy of theextracted characteristic parameters

Chapter 7: Conclusion and future research directions

11

Trang 23

require-we will clearly state the problem and define the basic concepts of the logical characteristics of chili peppers while reviewing some related researchmethods Additionally, we will present the perspective of the field of computervision on extracting characteristic information according to the definitions ofbiology.

morpho-2.2 Problem in the Field of Biology

Research on chili pepper varieties and breeding is a highly potential andimportant topic that Vietnamese biologists, in particular, and biologists world-wide, in general, are dedicating significant efforts to Examples of such researchinclude:

1 Assessment of Genetic Diversity in Pepper (Capsicum sp.) Landraces

from Ghana Using Agro-morphological Characters[1]: This study, ported in American Journal of Experimental Agriculture, examined the

re-13

Trang 24

2.2 Problem in the Field of Biology

agro-morphological traits of Capsicum species, including cultivated eties and wild species It emphasized the extensive genetic pool and theimportance of traits such as fruit size, shape, and disease resistance Thisresearch provides valuable insights for breeding programs targeting spe-

vari-cific fruit characteristics and improving resistance to pests and diseases.

The collection of morphological characteristics of chili peppers in the field of

Biology currently focuses on extracting information to evaluate how these

char-acteristics affect the genetic traits of chili peppers With the extensive breedingand the large number of chili pepper varieties, storing this characteristic in-formation is particularly essential for any research unit This information iscurrently measured and calculated manually by research teams, and there is nospecific database for this purpose This necessitates the creation of a dataset thatcomprehensively stores the morphological characteristics of each chili peppervariety Such a dataset would greatly facilitate the assessment of the genetic in-fluence on morphological traits, thereby providing a theoretical basis for genetic

breeding experiments in chili peppers.

Importance of Morphological Characteristics in the Field of Biology

Number of Seed

The number of seeds in a chili pepper is an important trait as it directlyaffects the plant’s reproductive potential and yield Studies have shown that

seed count is not only related to reproductive capability but also impacts fruit

quality and seedling development For example:

¢ Genetic studies: Some studies have explored the genes controlling seed

count in chili peppers These genes influence fruit development and thenumber of seeds inside, contributing to yield improvement through breed-

ing varieties with higher seed counts.

14

Trang 25

2.2 Problem in the Field of Biology

¢ Agricultural applications: Farmers often prefer chili varieties with more

seeds due to their high reproductive capacity, which helps maintain andexpand cultivation areas

Area of Chili pepper

The area of a chili pepper is a crucial indicator of the fruit’s size and shape,affecting its commercial and consumer appeal Research has indicated that:

¢ Impact on yield: Larger fruit areas generally correspond to higher yields

due to the increased weight and size of the fruit

¢ Genes related to fruit area: Genetic studies have identified genes that

affect fruit size and shape, allowing the breeding of varieties with desiredfruit areas

Degree of Redness

The redness of a chili pepper is one of the most important factors related tothe quality and commercial value of the pepper Redness affects not only thecolor and appearance but also the levels of capsaicin and carotenoids, which arecrucial compounds determining the pepper’s spiciness and nutritional value

¢ Nutritional quality and commercial value: High redness is often

asso-ciated with high capsaicin and carotenoid content, increasing the

nutri-tional and commercial value of the pepper.

¢ Genetics and breeding: Genes controlling the synthesis of capsaicin and

carotenoids have been extensively studied, helping to create chili varieties

with the desired redness to meet market demands

Wrinkle of the Chili Pepper’s Edge

The wrinkle degree of a chili pepper’s skin is an important characteristic inevaluating the quality and commercial value of the pepper Wrinkling affects

15

Trang 26

the aesthetic appeal and mouthfeel of consumers Notable studies on this trait

include:

¢ Morphological studies: The wrinkle degree of the chili pepper skin is

related to genetic and environmental factors Genes controlling this traithave been identified, aiding in the breeding of varieties with less wrinkled

or desired wrinkle degree

¢ Impact on processing: Less wrinkled Chili pepper skins are generally

preferred in processing methods like drying or making chili powder, as

they are easier to process and preserve

Length and Width of the Fruit

The length and width of a chili pepper are important indicators for ing the size and shape of the fruit, affecting yield and product quality Research

determin-on the length and width of chili peppers typically focuses determin-on:

¢ Genetic studies: Genes controlling the length and width of chili

pep-pers have been thoroughly studied These traits have high heritability andstrongly influence the plant’s yield and resilience

¢ Breeding applications: Breeding chili varieties with desired length and

width is an important goal in agriculture, helping to improve yield and

meet market demands For example, long and thin varieties are often

favored in Asian cuisines, while round and short varieties may be bettersuited for stuffing or pickling

1 High-throughput Characterization of Fruit Phenotypic Diversity among

New Mexican Chile Pepper (Capsicum spp.) Using the Tomato

Ana-lyzer Software[6]: Published in HortScience, this research characterized

105 genotypes of New Mexican chile pepper using the Tomato Analyzer

16

Trang 27

2.4 Summary

software The study identified key descriptors like perimeter, area, width,and height as major contributors to fruit shape diversity It highlightedthe high heritability and genetic effects of these traits, which are crucialfor breeding programs aimed at improving fruit morphology and yieldpotential

2 Genetic Diversity, Population Structure, and Heritability of Fruit Traits

in Capsicum annuum[9]: This study, published in PLOS ONE, explores

the heritability and diversity of fruit traits in Capsicum annuum It foundsignificant genetic diversity and heritability for traits such as fruit mass,length, diameter, and shape The study also used the Tomato Analyzer[8]

software to obtain precise measurements of fruit characteristics, aiding inthe understanding of the genetic factors controlling these traits

In the field of Computer Vision, the Tomato Analyzer[8] software was troduced to extract 32 pieces of information about tomatoes Although some

in-papers use this software to extract information about chili peppers, the extracted

features are limited due to the differences in shape between tomatoes and chili

peppers Additionally, important characteristics that affect the quality and yield

of chili peppers are not adequately captured.

Moreover, the presence of too much irrelevant information that does not fitthe phenotypic characteristics of chili peppers makes information extraction and

storage more challenging This drives us to develop a similar program tailored

specifically for chili peppers, capable of extracting more critical and relevant

information Our program aims to offer a user-friendly interface with unique

features that simplify storage and use compared to the original software

Trang 28

2.4 Summary

related research and identified their shortcomings, reiterating the motivations

for our research The information flow that our team aims to extract in this

thesis is illustrated in Figure 2.1

‘Goose's nane_—x) *f Number of Seeds)) 1 Width b_ box }

T Avg Width Chil TT Length oak) | Area Yous oF R

004 17158286 1_1.jpg 19 39.15 151.36 29.13 185.39.

00417158286 1 2.Jpg, 2 41.69 185.4 3082 18577

-(004 IT158286_1_3.Jpg 6 44.23 15258 32.92 185.25 | 004_IT158286_1_4.jpg 19 39.72 135496 30.99 136.84 -

.004_IT158286_1_5.jpg 22 46.26 — 153.15 28.86 18847 (004 IT158286 1 6.Jpg 13 4197 151.55 32.51 18473.

Figure 2.1: Arichitecture Pipeline

Our thesis is divided into two phases:

1 Phase of detecting chili seeds and fruits: In this phase, the team will

build a Chili Pepper dataset specifically for this purpose

2 Phase of extracting morphological characteristic information of chili

peppers: In this phase, the team will analyze and reason about the acteristics and use techniques in the field of Computer Vision to extract

char-these characteristics to meet the requirements in the field of Biology

18

Trang 29

Chapter 3

CHILI PEPPER IMAGES AND CHILI

PEPPER DATASET

3.1 Chapter Overview

To execute any machine learning problem, it is essential to have a dataset to

begin with so that the model can learn and produce the desired output for the

user This chapter will present the process of building the Chili pepper dataset,discuss the consistency of the dataset and data assurance in the field of Biology,

and explain how we label and process the data from the perspective of those in

the field of Computer Vision

In biology, there are two types of experimental setups for trait evaluation:

1 Evaluation of the same variety under different environmental

condi-tions: This includes variations in water supply, fertilizers, light, ature, wind, etc This type of experiment is typically conducted when avariety has been selected, and there is a desire to evaluate its adaptability

temper-to specific conditions

2 Evaluation based on genetic makeup, growing multiple varieties

un-der the same environmental conditions

The dataset construction in this research follows the second method Chili per varieties were obtained from the Korean Genebank and grown under con-

pep-19

Trang 30

3.2 Chili Pepper Cultivation and Sample Selection

trolled conditions with consistent temperature, irrigation, and fertilization Eachvariety was photographed individually

The database used in this study comprises longitudinal cross-section images

of chili peppers and meticulously labeled seeds These chili peppers were not

sourced from commercial outlets Instead, they were cultivated and harvested at

the Rural Development Administration (RDA) located in Jeonju, South Korea,

under the direct supervision of experts in the field of Biology This approachallowed us to maintain a high level of control over the growing conditions andoverall quality of the peppers

The cultivation of chili pepper varieties was conducted under tightly trolled environmental conditions We carefully monitored and adjusted the tem-perature, water irrigation, and fertilization for the chili plants This level ofcontrol was crucial in maintaining the uniformity of our samples and reducingany potential deviations in our data caused by environmental variations Weconducted experiments in a garden with 77 chili pepper species, each contribut-ing unique characteristics to our research Each species was randomly assigned

con-to plots in our greenhouse and given a unique species code (Figure 3.1) for easy

identification and data tracking.

MIS}

T 1T250232

Figure 3.1: Unique species code (IT name of each pepper variety)

The varieties were arranged in a single plot in the field (here in the

green-house) like Figure 3.2, marked sequentially with red numbers (corresponding

20

Trang 31

to the labels in the picture) along with the variety codes (/T1234567 ) Eachvariety was planted with 6 plants per plot, and the order of each plot in thefield was random The plots were repeated three times (in three different green-houses), resulting in 18 plants per variety (6 plants * 3 repetitions) to minimizesystematic errors and edge effects

Our chili selection process was meticulous and based on standardized ria We considered factors such as the area and weight of the chili fruit, theircolor, and other relevant characteristics This allowed us to maintain a high level

crite-of consistency in sample selection and ensure the reliability crite-of the data

Sampling was conducted when the plants had flowered (were mature in

terms of vegetative growth) Leaves, flowers, and fruits were selected fromrepresentative positions of the variety based on criteria (area, weight, color,

etc.) Growth curves were drawn for each species and variety to base the sample

selection on this data The sampling process began when the plants had fullyfruited and reached maturity, ensuring that we evaluated the chili peppers at thestandard growth cycle stage

21

Trang 32

22

Trang 33

3.4 Chili pepper Dataset for Step 1

GL 1T250232

Figure 3.4: QR barcode to calculated mm/pixel

We emphasized maintaining consistent lighting conditions during the tography process and keeping a consistent distance and angle for the shots Thiswas done to eliminate any potential biases or variations in the image data due

pho-to changing lighting conditions or different angles The images were saved in

JPG format, preserving the intricate details of the chili peppers with a pixel olution of 6024 x 4024 The Chili Pepper dataset contains a total of 32 images

res-before augmentation

We used cropping techniques to extract square patches of 640x640 pixelsfrom the original images, ensuring that each cropped region contains at leastone chili pepper (Figure 3.5)

OM

Figure 3.5: Cropping Chili pepper Images

23

Trang 34

This preprocessing step simplified subsequent image-processing tasks andoptimized the use of computational resources during model inference and train-

ing We utilized the Roboflow tool for the labeling process Roboflow is a

powerful and user-friendly platform that simplifies the task of annotating

ob-jects in images, a crucial step in preparing data for training machine learning

models, especially in computer vision

We identified two classes for our labeling task: "chili" and "seed." The

"chili" class represents the entire chili pepper, while the "seed" class includes

individual seeds within each chili pepper Labeling involved manually drawing

bounding boxes around each chili pepper and seed in the images For the "chili"

class, we drew tight bounding boxes around the chili pepper For the "seed"

class, we drew boxes around each visible seed within the chili pepper Seeds

with over 70% transparency and over 90% occlusion were ignored to avoid

in-troducing noise into the model

Figure 3.6: Chili Pepper Dataset Label

Our Chili Pepper dataset is structured in the YOLOv5[13] folder format,making it compatible with most popular model architectures and object detec-

tion frameworks, and specifically suitable for the input requirements of YOLOv5[13]

24

Trang 35

and YOLOv7[14], which we will experiment with later However, at this stage,

we have determined that the dataset is relatively small for training a deep ing model To enrich the diversity of the dataset and enhance the model’s gener-alization capability and performance on unseen samples, we employed variousdata augmentation techniques These data augmentation strategies aim to ex-pand the dataset by introducing variations in image orientation and other imagefeatures

learn-The final Chili pepper dataset includes over 700 annotated images of

dif-ferent chili pepper varieties captured under consistent lighting conditions andbackgrounds The images were manually labeled with bounding boxes aroundeach chili pepper and seed sample We split our annotated dataset into training,validation, and test sets to ensure robust model training and performance eval-uation This partitioning is a common practice in machine learning to preventoverfitting and to assess the generalization capability of the trained models Wefollowed a 70-20-10 split ratio, allocating 70% of the data for training purposes,20% for validation, and 10% for testing The training set is used to optimizethe model parameters during training In contrast, the validation set is used

to tune hyperparameters and monitor the model’s performance during training

to prevent overfitting The test set, entirely separate from the training process,serves as an objective measure of the model’s performance on unseen data Byevaluating the model’s predictions on the test set, we can obtain a realistic esti-mate of its generalization capability and assess its suitability for deployment inreal-world scenarios

The training, validation, and test sets were carefully managed and organized,with image file names and corresponding annotations stored in separate files foreach subset (Figure 3.7) This structured organization facilitated efficient dataloading and preprocessing during the training and evaluation phases of our ma-

chine learning workflow.

25

Trang 36

Figure 3.7: Structured of Chili pepper Dataset

The number of datasets after the final collection and preprocessing steps:

Data Distribution in Train, Validation, and Test Sets

Figure 3.8: Distribution of Data

Distribution of Data in Training, Validation, and Test Sets:

¢ Training set (Train): 584 samples

¢ Validation set (Validation): 40 samples

¢ Test set (Test): 30 samples

26

Trang 37

3.6 Summary

Figure 3.9: Distribution of Object Counts

The dataset consists of 2 objects, chili, and seed

Training set | Validation set | Test set

envi-collection process, as well as how we shaped and built the Chili Pepper Dataset

from the perspective of Computer Vision to serve the subsequent machine ing training process We also conducted a preliminary evaluation of the Chili

learn-Pepper dataset and discussed the assurance and consistency of each chili pepper

variety in this dataset

27

Trang 39

Chapter 4

CHILI PEPPERS AND SEEDS

DETECTION PROBLEM

4.1 Chapter Overview

As outlined from the beginning, our proposed processing flow consists of

two steps: detecting chili peppers and seeds and extracting the morphological

characteristic information of the chili peppers This chapter will focus on thestep 1 For the problem of detecting chili peppers and seeds, we need the out-put to be bounding boxes containing information about the location and labels

of the chili peppers and seed objects Essentially, this is an object detection

problem, so this chapter will introduce the concept of object detection, how we

approached this problem, the methods we experimented with, and the mental evaluation results, along with the rationale for our final choice in this

experi-stage.

Object detection is a fundamental problem that involves classifying and cating objects within an image or video There are two essential concepts tounderstand:

lo-29

Trang 40

Image Classification

Predicting the label of an object in an image The input is an image

contain-ing one object, and the output is the label of the object

Object Localization

Determining the presence of objects in an image and specifying their tions using bounding boxes The input is an image with one or more objects,and the output is one or more bounding boxes defined by the coordinates of the

loca-center, width, and height

Object detection combines image classification and object localization Itinvolves drawing a bounding box around each object of interest in the image

and assigning a label to it For this problem, this first module will help by

providing parameters to segment the input image into individual chili peppers,facilitating the extraction of information in the next module Additionally, thenumber of seeds per chili pepper will be extracted based on the detection ofseeds by this module

The input and output of Step 1, like Figure 4.1

¢ Input:

— Longitudinal cross-section images of chili peppers.

¢ Output

— A text file where each line includes:

+ The class label of the object (bb[0])

+ The confidence score of the detection (bb[1])

* The coordinates of the center of the bounding box (bb[2],

bb[3])

+ The width and height of the bounding box (bb[4], bb[5])

30

Tiêu đề	Applying Machine Learning For Chili Pepper Phenotyping And Feature Extraction
Tác giả	Nga Pham Thi
Người hướng dẫn	PhD. Dung Mai Tien
Trường học	Vietnam National University, Ho Chi Minh City University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Graduate Thesis
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	93
Dung lượng	93,87 MB