OverviewThe Iris Flower Classification problem requires you to identify three iris flower species based on four features: sepal length, sepal width, petal length, and petal width.. Comme
Trang 1BỘ GIÁO DỤC VÀ ĐẠO TẠOTRƯỜNG ĐẠI HỌC DUY TÂN
KHOA ĐÀO TẠO QUỐC TẾ
ARTIFICIAL INTELLIGENCE (FOR BUSINESS)INSTRUCTOR: DR Soon Goo Hong
CLASS: IS-CS 468 AIS
Term Group Project
“IRIS FLOWER CLASSIFICATION”
Team 4
Phan Van Minh Manh - 27211445925 Tran Thi Thu Hong - 27201401792 Doan Thien Nhan - 27211201936
Trang 2Da Nang, 12 December, 2023
Trang 41 Overview
The Iris Flower Classification problem requires you to identify three iris flower species based on four features: sepal length, sepal width, petal length, and petal width The problem has importance because it has several practical uses, including plant breeding, horticulture, and environmental monitoring.
The results of our research of the Iris Flower dataset employing three distinct
classification algorithms will be presented in this report: K-Nearest Neighbors
(KNN), Decision Tree, and Logistic Regression The performance of these
algorithms will be compared using various measures such as accuracy, recall, precision, and F1 score We will also guess the species of a new instance based on the supplied features and make some suggestions for future changes.
1 K-Nearest Neighbors (KNN)BRIGHTICS PROCESS
DATA LOAD
- Load the KNN data from “sample_iris.csv”.
- We upload a sample data provided by Samsung Brightics AI - Click ‘upload’ button and search ‘sample_iris.csv’ then click it - Click ‘Run’ button.
Trang 5 Group by: species.
In the output panel, you can see that there are two distinct data sets In other words, they are separated into "Split data (train_table)" and "Split data (test_table).
Trang 6KNN Classification
Parameter:
Inputs: click ‘Empty’ in the ‘test_table’.
Set Inputs: drag ‘test_table’ in the Split Data and drop it to the ‘Drop Data’ in the ‘test_table’.
Feature Columns: select all Label Columns: species.
Inputs: default Label Column: ‘species’ Prediction Column: ‘prediction.
Trang 7Comment: The accuracy of the predictions was perfectly high (Accuracy: 1.0).
With the KNN model, we successfully classified 3 flower species: setosa, versicolor, virginica with 100% accuracy.
Trang 8Accuracy is the proportion of the total number of predictions that are correct A high accuracy score means that the model is making correct predictions most of the time
Precision refers to the proportion of positive predictions made by the model that are actually true positives The denominator becomes TP+FP, as shown in the formula If the precision index is higher, it ensures that the ratio of positive predictions to actual positives is higher.
After using k =1, 3, 5 the model achieved an accuracy of 0.967 With k = 7, 9, 11 the model achieves an accuracy of 1.0
As we can see, if the value of K is 1, 3, or 5 then the accuracy is lower compared to when K has a value greater than 5 On the other hand, as the value of K increases, the risk of the model overfitting also increases Therefore, with a K value of 7, the risk of overfitting is minimized, and simultaneously, the four metrics for evaluating the model achieve their highest values, ensuring correct predictions most of the time.
Therefore, the best k for this dataset is 7
Trang 9Based on the classification evaluation results with k is 7, model performed very well on the sample_iris dataset Here is the analysist about the metrics
Accuracy: The model's accuracy is 1.0, meaning the model correctly predicts the flower species in about 100% of cases.
Setosa: The model classified this species perfectly with F1, Precision and Recall all 1.0 This shows that the model recognized all Setosa samples without any errors.
Virginica: The model classified this species perfectly with F1, Precision and Recall all 1.0 This shows that the model recognized all Virginica samples without any errors Versicolor: The model classified this species perfectly with
F1, Precision and Recall all 1.0 This shows that the model recognized all Versicolor samples without any errors.
2 Decision Tree
BRIGHTICS PROCESS
DATA LOAD
- Load the KNN data from “sample_iris.csv”.
- We upload sample data provided by Samsung Brightics AI.
Trang 11After running the Decision Tree approach to classifying cases with max depth = 3, 5, 7 using species as the outcome variable, we obtained the best max depth = 3 With max depth = 3, the model achieved an accuracy of 0.967 With max depth = 5 or 7, The model predicts with the same accuracy of 0.93 Therefore,
the best max depth is 3.
Decision Tree Classification Train
Trang 12 Splitter: Best.
Max Depth: 3 , 5, 7 (Replace the values one by one).
Decision Tree Classification Predict
- Parameter
Inputs: click ‘Empty’ in the ‘test_table’.
Set Inputs: drag ‘test_table’ in the Split Data and drop it to the ‘Drop Data’ in the ‘test_table’.
Trang 13 Parameter:
Label Column: ‘species’ Prediction Column: ‘prediction’.
Trang 14Comment: With max depth = 3, the accuracy of the predictions was
exceptionally high (Accuracy: 0.967) With the Decision Tree model, we successfully classified 3 flower species: setosa (10/10), versicolor (10/10), virginica (9/10) The classification of the two species setosa and versicolor is absolutely accurate Meanwhile, Iris virginica only correctly classified 9 out of 10 records (0.9).
3 Logistic RegressionBrightics Process
DATA LOAD
- Load the KNN data from “sample_iris.csv”.
- We upload a sample data provided by Samsung Brightics AI.
Trang 15PRE-PROCESSINGQuery Executor
Perform the conversion of the dependent variable into decimal format (numeric type) as per the input conditions (species), resulting in 1s, 2s, and 0s.
DESCRIPTIVE ANALYSISStatistic Summary
For the Number type variable (sepal length, sepal width, petal length, petal width), examine various statistics based on species.
Trang 16 Parameter:
Columns: sepal length, sepal width, petal length, petal width Target Statistic: Max, Min, Average, Standard deviation Group by: species
Select Column
To transform the categorical variable into a String format Parameter:
Condition: Change the Type of the " sepal length, sepal width, petal length, petal width " variable to String.
Trang 17String Summary
Examine the frequencies and proportions of species and the categorical variables using them as separators.
Parameter:
Input Columns: sepal length, sepal width, petal length, petal width Group by: species.
Trang 18Logistic Regression Train
Select the dependent variable (spec_cd) and explanatory variables (sepal length, sepal width, petal length, petal width), then proceed with the analysis
- Parameter :
Inputs: Split Data-train_table.
Feautre Columns: sepal length, sepal width, petal length, petal width Label Column: spec_cd.
Trang 19Logistic Regression Predict
Perform predictions by applying the regression equation generated from.
Inputs: Logistic Regression Predict Label Column: spec_cd.
Prediction Column: prediction.
Based on the classification evaluation results, the logistic regression model performed very well on the sample_iris dataset Here are some detailed comments:
Accuracy: The model's accuracy is 0.967, which means it accurately
guesses the flower species in around 96.7% of cases.
Species 1 (Setosa): The model correctly categorized this species with F1,
Precision, and Recall all 1.0 This shows that the model correctly identified all Setosa samples.
Trang 20 Species 2 (Virginica): The model performed well for this species as well,
with an F1 of 0.95, Precision of 1.0, and Recall of 0.9 This suggests that the model properly detected all of the Virginica samples predicted, however some Virginica samples were missing.
Species 0 (Versicolor): The model has F1 of 0.95, Precision of 0.91, and
Recall of 1.0 This shows the model accurately recognized all specimens predicted to be Versicolor, however, there were several instances when the model incorrectly classified other species.
Evaluation 2
Plot ROC and PR CurvesSetosa_1
Check the performance through the plots of ROC (Receiver Operating Characteristic) and PR (Precision-Recall)
In this case, the classification performance targeted is for spec_cd = 1 (setosa) Parameter
Label Column: spec_cd.
Probability Column: probability_1 Positive Label: 1.
Trang 21In the ROC curve chart, verify the threshold: 0.69 and the AUC (Area Under the Curve) value of 1.00.
Check the performance through the plots of ROC (Receiver Operating Characteristic) and PR (Precision-Recall)
In this case, the classification performance targeted is for spec_cd = 0 (versicolor)
Label Column: spec_cd.
Probability Column: probability_0 Positive Label: 0.
Trang 22In the ROC curve chart, verify the threshold: 0.55 and the AUC (Area Under the Curve) value of 1.00
Check the performance through the plots of ROC (Receiver Operating Characteristic) and PR (Precision-Recall)
In this case, the classification performance targeted is for spec_cd = 2 (virginica)
Label Column: spec_cd
Probability Column: probability_2 Positive Label : 2
Trang 23In the ROC curve chart, verify the threshold: 0.46 and the AUC (Area Under the Curve) value of 1.00.
Comment: We get pretty good accuracy (96.7%) in iris flower classification
using sepal length, sepal width, petal length, and petal width.
Trang 245 Prediction for the given data
Create table
K-Nearest Neighbors
Comment: Using the provided data, we use the previously trained KNN model
to predict species The model suggests that this is Iris Setosa, with a probability of 85.71% for Iris Setosa and a probability of 14.29% for Iris Versicolor.
Decision Tree
Trang 25Comment: Using the provided data, we use the Decision Tree model trained
previously to predict species The model suggests this is Iris Versicolor, with a chance of 0% for Iris Setosa, 97.44% for Iris Versicolor, and 2.56% for Iris Virginica.
Logistics Regression
Comment: Using the data provided, we use the previously trained Logistic
Regression model to predict species The model suggests this is Iris Setosa, with a chance of 89.69% for Iris Setosa, 10.02% for Iris Versicolor, and 0.29% for Iris Virginica.
Executive Summary
Our team analyzed the data and grouped it into three groups Setosa,
Versicolor, and Virginica are the three categories Three machine learning
techniques were used for classification: K-Nearest Neighbors (KNN), Decision
Tree, and Logistic Regression The three algorithms achieved the following
levels of accuracy: 100%, 96.7%, and 96.7%.
To conduct the analysis, our team performed the following steps:
1 Data Collection: We get data from Samsung Brightics AI sources.
Trang 262 Cleaning the Data: We cleaned the data by eliminating duplicates, missing
values, and outliers.
3 Data Preprocessing: We preprocessed the data by dividing it into training
and testing sets.
4 Model Training: On the training set, we trained three machine learning
models using the K-Nearest Neighbors (KNN) with k = 7, Decision Tree with max depth = 3, and Logistic Regression algorithms.
5 Model Evaluation: We examined the models on the testing set and obtained
the following levels of accuracy: 100%, 96.7%, and 96.7%.
6 Classification: We classified the data into three groups using the trained
models: Setosa, Versicolor, and Virginica.
Result : On the testing set (Accuracy)
On the given data
K-Nearest Neighbors: The model predicts that this is Iris Setosa.Decision Tree: The model predicts that this is Iris Versicolor.
Logistic Regression: The model predicts that this is Iris Setosa.
100%%