midterm project introduction to machine learning machine learning models

SUMMARY The document covers analysis of four machine learning models - k-Nearest Neighbors kNN, Linear Regression, Naive Bayes, and Decision Tree.. - The model predicts values of new dat

ANALYSIS OF MACHINE LEARNING MODELS –

K- Nearest Neighbors

The main goal of kNN is to create a model capable of predicting the value or class of a data point based on information from the closest data points in the training set (finding similarities between new data points and available data) In addition, the model must ensure accuracy and flexibility in prediction

Figure 1.1 A visual representation of how a kNN model works :

- kNN is based on finding similarities between new data points and available training data

- It makes predictions by searching through the entire training set to find the K most similar instances

- Similarity is determined using distance metrics like Euclidean distance

- The model predicts values of new data points based on the values of their nearest neighbors in the training set

- Collect data to build a kNN model

- Select number K: this is the number of closest points to make predictions for new data points

- Data preprocessing: includes normalization, handling missing or noisy data, and normalizing data

- Prediction: calculate the distance between the new data point and the existing data points in the set, then select the K closest point

 The kNN algorithm assumes that similar data will exist close to each other in a space Predictions are made across the entire training data set to find the K closest instances (with the smallest distance) to the new data point

 To determine the K closest cases to the input data, we need to calculate the distance between 2 points (Euclidean distance between two points x, y has n attributes:

 In addition, you can calculate the distance between 2 points using Manhattan, Minkowski, …

- Model evaluation: Evaluate the performance of the kNN model using learning criteria such as accuracy for classification problems or Mean Squared Error (MSE) for regression problems

- Check and adjust: if the performance not meet expectations, you can adjust the

K value or preprocess the data accordingly

In the classification problem, the learning criterion of kNN is accuracy In the Regression problem, the criterion is Mean Squared Error or Root Mean Squared Error 1.1.5 kNN Applications kNN is used in both Classification and Regression, it suitable for problems with numerical data because it makes it easy to measure the distance between data points

- Regression: applications in the investment industry, including bankruptcy prediction, stock price prediction,

1.1.6 Advantages and Disadvantages of kNN

 Adapts easily to new information

 No assumptions about class distribution

 High computational and storage requirements for large datasets

 Sensitively to noise, especially with small values of k

 Curse of dimensionality in high-dimensional spaces

 Prone to overfitting without proper feature selection or dimensionality reduction.

Linear Regression

1.2.1 Goal of Linear Regression Model

The main goal of Linear Regression is to create a linear model that predicts continuous output values based on information of input variables This model finds the line (or hyperplane) that minimizes the error between the actual value and the predicted value

Figure 1.2 A visual representation of how a simple linear regression model works : 1.2.2 Linear Regression Methods

The main method of Linear Regression is to create a linear function that describes the relationship between the input variable and the target variable

The linear function has the form:

 y: is the target variable to predict

Optimize the coefficient: Linear Regression finds the optimal values for the coefficient β so that the squared error (Residual Sum of Squares) between prediction and reality is smallest

- Determine the linear relationship: It is necessary to determine whether the relationship between the input variable and the target variable is linear or not This can be done through scatter plots and basic statistical analysis

- Collect data: Collect data with input variables and target variables

- Data preprocessing: Data preprocessing includes handling missing values, noise, removing outliers, and normalizing data

- Model building: Calculate β coefficients by optimizing the sum of squared errors using the minimization method

- Predict target values for new data points

- Model evaluation: Evaluate model performance using criteria such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R 2 ).

- Check and adjust: If the model performance is not satisfactory, the model can be adjusted by changing the input variables or using nonlinear variables

1.2.4 Learning Criteria for Linear Regression

Minimize MSE, adjust slope coefficients and coefficients so that MSE is as small as possible This helps the predicted results to be close to the actual value 1.2.5 Applications of Linear Regression

Linear Regression is often used in regression problems It is suitable for data with linear relationships, data that does not contain noise or complex interactions Often used to predict output values as continuous quantities such as sales or prices, … 1.2.6 Advantages and Disadvantages of Linear Regression

 Easy to understand, deploy, and quick to predict

 Having a good performance when the data has a linear relationship, Disadvantages:

 Does not work well for non-linear data

 The data is noisy and prone to overfitting.

Nạve Bayes Classification

1.3.1 Goal of Nạve Bayes Model

The main goal of Naive Bayes classifiers is to classify data into groups based on probability This model uses Bayesian formula to calculate the probability that a data point belongs to a particular group based on information from the characteristics of that data point

Figure 1.3 A visual representation of how a Naive Bayes classifiers model works : 1.3.2 Nạve Bayes Method

The Nạve Bayes classifier method is based on applying Bayes’ theorem and the Nạve independence assumption between features to perform probabilistic classifications

- Determine the events and probability each event based on training data

- Collect training data consisting of labeled data points

- Data preprocessing: cleaning and transforming data to remove noisy data and normalize and transform data into a form that can be used by the model

- Use the Bayesian formula to calculate the posterior probability of each event based on the training data set and the characteristics of the data points The formula is as follows:

 P(A|B): is the posterior probability of event A given event B

 P(B|A): is the posterior probability of event B when event A occurs

 P(A): is the prior probability of event A

 P(B): is the prior probability of event B

- Based on the posterior probability, select the event with the highest probability for each data point

- Evaluate the model based on measures such as accuracy, recall,

- Test and adjust: if the desired performance not achieved, you can change the parameters or perform optimization using variations of Naive Bayes or change the model

1.3.4 Learning Criteria for Nạve Bayes

Nạve Bayes based on probability The model uses probability to predict the probability that a data point belongs to each class or specific event

Suitable for classification problems, especially problems with discrete data such as spam filtering, document classification, recommended features,

1.3.6 Advantages and Disadvantages of Nạve Bayes

 Easy to deploy, effective in data classification, works well in cases with many features

 Rapid learning and prediction process is suitable for real-time applications

 It assumed that the features must be independent of each other, which may not be true in real cases

 When data is disruptive performance declines.

Decision Tree

1.4.1 Goal of Decision Tree Model

The main goal of Decision Tree is to create a decision tree structure to predict or classify data points based on information from their characteristics The model will divide the data set into branches so that the data points in each branch have similar or nearly the same properties

Figure 1.4 A simple example demonstrating how the decision tree algorithm works : 1.4.2 Decision Tree Method

Usually starts from the root node and continues to divide the data set based on features There are many algorithms to build Decision tress such as: Random Forest, ID3,

- Start by creating a root node containing the entire training dataset

- Create child nodes by dividing the data set based on the selected feature

- Repeat the process for the child nodes

- The process continues until the stopping condition is met Stopping conditions can be: depth of the tree, no longer having features to divide,

- When the process ends, each leaf node will be assigned a predicted value For a classification problem, the value of a leaf node can be a class or a predicted label For regression problems, the value of the leaf node can be the metric value

- Build the tree until all nodes meet the stopping condition

1.4.4 Learning Criteria for Decision Tree

Involves selecting features and deciding on segmentation based on criteria such as similarity, purity or information value

Decision Tree is suitable for classification (a case as possible or not, ) and problems prediction (predict continuous values such as salary, sales, ), it is suitable for both discrete and continuous data In addition, the model is also capable of handling noisy data, missing information, etc Widely applied in the fields of health, credit, economics,

1.4.6 Advantages and Disadvantages of Decision Tree

 Easy to understand and explain

 Can handle both discrete and continuous data

 Able to handle noisy and non-linear data

Comparing the Models

Goal Predict value based on nearest points

Determine the linear relationship between the independent variable and

Perform data classification into groups based on probability

Predict or classify data points based on information from the data set's the target variable to predict the metric value characteristics.

Calculate the distance between data points to find the closest points

Create a linea function that describes the relationship between the input variable and the target variable

Use probability and the independence assumption to calculate predicted probability

Evaluate similarity and purity to determine how to divide data

Effectiveness on Small data Any size data Huge data Large data

WINE QUALITY ANALYSIS –

Problem Description

The problem our team chose to solve using a machine learning approach this time is a problem with wine quality depending on multiple characteristics such as 'free sulfur dioxide,' 'total sulfur dioxide,' 'residual sugar,' and more The primary target variable: is 'quality'

The process involves several data preprocessing steps and the application of different classification models

# Check for missing values df.isnull sum() ()

# Fill missing values with the mean of the respective column for col indf columns: if df[col].isnull sum() () > 0: df[col] = df[col].fillna(df[col].mean())

# Check for duplicate rows and remove them df.duplicated() df.drop_duplicates(inplace=True)

# Function to remove outliers from a specific column using the IQR method def remove_outliers(df column, ):

IQR = Q3 -Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR upper_array = np.where(df[column] >= upper)[0] lower_array = np.where(df[column] = lower) & ( [df column] =7 else0 forx in ['quality']] df df.replace({'white': 1, 'red': 0}, inplace=True)

X rop(['quality' 'good quality'], , axis=1) y 'good quality']

# Split the data into training and testing sets

X_train X_test, , y_train, y_test =train_test_split(X, y, test_size=0.2, random_stateB)

# Scale the data scaler =StandardScaler()

X_train_scaled =scaler.fit_transform(X_train)

X_test_scaled =scaler.transform(X_test)

We use classification models such as Logistic Regression, XGBoost, Random Forest, and K-Nearest Neighbors and cross-validation method to predict the quality of wine

"Random Forest": RandomForestClassifier(n_estimators0, random_stateB), "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5)

} models_name = list(models.keys())

# Cross-validation and evaluation accuracy_scores ={} validation_scores = {} for model_name model, in models.items(): train_acc = cross_val_score(model, X_train_scaled, y_train, scoring=make_scorer(accuracy_score), cv=5).mean() test_acc = cross_val_score(model, X_test_scaled, y_test, scoring=make_scorer(accuracy_score), cv=5).mean() accuracy_scores[model_name] = train_acc validation_scores[model_name] = test_acc

Feature Selection

Feature selection is performed using RandomForestClassifier to traning data and use it to estimate the importance of each feature Then, use `SelectFromModel` class from scikit-learn, which selects features based on the importances calculate by RandomForestClassifier

# Create and fit RandomForestClassifier for feature importance rfc =RandomForestClassifier(n_estimators0, random_stateB) rfc.fit(X_train_scaled,y_train)

# Feature selection using SelectFromModel selector_sfm =SelectFromModel(estimator rfc= ) selector_sfm.fit(X_train_scaled, y_train)

# Transform the data using the selected features

X_train_sfm =selector_sfm.transform(X_train_scaled)

X_test_sfm = selector_sfm.transform(X_test_scaled)

Figure 2.6 9 most importance features by RandomForestClassifier :

Prediction using selected features

While feature selection can sometimes lead to improved model performance, in this case, it resulted in a decrease in accuracy for XGBoost, Random Forest, and K- Nearest Neighbors Only Logistic Regression maintained the same level of accuracy This suggests that the feature selection method may have excluded some important features that contribute to the prediction

Figure 2.7 Comparison Before and After Feature Selection :

OVERFITTING IN MACHINE LEARNING –

Definition of Overfitting in Machine Learning

Overfitting is an undesirable behavior in machine learning that occurs when a machine learning model produces accurate predictions on training data but not on new data When data scientists use machine learning models to make predictions, they first train the model on a known data set

Based on this information, the model then attempts to predict outcomes for new data sets Overfitting a model can produce inaccurate predictions and may not perform well on all types of new data.

Causes of Overfitting

Overfitting typically arises due to the following reasons:

Excessive Training: Overfitting can occur if the model is trained for too many iterations on a sample dataset, causing it to learn not just the underlying patterns but also the noise in the data

Model Complexity: A model that is too complex or has too many parameters can easily overfit the training data

Insufficient Training Data: If there isn’t enough data for training, the model may memorize individual data points instead of learning the overall trend.

Strategies to Mitigate Overfitting

This technique involves selecting only the most relevant features for model training Irrelevant or redundant features can be identified and removed by training individual models with different features and comparing their performance

#import necessary libraries import pandas as pd from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif import matplotlib pyplot asplt from numpy import set_printoptions

#read the file filename = 'auto-mpg.csv' df = pd.read_csv(filename delimiter, =",")

X ['cylinders' 'displacement' 'horsepower' 'weight' 'acceleration' 'year', , , , , ]].values

# feature extraction test =SelectKBest(score_func=f_classif, k=4) fit =test.fit(X, Y) set_printoptions(precision 3= ) print(fit.scores_) features =fit.transform(X) print(features[0 5: ,:])

# visualization of selected features feature_names = ['cylinders' 'displacement' 'horsepower' 'weight'] , , , for iin range(4): plt.figure(figsize=(5, 2)) plt.hist(features[ : , i],rwidth=0.8) plt.xlabel(feature_names[i]) plt.ylabel('Frequency') plt.title(f'Histogram of {feature_names[i]}') plt.show()

Monitoring the model’s performance during each iteration of the training phase can help prevent overfitting Training can be stopped before the model begins to learn from the noise in the data However, care should be taken to avoid stopping the training process too early, which could result in underfitting

#import necessary libraries from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.callbacks import EarlyStopping from sklearn model_selection import train_test_split

X_train X_test, , y_train, y_test =train_test_split(X Y, , test_size=0.25, random_stateB)

# Build the neural network model = Sequential() model.add(Dense(25, input_dim=X.shape[1 activation], ='relu')) # Hidden 1 model.add(Dense(10, activation='relu')) # Hidden 2 model.add(Dense(1)) # Output model.compile(loss='mean_squared_error', optimizer='adam') monitor = EarlyStopping(monitor='val_loss', min_delta 1e-3= , patience 5= , verbose 1 mode= , ='auto', restore_best_weights=True) model.fit(X_train y_train, ,validation_data=(X_test y_test, ), callbacks=[monitor], verbose=2,epochs00)

# Load the training and validation loss data train_loss =model.history.history['loss'] val_loss = model.history.history['val_loss']

# visualization of early stopping fig, ax =pltsubplots() ax.plot(train_loss, label='Training loss') ax.plot(val_loss, label='Validation loss') epoch_stopped = monitor.stopped_epoch ax.axvline(epoch_stopped, linestyle=' ' , color='gray', label='Early stopping') ax.legend() ax.set_title('Training and Validation Loss') plt.show()

Cross-Validation techniques such as Hold-out, K-folds, Leave-one-out, and Leave-p-out are widely used to prevent overfitting These techniques help evaluate how well a model can generalize to unseen data.

#import libraries import numpy as np from sklearn model_selection import KFold, cross_val_score import matplotlib pyplot asplt from sklearn metrics import r2_score from sklearn linear_model import Ridge

# Split into validation and training sets using KFold kfold = KFold(n_splits, shuffle=True, random_stateB)

# Create a Ridge regression model with L2 regularization model = Ridge(alpha=0.1)

# Train the model on the training fold val_scores = [] print(val_scores) for train_index val_index, inkfold.split(X): model.fit(X[train_index], Y[train_index]) val_score = model.score(X[val_index], Y[val_index]) val_scores.append(val_score)

# Calculate the average validation score avg_val_score = np.mean(val_scores) y_pred = model.predict(X_test)

# Calculate the R-squared score on the test set test_score =r2_score(y_test y_pred) ,

# Print the test score and validation score print("Test score:", test_score) print("Validation score:", avg_val_score)

# visualization of cross-validation plt.xlabel('Validation Score') plt.ylabel('Frequency') plt.title('Distribution of Cross-Validation Scores') plt.hist(val_scores) plt.show()

[1] GeeksforGeeks, “ML Data Preprocessing in Python,” GeeksforGeeks, Jun

2023, [Online] Available: https://www.geeksforgeeks.org/data-preprocessing- machine-learning-python/

[2] GeeksforGeeks, “ML Overview of data cleaning,” GeeksforGeeks, Jun 2023, [Online] Available: https://www.geeksforgeeks.org/data-cleansing- introduction/?ref=lbp

[3] GeeksforGeeks, “ML Understanding Data Processing,” GeeksforGeeks, May

2023, [Online] Available: https://www.geeksforgeeks.org/ml-understanding-data- processing/?ref=lbp

[4] GeeksforGeeks, “Detect and Remove the Outliers using Python,”

GeeksforGeeks, May 2023, [Online] Available: https://www.geeksforgeeks.org/detect- and-remove-the-outliers-using-python/

[5] GeeksforGeeks, “Exploring correlation in Python,” GeeksforGeeks, Mar 2023, [Online] Available: https://www.geeksforgeeks.org/exploring-correlation- -python/in [6] GeeksforGeeks, “Remove duplicates from a dataframe in PySpark,”

GeeksforGeeks, Dec 2021, [Online] Available: https://www.geeksforgeeks.org/remove-duplicates-from-a-dataframe- -pyspark/in [7] GeeksforGeeks, “Introduction to Support Vector Machines SVM,”

GeeksforGeeks, Feb 2023, [Online] Available: https://www.geeksforgeeks.org/introduction-to-support-vector-machines-svm/ [8] N Selvaraj, “8 machine learning models explained in 20 minutes,” Sep 16,

2022 https://www.datacamp.com/blog/machine-learning-models-explained

[9] S Ghosh, “The Ultimate Guide to Evaluation and Selection of Models in

Machine Learning,” neptune.ai, Sep 2023, [Online] Available: https://neptune.ai/blog/ml-model-evaluation-and-selection

Tiêu đề	Machine Learning Models
Tác giả	Phan Anh Tuan, Nguyen Kieu Thuy Tien, Le Cam Tu
Người hướng dẫn	MR. Le Anh Cuong
Trường học	Ton Duc Thang University
Chuyên ngành	Introduction to Machine Learning
Thể loại	Midterm Project
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	31
Dung lượng	2,3 MB