SUMMARY The document covers analysis of four machine learning models - k-Nearest Neighbors kNN, Linear Regression, Naive Bayes, and Decision Tree.. - The model predicts values of new dat
ANALYSIS OF MACHINE LEARNING MODELS –
K- Nearest Neighbors
The main goal of kNN is to create a model capable of predicting the value or class of a data point based on information from the closest data points in the training set (finding similarities between new data points and available data) In addition, the model must ensure accuracy and flexibility in prediction
Figure 1.1 A visual representation of how a kNN model works :
- kNN is based on finding similarities between new data points and available training data
- It makes predictions by searching through the entire training set to find the K most similar instances
- Similarity is determined using distance metrics like Euclidean distance
- The model predicts values of new data points based on the values of their nearest neighbors in the training set
- Collect data to build a kNN model
- Select number K: this is the number of closest points to make predictions for new data points
- Data preprocessing: includes normalization, handling missing or noisy data, and normalizing data
- Prediction: calculate the distance between the new data point and the existing data points in the set, then select the K closest point
The kNN algorithm assumes that similar data will exist close to each other in a space Predictions are made across the entire training data set to find the K closest instances (with the smallest distance) to the new data point
To determine the K closest cases to the input data, we need to calculate the distance between 2 points (Euclidean distance between two points x, y has n attributes:
In addition, you can calculate the distance between 2 points using Manhattan, Minkowski, …
- Model evaluation: Evaluate the performance of the kNN model using learning criteria such as accuracy for classification problems or Mean Squared Error (MSE) for regression problems
- Check and adjust: if the performance not meet expectations, you can adjust the
K value or preprocess the data accordingly
In the classification problem, the learning criterion of kNN is accuracy In the Regression problem, the criterion is Mean Squared Error or Root Mean Squared Error 1.1.5 kNN Applications kNN is used in both Classification and Regression, it suitable for problems with numerical data because it makes it easy to measure the distance between data points
- Regression: applications in the investment industry, including bankruptcy prediction, stock price prediction,
1.1.6 Advantages and Disadvantages of kNN
Adapts easily to new information
No assumptions about class distribution
High computational and storage requirements for large datasets
Sensitively to noise, especially with small values of k
Curse of dimensionality in high-dimensional spaces
Prone to overfitting without proper feature selection or dimensionality reduction.
Linear Regression
1.2.1 Goal of Linear Regression Model
The main goal of Linear Regression is to create a linear model that predicts continuous output values based on information of input variables This model finds the line (or hyperplane) that minimizes the error between the actual value and the predicted value
Figure 1.2 A visual representation of how a simple linear regression model works : 1.2.2 Linear Regression Methods
The main method of Linear Regression is to create a linear function that describes the relationship between the input variable and the target variable
The linear function has the form:
y: is the target variable to predict
Optimize the coefficient: Linear Regression finds the optimal values for the coefficient β so that the squared error (Residual Sum of Squares) between prediction and reality is smallest
- Determine the linear relationship: It is necessary to determine whether the relationship between the input variable and the target variable is linear or not This can be done through scatter plots and basic statistical analysis
- Collect data: Collect data with input variables and target variables
- Data preprocessing: Data preprocessing includes handling missing values, noise, removing outliers, and normalizing data
- Model building: Calculate β coefficients by optimizing the sum of squared errors using the minimization method
- Predict target values for new data points
- Model evaluation: Evaluate model performance using criteria such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R 2 ).
- Check and adjust: If the model performance is not satisfactory, the model can be adjusted by changing the input variables or using nonlinear variables
1.2.4 Learning Criteria for Linear Regression
Minimize MSE, adjust slope coefficients and coefficients so that MSE is as small as possible This helps the predicted results to be close to the actual value 1.2.5 Applications of Linear Regression
Linear Regression is often used in regression problems It is suitable for data with linear relationships, data that does not contain noise or complex interactions Often used to predict output values as continuous quantities such as sales or prices, … 1.2.6 Advantages and Disadvantages of Linear Regression
Easy to understand, deploy, and quick to predict
Having a good performance when the data has a linear relationship, Disadvantages:
Does not work well for non-linear data
The data is noisy and prone to overfitting.
Nạve Bayes Classification
1.3.1 Goal of Nạve Bayes Model
The main goal of Naive Bayes classifiers is to classify data into groups based on probability This model uses Bayesian formula to calculate the probability that a data point belongs to a particular group based on information from the characteristics of that data point
Figure 1.3 A visual representation of how a Naive Bayes classifiers model works : 1.3.2 Nạve Bayes Method
The Nạve Bayes classifier method is based on applying Bayes’ theorem and the Nạve independence assumption between features to perform probabilistic classifications
- Determine the events and probability each event based on training data
- Collect training data consisting of labeled data points
- Data preprocessing: cleaning and transforming data to remove noisy data and normalize and transform data into a form that can be used by the model
- Use the Bayesian formula to calculate the posterior probability of each event based on the training data set and the characteristics of the data points The formula is as follows:
P(A|B): is the posterior probability of event A given event B
P(B|A): is the posterior probability of event B when event A occurs
P(A): is the prior probability of event A
P(B): is the prior probability of event B
- Based on the posterior probability, select the event with the highest probability for each data point
- Evaluate the model based on measures such as accuracy, recall,
- Test and adjust: if the desired performance not achieved, you can change the parameters or perform optimization using variations of Naive Bayes or change the model
1.3.4 Learning Criteria for Nạve Bayes
Nạve Bayes based on probability The model uses probability to predict the probability that a data point belongs to each class or specific event
Suitable for classification problems, especially problems with discrete data such as spam filtering, document classification, recommended features,
1.3.6 Advantages and Disadvantages of Nạve Bayes
Easy to deploy, effective in data classification, works well in cases with many features
Rapid learning and prediction process is suitable for real-time applications
It assumed that the features must be independent of each other, which may not be true in real cases
When data is disruptive performance declines.
Decision Tree
1.4.1 Goal of Decision Tree Model
The main goal of Decision Tree is to create a decision tree structure to predict or classify data points based on information from their characteristics The model will divide the data set into branches so that the data points in each branch have similar or nearly the same properties
Figure 1.4 A simple example demonstrating how the decision tree algorithm works : 1.4.2 Decision Tree Method
Usually starts from the root node and continues to divide the data set based on features There are many algorithms to build Decision tress such as: Random Forest, ID3,
- Start by creating a root node containing the entire training dataset
- Create child nodes by dividing the data set based on the selected feature
- Repeat the process for the child nodes
- The process continues until the stopping condition is met Stopping conditions can be: depth of the tree, no longer having features to divide,
- When the process ends, each leaf node will be assigned a predicted value For a classification problem, the value of a leaf node can be a class or a predicted label For regression problems, the value of the leaf node can be the metric value
- Build the tree until all nodes meet the stopping condition
1.4.4 Learning Criteria for Decision Tree
Involves selecting features and deciding on segmentation based on criteria such as similarity, purity or information value
Decision Tree is suitable for classification (a case as possible or not, ) and problems prediction (predict continuous values such as salary, sales, ), it is suitable for both discrete and continuous data In addition, the model is also capable of handling noisy data, missing information, etc Widely applied in the fields of health, credit, economics,
1.4.6 Advantages and Disadvantages of Decision Tree
Easy to understand and explain
Can handle both discrete and continuous data
Able to handle noisy and non-linear data
Comparing the Models
Goal Predict value based on nearest points
Determine the linear relationship between the independent variable and
Perform data classification into groups based on probability
Predict or classify data points based on information from the data set's the target variable to predict the metric value characteristics.
Calculate the distance between data points to find the closest points
Create a linea function that describes the relationship between the input variable and the target variable
Use probability and the independence assumption to calculate predicted probability
Evaluate similarity and purity to determine how to divide data
Effectiveness on Small data Any size data Huge data Large data
WINE QUALITY ANALYSIS –
Problem Description
The problem our team chose to solve using a machine learning approach this time is a problem with wine quality depending on multiple characteristics such as 'free sulfur dioxide,' 'total sulfur dioxide,' 'residual sugar,' and more The primary target variable: is 'quality'
The process involves several data preprocessing steps and the application of different classification models
# Check for missing values df.isnull sum() ()
# Fill missing values with the mean of the respective column for col indf columns: if df[col].isnull sum() () > 0: df[col] = df[col].fillna(df[col].mean())
# Check for duplicate rows and remove them df.duplicated() df.drop_duplicates(inplace=True)
# Function to remove outliers from a specific column using the IQR method def remove_outliers(df column, ):
IQR = Q3 -Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR upper_array = np.where(df[column] >= upper)[0] lower_array = np.where(df[column] = lower) & ( [df column] =7 else0 forx in ['quality']] df df.replace({'white': 1, 'red': 0}, inplace=True)
X rop(['quality' 'good quality'], , axis=1) y 'good quality']
# Split the data into training and testing sets
X_train X_test, , y_train, y_test =train_test_split(X, y, test_size=0.2, random_stateB)
# Scale the data scaler =StandardScaler()
X_train_scaled =scaler.fit_transform(X_train)
X_test_scaled =scaler.transform(X_test)
We use classification models such as Logistic Regression, XGBoost, Random Forest, and K-Nearest Neighbors and cross-validation method to predict the quality of wine
"Random Forest": RandomForestClassifier(n_estimators0, random_stateB), "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5)
} models_name = list(models.keys())
# Cross-validation and evaluation accuracy_scores ={} validation_scores = {} for model_name model, in models.items(): train_acc = cross_val_score(model, X_train_scaled, y_train, scoring=make_scorer(accuracy_score), cv=5).mean() test_acc = cross_val_score(model, X_test_scaled, y_test, scoring=make_scorer(accuracy_score), cv=5).mean() accuracy_scores[model_name] = train_acc validation_scores[model_name] = test_acc
Feature Selection
Feature selection is performed using RandomForestClassifier to traning data and use it to estimate the importance of each feature Then, use `SelectFromModel` class from scikit-learn, which selects features based on the importances calculate by RandomForestClassifier
# Create and fit RandomForestClassifier for feature importance rfc =RandomForestClassifier(n_estimators0, random_stateB) rfc.fit(X_train_scaled,y_train)
# Feature selection using SelectFromModel selector_sfm =SelectFromModel(estimator rfc= ) selector_sfm.fit(X_train_scaled, y_train)
# Transform the data using the selected features
X_train_sfm =selector_sfm.transform(X_train_scaled)
X_test_sfm = selector_sfm.transform(X_test_scaled)
Figure 2.6 9 most importance features by RandomForestClassifier :
Prediction using selected features
While feature selection can sometimes lead to improved model performance, in this case, it resulted in a decrease in accuracy for XGBoost, Random Forest, and K- Nearest Neighbors Only Logistic Regression maintained the same level of accuracy This suggests that the feature selection method may have excluded some important features that contribute to the prediction
Figure 2.7 Comparison Before and After Feature Selection :
OVERFITTING IN MACHINE LEARNING –
Definition of Overfitting in Machine Learning
Overfitting is an undesirable behavior in machine learning that occurs when a machine learning model produces accurate predictions on training data but not on new data When data scientists use machine learning models to make predictions, they first train the model on a known data set
Based on this information, the model then attempts to predict outcomes for new data sets Overfitting a model can produce inaccurate predictions and may not perform well on all types of new data.
Causes of Overfitting
Overfitting typically arises due to the following reasons:
Excessive Training: Overfitting can occur if the model is trained for too many iterations on a sample dataset, causing it to learn not just the underlying patterns but also the noise in the data
Model Complexity: A model that is too complex or has too many parameters can easily overfit the training data
Insufficient Training Data: If there isn’t enough data for training, the model may memorize individual data points instead of learning the overall trend.
Strategies to Mitigate Overfitting
This technique involves selecting only the most relevant features for model training Irrelevant or redundant features can be identified and removed by training individual models with different features and comparing their performance
#import necessary libraries import pandas as pd from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif import matplotlib pyplot asplt from numpy import set_printoptions
#read the file filename = 'auto-mpg.csv' df = pd.read_csv(filename delimiter, =",")
X ['cylinders' 'displacement' 'horsepower' 'weight' 'acceleration' 'year', , , , , ]].values
# feature extraction test =SelectKBest(score_func=f_classif, k=4) fit =test.fit(X, Y) set_printoptions(precision 3= ) print(fit.scores_) features =fit.transform(X) print(features[0 5: ,:])
# visualization of selected features feature_names = ['cylinders' 'displacement' 'horsepower' 'weight'] , , , for iin range(4): plt.figure(figsize=(5, 2)) plt.hist(features[ : , i],rwidth=0.8) plt.xlabel(feature_names[i]) plt.ylabel('Frequency') plt.title(f'Histogram of {feature_names[i]}') plt.show()
Monitoring the model’s performance during each iteration of the training phase can help prevent overfitting Training can be stopped before the model begins to learn from the noise in the data However, care should be taken to avoid stopping the training process too early, which could result in underfitting
#import necessary libraries from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.callbacks import EarlyStopping from sklearn model_selection import train_test_split
X ['cylinders' 'displacement' 'horsepower' 'weight' 'acceleration' 'year', , , , , ]].values
X_train X_test, , y_train, y_test =train_test_split(X Y, , test_size=0.25, random_stateB)
# Build the neural network model = Sequential() model.add(Dense(25, input_dim=X.shape[1 activation], ='relu')) # Hidden 1 model.add(Dense(10, activation='relu')) # Hidden 2 model.add(Dense(1)) # Output model.compile(loss='mean_squared_error', optimizer='adam') monitor = EarlyStopping(monitor='val_loss', min_delta 1e-3= , patience 5= , verbose 1 mode= , ='auto', restore_best_weights=True) model.fit(X_train y_train, ,validation_data=(X_test y_test, ), callbacks=[monitor], verbose=2,epochs00)
# Load the training and validation loss data train_loss =model.history.history['loss'] val_loss = model.history.history['val_loss']
# visualization of early stopping fig, ax =pltsubplots() ax.plot(train_loss, label='Training loss') ax.plot(val_loss, label='Validation loss') epoch_stopped = monitor.stopped_epoch ax.axvline(epoch_stopped, linestyle=' ' , color='gray', label='Early stopping') ax.legend() ax.set_title('Training and Validation Loss') plt.show()
Cross-Validation techniques such as Hold-out, K-folds, Leave-one-out, and Leave-p-out are widely used to prevent overfitting These techniques help evaluate how well a model can generalize to unseen data.
#import libraries import numpy as np from sklearn model_selection import KFold, cross_val_score import matplotlib pyplot asplt from sklearn metrics import r2_score from sklearn linear_model import Ridge
X ['cylinders' 'displacement' 'horsepower' 'weight' 'acceleration' 'year', , , , , ]].values
# Split into validation and training sets using KFold kfold = KFold(n_splits, shuffle=True, random_stateB)
# Create a Ridge regression model with L2 regularization model = Ridge(alpha=0.1)
# Train the model on the training fold val_scores = [] print(val_scores) for train_index val_index, inkfold.split(X): model.fit(X[train_index], Y[train_index]) val_score = model.score(X[val_index], Y[val_index]) val_scores.append(val_score)
# Calculate the average validation score avg_val_score = np.mean(val_scores) y_pred = model.predict(X_test)
# Calculate the R-squared score on the test set test_score =r2_score(y_test y_pred) ,
# Print the test score and validation score print("Test score:", test_score) print("Validation score:", avg_val_score)
# visualization of cross-validation plt.xlabel('Validation Score') plt.ylabel('Frequency') plt.title('Distribution of Cross-Validation Scores') plt.hist(val_scores) plt.show()
[1] GeeksforGeeks, “ML Data Preprocessing in Python,” GeeksforGeeks, Jun
2023, [Online] Available: https://www.geeksforgeeks.org/data-preprocessing- machine-learning-python/
[2] GeeksforGeeks, “ML Overview of data cleaning,” GeeksforGeeks, Jun 2023, [Online] Available: https://www.geeksforgeeks.org/data-cleansing- introduction/?ref=lbp
[3] GeeksforGeeks, “ML Understanding Data Processing,” GeeksforGeeks, May
2023, [Online] Available: https://www.geeksforgeeks.org/ml-understanding-data- processing/?ref=lbp
[4] GeeksforGeeks, “Detect and Remove the Outliers using Python,”
GeeksforGeeks, May 2023, [Online] Available: https://www.geeksforgeeks.org/detect- and-remove-the-outliers-using-python/
[5] GeeksforGeeks, “Exploring correlation in Python,” GeeksforGeeks, Mar 2023, [Online] Available: https://www.geeksforgeeks.org/exploring-correlation- -python/in [6] GeeksforGeeks, “Remove duplicates from a dataframe in PySpark,”
GeeksforGeeks, Dec 2021, [Online] Available: https://www.geeksforgeeks.org/remove-duplicates-from-a-dataframe- -pyspark/in [7] GeeksforGeeks, “Introduction to Support Vector Machines SVM,”
GeeksforGeeks, Feb 2023, [Online] Available: https://www.geeksforgeeks.org/introduction-to-support-vector-machines-svm/ [8] N Selvaraj, “8 machine learning models explained in 20 minutes,” Sep 16,
2022 https://www.datacamp.com/blog/machine-learning-models-explained
[9] S Ghosh, “The Ultimate Guide to Evaluation and Selection of Models in
Machine Learning,” neptune.ai, Sep 2023, [Online] Available: https://neptune.ai/blog/ml-model-evaluation-and-selection