Báo cáo nghiên cứu khoa học: Advancing thyroid cancer prediction: a comparison of statistic and machine learning methods

VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL STUDENT RESEARCH REPORT ADVANCING THYROID CANCER PREDICTION: A COMPARISON OF STATISTIC AND MACHINE LEARNING METHODS CN.NC.SV.2

Project Name

- English: Advancing Thyroid Cancer Prediction: A Comparison Of Statistic And Machine Learning Methods

- Vietnamese: Nâng cao dự đoán ung thư tuyến giáp: So sánh các phương pháp thống kê và học máy

Project Code

Member List

Nguyễn Thị Yến Nhi BDA2020B 20070968

Advisor(s)

Phạm Thị Việt Hương PhD

Abstract

- English: Thyroid cancer is a prevalent type of endocrine carcinoma that develops within the thyroid gland Considerable resources have been dedicated to enhancing its diagnosis, with thyroidectomy serving as the primary treatment approach The effectiveness of surgery, while minimizing unnecessary damage, hinges on an exact preoperative diagnosis However, human evaluation of the malignancy of thyroid nodules is susceptible to inaccuracies and does not always ensure precise preoperative diagnoses This study investigates the application of machine learning, statistical analysis, and data mining techniques to enhance thyroid cancer detection Building upon previous research, our findings demonstrate the promising potential of these computational methods in improving diagnostic accuracy Despite slight variations in performance metrics, both our study and previous work consistently show superior performance compared to expert assessments alone, with an accuracy of 83.4% surpassing the previous study's 82%

- Vietnamese: Ung thư tuyến giáp là một loại ung thư biểu mô nội tiết phổ biến phát triển bên trong tuyến giáp Nguồn lực đáng kể đã được dành riêng để nâng cao khả năng chẩn đoán, trong đó phẫu thuật cắt tuyến giáp đóng vai trò là phương pháp điều trị chính Hiệu quả của phẫu thuật, đồng thời giảm thiểu những tổn thương không cần thiết, phụ thuộc vào việc chẩn đoán chính xác trước phẫu thuật Tuy nhiên, đánh giá của con người về tính ác tính của các khối u tuyến giáp dễ bị thiếu chính xác và không phải lúc nào cũng đảm bảo chẩn đoán trước phẫu thuật chính xác Nghiên cứu này điều tra việc ứng dụng kỹ thuật học máy, phân tích thống kê và khai thác dữ liệu để tăng cường phát hiện ung thư tuyến giáp Dựa trên nghiên cứu trước đây, những phát hiện của chúng tôi chứng minh tiềm năng đầy hứa hẹn của các phương pháp tính toán này trong việc cải thiện độ chính xác của chẩn đoán Mặc dù có sự khác biệt nhỏ về số liệu hiệu suất, cả nghiên cứu của chúng tôi và nghiên cứu trước đây đều cho thấy hiệu suất vượt trội so với đánh giá của chuyên gia, với độ chính xác 83,4%, vượt qua con số 82% của nghiên cứu trước đó.

Keywords

SUMMARY REPORT IN STUDENT RESEARCH

In the realm of thyroid cancer detection and prediction, recent literature highlights the increasing integration of machine learning techniques to enhance diagnostic accuracy and prognostic capabilities One notable study, as detailed in [1], developed a machine learning-assisted system for thyroid nodule diagnosis, utilizing a variety of algorithms and validation techniques to refine models that exhibited promising performance metrics, including sensitivity, specificity, and area under the curve (AUC) Building upon this foundation, introduced a machine learning-based prediction model for papillary thyroid carcinoma recurrence, employing a combination of decision tree, random forest, and gradient boosting algorithms, alongside cross-validation techniques, to effectively predict disease outcomes [2] Furthermore, contributed to the field by conducting a comprehensive performance analysis of machine learning algorithms for thyroid disease, demonstrating the efficacy of methods such as KNN, Nạve Bayes, and SVM in diagnostic applications [3] Collectively, these studies underscore the transformative potential of machine learning in revolutionizing thyroid cancer diagnosis and management, offering valuable insights into the development of more accurate and personalized healthcare solutions.

Concerning the Rationale of the Study

Thyroid cancer represents a significant burden on global healthcare systems, with its incidence rates steadily rising in recent years Despite advancements in medical technology and diagnostic methodologies, the timely detection of thyroid cancer remains a daunting challenge This gap between incidence and detection underscores the critical need for innovative approaches to enhance diagnostic accuracy and improve patient outcomes

Statistical methods offer transformative potential in thyroid cancer diagnosis, addressing the need for objective and accurate assessments Traditional diagnostic techniques suffer from subjectivity and qualitative observations, resulting in inconsistencies and inaccuracies Our study aims to harness the power of statistical methods to overcome these limitations, ensuring more precise and reliable thyroid cancer diagnosis.

By contrast, statistical methods offer a systematic and objective framework for analyzing complex datasets, thereby enabling healthcare professionals to extract meaningful insights and make informed decisions

Furthermore, the growing convergence of technology and healthcare presents a unique opportunity to leverage computational algorithms and machine learning techniques to augment traditional diagnostic practices Through the integration of advanced statistical methodologies into our research framework, we aim to bridge the gap between theory and practice, translating theoretical concepts into tangible solutions that can be readily applied in clinical settings

Ultimately, the rationale of our study lies in the belief that by harnessing the power of statistical methods and technological innovation, we can revolutionize the landscape of thyroid cancer diagnosis Through our research efforts, we seek to not only advance scientific knowledge but also to make a tangible difference in the lives of patients by facilitating earlier detection, more accurate diagnosis, and ultimately, improved treatment outcomes.

Research questions

The study’s goal is to develop a comprehensive machine learning framework for accurately predicting nodule malignancy based on clinical data To attain these goals, the following research questions are addressed below

● What role do symptoms have in determining whether a malignancy is benign or malignant?

● In what ways does developing this machine learning model help professionals detect thyroid cancer?

● To what extent can the thyroid cancer diagnosis be made using this model?

Motivation and Objective

Thyroid cancer stands as a prevalent disease in contemporary society, yet its diagnosis often occurs at an advanced stage, posing significant challenges to effective treatment and patient outcomes Recognizing this critical gap in current medical practices, our research endeavors to harness the power of technology to revolutionize the landscape of thyroid cancer diagnosis By leveraging advanced statistical methods and cutting-edge technology, we aim to pioneer a paradigm shift towards more accurate and effective detection strategies Through this innovative approach, we aspire to not only enhance early detection rates but also to empower healthcare professionals with the tools and insights needed to combat thyroid cancer more effectively than ever before

At the heart of our research lies a singular objective: to develop an improved method for thyroid cancer diagnosis that surpasses existing approaches in both accuracy and efficacy Building upon the foundations of our previous methodologies, we are driven by the desire to demonstrate the distinct advantages offered by statistical methods in the realm of medical diagnostics Through rigorous experimentation and meticulous analysis, our goal is to showcase the superiority of our proposed method in identifying thyroid cancer with unparalleled precision and reliability By achieving this objective, we seek to catalyze a transformative shift in medical practice, ushering in a new era of personalized and proactive healthcare for patients battling thyroid cancer worldwide.

Research Methods

The research methodology centers on a dataset comprising 724 patients admitted to Shengjing Hospital of China Medical University from 2010 to 2012, with a primary focus on enhancing thyroid cancer detection The study delves into the integration of predictions from multiple models, employing logistic regression models to highlight the efficacy of statistical methods alongside prominent machine learning algorithms like Gradient Boosting, Random Forest, and Extra Trees Through rigorous experimentation and comparative analysis, the research aims to identify optimal methodologies for thyroid cancer detection, leveraging the strengths of both statistical and machine learning approaches to achieve enhanced diagnostic accuracy and improved outcomes for patients.

Structure

The research paper begins with a Literature Review section that discusses the rationale of the study, presents the research questions and outlines the motivation and objectives

It also describes the research methods and provides an overview of the structure of the study

This is followed by the Data & Methodology section, which is divided into three chapters The first chapter focuses on exploratory data analysis of dataset The second chapter explores the algorithms used include both statistical and machine learning methods The third chapter details data processing procedures

The next section, Results & Discussions, presents the detailed results of the study and provides an analysis It also reviews the methodology and discusses the challenges, limitations, and potential future work

The Conclusion & Recommendations section summarizes the findings of the study, provides a concluding statement, and discusses practical applications It also offers recommendations for future research

The paper concludes with an Appendix that includes a detailed description of the face recognition algorithm, the operation of the class attendance system, experimental results, and illustrations The paper also includes sections for Abbreviations and

DATA & METHODOLOGY

VISUALIZATION

Based on the information provided by [5], the data used in this study were collected from 724 patients admitted to Shengjing Hospital of China Median University between

In a study conducted from 2010 to 2012, 1232 thyroid nodules were analyzed from patients who underwent flexible thyroidectomy and tumor resection Data collection included malignancy status, patient demographics, ultrasound characteristics, and blood test results Each patient exhibited at least one nodule within one of three regions: left lobe, right lobe, or isthmus If multiple nodules were present within a region, only the largest nodule was included in the dataset, which comprised a total of 19 variables.

Age: The Age of the Patient Quantitative Predictor

FT3: Triiodothyronine Test Result Quantitative Predictor

FT4: Thyroxine Test Result Quantitative Predictor

TSH: Thyroid-Stimulating Hormone Test Result Quantitative Predictor TPO: Thyroid Peroxidase Antibody Test Result Quantitative Predictor TGAb: Thyroglobulin Antibodies Test Result Quantitative Predictor

Multifocality: If Multiple Nodules Exist in One

Size: The Nodule Size in Cm Quantitative Predictor

Margin: The Clarity of Nodule Margin

Echo Strength: The Nodule Echogenicity

Blood Flow: The Nodule Blood Flow

Multilateral: If Nodules Occur in More Than One

This data set includes 12 categorical variables and 7 numeric variables, with 2 target variables of 819 benign patients and 413 malignant patients The distribution of each variable is presented in Figure 1 Among the variables, most have a right-skewed distribution, only the variable 'age' appears relatively normal Categorical variables are also unevenly distributed, especially 'gender', 'echo_pattern', 'composition', 'echo_strength', and 'multilateral' This can affect model performance and should be carefully considered during data analysis and processing

The data set appears to have a little to no correlation between its variables (Figure 2), indicating favorable conditions for analysis Only the correlation between FT3 and FT4 was observed to be moderate, suggesting a potential relationship between these two variables

METHODOLOGY

Figure 3: Flow diagram of the proposed method

Predictive models for nodule malignancy were developed using machine learning techniques following the methodology outlined in Figure 3 The dataset utilized comprises 18 predictor variables, with the output variable being malignancy.

Once the data is collected, we explore and visualize the data before preprocessing to better understand the characteristics of the data, the distribution of variables and the relationships between them This step helps us in gaining insights into shape the data preprocessing strategy and select important features for our model

We then checked for missing values, outliers, and normalized numerical features Additionally, categorical variables may need to be encoded or transformed into a format suitable for machine learning algorithms

In the feature engineering stage of our methodology, crucial transformations are applied to the dataset to extract meaningful features pertinent to thyroid cancer detection Feature extraction is employed to discern the most informative attributes, mitigating redundancy and noise in the dataset This meticulous curation of features not only enhances the discriminative power of our model but also facilitates better interpretability of the underlying patterns driving thyroid cancer detection

In the training phase, hyperparameters need to be tuned to optimize their performance This involves selecting the best combination of hyperparameters through techniques such as grid search, random search, or BIC for a fitted model

In order to evaluate the performance of the trained models, we compared the predicted results with the actual malignancy of the nodule Several metrics were calculated for each model, including accuracy, F1-score, recall, and precision These metrics provide insight into the model's diagnostic capability as well as its ability to accurately predict malignant and benign nodules

To further improve the accuracy of the models, we ensemble some techniques such as random search and grid search Random search involves randomly sampling hyperparameter combinations from a predefined search space, while grid search exhaustively evaluates all possible hyperparameter combinations within a defined grid

This research presents a groundbreaking data preprocessing methodology, utilizing a combination of ensemble learning techniques and meticulous hyperparameter tuning This methodology aims to enhance the performance of predictive models used to determine the malignancy of lung nodules By refining these models, we seek to improve their predictive accuracy and provide more reliable insights for clinical decision-making.

This paper focuses primarily on data preprocessing methods We carefully found missing values, encoded categorical variables, and removed outliers to ensure the dataset was suitable for model training Through these preprocessing endeavors, the study aimed to enhance the robustness and credibility of predictive models

Furthermore, statistical techniques play a pivotal role in analyzing data, identifying patterns, and drawing meaningful conclusions They provide the framework for rigorous experimentation and ensure the reliability of research findings Therefore, the meticulous application of statistical methods is crucial for the success and validity of this study

The proposed methodology in this study utilizes a combination of advanced data preprocessing techniques, statistical analysis, and data mining methods to significantly contribute to medical research The findings reveal that this approach effectively enhances the precision and diagnostic potential of predictive models for nodule malignancy assessment This improved understanding empowers healthcare professionals to make well-informed decisions for patient management and treatment, ultimately leading to improved healthcare outcomes.

In supervised learning algorithms, logistic regression predicts the probability of an outcome based on one or more input variables As with other generalized linear models, it assumes that features and a transformed version of the target variable are linearly related The transformation is done by applying a sigmoid function, which maps any real number to a value between 0 and 1 As a result,

𝑚*1 the logistic regression model's output can be interpreted as the probability of belonging to a specific group The logistic regression equation can be represented as follows:

1" 𝑒 while z is the linear combination of the independent variables and their coefficients: z = 𝛽% + 𝛽1 𝑥1 + 𝛽& 𝑥& + +𝛽𝑛 𝑥𝑛 with 𝛽% is the intercept term and 𝛽1 ,𝛽& , 𝛽𝑛 are the coefficients associated with each independent variable 𝑥1, 𝑥& , 𝑥𝑛 respectively

Gradient Boosting is a powerful boosting algorithm that combines several weak learners into strong learners, in which each new model is trained to minimize the loss function such as mean squared error or cross-entropy of the previous model using gradient descent In each iteration, the algorithm computes the gradient of the loss function with respect to the predictions of the current ensemble and then trains a new weak model to minimize this gradient The predictions of the new model are then added to the ensemble, and the process is repeated until a stopping criterion is met The equation for predicting with an Gradient Boosting model can be represented as follows:

𝐹, (x) = ∑ 𝑀 𝛾𝑚 ℎ𝑚(x) where 𝐹, (x) represents the final prediction or model output for the input x, M is the total number of trees (or weak learners) in the ensemble, 𝛾𝑚 is the learning rate (or shrinkage parameter) associated with the 𝑚-th tree, and ℎ𝑚(x) represents the 𝑚-th weak learner (e.g., decision tree) that is added to the ensemble

Extremely Randomized Trees Classifier(Extra Trees Classifier) is a type of ensemble learning technique which aggregates the results of multiple de-

!z correlated decision trees collected in a “forest” to output its classification result

In concept, it is very similar to a Random Forest Classifier and only differs from it in the manner of construction of the decision trees in the forest Each Decision Tree in the Extra Trees Forest is constructed from the original training sample Then, at each test node, Each tree is provided with a random sample of k features from the feature-set from which each decision tree must select the best feature to split the data based on some mathematical criteria (typically the Gini Index) This random sample of features leads to the creation of multiple de-correlated decision trees

A Random Forest is like a group decision-making team in machine learning

It combines the opinions of many “trees” (individual models) to make better predictions, creating a more robust and accurate overall model The algorithm’s strength lies in its ability to handle complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables, as in the case of regression, and categorical variables, as in the case of classification It performs better for classification and regression tasks.

DATA PREPROCESSING

Base model

We will run the base model after handling outliers and use cross-validation to evaluate the performance of 13 different models Results from this process have shown that using data normalized using the Standard Scaler and MinMax Scaler methods can improve model accuracy In particular, the results showed that using MinMaxScaler brought slightly better results than Standard Scaler

Based on the result shown in Table 2, we selected the three highest performing models which are Gradient Boosting, Random Forest, and Extra Trees to continue performing Feature Selection and Hyperparameter Tuning This will help us optimize the models and ensure that they can be effectively applied in real-world data These results and processes will be an important part in developing the final model and in informing further recommendations and developments in our research

Model Without scaler Standard scaler Minmax scaler

Scaling

We employed two fundamental methods, the Min-Max scaler and Standard scaler, for data preprocessing The Min-Max scaler method was utilized to normalize the range of features within a fixed interval, typically between 0 and 1 This was achieved by computing the minimum (𝑋𝑚+𝑛) and maximum (𝑋𝑚𝑎𝑥) values of each feature and scaling the values accordingly using the formula: 𝑋𝑠𝑐𝑎𝑙𝑒𝑑 = 2 – 2𝑚$𝑛 While this approach

𝜎 preserves the original distribution of data and is straightforward to implement, it may be sensitive to outliers

The Standard scaler method aimed at standardizing the distribution of features by removing the mean (μ) and scaling to unit variance (σ), making it suitable for algorithms assuming zero-mean, unit-variance features This was accomplished through the formula: 𝑋𝑠𝑐𝑎𝑙𝑒𝑑 = 2–𝜇 Although the Standard scaler handles outliers better, it does not constrain values to a specific range and may alter the shape of the original distribution These methods offer researchers flexibility in data preprocessing, allowing them to choose the most appropriate scaling technique based on the characteristics of their dataset and analysis requirements

After comparing the results obtained from both the Standard Scaler and Min-Max Scaler methods, I opted for the Min-Max Scaler due to its slightly higher performance Despite both methods being widely used for data normalization, the Min-Max Scaler demonstrated a marginally better outcome in our specific context The decision to select the Min-Max Scaler was informed by its ability to normalize the range of features within a fixed interval, typically between 0 and 1, preserving the original distribution of the data while ensuring each feature contributed equally to the analysis While the difference in results may appear subtle, even minor improvements can have significant implications, particularly in scenarios where precision is crucial Thus, the selection of the Min-Max Scaler aligns with the objective of optimizing our data preprocessing pipeline for enhanced analytical performance.

RESULTS & DISCUSSIONS

From the base model result we proceed to perform a feature variable selection process to identify the most important variables for the model Specifically, we selected variables whose influence on the model is greater than 20% After obtaining the set of important variables for each model, we proceed to normalize the data using Min-max method and then adjust the model's hyperparameters using Grid Search to optimize forecasting performance This process helps us fine-tune the model parameters so that they best reflect the predictability of the data

Figure 5: Generalized Linear Model Regression Results

Based on the table shown in Figure 5 among the 18 independent variables considered, several showed a significant influence on the "mal" forecast target Specifically, the variables "site", "shape", "margin", "calcification", "blood_flow", "multilateral",

"TGAb" and "size" were identified as significantly influencing the ability to forecast "mal." Other variables, such as "FT4" and "TSH," have a moderate influence, with statistical significance Notably, "gender" and other variables were not found to have a significant impact.

"echo_pattern", "multifocality", "echo_strength", "composition", "age", "FT3", "TPO" do not have a significant influence The Pseudo R-squared in this case is 0.3370, which indicates the percentage of change in the dependent variable that the model can explain

The Bayesian Information Criterion (BIC) is a statistical measure used for model selection, balancing model complexity and goodness of fit It penalizes complex models by incorporating a penalty term that depends on the number of parameters in the model The formula for BIC is 𝐵𝐼𝐶 = −2𝑙𝑜𝑔(𝐿) + 𝑘𝑙𝑜𝑔(𝑛), where L is the maximum likelihood of the model given the data, k is the number of parameters in the model, and n is the number of data points in the dataset This penalty term discourages overfitting by favoring simpler models Lower BIC values indicate better model fit with less complexity, aiding in more reliable model selection

We use the BIC (Bayesian Information Criterion) to identify important variables for predicting the target variable "mal" for the Logistic model The results show that the BIC value is -7601.5 when using the variables FT4, TGAb, site, size, shape, margin, calcification, blood_flow, and multilateral We then tuned the hyperparameters for the Logit model, we used parameter grids, in which we tested many different values of the parameters to find the best configuration for the model The final results show that the Logit model achieved an accuracy of 78.8%, indicating a good forecasting performance after selecting the important variables and adjusting the hyperparameters

We performed a thorough hyperparameter tuning process for the Random Forest model, using Grid Search to search for optimal parameters The results show a stable accuracy, reaching 80.6% During this process, we selected the 10 most important variables from the original data that had effect greater than 20%, including 'TPO', 'TGAb', 'site', 'size', 'calcification', 'blood_flow', 'TSH', ' FT4', 'age', and 'FT3'

The results show for the Gradient Boosting model that this model has achieved a remarkably high accuracy, reaching 83% During this process, we selected variables that had a greater than 20% influence on the model, including 'TPO', 'TGAb', 'site', 'size', 'calcification', and 'blood_flow' and also used Grid Search This choice helped reduce the number of variables from the original data to 6 variables, which greatly reduced the complexity of the model compared to other 2 models

Finally, the Extra Trees model also achieved relatively high results after the hyperparameter tuning process With an accuracy of 80.3%, using 14 variables from the original data, including 'TPO', 'TGAb', 'site', 'size', 'calcification', 'blood_flow', 'TSH', ' FT4', 'age', 'FT3', 'shape', 'margin', 'multifocality', and 'echo_strength' This helped the Extra Trees model achieve a balance between complexity and forecasting performance The performance of our methodology is shown in Table 3 It can be seen that all four models have brought relatively impressive results in predicting the target variable The Gradient Boosting model, with the highest accuracy of 83.4%, is considered to have the best forecasting performance among the evaluated models Random Forest and Extra Trees also showed positive results, with accuracy of 80.6% and 80.3% respectively

Despite Logistic Regression's lower accuracy (79.2%), it allows for a comprehensive analysis of variable effects Decision tree-based models, such as Gradient Boosting, Random Forest, and Extra Trees, excel in predictive performance and are typically preferred for higher accuracy requirements.

Methodology Accuracy Precision Recall F1 - Score

Table 4: Comparison with other state-of-the-art research

As seen in Table 4, overall our GBM model demonstrated superior performance in both accuracy and precision score at 83.4%, and 82.3% respectively, indicating its effectiveness in classification tasks, which was higher than the previous study at 77.41% accuracy and a precision score of 80.3% In terms of precision, the Xi2022 study shows a higher percentage for Logistic Regression and Random Forest models, but lower for Gradient Boosting

From these comparisons, it can be seen that, although both studies were performed on the same data set, the results of our study appear to be more stable and accurate, which means our preprocessing step is efficient in choosing appropriate variables

During the process of researching and developing models, we noticed a number of challenges and limitations that could potentially affect the process and results of the research

One of the challenges we encountered was imbalance in the data, especially in the target variable "mal" This imbalance can lead to the model being biased towards the majority class and cause an imbalance in forecast performance Furthermore, some variables in the data set have skewed distribution, mainly skewed to the right This can affect model performance, especially when using statistical models like logistic which are based on normal distribution assumptions While the models have achieved good performance on training data, the next challenge is to ensure that they generalize well on new data This is especially important when deploying the model in real life, where the data varies and is not the same as the training data

Based on our learnings from the research and development process, we have a number of recommendations for the future

Collecting more and the diversity of the data is the key to building accurate and generalizable forecasting models In the future, we recommend increasing efforts to collect data from a variety of sources to improve the diversity and richness of the dataset Although current models have achieved relatively high results, research and development of more complex models may yield better forecasting results We propose to focus on research on deep learning models and artificial neural networks to further exploit information from data To ensure the applicability and generalization of the model, it is necessary to test and evaluate the model on unseen data to ensure the effectiveness and application of the model in medical practice.

CONCLUSION & RECOMMENDATIONS

In summary, both the previous paper and the current study highlight the promising potential of machine learning, statistical analysis, and data mining techniques in enhancing thyroid cancer detection Although the performance metrics may vary slightly between studies, these methods have consistently demonstrated improved diagnostic accuracy when compared with expert assessments alone

It is imperative that future research endeavors focus on mitigating existing limitations, such as data quality issues and interpretability concerns For the selection of features and the evaluation of models, more sophisticated statistical techniques can be used A further enhancement to machine learning models may be achieved by integrating advanced data mining algorithms

The exploration of ensemble learning methods, which combine multiple models to improve predictive performance, presents a promising avenue for further development of this area With the help of larger datasets and innovative feature engineering approaches, researchers can refine and optimize machine learning models for more reliable and efficient thyroid cancer detection tools These concerted efforts are pivotal in realizing the full potential of computational methods in medical diagnostics and ultimately improving patient outcomes.

APPENDIX

The dataset used in this research serves as the foundation upon which our analyses and findings are built Composing information collected from 724 patients admitted to Shengjing Hospital of China Medical University between 2010 and 2012, this dataset provides a comprehensive insight into the demographic, clinical, and pathological characteristics of individuals with thyroid nodules Each patient underwent flexible thyroidectomy and tumor resection, with detailed records of malignancy status, ultrasound characteristics, and blood test results meticulously recorded

To ensure data integrity, preprocessing involved removing missing values and outliers, and normalizing numerical variables The dataset was then split into training and testing subsets to train predictive models on a representative sample and validate them on unseen data, assessing their generalizability and performance To prevent overfitting and enhance model performance, cross-validation techniques were utilized during training.

ABBREVIATION

REFERENCES

[1] BALIKầI ầİầEK, İ., & KĩầĩKAKầALI, Z (2023) Machine Learning Approach for Thyroid Cancer Diagnosis Using Clinical Data Middle Black Sea

Journal of Health Science, 9(3), 440–452 https://doi.org/10.19127/mbsjohs.1282265

[2] Zhang, B., Tian, J., Pei, S., Chen, Y., He, X., Dong, Y., Zhang, L., Mo, X., Huang, W., Cong, S., & Zhang, S (2019) Machine Learning–Assisted System for Thyroid Nodule Diagnosis Thyroid, 29(6), 858–867 https://doi.org/10.1089/thy.2018.0380

[3] Abbad Ur Rehman, H., Lin, C.-Y., Mushtaq, Z., & Su, S.-F (2021) Performance Analysis of Machine Learning Algorithms for Thyroid Disease Arabian Journal for Science and Engineering, 46(10), 9437–9449 https://doi.org/10.1007/s13369-020-05206-x

[4] T A Vu, N A Huyen, H Q Huy and P T V Huong, "Enhancing Thyroid Cancer Detection Through Machine Learning Approach," 2023 12th International Conference on Control, Automation and Information Sciences (ICCAIS), Hanoi, Vietnam, 2023, pp 188-193, doi: 10.1109/ICCAIS59597.2023.10382297

[5] Xi, N M., Wang, L., & Yang, C (2022) Improving the diagnosis of thyroid cancer by machine learning and clinical data Scientific Reports, 12(1), 11143 https://doi.org/10.1038/s41598-022-15342-z

[6] Tripathi, A (2022) Gradient Boosting Algorithm Guide with examples [online] Blogs & Updates on Data Science, Business Analytics, AI Machine Learning Available at: https://www.analytixlabs.co.in/blog/gradient-boosting- algorithm/

Logistic regression, a fundamental technique in data science, is crucial for understanding the relationship between independent variables and a binary outcome variable Unlike linear regression, which predicts continuous outcomes, logistic regression estimates the probability of an event occurring within a range of 0 to 1 This capability makes it valuable for classifying data and predicting categorical outcomes, such as whether an email will be opened or a loan will be repaid.

Tiêu đề	Advancing Thyroid Cancer Prediction
Tác giả	Nguyen Thu Huyen
Người hướng dẫn	Pham Thi Viet Huong, PhD
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Business Data Analytics
Thể loại	Student Research Report
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	35
Dung lượng	1,84 MB