Research objectives i Summarize theoretical basis related to the application of machine learning to forecast stock price fluctuations 1 Conduct literature review on domestic and foreign
THEORETICAL BASIS AND LITERATURE REVIEW
Theoretical basSiS Gà HH HH HH HH HT HT nh nh ng 10 1 Stock market anaẽYSIS . c1 11919 111 191 111 1v vn TH ngư 10 2 Text mining apprOaCHh - - <1 1990 9911 nh rưy 17 3 Introducing the problem of stock market forecasting - s5 +++<<£++<++eexss 34 1.2 Literature ROVICW .ccccccccessceesccesseeescecesecessecesceceseeeeneceseeceaeecaeceacecnaeesaeceeeeeeeeeaeeeeneenses 34 1.2.1 Overview of domestic r€S€ATCH - G111 91 11v 01 T1 vn nh nh nh nh nh nh 34 1.2.2 Overview of foreign T©S€AaTCH - - <1 1E 191 HH HH ky 35
Fundamental analysis (FA) is a technique used to assess a security's intrinsic value by analyzing various economic and financial factors Analysts in this field evaluate both macroeconomic elements, such as overall economic conditions and industry trends, as well as microeconomic aspects, including the performance of a company's management.
The end goal is to arrive at a number that an investor can compare with a security's current price in order to see whether the security is undervalued or overvalued.
Technical analysis is a trading discipline employed to evaluate investments and identify trading opportunities by analyzing statistical trends gathered from trading activity, such as price movement and volume.
Technical analysis differs from fundamental analysis by concentrating on price and volume rather than business performance metrics like sales and earnings It employs various tools to analyze how supply and demand influence price changes, volume, and implied volatility While primarily used for generating short-term trading signals through charting techniques, technical analysis also enhances the assessment of a security's relative strength or weakness in comparison to the broader market or specific sectors, ultimately aiding analysts in refining their overall valuation estimates.
Technical analysis can be used on any security with historical trading data This includes stocks, futures, commodities, fixed-income, currencies, and other securities.
The Efficient Market Hypothesis posits that financial markets, particularly the stock market, are efficient, meaning that security prices fully incorporate all available information As a result, it is impossible for investors to achieve profits based on known data or historical price trends, as they cannot outperform the market's collective understanding.
The efficient market hypothesis comprises three forms: weak, semi-strong, and strong The weak form asserts that securities prices incorporate all past information, indicating that speculators cannot outperform the market based on historical data The semi-strong form posits that stock prices reflect both past and newly published information, meaning that speculators cannot capitalize on recent news to acquire stocks at lower prices, as the market quickly adjusts to reflect this information.
The strong form of market efficiency theory asserts that security prices incorporate all available information, including past and present public data and insider knowledge This theory posits that when insiders possess non-public information, they will act swiftly to buy or sell securities for profit, causing immediate adjustments in stock prices until the opportunity for profit diminishes.
1.1.1.4 Evidence against the efficient market assumption
Numerous studies highlight market inefficiencies that allow speculators to capitalize on opportunities, such as purchasing stocks with low price-to-earnings (P/E) ratios to outperform the market The author identifies various factors that challenge the efficient market hypothesis, suggesting that these anomalies can be leveraged by informed investors.
The small firm effect posits that investments in smaller firms or those with low market capitalization yield higher returns than larger firms This market anomaly is a key component in explaining the substantial returns identified in Eugene Fama and Kenneth French's Three Factor Model, which includes market return, high book-to-market value, and small stock capitalization as its three critical factors.
The small firm effect hypothesis suggests that smaller firms experience greater growth opportunities compared to their larger counterparts Smaller companies typically operate in a more volatile business environment, allowing them to address challenges more effectively Additionally, stocks of companies with small market capitalization generally have lower values, while those of larger market capitalization tend to be higher.
The January effect refers to the seasonal rise in stock prices observed in January, often attributed to increased stock purchases following a decline in prices during December This decline is commonly linked to investors selling off stocks to realize losses for tax benefits.
The January effect presents a potential profit opportunity for investors, but its impact has diminished in recent years This decline can be attributed to several factors, including the tendency for investors to hold assets in tax-advantaged accounts, which lessens the necessity for selling to reduce tax liabilities Additionally, widespread predictions of rising stock prices in early January may lead to these expectations being priced in earlier, further weakening the effect.
Research indicates that stock returns can be reversible, meaning that stocks currently performing poorly often yield high future returns, and vice versa As reversals occur, prices may shift significantly, potentially resulting in substantial losses or diminished profits for traders Consequently, trend traders typically exit positions while prices are still favorable, avoiding concerns about whether a trend is reversing or merely correcting Reversals can be identified through indicators or price action, yet prices may subsequently resume their previous trend direction.
The efficient market hypothesis remains a contentious topic, as evidenced by various studies In the following section, the author will explore specific research in computer science related to market prediction, highlighting instances where the stock market has demonstrated that this hypothesis does not consistently hold true.
1.1.1.5 Stock market a Stock Market Concept
The stock market is a marketplace where individuals buy and sell stocks, reflecting their ownership in various businesses, including both publicly traded and private companies Private company shares can be sold to investors via crowdfunding platforms Typically, investments in the stock market are facilitated through stockbrokers and electronic trading platforms.
The stock market operates as a direct financing platform where capital seekers and providers engage without intermediaries It functions continuously, allowing securities issued in the primary market to be traded repeatedly in the secondary market This structure enables investors to easily liquidate their securities into cash whenever needed.
The Stock Market operates similarly to a perfectly competitive market, allowing free entry for all participants without price imposition Prices are determined by the dynamics of supply and demand between buyers and sellers This public trading environment fosters transparency in financial transactions, ensuring that all market participants have access to the same information regarding share prices As a result, traders can engage freely and efficiently in the market.
PREPROCESSING DA TÁA . Tnhh HH nh nh TT ng 39 2.1 Datta SOUITC€S Qui th nh 39 2.2 The method of combining the content Of I€WS - - +11 1v 9v net 4I 2.3 Assign news labels to prepare data for the training phase - +5 ô+ ++s+sexsss 42 2.4 Eliminate interfering ChaTACẨ€TS - - G5 119001119301 11910 1190 1 nh nh re 43 2.5 Eliminate SfODWOTS - -¿- 6 2 6 4k1 TT TH TH TH HH HH HT TH Hiệp 45 2.6 Represents NEWS 1n V€CfOT SDACC - HH HH rờ 46
The group collected a total of 69,947 news texts from high-traffic websites, including cafef.vn, vietstock.vn, vnexpress.net, and thanhnien.vn This extensive dataset is utilized to train machine learning models, focusing on essential elements such as the time, title, and content of the news articles.
The study utilized the Beautiful Soup 4 library in Python to extract data from files This powerful library enables users to retrieve all HTML content from web pages, filter the HTML tags, and extract the essential text, which is then converted into a TXT file.
Table 2.1: Table of data collected from online newspapers
Web Time period No of Paper
SUM 69 947 vN[fixecss 5 Đăng nhập ® Thdisy Gécnhin Thếglới Video Kinhdoanh Giditri Thếthao Phápluât Gidoduc Sứckhỏe Đờisông Dulich Khoahoc Sốhóa Xe Ykién Tâmsự Hài
Quốctẻ Ooanhnghiệp Chưngkhoan BA dong san Ebauk Vimô Tiềcủatôi Baohiém Hànghỏa Startup vhome
Phạt hai công ty cho : ` Giá vàng Xem shin khách mua chứng Mua — Bí khoán khi không đủ Vàng SIC (tiêu đồnglượng) 5560 S80 tiền Vang tw gi (USOlounce) — 17121 171M1
Chứng khoán BOS và Sacombank đã bị phạt 125 triệu đồng mỗi đơn vị do vi phạm quy định liên quan đến việc đặt lệnh mua cho một số khách hàng Mặc dù không phải là đơn vị tiên phong trong lĩnh vực chứng khoán, nhưng sự việc này đã thu hút sự chú ý của nhà đầu tư.
Chứng khoán Mỹ dứt chuỗi tăng
(Cac chỉ số chính của Phố Wall giảm điểm khi nha đầu tư quyết định chốt lời mộ số lĩnh vực có mức tăng ấn tượng gần đây 36
Chứng khoán tăng gan 32 điểm phiên cuối năm
Bên mua cố gắng gom hàng từ giữa phiên sáng, đặc biệt ở nhóm cố phiếu vốn hoá lớn, gúp VN-Index
Ong Nguyễn Đức Tai: Không lo Thế giới
Công ty chứng khoán thưởng Tết cao nhất nửa tỷ ` Di Động bị thâu tóm đồng +
Loạt siêu thị Big C đối tên 32+
Thưởng Tết phổ biến của cá " ng và có
Figure 2.1: Example source of text data from Vn-express website cafef, vietstock, vnexpress, thanhnien
Figure 2.2: Diagram of data collection process
After utilizing Python's "Beautiful Soup4" library to collect and process text data, the author moved on to gather numerical data This additional data collection aims to enhance the reliability of the research findings and provide a stronger foundation for testing results.
Therefore, the author collected data on: The daily price change (Change) of the VN-Index from the website: https://vn.investing.com/ - from 2001 to 2021. ll xa nan ơ ơơơ ơaơ^ơ
Figure 2.3 Price history table of VN-INDEX in IDE software - Visual Code
This historical data source encompasses key elements such as trading time, opening price, closing price, and volatility trends It is utilized for automatic news categorization, aiding in the preparation of training datasets The data has been tested and is specifically designed for predictive models that rely solely on numeric information.
2.2 The method of combining the content of news
Numerous studies, as highlighted in section 1.4, propose a method that consolidates same-day news into a unified text, yielding favorable predictive outcomes The author notes that investors frequently seek financial and securities news from diverse sources, with Vietnamese websites, detailed in section 3.1, being particularly popular among Vietnamese investors.
To enhance the accuracy of the forecast, the author aggregated information from four pages using an automated computer program (see Appendix 4) This news aggregation occurs during both the data preparation phase for training and the subsequent forecasting phase.
The diagram illustrates the process of aggregating articles from financial and securities categories on the websites Thanh Nien, Tuoi Tre, Cafef, and Vietstock After downloading the news, the author organizes the articles by their publication dates (see Appendix 4) and combines all news published on the same day from these four sources into a comprehensive summary (refer to Appendix 5).
1 vẫn bản tổng hợp tất cả tin tức từ 4 nguồn trong 1 Tin tức về tài chính, chứng ngày khoản trên Thanh Nién,
Figure 2.4: Diagram of the method to combine the news into a single document.
2.3 Assign news labels to prepare data for the training phase
After acquiring articles from online news platforms, the author systematically categorized the news content into three distinct levels—good news, medium news, and bad news—using historical data to create a training dataset.
While manual news reading and labeling can be time-consuming and prone to inaccuracies, the author advocates for the use of automatic news labeling to enhance program expansion and practical application in the future.
Figure 2.5: Chart of trend prediction of VN-Index using built-in machine learning model
The way to automatically label the news in the training set is as follows:
Information data (text) by date called t - 1s taken from aggregated data file (Appendix 4). Information about the price movement (%) of the next day is t + 1 - from the column
"change" in the historical price of VN-Index.
Value Next day's price volatility t + 1 reflects a Label assigned to news data of a previous day:
+ Positive value of price movement - Label is assigned 0 (Up).
Recent news data indicates that the VN-Index is expected to rise, aligning with the positive price volatility observed in the index.
+ Negative value of the price movement - The label is assigned as 1 (Down).
The previous day's news data suggests a decline in the VN-Index value, reflecting a negative trend in price volatility associated with the index.
The VN-Index is expected to maintain a steady value, as indicated by a label of 2, signifying no price movement—neither an increase nor a decrease This stability aligns with the zero level of price volatility, reflecting the information presented in the previous day's news.
Assuming price volatility on 9/2/2021 is 2.93%, and news data on 8/2/2021 is identified and labeled with 0 (Up) or positive forecast information.
Table 2.2: Example of news classification by price history
Date of news (t) Date of price (t+1) The degree of price|Label assigned to volatility (Change) news data
To enhance data processing, it is essential to eliminate interfering characters from news articles, particularly those with only headers or images, which often result in numerous blank lines (M.K.c.Dr PK Sahoo 2019) While headlines from these blank sections can still be utilized, their values need conversion to eliminate blank lines Initially represented as "Nan" in the web structure, these values should be transformed into spaces after data filtering Given the unstructured nature of text and the presence of HTML tags and JavaScript code as noise, it is crucial to address these issues to ensure accurate processing and avoid future discrepancies caused by extraneous characters.
To enhance clarity and improve SEO, it is essential to eliminate confusing special characters and punctuation from articles This process ensures that the content is more informative and accessible to readers, ultimately leading to a better user experience.
+ Remove punctuation marks, numbers, etc.
, ft or white' characters need to be deleted and it may
[2] icCdDđĐeEèÈEẻÈẽEéÉeEêÊêÊ&ÊâÊ&Ê@ỆfFg6hHi13 †iÌ11í f1{†j 1kKLLmMnNo 1006 0ờờởửỡ0ớớợợpPqQrRsStTuU. return string.s
Hôm nay cô” phiêú VNI tăng tên
Figure 2.6: Example illustrating removing unnecessary characters
+ Convert uppercase letters to lowercase letters: text_lowercase(string): return string lower()
'hôm nay cô” phiêú vni tăng Lên
Figure 2.7: Example illustration changing upper case type to lower case
+ Split text into meaningful words: tokenize(st gs): return word tokenize(strings, format="text")
2] tokenize( "hôm nay cô” phiêú vni tang Lên")
Figure 2.8: An example illustrates text / sentence split into meaningful words
DEVELOPING TEST PROGRAMS HH HH ng gi, 48 3.1 00:1
Code Editor - Visual Studio Code (Visual Code) - c5 555cc sc+ssseeeeesss 48
Visual Studio Code is developed by Microsoft, free to download for Windows, Linux and MacOS.
This widely used source code editor offers built-in support for JavaScript, TypeScript, and Node.js, while also featuring a robust ecosystem for additional programming languages like C++, C#, Java, Python, PHP, and Go, as well as runtime environments such as NET and Unity.
Enhance your coding experience with Visual Studio Code by leveraging its comprehensive support for imported variables, methods, and modules, alongside advanced features like graphical debugging, linting, and multi-pointer editing Enjoy seamless parameter hinting and efficient code navigation, enabling quick refactoring and integrated source code control for a streamlined development process.
File Edit Selection View Go Debug Terminal Help Welcome - Visual Studio Code
Python, created by Guido van Rossum and launched in 1991, is a high-level programming language known for its versatility Its design emphasizes readability and simplicity, making it easy for beginners to learn and remember With a clean structure, Python enables users to write code efficiently, often requiring fewer keystrokes.
Additionally, Python has a high technology that includes a number of libraries for artificial intelligence and machine learning Here are some popular libraries and frameworks:
* Keras, TensorFlow, and Scikit-learn for Machine Learning
* NumPy for high-performance scientific computation and data analysis
* SciPy for advanced computing ¢ Pandas for general purpose data analysis ¢ Seaborn for data visualization (Data Visualization)
BeautifulSoup is a powerful Python library designed for extracting data from HTML and XML files It operates alongside parsers, enabling users to navigate, search, and modify the parse tree effectively Additionally, BeautifulSoup can be integrated with libraries like Requests to retrieve information from remote sources without the need to download HTML files, and it supports modules for saving data in formats such as CSV and JSON.
Using this library can help the author get the desired content, remove unnecessary information when retrieving HTML from news websites, to serve input data at the training stage.
Scikit-learn, often referred to as Sklearn, is a leading Python library renowned for its robust machine learning algorithms It offers a comprehensive suite of tools designed to tackle various machine learning and statistical modeling challenges, encompassing classification, regression, clustering, and dimensionality reduction.
The scikit-learn library includes sub-libraries such as:
NumPy is a powerful library for managing multidimensional arrays and matrices, while SciPy offers a suite of functions for scientific computing Matplotlib allows for the visualization of data through 2D and 3D graphs, and IPython provides an interactive notebook environment for engaging with Python code SymPy serves as a library for symbolic mathematics, and Pandas is essential for processing and analyzing data in tabular format.
Table 3.1: Explanatory table for modules Module name Function Explaination
Module for | Use the Beautiful Soup library to get content from the html tags of downloading news | Investing.com's quote website and prices
The Underthesea library is a widely-used tool for Vietnamese language processing, specifically designed for word segmentation Unlike English, where words are clearly separated, Vietnamese can present challenges as words may consist of multiple syllables without distinct boundaries For instance, the term "stock" is a single word despite having two syllables, illustrating the complexity of word separation in Vietnamese.
So it will be very difficult for the computer to learn and process Vietnamese language in separating different words and encoding the language for the computer to understand.
Module processing machine algorithms for learning
The Scikit-learn library in Python is a powerful tool for implementing machine learning algorithms, offering a wide range of both classical and modern techniques This library enables users to conduct various tests on different machine learning models, facilitating comprehensive analysis and experimentation Scikit-learn organizes its algorithms into distinct groups, enhancing the user experience and streamlining the model selection process.
* Clustering: Clustering algorithm group of unlabeled data. Example KMeans algorithm
Cross-validation is a vital technique for assessing the performance of machine learning algorithms by utilizing validation data during the model training process The datasets used in this context are pre-compiled and standardized, ensuring high efficiency in training, with notable examples including the Iris and digit datasets Additionally, dimensionality reduction plays a crucial role in simplifying data by minimizing the number of significant features through methods like aggregation, data representation, and feature selection, with Principal Component Analysis (PCA) being a key algorithm in this area.
* Ensemble methods: Aggregation Methods use a variety of learning algorithms to get better predictive performance than any of the constituent learning algorithms.
Feature extraction involves defining attributes for visual and linguistic data, while feature selection focuses on identifying significant features essential for training supervised learning models Parameter tuning is crucial for adjusting algorithm parameters to optimize model performance Manifold learning addresses complex multidimensional data analysis through advanced learning algorithms Supervised models encompass a wide range of contemporary machine learning algorithms, including linear models, discriminant analysis, naive Bayes, lazy methods, neural networks, support vector machines, and decision trees.
To calculate the accuracy of the model, assuming we specify the following:
TU: The number is correct for an uptrend
TD: The number is correct for the downtrend
TN: The number is correct for the trend of neither increase nor decrease
FU: The number of false predictions for an uptrend
FD: The number of false predictions for a downtrend
FN: It is the number of false predictions for an uptrend and a non-decreasing trend
% Accuracy = (TU + TD + TN) / (TU + TD + TN + FU + FD + FN) * 100
3.4 Testing program hình se Cây quyết
Thử nghiệm tập dữ liệu e Vietstock
Cải tiến mô hình se Thay đổi lần định © VnExpress lượt các tham ô Rừng ngẫu * Thanhnien số: nhiên © Cafef e K- Láng Giéng © Máy véc-tơ hỗ trợ (SVM) w w
| Máy véc-tơ hỗ trợ (SVM) | | BộdữliệuViestoek | | Kếtquảmôhinhốiưu |
Figure 3.2: Process diagram of experimental model The common data set, after being preprocessed, is grouped into 2 data sets:
Volume | with daily data of all 4 electronic newspaper sources
Volume 2 with daily data of every website
The above data set, after pre-treatment, was divided 70:30 with 70% of the data used for training and 30% of the data to test the model.
[ ] from sklearn.svm import SVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import accuracy_score, f1_score, confusion_matrix from sklearn.model_selection import train_test_split
X train, X test, y train, y test = train test split(X, y, test size=9.3, nandom state!)
Figure 3.3: Library call and vectorization, train-test split
Decision Tree, Random Forrest, K-Nearest Neighbor and Support Vector Machine models were tested for the first time on data set 1.
The model with a high accuracy rate will be selected as the optimal model, continue to test at the second time The results are as follows:
Figure 3.4: Test results on four models
The Support Vector Machine (SVM) model achieved the highest accuracy rate of 52.8% In subsequent tests, the author will apply the SVM model to forecast the VN-index, which represents the Vietnam stock market.
After observing unsatisfactory results from the initial trial, the author proceeded to conduct a second test by applying the SVM model to the second dataset.
The research utilizes a data set comprising financial and economic news from prominent Vietnamese websites, including Vietstock, VnExpress, Thanhnien, and Cafef, to evaluate the data quality of each source This data set serves as the foundation for enhancing the research model.
The analysis reveals that Vietstock leads with a remarkable 55.87% success rate based on a sample of 1,274 news articles Consequently, I have selected financial and economic articles from the Vietstock website as the input data for the model.
The standard Support Vector Machine (SVM) aims to identify the margin that distinguishes positive and negative examples, but it can produce suboptimal models when examples are mislabeled or atypical To address this issue, Cortes and Vapnik introduced the concept of "soft margin" SVM in 1995, which permits certain examples to be disregarded or positioned incorrectly relative to the margin This advancement typically results in a more effective overall model fit.
Therefore, the author enhances the test program results by making 2 parameter changes as follows:
C and Gamma are the parameters for the nonlinear Support Vector Machine (SVM) with the radial base function kernel Gauss.
C is the parameter for the soft margin cost function, which controls the effect of each individual support vector; This process involves penalizing the transaction error for stabilization.
*C=© - Do not allow deviation, it means hard margin
* Large C - Allows small deviations, resulting in small margins. ¢ Small C - Allows large deviations, resulting in large margins.
Beautiful Soup Library 1
BeautifulSoup is a powerful Python library designed for extracting data from HTML and XML files It operates alongside parsers that enable users to navigate, search, and modify the parse tree efficiently Additionally, BeautifulSoup seamlessly integrates with libraries like Requests, which facilitates the retrieval of information from remote sources without the need to download HTML files, and it can also work with modules that save data in formats such as CSV and JSON.
Using this library can help the author get the desired content, remove unnecessary information when retrieving HTML from news websites, to serve input data at the training stage.
Scikit-learn Library 0 3
Scikit-learn (Sklearn) is a leading Python library for machine learning, offering robust tools for various tasks such as classification, regression, clustering, and dimensionality reduction This powerful library streamlines the process of addressing machine learning and statistical modeling challenges, making it an essential resource for data scientists and analysts.
The scikit-learn library includes sub-libraries such as:
NumPy is a powerful library for managing multidimensional arrays and matrices, while SciPy offers a suite of functions for scientific computations For data visualization, Matplotlib provides tools to create both 2D and 3D graphs IPython serves as an interactive notebook for visual engagement with Python, and SymPy is a library dedicated to symbolic mathematics Additionally, Pandas excels in data processing and analysis, particularly with tabular data.
Introduction of MOCUIES 1 ố 49 3.3 Evaluate the model's accuracy . c1 TH TH nh ng rệt 51
Table 3.1: Explanatory table for modules Module name Function Explaination
Module for | Use the Beautiful Soup library to get content from the html tags of downloading news | Investing.com's quote website and prices
Underthesea is a widely-used Vietnamese language processing library that features a word tokenizer designed to separate words accurately Unlike English, where words are clearly divided, Vietnamese presents unique challenges, as it often combines multiple syllables into a single word For instance, the term "stock" consists of two syllables but is recognized as one word, highlighting the complexity of word segmentation in the Vietnamese language.
So it will be very difficult for the computer to learn and process Vietnamese language in separating different words and encoding the language for the computer to understand.
Module processing machine algorithms for learning
The Scikit-learn library in Python is a powerful tool for implementing machine learning algorithms, offering a wide range of classical and modern techniques This library enables users to conduct extensive testing on various machine learning models, facilitating the evaluation of their performance Scikit-learn encompasses several groups of algorithms, making it a versatile choice for data analysis and model development.
* Clustering: Clustering algorithm group of unlabeled data. Example KMeans algorithm
Cross-validation is a method used to assess the effectiveness of machine learning algorithms by evaluating them against validation data during the training process The datasets utilized in this process are part of a curated library, featuring standardized collections like the Iris and Digit datasets, which enhance training efficiency Additionally, dimensionality reduction techniques, such as Principal Component Analysis (PCA), aim to minimize the number of significant features in the data through aggregation, data representation, and feature selection.
* Ensemble methods: Aggregation Methods use a variety of learning algorithms to get better predictive performance than any of the constituent learning algorithms.
Feature extraction involves defining attributes for visual and linguistic data, while feature selection focuses on identifying significant features for training supervised learning models Parameter tuning is essential for adjusting algorithms to optimize model performance Manifold learning addresses complex algorithms and multidimensional data analysis Supervised models encompass a wide range of machine learning algorithms, including linear models, discriminant analysis, naive Bayes, lazy methods, neural networks, support vector machines, and decision trees.
To calculate the accuracy of the model, assuming we specify the following:
TU: The number is correct for an uptrend
TD: The number is correct for the downtrend
TN: The number is correct for the trend of neither increase nor decrease
FU: The number of false predictions for an uptrend
FD: The number of false predictions for a downtrend
FN: It is the number of false predictions for an uptrend and a non-decreasing trend
% Accuracy = (TU + TD + TN) / (TU + TD + TN + FU + FD + FN) * 100
3.4 Testing program hình se Cây quyết
Thử nghiệm tập dữ liệu e Vietstock
Cải tiến mô hình se Thay đổi lần định © VnExpress lượt các tham ô Rừng ngẫu * Thanhnien số: nhiên © Cafef e K- Láng Giéng © Máy véc-tơ hỗ trợ (SVM) w w
| Máy véc-tơ hỗ trợ (SVM) | | BộdữliệuViestoek | | Kếtquảmôhinhốiưu |
Figure 3.2: Process diagram of experimental model The common data set, after being preprocessed, is grouped into 2 data sets:
Volume | with daily data of all 4 electronic newspaper sources
Volume 2 with daily data of every website
The above data set, after pre-treatment, was divided 70:30 with 70% of the data used for training and 30% of the data to test the model.
[ ] from sklearn.svm import SVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import accuracy_score, f1_score, confusion_matrix from sklearn.model_selection import train_test_split
X train, X test, y train, y test = train test split(X, y, test size=9.3, nandom state!)
Figure 3.3: Library call and vectorization, train-test split
Decision Tree, Random Forrest, K-Nearest Neighbor and Support Vector Machine models were tested for the first time on data set 1.
The model with a high accuracy rate will be selected as the optimal model, continue to test at the second time The results are as follows:
Figure 3.4: Test results on four models
The Support Vector Machine (SVM) model achieved the highest accuracy rate of 52.8% In subsequent tests, the author plans to utilize the SVM model for forecasting the VN-index, which represents the Vietnam stock market.
After observing unsatisfactory results from the initial trial, the author proceeded to conduct a second test by applying the SVM model to the second dataset.
The dataset comprises financial and economic news from prominent Vietnamese financial websites, including Vietstock, VnExpress, Thanhnien, and Cafef This collection aims to evaluate the data quality of each site and will serve as the primary dataset for enhancing the research model.
The analysis reveals that Vietstock leads with the highest performance, boasting a sample size of 1,274 news articles and achieving a success rate of 55.87% Consequently, I have selected financial and economic articles from the Vietstock website as the input data for the model.
The standard Support Vector Machine (SVM) aims to identify the margin that distinguishes positive and negative examples, but this approach can result in suboptimal models when examples are mislabeled or atypical To address this issue, Cortes and Vapnik introduced the concept of a "soft margin" SVM in 1995, which permits certain examples to be disregarded or positioned incorrectly relative to the margin This advancement typically enhances the overall model fit.
Therefore, the author enhances the test program results by making 2 parameter changes as follows:
C and Gamma are the parameters for the nonlinear Support Vector Machine (SVM) with the radial base function kernel Gauss.
C is the parameter for the soft margin cost function, which controls the effect of each individual support vector; This process involves penalizing the transaction error for stabilization.
*C=© - Do not allow deviation, it means hard margin
* Large C - Allows small deviations, resulting in small margins. ¢ Small C - Allows large deviations, resulting in large margins.
Gamma is a crucial super-parameter in non-snow Support Vector Machines (SVMs), particularly when utilizing the radial basis function (RBF) kernel The gamma parameter of the RBF kernel determines the influence range of individual training data points, significantly affecting the model's performance and decision boundary.
A low Gamma value indicates a broad similarity radius, causing numerous data points to cluster together, while a high Gamma value requires points to be in close proximity to be classified within the same group.
The test group replaces 2 main parameters for the model:
Gamma from 0.0001 to 1 kernel is 'rbf'
The results are as follows:
Table 4.4: Table of improving test results
C=1 gamma=0.1 kernel='rbf 58.1% C=1 gamma=0.01 kernel='rbf 57.1% C=1 gamma=0.001 kernel='rbf 57.1%
C0 gamma=0.1 kernel='rbf 55.2% C0 gamma=0.01 kernel='rbf 60.1%
C0 gamma=0.001 kernel='rbf 57.1% C0 gamma=0.0001 kernel='rbf 57.1% C00 gamma=1 kernel='rbf 56.3%
C00 gamma=0.1 kernel='rbf 56.0% C00 gamma=0.01 kernel='rbf 56.0%
C00 gamma=0.001 kernel='rbf 60.1% C00 gamma=0.0001 kernel='rbf 57.1%
The best result achieved after raising the result is 60.1%.
Following enhancements to the SVM model, the author obtained more viable outcomes compared to the initial test These findings indicate that financial and securities news from electronic newspapers, which are frequently accessed by the Vietnamese public, significantly influences the VN-Index stock price.
The stock market is characterized by constant volatility, influenced by various factors including investor sentiment and domestic news Global economic conditions and stock market information also play a significant role Despite positive news, stock price indices can still decline the following day Currently, the ongoing effects of the pandemic on economies worldwide, including Vietnam, contribute to a sensitive and ever-changing market environment.
The author emphasizes that the results presented are modest and serve primarily as a reference for managers and investors in the Vietnamese stock market Additionally, the research aims to provide readers with fresh insights and innovative approaches to data analysis and stock market evaluation.
The rapid advancement of Information Technology has significantly transformed various fields, including economics and finance, by facilitating global data collection through sophisticated computer systems and transmission networks As a result, the volume of available information has surged, with numerous newspapers and websites providing updates on an hourly basis For stock market investors to gain a comprehensive understanding of market dynamics, it is essential to sift through and classify this vast amount of information Traditional methods of data processing and classification have become inadequate; however, the application of Machine Learning and text-mining techniques has proven to be highly effective in automating the classification of information, enabling investors to navigate the complexities of the market more efficiently.