Creating a sentiment analysis system to process customer feedback from an e commerce site Creating a sentiment analysis system to process customer feedback from an e commerce site
INTRODUCTION
Research Background
In the era of technology and digital transformation, understanding customer feedback is crucial for e-commerce businesses to enhance their service quality and maintain competitive advantages Sentiment analysis of customer feedback allows businesses to gain insights into customer satisfaction and areas needing improvement This research focuses on developing a sentiment analysis system to process customer feedback from Shopee, a leading e-commerce platform By leveraging modern technologies such as RESTful API and Apache Kafka, the research aims to create a robust and efficient system to classify customer feedback into positive, neutral, and negative sentiments.
Objectives of the Research
The primary objectives of this research are to:
• Develop a sentiment analysis system to process customer feedback from an e- commerce site
• Utilize Shopee's RESTful API to collect customer comments and ratings for a specific product
• Implement a data streaming pipeline using Apache Kafka to handle real-time data processing
• Build and evaluate a sentiment analysis model to categorize feedback as positive, negative The research aims to provide a comprehensive solution that not only automates the sentiment analysis process but also delivers actionable insights to improve customer satisfaction.
Scope and Limitations of the Research
The scope of this research includes:
• Collecting and processing feedback data for a specific product on Shopee
• Utilizing Apache Kafka for real-time data streaming and processing
• Building and testing a sentiment analysis model on the collected dataset However, there are several limitations:
• Data scope: Limited to feedback from one specific product on Shopee
• Model accuracy: Dependent on the quality and quantity of the collected data
• External factors: Influences such as language and context of feedback that may not be fully addressed by the model
By providing these details, you can construct a comprehensive introduction for your research project, ensuring clarity and thoroughness
RESTFUL APIs AND KAPPA ARCHITECTURE IN DATA STREAMING
What is APIs?
REST (representational state transfer) is an architecture for developing services that can be accessed across several platforms and contexts to promote interoperability and the World Wide Web (Wu Chou; Wei Zhou; Min Luo, 2008) This architecture has two important characteristics: statelessness and cross-platform consumption readiness It has become a widely accepted, standardized method of publishing services via the Internet (Pahl, C., & Jamshidi, 2016) REST application programming interfaces (APIs) are commonly used in the construction of microservices (Khare, R.; Taylor, R, 2004) Research was conducted to expand the REST architecture to allow distributed systems (Pautasso, Zimmermann, Leymann,
RESTful APIs, or web APIs, are made up of endpoints Every endpoint represents a concretely realized functionality of a business process These APIs are normally available over HTTP (hypertext transfer protocol) and include standard verbs such as GET, POST, PUT, and DELETE RESTful APIs are invoked using a URI (uniform resource identifier)
One of the issues was defining a uniform message format (request and answer) Initially, REST APIs were described using casual text (Ed-Douibi, H.; Izquierdo, J.L.C.; Cabot, 2018) JSON (JavaScript Object Notation) documents emerged subsequently as a standard The format is pure text, which is easily recognized and processed by machines across networks and devices The challenge of developing a common method for describing REST services remains The Open API specification (Ed-Douibi, H.; Izquierdo, J.L.C.;, 2018) is emerging as one answer to that problem
Any of the verbs listed above can be used to invoke a REST API endpoint For example, if there is an endpoint that returns a list of products from an online shopping website, it can be called using the GET verb, indicating that we wish to retrieve some data as a list of instances or a single instance The request is called a GET request and may include some arguments Data can also be inserted using a POST request, deleted
11 with a DELETE request, and modified with a PUT request To access any REST API endpoint, create a URL containing the address and any necessary parameters, then mark it correctly with one of the verbs The POST verb is used to convey data as part of the request body rather than embedding it in the URL This renders it invisible, which is preferable for sensitive information such as credentials When a user registers in, the username and password are given to a REST API endpoint via a POST request
In this section, we learned about RESTful API and its importance in software development RESTful APIs are not only a functional and efficient approach to interacting with online services but are the basis of a platform for application integration and extension We have found that RESTful APIs provide an easy approach to send and receive data through the HTTP protocol, helping to connect applications and services in an operational and efficient manner At the same time, using RESTful APIs also provides security, performance, and operational benefits for application changes and extensions
In the next part of the thesis, we will delve into applying Shopee's RESTful API to collect feedback data from customers, better understand how this API works and how we can take advantage of it Use this API to build a sentiment analysis system from Customer Feedback on e-commerce sites.
The role of restful APIs in developing system
In this thesis, RESTful APIs play an important role in retrieving data for analysis Specifically, it is used to get data from customer feedback on the Shopee page It is important to know that Shopee is an extremely large e-commerce platform Big brands are all concentrated on this e-commerce platform Therefore, the security issue for this e-commerce platform is extremely important Therefore, restricting access to APIs is completely acceptable We can mention a few limitations as follows:
- Shopee imposes limits on the number of requests or data accessible from their API Therefore, it will make accessing APIs more difficult
- Access authentication: To access Shopee's API, you must authenticate information such as API key and access token This authentication process can be complex and requires an understanding of how the API works
- Frequent changes: Companies often update and tune their APIs to improve performance and security, resulting in changes to the structure or authentication requirements, causing difficulties for those using the APIs
In this thesis, APIs are used to retrieve comment data, including username, rating and comment The following is the method to get APIs that I did in my thesis:
- The code in this thesis uses the python language to get APIs Because python is a popular language, it supports a lot in getting APIs and building models However, the limitation of using python is: some websites, especially the e- commerce platform Shopee, often hide product links with javascripts, making it difficult to get URLs
- The website I intend to call APIs is: https://shopee.vn/sofirnlight.vn#product_list Because of Shopee's security, I cannot use python's selenium library to automate getting product page links on Shopee Instead, I use javascript code to get all the links on that product page, then save it in an html file for ease of calling the API of all products.
Figure 2.1 Code Javascript to fetch comments in products list
- The website I intend to call APIs is:
- https://shopee.vn/sofirnlight.vn#product_list Because of Shopee's security, I cannot use python's selenium library to automate getting product page links on Shopee Instead, I use javascript code to get all the links on that product page, then save it in an html file for ease of calling the API of all products
- Required parameters: The API requires several parameters, including filter, flag, itemid, limit, offset, shopid, and type These parameters are used to determine the type of data to retrieve and paging if any Within the limits of the thesis, the parameters set out above basically meet the requirements of the problem
Below is a diagram of the process for obtaining APIs used in the thesis:
Figure 2.2 Process to fetch APIs of Shopee
In this section, we learned about the process of getting feedback data from Shopee using their RESTful API Shopee is one of the popular e-commerce platforms, providing a powerful API that allows developers to access product information, comments, and reviews from the user community However, if we want to get a good source of APIs and have deep access, we must become an official agent of Shopee and register for an API package for the business
We've seen that pulling data from Shopee can bring many benefits, including better understanding customer opinions about products, detecting shopping trends, and improving your business strategy enterprise However, getting data from Shopee can also face some challenges such as restrictions on the number of requests or limits on the data accessible from their API
The source of the APIs and how to crawl from it
The Python script employs a data extraction technique from Shopee via reverse engineering This involves dissecting and re-creating network requests to obtain data directly, bypassing Shopee's official user interface The methodology for this technique is detailed at https://apify.com/marc_plouhinec/shopee-api-scraper#api- reference, where the APIs are meticulously examined and reconstructed The script starts by setting up a Kafka producer to stream data to my-stream, and then it reads a prepared HTML file (list_of_links.html), which contains links to various Shopee product sites Using BeautifulSoup, it parses these URLs and retrieves the shop_id and item_id using regular expressions (search) For each product link, the fetch_and_produce_ratings function dynamically generates an API URL (ratings_url) for getting ratings data from Shopee's API HTTP requests are made via requests.get() uses special headers (User-Agent, Accept, and Content-Type) to ensure compatibility and authorization JSON replies are processed to extract ratings information such as author_username, rating_star, and remark, which are then prepared and transmitted to Kafka Robust error handling is used to handle HTTP failures and missing data fields compassionately The script concludes by flushing the Kafka producer to guarantee all messages are successfully transmitted This automated approach enables efficient extraction and integration of Shopee product ratings for applications in data analysis and real-time monitoring
Through this section, we have had an overview of how to get data from Shopee through their RESTful API, and will continue to apply this knowledge to building a sentiment analysis system from customer feedback in the next parts of the thesis
The construction kappa architecture
In the modern technological world, real-time data processing and analysis has become an urgent requirement for many applications and systems Kappa Architecture, one of the cutting-edge data architectures, is designed to process continuous data streams and provide efficient real-time analytics Recommended by Jay Kreps - works as a data architect at LinkedIn and one of the original Kafka authors, Kappa Architecture focuses on simplifying data processing by eliminating the complexity of maintaining two systems Separate systems for batch data and streaming data, as in the Lambda Architecture model
Kappa Architecture relies heavily on a single streaming processing system, using a continuous stream of data instead of individual batches of data Apache Kafka is one of the main tools used in Kappa Architecture, thanks to its real-time data processing capabilities, high fault tolerance, and easy scalability By using Kafka, applications can collect, store, and process data with low latency, quickly responding to user and system requests
For an example, the proposed design utilizes the kappa architecture of data streaming and has layers that are streaming and dedicated to serving other layers print figure 3.1 The information is obtained from Shopee and fed into the streaming layer Kafka receives the raw data and processes it Transforms it into a structured data set that is organized prior to be utilized by consumers After consumer fetch the message from topic: my-stream, it will save the data as data4.csv and send it to sentiment analysis model to analysis the data
Figure 3.1 Flowchart for visualizing Shopee streaming data
What is Kafka?
Apache Kafka functions as a distributed messaging platform that enables the publishing and subscribing of substantial amounts of data with minimal delay Unlike conventional message queues that manage high message volumes by employing several processors per topic—where only one recipient can receive each message at a time—Kafka facilitates efficient data transmission Both Kafka and publish/subscribe systems share the common goal of transferring information from producers to consumers However, message queues are only capable of accommodating one consumer to have each message subject to a particular topic Today, in the context of large information, the latter of which is transmitted to multiple computers, including the batch system, stream processing, while requiring a low delay We deploy the Kafka apache on the Docker This is containerized Deployment design that help easy to deploy and have mobility The figure 3.2 show How it works in Docker
Figure 3.2 the Layer in the container of Docker
In order to adapt all the obligation above, apache Kafka has some feature such as:
• Multiple customer distribution Apache Kafka is a publish/subscribe system that connects various clients and consumers to messages This feature's integration with several technologies, like Apache Hadoop, Apache Storm, and TensorFlow, makes it very useful
• Efficient message delivery rate This is achieved by utilizing several features:
(1) message set abstractions that cluster messages together, minimizing the overhead of network round trips; (2) binary message format, which allows data chunks to be transferred without modification; and (3) zero-copy optimizations, which prevent multiple copies of the page cache One important feature is the Kafka consumer group, which allows for message delivery among a cluster of clients maintained by Apache Kafka, similar to message queues
In Kafka, topics are communication streams that producers can post to and consumers can subscribe to Unlike other distributed queue systems, Kafka stores messages on disk with a flexible retention policy, enabling components to access them at a later time The distributed log provides consumers with the ability to read
18 the log whenever necessary This capability is useful for ML training in Kafka-ML, since it allows for simultaneous processing of all data If a fault occurs during the procedure, the customer can resume without losing data or storing it in a file system The distributed log in Kafka provides a durable and scalable way to store and access streams of records This is particularly advantageous for machine learning workflows where continuous data processing is required The ability to retain messages on disk with a customizable retention policy means that data can be reprocessed if necessary, and it also supports long-term storage for analysis If an error occurs during the processing, Kafka's architecture allows for the recovery and continuation of the workflow without data loss, which would be more challenging in systems that require data to be stored in traditional file systems Moreover, Kafka's architecture supports high availability and fault tolerance, ensuring that even if some components fail, the system can recover and continue to operate This resilience is crucial for maintaining uninterrupted data streams, particularly in real-time applications where delays or data loss could have significant impacts Kafka's distributed nature also allows it to scale horizontally, adding more brokers to handle increased data loads, which is essential for large-scale data processing tasks such as those found in machine learning training
Partitioning topics allows for load balancing and fault tolerance, with each partition having many replicas Partitions divide the log into smaller segments to distribute the load, and topic replicas provide fault tolerance Apache Kafka clusters consist of broker networks operating on a peer-to-peer basis, managing both partitions and replicas Employing a consumer group enables efficient data distribution by allocating specific partitions to designated clients Apache Kafka has customizable QoS policies for message dispatching, including ''at most one'', ''at least once'', and ''exactly one'' As figure 3.3 describe the model Kafka detail:
Apache Kafka has become the preferred method for connecting systems, consuming data, and disseminating information because to its widespread adoption and interoperability with many cloud computing technologies.
Adapt Kafka in kappa architecture
Applying Kafka within Kappa Architecture provides a powerful and efficient method for real-time data processing and analysis Kappa Architecture, unlike its predecessor Lambda Architecture, simplifies data flow by focusing only on processing streaming data, thereby eliminating the need for separate real-time and batch processing layers This approach takes full advantage of Kafka's capabilities to manage and process continuous data in a flexible, scalable, and fault-tolerant manner
In the first stage, Kafka acts as the data center where all input data streams are received Producers, which can be different data sources such as IoT devices, web logs, user interactions or any application that generates data, will publish records to Kafka topics Kafka's ability to handle high-volume input data makes it an ideal choice for gathering large amounts of real-time data efficiently
Processing Streaming Data with Kafka Streams:
Once the data has been ingested into Kafka, the next step is to process these data streams in real time This is where Kafka Streams, a powerful data processing library in Kafka, comes into play Kafka Streams enables the transformation, aggregation, and enrichment of live data streams in Kafka Using Kafka Streams, we can build complex data processing pipelines that can handle tasks like filtering, combining, and aggregating data in real time
For example, in an e-commerce scenario, user interactions with the website (such as product views, clicks, and purchases) are ingested into Kafka topics Using Kafka Streams, these interactions can be processed to generate real-time insights such as trending products, customer behavior analysis, and personalized recommendations The real-time nature of Kafka Streams ensures that these insights are always up-to-date and can be acted upon immediately
One of the key advantages of using Kafka in Kappa Architecture is its built-in fault tolerance and scalability Kafka is designed to be distributed and can replicate data across multiple brokers, ensuring that there is no single point of failure that can disrupt data flow In case a broker goes down, Kafka automatically balances the load and ensures data remains available This flexibility is key to maintaining continuous data processing without interruption
Additionally, Kafka's horizontal scalability allows the system to grow easily with increasing data volumes By adding more brokers, Kafka can process more data and provide higher throughput, which is important for large-scale applications that require real-time data processing
Data Storage and Query Layer:
After processing the data streams, the results need to be stored and queried efficiently In Kappa Architecture, this is managed by the Data Query Layer (Serving Layer) Processed data from Kafka Streams can be written to different storage
21 solutions such as Elasticsearch, Cassandra, or HDFS, depending on the specific requirements of the application
For example, in an e-commerce scenario, aggregated data such as sales statistics, user activity logs, and recommendation models can be stored in Elasticsearch This enables fast and efficient queries for real-time dashboards and analytics Users can query this data to gain insights, run reports, and make data-driven decisions without delay
Integrating Kafka into Kappa Architecture also facilitates seamless machine learning (ML) workflows Data streams can be used to continuously train and update machine learning models, ensuring that they always learn from the latest data Kafka's ability to retain data on disk allows ML pipelines to reprocess historical data as needed, improving model accuracy and durability
For example, user interaction data ingested into Kafka can be used to train real-time recommendation algorithms As new data is brought in, the models are updated, providing more accurate and relevant suggestions This continuous learning loop ensures that the system quickly adapts to changing user behaviors and preferences
In summary, adopting Kafka in Kappa Architecture provides a powerful and efficient solution for real-time data processing Kafka's advantages of data ingestion, stream processing, fault tolerance, scalability, and integration with machine learning make it an indispensable component in building robust data streams strong and sensitive By leveraging Kafka in Kappa Architecture, we can achieve continuous, real-time data processing and analysis, allowing businesses to make timely and informed decisions based on the latest data.
Build streaming data streams with Kafka
Kafka Streams ( M.J Sax, G Wang, M Weidlich, J.-C Freytag, 2018) is a Java-based library designed to streamline the development of real-time processing applications that interact with data streams stored in Apache Kafka, focusing on handling input and output within Kafka Streams applications These are called topics
Kafka Streams provides an abstraction over the data and with each other Streams and tables (aggregation of data) that are intended for the development of streaming and microservice-based applications Faust (Faust - Python Stream Processing, 2021) is a free software library designed for stream processing, inspired by Kafka Streams but implemented in Python Similar to Kafka Streams, Faust supports tasks such as data stream processing, windowing, and aggregations Its interface is less verbose compared to Kafka Streams, allowing for the creation of streams and applications with minimal lines of code While not directly incorporating ML/AI technologies, these streaming libraries rely on Apache Kafka for distribution, similar to Kafka-ML
In the work of building a sentiment analysis system from customer feedback, the work of designing an effective data flow comes next Apache Kafka, a real-time data streaming platform, was chosen to ensure continuous and reliable data collection, transmission, and processing The process of building a streaming data stream with Kafka includes the following main steps:
First, the system will collect data from Shopee's RESTful API API (Application Programming Interface) is a set of definitions and protocols that different software can use to communicate with each other In this case, RESTful API is an API design method that uses standard HTTP methods, making it easy to interact with web services
Customer feedback including comments and product reviews will be continuously received Producers, the components responsible for sending data into Kafka, will send these data records to specific topics in Kafka Topics in Kafka are channels or streams of data through which data is written and read Each topic corresponds to a data stream and can be configured to store data with a custom retention period, ensuring that data can be retrieved when needed
For example, when a customer posts a comment or review about a product on Shopee, an HTTP request is sent to Shopee's API This response data will then be
23 received by the producer in the Kafka system and sent to a topic in Kafka This helps centralize and standardize data, preparing it for further processing
2 Processing Streaming Data with Kafka Streams:
After data is received into Kafka topics, the next step is to process these data streams in real time using Kafka Streams Kafka Streams is a powerful Kafka library that allows performing data transformations, aggregations, and enrichments directly on data streams With Kafka Streams, we can perform tasks such as filtering data, combining multiple data streams, and calculating aggregate metrics
For example, customer comments and reviews can be analyzed to isolate important keywords, determine user sentiment (positive, negative or neutral), and calculate aggregate metrics consistent with the average score assessed in real time These processing results can then be transferred to other storage or analysis systems for further exploitation
A specific case might be to build a pipeline in Kafka Streams to read from a topic containing response data, apply transformations to normalize the data, and then use natural language analysis algorithms (NLP) to assign emotional labels to each response
Results from data stream processing will be stored in data storage systems such as HDFS (Hadoop Distributed File System), Elasticsearch or other NoSQL databases HDFS is a distributed file system designed to run on commodity hardware, providing efficient storage and retrieval of large data Elasticsearch is a powerful search and analytics engine, commonly used to query and analyze real-time data NoSQL is a type of non-relational database, optimized for storing and querying flexible structured data
This allows data to be queried and analyzed quickly and efficiently After being analyzed and labeled with emotions, data records will be exported as CSV files, making it easy to store and use for machine learning models later
For example, processed data can be written to Elasticsearch, making it easy to execute complex queries and providing intuitive dashboards for real-time data
24 analysis This allows managers and analysts to track customer sentiment and trends quickly and effectively
A crucial aspect of this system involves utilizing processed data for training machine learning models Machine learning, a subset of artificial intelligence (AI), focuses on developing algorithms and methodologies that enable computers to learn from data and enhance performance progressively These machine learning models are intended to forecast customer sentiment based on their feedback Data extracted from Kafka and stored in CSV files serves as the input data for these models, aiding in their learning from real-world data and enhancing prediction precision
The integration of machine learning encompasses essential stages such as data preprocessing, feature selection, model training, and performance assessment Popular machine learning algorithms like Logistic Regression, Support Vector Machine (SVM), and deep learning models are employed to construct sentiment prediction models The system maintains the capability to continuously update these models with new data, ensuring ongoing improvements in prediction accuracy over time
5 Ensuring Scalability and Fault Tolerance:
Kafka is designed to scale easily and be fault tolerant The system can be expanded by adding more brokers to handle increased data traffic, and data is replicated across multiple brokers to ensure no data loss in case of failure Brokers in Kafka are servers responsible for receiving, storing, and sending data This ensures that data flow is always maintained and that the system can continue to operate continuously even if part of the system fails
For example, if one broker in a Kafka cluster fails, the remaining brokers will continue to operate without interruption This feature is important for systems that require high reliability and continuous operation such as systems that analyze sentiment from customer feedback
APPLYING MACHINE LEARNING IN SENTIMENT ANALYSIS
What is sentiment analysis?
The rise of social media platforms has led to the emergence of various fields dedicated to analyzing these networks and their content to extract pertinent information Sentiment analysis aims to discern the emotions conveyed by text based on its content Situated within the broader field of natural language processing (NLP), sentiment analysis holds significant importance in decision-making processes influenced by public opinion While early efforts in sentiment analysis have existed for some time, its relevance persists well into the new millennium
Various practical applications necessitate sentiment analysis for comprehensive examination For instance, it is employed in product analysis to identify which components or attributes of a product resonate with buyers in terms of quality In a study referenced as (Subhashini, L D C S., Li, Y., Zhang, J., Atukorale,
A S., & Wu, 2021), the authors present findings from a thorough evaluation of current literature on opinion mining This study details methods for extracting textual features from opinions that may contain noise or ambiguity, articulating sentiment in opinions, and categorizing them
Additionally, (Mowlaei, M E., Abadeh, M S., & Keshavarz, 2020) proposes adaptive aspect-based lexicons for sentiment classification The authors suggest developing two dynamic lexicons using statistical and genetic algorithms to enhance sentiment classification based on various aspects (Naresh Kumar, K E., & Uma,
2021) highlight that dynamic lexicons enable more precise grading of context- specific ideas through automated updates
Reviews were categorized using lexicons sourced from various dictionaries Sentiment analysis has found application across diverse industries such as hospitality, aviation, healthcare, and financial markets (Zvarevashe K, Olugbara, 2018) In the hospitality sector, sentiment analysis is utilized to gain insights into consumer preferences and dislikes in hotel reviews Valencia et al (2019) [16] employ sentiment analysis of market sentiment to forecast developments in stock markets and
26 cryptocurrencies (Ahmad S, Asghar MZ, Alotaibi FM, 2019) investigate sentiment in tweets across different domains
In healthcare, sentiment analysis is increasingly used for analyzing customer opinions and satisfaction (Rufer N, Knitza J, Krusche, 2020) The business sector also utilizes sentiment analysis for various purposes including reputation management, market research, competitive analysis, product evaluation, and customer feedback analysis
Sentiment analysis and natural language processing encounter various challenges such as informal writing styles, sarcasm, irony, and language-specific nuances Words across different languages often carry context-dependent meanings and orientations Accessible resources for all languages remain limited Recently, scholars have focused on addressing challenges like identifying sarcasm and irony in texts, making significant advancements in this area Numerous hurdles exist in sentiment analysis, which we explore in this article along with various approaches, applications, and algorithms used in the field Our analysis includes comparative data presented through tables, charts, and graphs for clarity
As far as we know, most existing studies tend to overlook a range of sentiment analysis methodologies in favor of machine learning, transformer learning, and lexicon-based approaches While this study encompasses all these methods, it diverges from prior research by specifically emphasizing the most commonly employed strategies Other surveys may examine sentiment analysis within specific domains, across multiple tasks, or focus on particular subjects like product reviews
In contrast, this paper takes a comprehensive approach to sentiment analysis, addressing various perspectives including challenges, applications, tools, and methodologies
This book is valuable for researchers and newcomers alike as it offers a thorough overview of the subject within a single resource The primary contributions of the survey are outlined as follows:
• Studies have been undertaken to define sentiment analysis and identify prevalent technologies utilized for this task
• Examine various approaches to determine the most suitable option for a specific application
• We classify and outline well-known methodologies for sentiment analysis, encompassing machine learning, lexicon-based analysis, and hybrid approaches
• Summarizing the advantages and drawbacks of sentiment analysis to align with contemporary research trends
• Compare one by one approach's advantages and disadvantages to choose the best sentiment analysis method for the task
Applying sentiment analysis goes beyond simple classification and opens up many opportunities to explore and exploit data more deeply From anticipating consumer trends to optimizing business strategies, sentiment analysis provides a powerful tool for businesses to gain a competitive advantage Overall, the role Sentiment Analysis plays will then be busy in determining smart business decisions, based on deep and practical insights from consumer feedback
Figure 4.1 Level of sentiment analysis
Machine learning methods and techniques are applied to classify emotions from
Sentiment analysis may be done using three approaches: lexicon-based, machine learning, or hybrid Researchers aim to improve task accuracy and reduce computing costs
Figure 7 illustrates the many methodologies used for sentiment analysis
Figure 4.2 Approach of sentiment analysis
Lexicons consist of a set of tokens, each assigned a predefined score that indicates its sentiment— whether neutral, positive, or negative (Kiritchenko, S., Zhu, X., & Mohammad, SM, 2014) These scores are typically assigned based on polarity, represented as +1, 0, or -1 for positive, neutral, and negative sentiments respectively Alternatively, scores can reflect intensity, ranging between [+1, -1], where +1 denotes highly positive and -1 denotes highly negative sentiment In this method, the scores of each token within a review or text are aggregated: positive, negative, and neutral scores are calculated separately and then summed
The text is segmented into individual word tokens, with each token's polarity determined based on the highest sentiment score The lexicon-based approach proves highly effective for sentiment analysis at both sentence and feature levels Since it operates without requiring training data, it can be considered an unsupervised method However, a significant drawback of this approach is its dependency on specific domains, where words may have multiple meanings or contexts Consequently, a term considered positive in one domain may be negative in another For instance, the word "small" in the sentence "The TV screen is too small" conveys negativity because larger screens are preferred, whereas in "This camera is extremely small," it conveys positivity due to portability Addressing this issue involves
30 developing domain-specific sentiment lexicons or adapting existing language resources
Machine learning algorithms can be employed for sentiment classification tasks Sentiment analysis entails identifying and evaluating sentiment expressed in text or audio through natural language processing, text analysis, computational linguistics, and related methodologies There exist two primary machine learning methodologies for sentiment analysis:
This task can be accomplished using both supervised and unsupervised learning techniques Unsupervised sentiment analysis methods utilize knowledge bases, ontologies, databases, and lexicons curated and tailored for sentiment analysis
In contrast, supervised learning approaches are gaining popularity due to their precision Algorithms require training on a labeled dataset before they can be applied to real-world data Features are extracted from textual data during this process
The machine learning approach utilizes syntactic and/or linguistic features to classify sentiments, which is a typical task in text classification The classification model compares the attributes of the data to predefined class labels It then predicts class labels for instances that belong to unknown classes When a single label is assigned to an instance, classification becomes difficult The soft classification issue occurs when labels are applied to instances based on a probabilistic value Machine learning allows computers to develop new skills without explicit programming Sentiment analysis algorithms may be trained to recognize contextual information, sarcasm, and misapplied words Commonly model use for sentiment analysis include:
The Naive Bayes (NB) approach serves dual roles in classification and training NB operates on Bayesian principles, calculating the likelihood of a set of features being associated with a specific label using Bayes' theorem It computes the
31 conditional probability of event A given the individual probabilities of A and B, as well as the conditional probability of event B NB assumes feature independence The Bag of Words (BoW) model is often used for feature extraction NB is particularly effective with small training datasets In specific tests, NB showed 10% higher accuracy in classifying positive sentiments compared to negative ones, though this reduced overall accuracy
To address these challenges, (Kang H, Yoo SJ, Han D, 2012) enhanced the
NB classifier, applying it to a restaurant review dataset (Tripathy A, Agrawal A, Rath
SK, 2015) utilized machine learning for review classification (Hajek P, Barushka A, Munk M, 2020) and (Bordes A, Glorot X, Weston J, Bengio Y, 2014) proposed a hybrid NB and SVM model, training and testing on a movie review dataset Preprocessing and vectorization of 2,000 reviews were conducted before training the machine learning model using Count Vectorizer and TF-IDF Tripathy et al (2015) achieved 89.05% accuracy in K-fold cross-validation, demonstrating superior performance compared to other probabilistic NB-based models (Calders T, Verwer
Logistic regression is a machine learning method that assigns weights to input values This classifier identifies which input features are most relevant for distinguishing between positive and negative classes It operates as a probabilistic regression technique primarily used for binary classification tasks Logistic regression is widely applied in scenarios where multiple explanatory variables are present, calculating the odds ratio when multiple predictors are involved Maximum likelihood estimation is employed to determine the optimal parameters Independent variables can be continuous, discrete, ordinal, or nominal (Hamdan H, Bellot P, Bechet F , 2015) utilized the logistic regression model with a binary dependent variable and minimal multicollinearity among predictor variables.
Describe the process of building a machine learning model
First, we collected data from user comments and reviews on Shopee Product URLs on Shopee are used to extract reviews through Shopee's API Using APIs, we can automate the process of getting data without having to do it manually The data collected includes user names, star ratings and comment content, giving us a comprehensive view of user feedback on the product In this essay, we will take all the products at the https://shopee.vn/sofirnlight.vn booth Because we use python to interact with website, we will have some trouble in fetching APIs We can not use the selenium library in python to interact with Shopee Because the product link in Shopee page have to use Javascript to interact with it So in the next step, we will use some Javascript code to catch all the link of the product in the Sofirnlight page However, the code to call our APIs to retrieve data only calls each product in the total number of products So, we will call all the product links of the product, then print them to an html file, and perform a loop to get each comment in that html file in turn
Data preprocessing is an indispensable step in building a machine learning model, to ensure the quality and accuracy of the model Key steps in data preprocessing include:
- Eliminate empty comments: During the data collection process, there may be comments that contain no content, only containing star ratings These comments do not provide useful information to the sentiment analysis model, so we remove these comments from the dataset to minimize noise
- Split the word: using the word tokenize in UnderTheSea library of the python They has the activity of parsing text into separate phrases or phrases This is an important preprocessing step in natural language processing (NLP), especially languages without clear spaces between words like Vietnamese For languages like Vietnamese, many important semantic meanings lie in phrases instead of single words Word split helps identify these phrases, thereby improving the accuracy of NLP models
- Convert reviews into sentiment labels: We convert star reviews into two main sentiment labels: 'positive' and 'negative' Reviews of 4 stars or more are
33 labeled 'positive', while reviews below 4 stars are labeled 'negative' This classification simplifies the problem and focuses on distinguishing between positive and negative feedback
- Clean text: The text of comments is cleaned by removing symbols, punctuation and stop words Stop words are words that do not have much meaning in analysis such as "and", "but", "is", etc This cleaning helps reduce distracting elements and focus on the main content of the comment We remove the icon and special character also to reduce the noise
3 Divide the data into training set and test set
After preprocessing the data, we divide the data into two sets: training set and test set The training set is used to build the model, while the test set is used to evaluate the performance of the model Splitting the data is done using the train_test_split method with a ratio of 80-20, meaning 80% of the data is used for training and the remaining 20% is used for testing
4 Represent text into feature vector
Text cannot be fed directly into a machine learning model but needs to be converted into digital form To do this, we use Td-IDF Count Vectorizer to convert text into TF-IDF feature vectors TF-IDF (Term Frequency-Inverse Document Frequency) is a method of evaluating the importance of a word in a text compared to the entire set of documents Using TF-IDF helps the model focus on meaningful words in each comment
In reality, data are often unbalanced, meaning that the number of samples belonging to a certain class may be larger than that of other classes In figure 8, the positive sample more than negative number This can cause the machine learning model to be biased and make inaccurate predictions To solve this problem, we use SMOTE (Synthetic Minority Over-sampling Technique) technique to generate more data for the imbalanced layer SMOTE creates new synthetic samples by interpolating
34 between existing samples of the minority class, which balances the data and improves model accuracy
Figure 4.3 the volume of word appeared in data cleaned
6 Train the machine learning model
Moreover, along with logistic regression, we make use of Naive Bayes model in sentiment analysis The Naive Bayes is a probabilistic classifier that assumes based on Bayes' theorem, the features are conditionally independent within a class and it achieves simplicity Even though it's simple, Naive Bayes can be quite effective for tasks on text classification particularly when having to deal with large data sets For each class it computes the probability and chooses the one with the highest probability as its prediction straightforward but surprisingly powerful
After training, the model is evaluated on the test set using metrics such as precision, recall, and F1-score These metrics help determine how accurate the model is in classifying comments Precision indicates the ratio of correct predictions to the total number of model predictions, recall indicates the ratio of correct predictions to
35 the total number of actually correct samples, and F1-score is the harmonic average of precision and recall
8 Predict sentiment for new comments
Finally, the model is used to predict sentiment for new comments These comments are cleaned, converted into TF-IDF vector and sentiment predicted using the trained model The prediction results will indicate whether new comments belong to the 'positive' or 'negative' class
The process of building a machine learning model for sentiment analysis includes important steps from data collection and preprocessing, representing text into feature vectors, data balancing, model training, and model evaluation and finally sentiment prediction for new comments By using techniques such as TF-IDF and SMOTE, along with the Logistic Regression model, a sentiment analysis model is built that is capable of accurately classifying positive and negative feedback from users
In addition to Logistic Regression, the Naive Bayes model is also employed in our sentiment analysis system Naive Bayes, a probabilistic classifier based on Bayes' theorem, is particularly effective for text classification tasks due to its simplicity and robustness It calculates the probability of in the samples based on the given features and it will select the sample with the highest probability in predicting Incorporating Naive Bayes enhances the system's ability to classify sentiments accurately, leveraging its strength in handling large text datasets efficiently.
Critical Discussion of Methodology and Model Validation
The process of developing a machine learning model for sentiment analysis entails numerous critical processes, including data collection and preprocessing, text representation into feature vectors, data balancing, model training, and model assessment, and lastly predicting sentiment for fresh comments We created a sentiment analysis model that can effectively categorize positive and negative user
36 comments using methods like TF-IDF and SMOTE, as well as models like Logistic Regression and Naive Bayes
We used a variety of data analysis approaches to assure our model's correctness and efficiency Data preparation includes cleaning the data to reduce noise, tokenizing text to efficiently handle languages such as Vietnamese, and balancing the data to prevent model bias These procedures are crucial for preparing the data for analysis and ensuring that the model learns properly from it
The models are evaluated using measures like as accuracy, recall, and F1- score, which offer an overall picture of the model's performance These metrics help us assess how effectively the model is doing and where improvements might be made Precision assesses the accuracy of positive predictions, recall measures the coverage of true positives, and the F1-score strikes a balance between precision and recall
Throughout the development process, we encountered issues such data imbalance and the necessity for fast feature extraction Using sophisticated approaches such as SMOTE to balance the data and TF-IDF to extract significant features was critical in overcoming these hurdles Furthermore, by validating the model with robust measures, we ensure that our system not only performs well on training data, but also successfully generalizes to new, previously unknown data
By deepening the discussion on our methodology, particularly regarding data analysis techniques and model validation, we provide a clearer understanding of the steps involved in building and validating a sentiment analysis system This comprehensive approach helps ensure the reliability and accuracy of the system in classifying user sentiments
Way to assess the model
Figure 4.4 Result of LR model
- Precision, Recall, F1-Score: These indexes all reach 1.00 on both positive and negative classes, showing that the Logistic Regression model has the ability to classify absolutely accurately in the test data set
- Accuracy: Reaches 1.00, proving that Logistic Regression can accurately predict all samples in the test set
- Advantages: Absolute classification ability, suitable for applications requiring high accuracy
- Disadvantages: Logistic Regression models can require higher computational resources and can be affected by unbalanced data if techniques like SMOTE are not applied
Figure 4.5 Result of NV model
- Precision, Recall, F1-Score: These indicators show that Naive Bayes has lower performance than Logistic Regression, especially with the negative class (precision = 0.74) However, with the positive class, these indicators are still very high (precision and recall are both near 1.00)
- Accuracy: Reached 0.99, showing that this model is still very accurate, although a bit lower than Logistic Regression
- Advantages: Naive Bayes is a simple, fast, and effective model that is often suitable for large datasets and can perform well even when the data is not well balanced
- Disadvantages: Lower performance in negative class classification, susceptible to imbalanced data
In summary, in sentiment analysis, the choice of model depends on the specific requirements of the problem Logistic Regression shows the ability to classify with absolute accuracy, especially useful for applications that require high accuracy However, it may require higher computational resources In contrast, Naive Bayes is a simple and fast model, suitable for large datasets but the performance may be lower in classifying some specific classes
Both models have their own advantages and disadvantages, and model selection should be based on application-specific criteria, such as desired accuracy, available computational resources, and characteristics of the data In this case, if absolute accuracy is required, Logistic Regression is the better choice, but if a fast and efficient model is required, Naive Bayes is still a suitable choice
CONCLUSION
Future research directions
The thesis not only enhances emotion analysis, but also opens up new research opportunities and practical applications in data mining and machine learning In the field of data engineering, there is a compelling potential to develop powerful Extract, Transform, and Load (ETL) pipelines that interface smoothly with current technologies like as Apache Kafka and Spark These pipelines not only make it easier to extract data from streaming sources like Kafka, which dynamically captures real- time consumer feedback and product evaluations, but they also make use of Apache Spark's processing capability for complex data transformations and feature engineering
In the future, this system will be expanded to accommodate a wider range of data sources, including multimedia inputs and structured data from multiple e-commerce platforms This extension intends not just to broaden the range of insights produced, but also to increase model accuracy through more detailed data inputs
Automation plays a pivotal role in realizing these ambitions By leveraging Apache Airflow, the orchestration of ETL workflows becomes automated and streamlined Airflow enables the scheduling, monitoring, and execution of tasks across the entire ETL pipeline—from data ingestion through Kafka consumers to Spark-powered data processing, model training, and finally, storing results in MongoDB This automation not only enhances operational efficiency but also ensures consistency and reliability in data processing tasks
Furthermore, integrating such automated ETL pipelines into e-commerce platforms benefits businesses by offering deeper insights into customer moods and preferences, while also improving the entire purchasing experience for customers Businesses that immediately analyze and act on consumer input may quickly address complaints, improve product offers, and create a more customized and responsive purchasing experience
To summarize, the combination of advanced data mining techniques, machine learning models, and automated ETL procedures is a huge step toward creating a more flexible and successful e-commerce ecosystem This comprehensive approach not only fosters innovation in emotion analysis, but it also lays the groundwork for future advances in data-driven decision-making across other domains
M.J Sax, G Wang, M Weidlich, J.-C Freytag (2018) Streams and tables: Two sides of the same coin In Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics
Ahmad S, Asghar MZ, Alotaibi FM (2019) Detection and classifcation of social media-based extremist afliations using sentiment analysis techniques Hum Centric Comput Inf Sci 9(1):1–23
Bordes A, Glorot X, Weston J, Bengio Y (2014) A semantic matching energy function for learning with multi-relational data Mach Learn 94(2):233–259
Calders T, Verwer S (2010) Three naive bayes approaches for discrimination-free classifcation
Ed-Douibi, H.; Izquierdo, J.L.C.; (2018) A tool to generate UML models from OpenAPI definitions
In Proceedings of the International Conference on Web Engineering Switzerland
Ed-Douibi, H.; Izquierdo, J.L.C.; Cabot (2018) OpenAPItoUML: A Tool to Generate UML Models from OpenAPI Definitions Switzerland: ICWE
Faust - Python Stream Processing ( 2021, April 12) Retrieved from https://faust readthedocs.io/: https://faust readthedocs.io/
Hajek P, Barushka A, Munk M (2020) Fake consumer review detection using deep neural networks integrating word embeddings and emotion mining Neural Comput Appl
Hamdan H, Bellot P, Bechet F (2015) Lsislif: Crf and logistic regression for opinion target extraction and sentiment polarity analysis In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 753–758
Kang H, Yoo SJ, Han D (2012) Senti-lexicon and improved Nạve Bayes algorithms for sentiment analysis of restaurant reviews Expert Syst Appl 39(5):6000–6010
Khare, R.; Taylor, R (2004) xtending the Representational State Transfer (REST) architectural style for decentralized systems In Proceedings of the 26th International Conference on
Kiritchenko, S., Zhu, X., & Mohammad, SM (2014) Sentiment analysis of short informal texts
Journal of Artificial Intelligence Research, 50, 723-762
Kreps, J., Narkhede, N., & Rao, J (2011) Kafka: A distributed messaging system for log processing In Proceedings of the NetDB (Vol 11, No 2011, pp 1-7)
Marc Plouhinec (2023) Retrieved from https://apify.com/marc_plouhinec/shopee-api- scraper#api-reference
Mowlaei, M E., Abadeh, M S., & Keshavarz (2020) Aspect-based sentiment analysis using adaptive aspect-based lexicons Expert Systems with Applications, 148, 113234
Naresh Kumar, K E., & Uma (2021) Intelligent sentinet-based lexicon for context-aware sentiment analysis: optimized neural network for sentiment classification on social media The Journal of Supercomputing, 77(11), 12801-12825
Pahl, C., & Jamshidi (2016) Microservices: A Systematic Mapping Study London: 2016
Pang, B., Lee, L., & Vaithyanathan, S (2002) Thumbs up? Sentiment classification using machine learning techniques arXiv preprint cs/0205070
Pautasso, Zimmermann, Leymann (2008) Restful web services vs "big"' web services: making the right architectural decision Beijing, China: ACM
Rufer N, Knitza J, Krusche (2020) Covid4Rheum: an analytical twitter study in the time of the
Subhashini, L D C S., Li, Y., Zhang, J., Atukorale, A S., & Wu (2021) Mining and classifying customer reviews: a survey Artificial Intelligence Review, 1-47
Tripathy A, Agrawal A, Rath SK (2015) Classifcation of sentimental reviews using machine learning techniques Procedia Comput Sci 57:821–829
Wu Chou; Wei Zhou; Min Luo (2008) Design Patterns and Extensibility of REST API for Networking
Yoon Kim (2014) Convolutional Neural Networks for Sentence Classification Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Zvarevashe K, Olugbara (2018) A framework for sentiment analysis with opinion mining of hotel reviews In: 2018 Conference on information communications technology and society (ICTAS) IEEE
SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness
EXPLANATORY REPORT ON CHANGES/ADDITIONS
BASED ON THE DECISION OF GRADUATION THESIS COMMITTEE
FOR UNDERGRADUATE PROGRAMS WITH DEGREE AWARDED BY
Student’s full name: Nguyễn Đức Huy
Graduation thesis topic: Creating a Sentiment analysis System to Process
Customer Feedback from an E-commerce Site
Based on the VNU-IS’s decision no …… QĐ/TQT, on … / … / …… about the establishment of Graduation Thesis Committee for Bachelor programs which degree awarded by Vietnam National University, Hanoi, the thesis was defended and modified in the following sections:
No Change/Addition Suggestions by the Committee Detailed Changes/ Additions Page