project report sentiment analysis womens e commerce clothing reviews

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY SCHOOL OF ECONOMICS AND MANAGEMENT ------ PROJECT REPORT Sentiment analysis: Women''''s E-Commerce Clothing Reviews Course: Data Science fo

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY SCHOOL OF ECONOMICS AND MANAGEMENT

-

PROJECT REPORT

Sentiment analysis:

Women's E-Commerce Clothing Reviews

Course: Data Science for Business (MI4062E) Instructor: Prof Le Hai Ha

Student: Ngo The Nam

Std.ID: 20192655

Hà Nội – 2022

Trang 2

1

Cont Conten en ents ts ts

1 Project Description and Objectives 2

2 Data Descriptions 2

2.1 Import libraries and Dataset: 2

2.2 Data Preprocessing 3

2.3 Text Processing 3

2.4 Data Exploration 4

3 Methodology 5

3.1 Sentiment Analysis 5

3.2 Predictive Modeling 8

3.3 Performance of Models 9

3.4 Confusion Matrix 9

3.5 ROC 10

4 Key findings & Conclusions 10

Trang 3

1 Project Description and Objectives

I will perform analytics on the dataset of women's e-commerce clothes reviews for my project My research can be split into three parts In order to determine which classes and departments had generally positive or negative assessments, first I will examining the link between those factors Second, to examine positive and negative terms, I will perform sentiment analysis and create word clouds Finally, I will create various predictive models and evaluate their effectiveness

*All my analysis work on Microsoft Visual Studio Code

2 Data Descriptions

2.1 Import libraries and Dataset:

First, I imported the necessary libraries:

Next, import the dataset to vsc:

I am using the dataset: Women's E-Commerce Clothing Reviews

https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews?select=Womens+Clothing+E-Commerce+Reviews.csv

Trang 4

3

This dataset is real commercial data, it includes 11 feature columns and 23486 rows Every row represents a customer review

2.2 Data Preprocessing

Drop unnecessary columns and rename the rest for further analysis

# Drop Unnecessary Columns

# Rename Columns

# Show Dataframe

2.3 Text Processing

Started by removing all the punctuation and special characters from the reviews for text processing By doing this, I can preclude those buzz words, which is a step toward further categorizing The next step is to lowercase all of the uppercase words

Trang 5

Finally, eliminate the stopwords (in this case are also referred to as buzz characters)

2.4 Data Exploration

I investigate the distribution of possitive and negative reviews by various clothing categories, divisions, ratings, and other factors using exploratory data analysis Whether a customer would recommend the product is how i determine whether or not this review is favorable

“Recommended_IND” is a binary variable stating where the customer recommends the product where 1 is recommended, while 0 is not recommended

Below are the bar charts of “Division_Name”, “Class_Name”, “Department_Name”, and

“Rating'' by “Recommended_IND”

Trang 6

5

3 Methodology

3.1 Sentiment Analysis

Apply VADER for sentiment analysis Text is scored primarily using compound score

# Calculate Compound Score

Trang 7

New dataframe after calculation:

To categorize the texts, set 0.05 and -0.05

as thresholds Assign text with compound

score larger than 0.05 as positive text,

compound score smaller than -0.05 as

negative text, others as neutral text

Next, drop the neutral text and get a new dataframe with only the Review_Text and Label

Lastly, update stopwords with some frequently appeared words, and then perform word clouds for the positive and negative reviews

Trang 8

7

Positive Reviews

Trang 9

8

Negative Reviews

3.2 Predictive Modeling

The dataset is imbalanced because of 15152 positive rows and 485 negative rows To solve this, used SMOTE to oversample the data in the “negative” class After the process, the amount of both two classes are 15152 rows

Then, built 4 models and made predictions (Logistic Regression, Naive Bayes, Decision Tree and Random Forest) ;Calculated Accuracy, Recall, F1-Score, Confusion Matrix, ROC and AUC

to compare the performance of prediction of the models

Trang 10

9

3.3 Performance of Models

3.4 Confusion Matrix

Trang 11

10

3.5 ROC

 Logistic Regression is the best model Its F1 Score and AUC are both the highest among the four models

4 Key findings & Conclusions

Based on the bar chart of the division name below, the general performance of the women's e-commerce platform is significant in terms of the overall recommended rate being higher than the unrecommended one Moreover, the sold units are decreasing as the sizes of the clothes get smaller, the division in “General” has the highest sold units, “General Petite” ranks second, and “Intimates” has the lowest unit sold In other words, compared to other different divisions, a larger proportion of customers recommend “General” than “General Petite” and “Intimate” Based on these metrics, we should design the products in larger sizes for the following seasons, such as medium, large and extra-large The women's e-commerce platform has to embrace the new trend of body positivity, though it could be a sharp turn away from the styles that defined the women's apparel industry for decades There will be a more promising future for the e-commerce platform when it provides a wider variety of types of clothes

Product quality consistency is a principle to the overall success of every business Providing consistency allows customers to know what to expect every time they merchandise and every product they purchase, which could increase the trust of the customers towards the brand and increase sales units in the long term Based on "Class name" and "Department name", we can find out the unrecommended rate in Blouses, Dresses, Knits, and Top are relatively higher than in other categories Customers are able to observe and realize the quality of clothing products By improving the product consistency aim for these four types

of segments could be a significant increase to the recommended rate

Trang 12

11

According to the rating bar chart below, we could tell the rating score (from 1 to 5) is positively related to the customers' recommended propensity The compulsive takeaway is where we could look through the customers who rate the platform less than 3 stars, their recommendation inclination is inconsistent So, the platform could enforce the customer relations management with this segment of customers, which could potentially turn their negative shopping experience into a positive one For the customers who rate the platform

at 4 or 5, product consistency is a key to maintaining customer retention; Also, for the customers who rate the platform under 3, which could be the churn rate that is a loss for the platform In addition, the proportion of ratings more than 3 stars on the condition of customers who recommend (81%) is higher than ratings less than 3 stars on the condition of customers who don’t recommend (19%)

According to the Wordcloud, Fabric and Sweaters are the words that show up more frequently in both positive and negative comments Based on that, the platform should narrow the customer segmentation, to clarify which types of customers could be the target

Trang 13

12

audience depending on the type of fabric they choose Moreover, if the platform would like

to expand the business: create multiple product lines, to target different types of customers, such as a high-end line for customers at the age of 28-35, with better financial capability and

a classic line for the customer at the age of 18-25, with relatively less money to spend on apparel

Pants are the category of products that have been mentioned significantly in the positive comments Based on the metrics, the women's e-commerce platform could take the product categories as advertised to attract new customers based on the high customer reputation Size is the biggest concern of customers, which could be related to the bar chart of "division name" The women's e-commerce platform produces mainly smaller sizes for the customers, which could harm the platform's reputation As mentioned before, the platform could benefit from producing the medium, large and extra-large sizes of apparels

Performance of Models

Compared to the different performance of each model used, the model using Logistic Regression performs the best Because its AUC score and F1-Score are the highest, 0.82 and 0.96 among all

From those charts and tables presented above Firstly, the women's e-commerce platform can expand the amount of product by producing larger sizes of apparels Moreover, customers do value the importance of product quality, which reflects on higher ratings and more positive reviews The platform should make more efforts to improve and maintain its consistent product quality so that the platform can attract new potential customers As long

as the platform is willing to change and adapt, this results in increased earnings, a better reputation, and user loyalty to the platform

Tiêu đề	Sentiment Analysis: Women's E-Commerce Clothing Reviews
Tác giả	Ngo The Nam
Người hướng dẫn	Prof. Le Hai Ha
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Data Science for Business
Thể loại	Project Report
Năm xuất bản	2022
Thành phố	Hà Nội

Định dạng
Số trang	13
Dung lượng	3,19 MB