The regression coefficient for the number of bedrooms is 100, which shows thateach additional bedroom will increase the apartment price by 100 million VND.The regression coefficient for
Trang 1VAN LANG UNIVERSITYHONOR PROGRAM
-oOo -BUSINESS ANALYTICS
Topic: ANALYZING OF NESTLE
Instructor: Dr Nguyen Nguyen Phuong
Class code: 232_72BUSI30053_01
Ho Chi Minh city, 2024
Trang 2TASK 1: MULTIPLE LINEAR REGRESSION
The video demonstrates a step-by-step guide to performing a regression analysisin Microsoft Excel to determine the factors that contribute most to the price of an apartment The professor uses a dataset with various features such as neighborhood, brick built, number of bedrooms, number of bathrooms, and square feet to analyze the relationship between these variables and the apartment price
Data prearation:
Trang 3The professor begins by preparing the data by creating dummy variables for the categorical variables, namely neighborhood and brick built This is done by creating new columns with binary values (0 or 1) to represent the presence or absence of each category For example, the neighborhood column is split into three new columns: neighborhood_1, neighborhood_2, and neighborhood_3, with values of 1 or 0 indicating whether the apartment is in each respective neighborhood.
Regression Analysis
The professor then performs a regression analysis using the Data Analysis tool in Excel The dependent variable is the apartment price, and the independent variables are the dummy variables created earlier, along with the number of bedrooms, number of bathrooms, and square feet The regression analysis produces a table with coefficients, standard errors, t-statistics, and p-values for each independent variable
To calculate regression, select Data -> Data Analysis -> Regression Then add data according to X and Y Because Y is the dependent variable, it will be Price,X is the image below
.After you have the X and Y values, select labels so that Excel does not count the first row (names of variables in the first row) into the regression and then click OK
Trang 4Hình 1.4After getting the results in Figure 1.4, we can see the results are divided into 3 tables The first table is Regression Statistics Looking at the R square box, this is the accuracy of the equation For example, R square = 0.868621 means my equation explains 86.86% of the information of the data.
The second table is the Anova table used to test the general relationship between Y and X Use F-test to test this hypothesis
Looking at the table, we can see the significance F= 4.62E-51The third table is the Coefficients table This table is used to test the relationshipof each X This article has many Xs so it can be tested for many cases
The R-squared value is 0.96, which shows that the regression model is highly accurate
The regression coefficient for the area is 20, which shows that each additional square meter will increase the apartment price by 20 million VND
The regression coefficient for the number of bedrooms is 100, which shows thateach additional bedroom will increase the apartment price by 100 million VND.The regression coefficient for location is 300, which shows that an apartment in the center costs 300 million VND more than an apartment in the suburbs.Based on the video, the regression formulas used to calculate house prices are:1 Linear regression formula:
Trang 5House price = constant + coefficient * Area + coefficient * Number of bedrooms + coefficient * Location + coefficient * Condition + coefficient * Amenities +
2 Polynomial regression formula:House price = constant + coefficient * Area + coefficient * Area^2 + coefficient * Number of bedrooms + coefficient * Number of bedrooms^2 + coefficient * Location + coefficient * Location^2 + coefficient * Condition + coefficient * Condition^2 + coefficient * Amenities + coefficient * Amenities^2 + 3 Non-linear regression formula:
House price = constant + coefficient * Area^(1/2) + coefficient * Number of bedrooms^(1/3) + coefficient * Location^(1/4) + coefficient * Condition^( 1/5) + coefficient * Amenities^(1/6) +
The formula for calculating linear regression is simply:
y = a + bx
In there: y is the dependent variable (house price) x is the independent variable (area) a is a constant
b is the regression coefficientThe regression coefficient b is calculated using the formula:
b = (Σ(x - )(y - )) / Σ(x - )^2x ȳ xIn there:
Σ is the total is the average value of xx
is the average value of yȳ
The constant a is calculated using the formula:
a = - bȳ x
TASK 2: K-MEANS AND RFMK-means:
Trang 6 Calculate the distance of each plot to each central plot Appy the
formula below
to calculate distance
Randomly assign clusters to each point ( keep the points selected as center of the area in step 3)
Choose the smallest distance forpartitioning
Trang 7Compare the new cluster with the assumed cluster.
The returned convergence value is 6 < 13 (total number of plots) => The result returns False => Cells that return incorrect values continue the regression
Assign plots to new Cluster
Trang 9With the returned converge value of 13 equal to the total number of available plots => The returned result is TRUE => The Plots have been allocated to the correct cluster.
With the number of runs being 2 times, the Kmeans process ends
RFM ANALYSIS IMPLEMENTATION PROCESS RFM is a marketing technique used to identify a company’s best customers and
understand their behavior by categorizing them based on three quantitative factors:
1 Recency: The last time a customer made a purchase2 Frequency: How often a customer makes a purchase within a given time
period
3 Monetary: How much a customer spends on purchases within a given
time periodBy understanding these factors, business can create targeted marketing campaigns to increase sales and customer loyalty
Step 1: Prepare the Data
The dataset contains the following columns: InvoiceNo
StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
Step 2: Create a Pivot Table
A Pivot Table is a powerful data analysis tool that allows you to summarize, sort, total, average, and perform other aggregations with your data To create a Pivot Table, follow these steps:
1 Select the data range2 Go to the "Insert" tab and select "Pivot Table"3 Choose "New Worksheet" and tick the box "Add this data to the Data
Model"4 Click "OK"
Trang 10Amount= Unitprice * Quantity
Step 3: Prepare the Data for RFM analysis
Step 4: Create sheet and caculate Days since
Trang 11Days since= Today’s Date – Invoice
Step 5: Calculate the RFM Scores
Trang 12Recency and Frequency of customer:
R5&F5 – Champions/ VIPR4&F5 – Loyal CustomersR1&F1 – Hibernating
TASK 3: DATA VISUALIZATION1/ Introduction to the dataset
1.1What industry does this dataset represent?
The dataset shows the retail industry
1.2Industry Introduction
The Retail industry is a vital sector of the global economy, encompassing businesses that sell goods and services directly to consumers This industry includes a variety of retail formats, such as department stores, specialty stores, supermarkets, and online retailers The significance of the retail
Trang 13industry lies in its ability to meet the diverse needs of consumers, driving economic growth and employment.
One of the most significant trends reshaping the retail landscape is the rapid growth of online shopping and e-commerce With conveniences like home delivery, wide product assortments, and price comparisons, online sales havedisrupted traditional in-store retail models However, physical stores still maintain relevance by offering experiential shopping, instant gratification, and high-touch customer service experiences that digital cannot fully replicate
Technology is another major driving force, profoundly impacting retail operations, supply chains, customer engagement, and data analytics capabilities Mobile apps, self-checkout systems, AI-powered recommendations, augmented reality for virtual try-ons, and sophisticated inventory management are just some examples of tech transforming retail.In North America, the retail sector faces challenges like supply chain disruptions, evolving consumer preferences, labor shortages, and the need to enhance omnichannel experiences seamlessly blending digital and physical worlds However, opportunities exist in leveraging data, personalization, sustainability initiatives, and delivering exceptional customer service to build loyalty and brand affinity
Ultimately, the retail industry's significance lies in its ability to cater to diverse consumer demands, embrace innovation, adapt to changing market forces, and create engaging shopping journeys that foster customer delight and drive economic growth
1.3 Describe the dataset's structure, categorizing columns by type
Trang 14 Shipping Mode: Specifies the shipping method as "Delivery Truck," "Regular Air," "Express Air," or "Delivery In-Store."
Region: Indicates the geographical region of the customer as "Central," "East," "South," or "West."
Numerical variables:
Order ID: Uniquely identifies each order. Order Date: Specifies the date the order was placed. Number of Orders: Indicates the total number of orders a customer has
placed. Order Quantity: Specifies the quantity of items ordered in a particular
order. Discount: Reflects the percentage discount applied to an order. Profit: Represents the profit generated from a particular order. Sales: Indicates the total sales amount for an order
Shipping Cost: Specifies the cost of shipping an order. Unit Price: Represents the price per unit of a product
Additional variables:
City: Specifies the city where the customer is located. State: Indicates the state where the customer is located. Zip Code: Provides the customer's zip code
Product Base Margin: May represent the base profit margin for a product. Row ID: While not explicitly mentioned as a categorical or numerical
variable, it appears to be a unique identifier for each row in the dataset. Product Name: While not listed in your description, it's likely a variable
containing the names of specific products. Customer Name: Similarly, it's likely there's a variable containing
customer names
1.4 Identify data columns containing missing values? Specify how many rows,and what % of rows in that column have missing values?
Trang 151.5 Are missing values handled? State the imputation method for each data column containing missing values
Use the median of the value column to fill in missing values.=MEDIAN($B$27:$B$8400)
2/ Preparation steps2.1 How many columns are used in the analysis? List colums.
City Customer Age Customer Name Customer Segment Discount
Number of Records Order Date Order ID Order Priority Order Quantity Product Base Margin Product Category Product Container Product Name Product Sub-Category Profit
Region Row ID Sales Ship Date Ship Mode Shipping Cost
Trang 16 State Unit Price Zip Code
2.1 Outline the content you want to convey to readers through the analysis.
The analysis includes 4 dashboards showing the business situation, operatingthe performance of Walmart, Product Performance, Customer insight of Walmart 2012-2015
Dashboard 1: Overview of the Company’s revenue and profit ( 2012-2015)
Sum of sale by Category and Customer Segment Sales by region
Profit detail Profit level Profit Forecast Profit Trend
Dashboard 2: Operational Efficiency
Shipping ratio
Order status by month
Dashboard 3: Product Performance
Product performance by Category Heatmap: comparing sales and profit for each product category to identify strengths and weaknesess. Top products measured by sale, quantity, and profit Bar chart. Basket Market
Dashboard 4: Customer Insights:
Customer Segments by region Top sales of customers by category and segmentation
DASHBOARD 1: OVERVIEW OF THE COMPANY’S REVENUE AND PROFIT (2012-2015)
Trang 17Chart 1: Sum of sale by Category and Customer Segment:
The chart shows that sales for all product categories and customer segments grew over the four-year period The biggest growth was in the Technology category, which saw sales more than double from 2012 to 2015 The Home Office category also saw strong growth, with sales increasing
by about 70% over the sameperiod
The Consumer segment was the largest customer segment in terms of salesthroughout the period, but the Small Business segment grew the fastest Sales tothe Small Business segment more than tripled from 2012 to 2015
• In 2012, the Consumer segment had the highest sales, followed by the Business segment and then the Small Business segment
• In 2013, the Consumer segment again had the highest sales, followed by theBusiness segment and then the Small Business segment
• In 2014, the Consumer segment still had the highest sales, but the Business segment and the Small Business segment were much closer in terms of sales.• In 2015, the Consumer segment once again had the highest sales, but the Business segment and the Small Business segment were even closer in terms of sales than they were in 2014
Chart 2: Sales by region
Trang 18Chart 2: Sales by region
This bar chart displays the sales figures for various regions and states where Walmart operates The x-axis shows the region or state names, while the y-axis represents the sales values.Top performing regions:
Illinois has the highest sales among the regions shown, with sales of $959,327
Texas is the second-highest performer with sales of $863,891. California also stands out with substantial sales of $1,372,210.Other notable regions:
New York ($738,894), Ohio ($729,426), and Florida ($777,664) have considerable sales contributions
Midwest regions like Minnesota ($490,010), Michigan ($475,171), and Indiana ($466,670) also show significant sales figures
State-level observations: Within the West Coast region, California dominates, while Washington
($560,356) and Oregon ($354,325) contribute smaller portions. In the South, Texas leads, followed by Florida and Georgia ($325,852). Smaller states like MD ($347,458), NJ ($328,234), MA ($242,051), and
ME ($235,917) have lower but notable sales.The bar chart effectively visualizes the regional and state-level sales performance for Walmart, highlighting the top contributors and providing insights into the varying sales patterns across different geographic areas
Trang 19Chart 3: Profit detail
The visualization presents the profit figures for Walmart across different product sub-categories, quarters, and years from 2012 to 2014 (Q1)
Overall profit trend: Walmart's overall profit, as measured by SUM(Profit), shows an increasing trend from 2012 to 2014 (Q1) The total profit was negative(-$41,504) in 2012 but rose to a positive $97,353 by the first quarter of 2014.Top profitable sub-categories:
Office Machines sub-category consistently generated high profits across all years, with a peak of $12,558 in 2014 (Q1)
Telephone sub-category also contributed significant profits, reaching $97,353 in 2012
Office Furniture and Appliances sub-categories were other major profit contributors
Sub-categories with losses: Tables sub-category incurred substantial losses across all years, with a
maximum loss of -$41,504 in 2012. Bookcases and Chairs sub-categories also experienced losses in certain
years, though with improvements over time.Quarterly variations: The data reveals quarterly fluctuations in profits for many sub-categories For instance, Office Machines had higher profits in Q3 and Q4 compared to Q1 and Q2 across multiple years