Business analytics: data science for business problems

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề	Business Analytics: Data Science for Business Problems
Tác giả	Walter R. Paczkowski
Chuyên ngành	Business Analytics
Thể loại	Book
Năm xuất bản	2021
Thành phố	Cham

Định dạng
Số trang	416
Dung lượng	13,92 MB

Cấu trúc

1.1 Types of Business Problems (40)
1.2 The Role of Information in Business Decision Making (41)
1.3 Uncertainty vs. Risk (0)
1.4 The Data-Information Nexus (45)
- 1.4.1 Data and Information Confusion (46)
- 1.4.2 The Data Component (46)
- 1.4.3 The Extractor Component (51)
- 1.4.4 The Information Component (57)
1.5 Analytics Requirements (60)
- 1.5.1 Theoretical Framework (61)
- 1.5.2 Data Handling (63)
- 1.5.3 Programming Literacy (64)
- 1.5.4 Component Interconnections (66)
2.1 Data Dimensions: A Taxonomy for Defining Data (68)
- 2.1.1 Taxonomy Component #1: Source (68)
- 2.1.2 Taxonomy Component #2: Domain (74)
- 2.1.3 Taxonomy Component #3: Levels (74)
- 2.1.4 Taxonomy Component #4: Continuity (75)
- 2.1.5 Taxonomy Component #5: Measurement Scale (76)
2.2 Data Organization (78)
- 2.2.1 External Database Structures (78)
- 2.2.2 Internal Database Structures (81)
2.3 Data Dictionary (91)
- 3.1.1 Case Study 1: Customer Transactions Data (94)
- 3.1.2 Case Study 2: Measures of Order Fulfillment (95)
3.2 Importing Your Data (97)
- 3.2.1 Data Formats (97)
- 3.2.2 Importing a CSV Text File into Pandas (99)
- 3.2.3 Importing Large Files in Chunks (101)
- 3.2.4 Checking Your Imported Data (0)
3.3 Merging or Joining DataFrames (113)
3.4 Reshaping DataFrames (0)
3.5 Sorting a DataFrame (116)
3.6 Querying a DataFrame (117)
- 3.6.1 Boolean Operators and Indicator Functions (117)
- 3.6.2 Pandas Query Method (119)
4.1 Background for Data Visualization (120)
4.2 Gestalt Principles of Visual Design (121)
4.3 Issues Complicating Data Visualization (122)
- 4.3.1 Human Visual Limitations (122)
- 4.3.2 Data Visualization Tools (124)
- 4.3.3 Types of Visuals (0)
- 4.3.4 What to Look for in a Graph (127)
4.4 Visualizing Spatial Data (132)
- 4.4.1 Data Preparation (0)
- 4.4.2 Visualizing Continuous Spatial Data (133)
- 4.4.3 Visualizing Categorical Spatial Data (144)
- 4.4.4 Visualizing Continuous and Categorical Spatial Data (147)
4.5 Visualizing Temporal (Time Series) Data (150)
- 4.5.1 Properties of Temporal (Time Series) Data (152)
- 4.5.2 Visualizing Time Series Data (153)
- 4.5.3 Times Series Complications (154)
4.6 Faceted Plots (0)
4.7 Appendix (0)
- 4.7.1 Taylor Series Expansion for Growth Rates (161)
5.1 Transformations (163)
- 5.1.1 Linear Transformations (164)
- 5.1.2 Nonlinear Transformations (171)
- 5.1.3 A Family of Transformations (173)
5.2 Encoding (176)
- 5.2.1 Dummy or One-Hot Encoding (177)
- 5.2.2 Patsy Encoding (181)
- 5.2.3 Label Encoding (182)
- 5.2.4 Binarizing Data (182)
5.3 Dimension Reduction (0)
5.4 Handling Missing Data (186)
5.5 Appendix (188)
- 5.5.1 Mean and Variance of Standardized Variable (0)
- 5.5.2 Mean and Variance of Adjusted Standardized Variable (189)
- 5.5.3 Unbiased Estimators of μ and σ 2 (190)
Part II Intermediate Analytics (0)
- 6.1 Basic OLS Concept (0)
  - 6.1.1 The Disturbance Term and the Residual (195)
  - 6.1.2 OLS Estimation (196)
  - 6.1.3 The Gauss-Markov Theorem (200)
- 6.2 Analysis of Variance (200)
- 6.3 Case Study (0)
  - 6.3.1 Basic OLS Regression (203)
  - 6.3.2 The Log-Log Model (203)
  - 6.3.3 Model Set-up (205)
  - 6.3.4 Estimation Summary (206)
  - 6.3.5 ANOVA for Basic Regression (206)
  - 6.3.6 Elasticities (206)
- 6.4 Basic Multiple Regression (208)
  - 6.4.1 ANOVA for Multiple Regression (209)
  - 6.4.2 Alternative Measures of Fit: AIC and BIC (211)
- 6.5 Case Study: Expanded Analysis (213)
- 6.6 Model Portfolio (217)
- 6.7 Predictive Analysis: Introduction (218)
  - 6.7.1 Predicting vs. Forecasting (219)
  - 6.7.2 Developing a Prediction (219)
  - 6.7.3 Simulation Tool for Prediction Application (220)
- 7.1 Time Series Basics (222)
  - 7.1.1 Time Series Definition (223)
  - 7.1.2 Time Series Concepts (224)
- 7.2 Importing a Date/Time Variable (226)
- 7.3 The Data Cube and Time Series Data (226)
- 7.4 Handling Dates and Times in Python and Pandas (227)
  - 7.4.1 Datetimes vs. Periods (228)
  - 7.4.2 Aggregating Datetime Measures (229)
  - 7.4.3 Converting Time Periods in Pandas (229)
  - 7.4.4 Date-Time Mini-Language (231)
- 7.5 Some Calendrical Calculations (0)
- 7.8 Durbin-Watson Test Statistic (237)
- 7.9 Lagged Dependent and Independent Variables (243)
  - 7.9.1 Lagged Independent Variable: ARDL(0, 1) (244)
  - 7.9.2 Lagged Dependent Variable: ARDL(1, 0) (244)
  - 7.9.3 Lagged Dependent and Independent Variables: ARDL(1, 1) (244)
- 7.10 Further Exploration of Time Series Analysis (244)
  - 7.10.1 Step 1: Identification of a Model (247)
  - 7.10.2 Step 2: Estimation of the Model (252)
  - 7.10.3 Step 3: Validation of the Model (254)
  - 7.10.4 Step 4: Forecasting with the Model (255)
- 7.11 Appendix (256)
  - 7.11.1 Backshift Operator (256)
  - 7.11.2 Useful Algebra Results (257)
  - 7.11.3 Mean and Variance of Y t (257)
  - 7.11.4 Demeaned Data (258)
  - 7.11.5 Time Trend Addition (258)
- 8.1 Data Preprocessing (259)
- 8.2 Categorical Data (260)
- 8.3 Creating a Frequency Table (261)
- 8.4 Hypothesis Testing: A First Step (263)
- 8.5 Cross-tabs and Hypothesis Tests (265)
  - 8.5.1 Hypothesis Testing (0)
  - 8.5.2 Plotting a Frequency Table (270)
- 8.6 Extending the Cross-tab (277)
- 8.7 Pivot Tables (279)
- 8.8 Appendix (281)
  - 8.8.1 Pearson Chi-Square Statistic (281)
Part III Advanced Analytics (0)
- 9.1 Supervised and Unsupervised Learning (284)
- 9.2 Working with the Data Cube (286)
- 9.3 The Data Cube and DataFrame Indexing (287)
- 9.4 Sampling From a DataFrame (292)
  - 9.4.1 Simple Random Sampling (SRS) (293)
  - 9.4.2 Stratified Random Sampling (294)
  - 9.4.3 Cluster Random Sampling (0)
- 9.5 Index Sorting of a DataFrame (295)
- 9.6 Splitting a DataFrame: The Train-Test Splits (296)
  - 9.6.1 Model Tuning of Hyperparameters (297)
  - 9.6.2 Incorrect Use of Testing Data (299)
  - 9.6.3 Creating the Training/Testing Data Sets (300)
  - 9.6.4 Recombining the Data Sets (306)
- 9.7 Appendix (307)
  - 9.7.1 Primer on Random Numbers (307)
- 10.1 Link Functions: An Introduction (310)
- 10.2 Data Preprocessing (311)
  - 10.2.1 Data Standardization for Regression Analysis (311)
  - 10.2.2 One-Hot and Effects (or Sum) Encoding (313)
- 10.3 Case Study Application (315)
- 10.4 Heteroskedasticity Issues and Tests (320)
  - 10.4.1 Heteroskedasticity Problem (322)
  - 10.4.2 Heteroskedasticity Detection (323)
  - 10.4.3 Heteroskedasticity Remedy (325)
- 10.5 Multicollinearity (327)
  - 10.5.1 Digression on Multicollinearity (328)
  - 10.5.2 Detection with VIF and the Condition Index (330)
  - 10.5.3 Principal Component Regression and High-Dimensional Data (331)
- 10.6 Predictions and Scenario Analysis (332)
  - 10.6.1 Making Predictions (332)
  - 10.6.2 Scenario Analysis (333)
  - 10.6.3 Prediction Error Analysis (PEA) (334)
- 10.7 Panel Data Models (340)
- 11.1 Case Study: Background (345)
- 11.2 Logistic Regression (345)
  - 11.2.1 A Choice Interpretation (346)
  - 11.2.2 Properties of this Problem (346)
  - 11.2.3 A Model for the Binary Problem (347)
  - 11.2.4 Case Study: Train-Test Data Split (350)
  - 11.2.5 Case Study: Logit Model Training (351)
  - 11.2.6 Making and Assessing Predictions (0)
  - 11.2.7 Classification with a Logit Model (359)
- 11.3 K-Nearest Neighbor (KNN) (0)
  - 11.3.1 Case Study: Predicting (364)
- 11.4 Naive Bayes (364)
  - 11.4.1 Background: Bayes Theorem (364)
  - 11.4.2 A General Statement (0)
  - 11.4.3 The Naive Adjective: A Simplifying Assumption (368)
  - 11.4.4 Distribution Assumptions (368)
  - 11.4.5 Case Study: Naive Bayes Training (370)
  - 11.5.2 Gini Index and Entropy (375)
  - 11.5.3 Case Study: Growing a Tree (379)
  - 11.5.4 Case Study: Predicting with a Tree (381)
  - 11.5.5 Random Forests (382)
- 11.6 Support Vector Machines (382)
  - 11.6.1 Case Study: SVC Application (384)
  - 11.6.2 Case Study: Prediction (384)
- 11.7 Classifier Accuracy Comparison (386)
- 12.1 Training and Testing Data Sets (388)
- 12.2 Hierarchical Clustering (389)
  - 12.2.1 Forms of Hierarchical Clustering (389)
  - 12.2.2 Agglomerative Algorithm Description (390)
  - 12.2.3 Metrics and Linkages (391)
  - 12.2.4 Preprocessing Data (392)
  - 12.2.5 Case Study Application (392)
  - 12.2.6 Examining More than One Solution (397)
- 12.3 K-Means Clustering (398)
  - 12.3.1 Algorithm Description (398)
  - 12.3.2 Case Study Application (399)
- 12.4 Mixture Model Clustering (401)
verbs I mentioned in the text (0)

Nội dung

I have two objectives for this introductory chapter regarding my spoiler alert. I will first discuss the types of problems business decision makers confront and who the decision makers are. I will then discuss the role and importance of information to set the foundations for succeeding chapters. This will include a definition of information. People frequently use the words data and information interchangeably as if they have the same meaning. I will draw a distinction between them. First, they are not the same despite the fact that they are used interchangeably. Second, as I will argue, information is latent, hidden inside data and must be extracted and revealed which makes it a deeper, more complex topic. As a data analyst, you need to have a handle on the significance of information because extracting it from data is the sole reason for the existence of BDA

Types of Business Problems

What types of business problems warrant BDA? The types are too numerous to mention, but to give a sense of them consider a few examples:

• Anomaly Detection: production surveillance, predictive maintenance, manufac- turing yield optimization;

– Marketing cross-sell and up-sell;

– Pricing: leakage monitoring, promotional effects tracking, competitive price responses;

– Fulfillment: management and pipeline tracking;

• Competitive Environment Analysis (CEA); and

And the list goes on, and on.

Effective decision-making is crucial in addressing various challenges, particularly in new product development, which involves a complex decision process through multiple stages According to Paczkowski (2020), the product development pipeline consists of five key stages: ideation, design, testing, launch, and post-launch tracking At each stage, critical decisions are made regarding whether to advance to the next phase or discontinue development These decisions are supported by business case analyses that evaluate expected revenue, market share, sales forecasts, pricing strategies, production and marketing costs, and competitive landscape assessments If any aspect of the analysis suggests negative outcomes for the product concept, it is promptly removed from the pipeline Comprehensive information is essential at each business case checkpoint to ensure informed decision-making.

As new and improved information becomes available during product development, the expected revenue and market share are refined for each business case analysis While data is collected, it is the analysis of this data, guided by the methods outlined in this book, that determines whether to advance the concept to the next stage The initial "Go/No Go" decision is based on the information gathered, with similar evaluations occurring at subsequent stages of the process.

Product pricing is a critical two-fold decision that involves choosing a pricing structure, such as uniform pricing or price discrimination, and determining the appropriate price level within that structure These decisions are made throughout the product life cycle, starting from the development stage and continuing until the product is eventually removed from the market Incorrect pricing strategies can lead to significant consequences, including lost profits and market share For more insights on effective pricing strategies, refer to Paczkowski (2018) for an analysis of pricing structures and levels, and Paczkowski (2020) for guidance on pricing during new product development at various stages of the pipeline.

The Role of Information in Business Decision Making

Effective decisions are those that address specific problems and contribute to your business's success in the marketplace A successful business is typically defined by its ability to generate profit and provide positive returns for its stakeholders, including shareholders, partners, employees in employee-owned companies, or sole proprietors.

• the state of the market;

• consumer, social, and technology trends and developments;

• the size of customer churn.

The GIGO Principle (Garbage In–Garbage Out) emphasizes that poor-quality information leads to poor decisions While this concept seems obvious, it is often difficult to determine the quality and sufficiency of the information at the time of decision-making Consequently, decision-makers frequently confront uncertainty stemming from the volume and quality of the available data.

Making decisions without sufficient information leads to costly guesses, as illustrated in Fig 1.1 When lacking data, individuals rely on instinct or past experiences, which may not accurately reflect the current situation This reliance on naive approximations can result in misguided choices, emphasizing the importance of informed decision-making.

The financial repercussions of inaccurate approximations can be substantial, leading to significant losses, diminished market share, or even bankruptcy However, as the volume of available information grows, decision-makers gain deeper insights, thereby enhancing the accuracy of their approximations and reducing associated costs This improvement is evident throughout the business case process, where increased and refined information allows for more informed decisions at each stage Consequently, approximations can be derived from trends, statistically significant impact estimates, or model-based what-if analyses, transforming raw data into valuable information.

The cost curve depicted in Fig 1.1 demonstrates the relationship between decision-making costs and the level of information available As information increases, the costs associated with decisions tend to decrease The Base Approximation Cost represents the minimum achievable cost, which remains above zero due to inherent uncertainties in decision-making.

According to Adriaans (2019), there exists an inverse relationship between the amount of information available and the level of uncertainty encountered While Adriaans describes this relationship as linear, it is argued that uncertainty does not decrease to zero, as perfect knowledge is unattainable Consequently, the costs associated with decision-making will never reach zero; instead, they will asymptotically approach a positive lower limit, indicating that some uncertainty will always remain This nonlinear relationship is illustrated in the cost curve presented in Fig 1.1, which demonstrates that although increased information reduces uncertainty, the costs of potential errors will decline but never completely vanish.

Uncertainty is an inherent aspect of life, stemming from our limited knowledge about various situations, whether spatial—such as current events in Congress—or temporal, like predicting sales for the upcoming year This lack of understanding pertains to the state of the world (SOW), highlighting the challenges businesses face in navigating unpredictable environments Renowned business texts, including those by Freund and Williams (1969) and Spurr and Bonini, delve into the implications of this uncertainty for decision-making and strategy.

In their discussions, Hildebrand et al (2005) emphasize the importance of assigning probabilities to various states of the world (SOWs) to make informed predictions before outcomes occur Although the method for assigning numeric values to these SOWs is not clearly defined, these values represent potential payoffs By utilizing the assigned probabilities and corresponding payoffs, one can calculate an expected average payoff across all possible SOWs For instance, when evaluating the return on investment (ROI) for a capital expansion project, the ROI may hinge on the projected average annual growth of real GDP over the next five years This growth can be categorized as declining, flat, slow, or robust, with respective probabilities of 0.05, 0.20, 0.50, and 0.25, creating a probability distribution where the sum of probabilities equals 1.0 for the four possible states.

SOWs, probabilities, andROIvalues in Table1.1 The expectedROIis4 i = 1 p i ×

ROI i =2.15% This is the amount expected to be earned on average over the next

SOW Real GDP growth Probability ROI

Table 1.1 For the three SOW s shown here, the expected ROI is 3 i = 1 ROI i × p i = 0.0215 or 2.15%

Savage (1972) emphasizes that the term "world" in the phrase "state of the world" is context-specific and should not be interpreted literally It represents a dynamic concept that reflects the particular issues or concerns relevant to an individual.

The concept of the "state" of the world refers to a comprehensive description of its conditions, as highlighted by Savage (1972), who emphasizes that it encompasses all relevant aspects However, he also points out the existence of a true state—a reality that remains unknown until it is revealed Consequently, the best approach for decision-making is to assign probabilities to various potential states, as illustrated in Table 1.1 The core issue lies in the fact that this true state will always remain elusive, as no amount of information can fully disclose it prior to its actual occurrence.

To anticipate future events, you can create a model that captures some features of the anticipated world, acknowledging that it won't encompass all complexities This candidate model can be trained using a dataset reflecting actual outcomes, such as the creditworthiness of customers, to enhance its predictive accuracy After training, you can evaluate the model's performance with a separate dataset to ensure reliability Once confident in its predictive capabilities, the model can be implemented to assess credit extension decisions for your customer base Further details on developing training and testing datasets, as well as training and utilizing models for predictions, are discussed in subsequent chapters.

Although Table1.1is a good textbook example for an introduction to expected values, it has several problems These are the identification of:

The origins of probability distributions in business contexts raise important questions, particularly regarding metrics like SOW (Share of Wallet) definitions and ROI (Return on Investment) values, which may be based on historical business cycles in the U.S While these metrics can be justified through past performance, the source of the probability distributions remains unclear This uncertainty is evident in credit assessments, where customers are rated on their likelihood to default—categories such as "Very Likely," "Somewhat Likely," and "Not at all Likely" do not reveal the actual probabilities until credit is extended Understanding where these probability distributions originate is crucial for accurately determining the expected value of payments.

Probabilities can be categorized into frequency-based and subjective types Frequency-based probabilities are obtained through repeated experiments, such as flipping a fair coin, while subjective probabilities stem from personal judgment and can be contentious Unlike experimental results, which are generally reliable when conducted correctly, future real GDP growth and default ratings cannot be determined through experimentation, making their probabilities inherently subjective However, for assessing default risks, an alternative approach exists: customers can be classified based on their credit history and current financial status.

FICO® Credit Scores, along with sales force input, can be utilized to develop a classification model that assigns ratings to customers This model predicts the probability of default, generating the necessary probability distribution In classification problems, these probabilities are known as propensities, and class assignments are determined based on these propensities Further details on classification modeling and probability assignments are discussed in Chapter 11.

Knight (1921) differentiated between risk and uncertainty by emphasizing the lack of knowledge regarding the subjective probability distribution of states of the world (SOWs) According to Knight, true uncertainty can never be fully known Zeckhauser (2006) elaborates on Knight's perspective, while Arrow notes that initial probabilities assigned to SOWs can be considered "flimsy" or "flat," indicating equal chances for outcomes such as real GDP growth or credit defaults These initial probabilities serve as priors, which are updated as new data—both news and numerical—becomes available, allowing for the formation of new probability distributions This process enables better decision-making with reduced uncertainty and lower costs, as decisions rely on the information derived from data rather than the data itself Although uncertainty can be diminished through increased knowledge of probability distributions, it can never be entirely eliminated.

Business analytics: data science for business problems

Types of Business Problems

The Role of Information in Business Decision Making

The Data-Information Nexus

Analytics Requirements

Data Dimensions: A Taxonomy for Defining Data

Data Organization

Data Dictionary

Importing Your Data

Querying a DataFrame

Issues Complicating Data Visualization

Visualizing Spatial Data

Visualizing Temporal (Time Series) Data

Appendix

Transformations

Encoding

Appendix

Intermediate Analytics

Advanced Analytics