Maclennan c01.tex V2 - 10/04/2008 1:59am Page 3
Introduction to DataMining in SQLServer2008 3
Figure 1-1 Student table
In contrast, the datamining approach for this problem is almost the reverse
of the query-and-explore method. Instead of guessing a hypothesis and trying
it out in different ways, you ask the question in terms of the data that can
support many hypotheses, and allow your datamining system to explore them
for you.
In this case, you indicate that the columns
IQ, Gender, ParentIncome,
and
ParentEncouragement are to be used as hypotheses in determining
CollegePlans. As the datamining system passes over the data, it analyzes the
influence of each input column on the target column.
Figure 1-2 shows the hypothetical result of a decision tree algorithm operat-
ing on this data set. In this case, each path from the root node to the leaf node
forms a rule about the data. Looking at this tree, you see that students with IQs
greater than 100 and who are encouraged by their parents are highly likely to
attend college. In this case, you have extracted knowledge from the data.
As shown here, datamining applies algorithms such as decision trees,
clustering, association, time series, and so on to a data set, and then analyzes
its contents. This analysis produces patterns, which can be explored for
valuable information. Depending on the underlying algorithm, these patterns
can be in the form of trees, rules, clusters, or simply a set of mathematical
formulas. The information found in the patterns can be used for reporting (to
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 4
4 Chapter 1 ■ Introduction to DataMining in SQLServer 2008
guide marketing strategies, for instance) and for prediction. For example, if
you could collect data about undecided students, you could select those who
are likely to be interested in continued education and preemptively market to
that audience.
Attend College:
55% Yes
45% No
Attend College:
35% Yes
65% No
Attend College:
79% Yes
21% No
IQ > 100 IQ ≤ 100
Attend College:
69% Yes
31% No
Attend College:
94% Yes
6% No
Encouragement =
Encouraged
Encouragement =
Not Encouraged
Figure 1-2 Decision tree
Business Problems for Data Mining
Data mining techniques can be used in virtually all business applications,
answering various types of businesses questions. In truth, given the software
available today, all you need is the motivation and the know-how. In general,
data mining can be applied whenever something could be known, but is not.
The following examples describe some scenarios:
Recommendation generation — What products or services should you
offer to your customers? Generating recommendations is an important
business challenge for retailers and service providers. Customers who
are provided appropriate and timely recommendations are likely to be
more valuable (because they purchase more) and more loyal (because
they feel a stronger relationship to the vendor). For example, if you go to
online stores such as Amazon.com or Barnesandnoble.com to purchase
an item, you are provided with recommendations about other items
you may be interested in. These recommendations are derived from
using datamining to analyze purchase behavior of all of the retailer’s
customers, and applying the derived rules to your personal information.
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 5
Business Problems for DataMining 5
Anomaly detection — How do you know whether your data is ‘‘good’’
or not? Datamining can analyze your data and pick out those items that
don’t fit with the rest. Credit card companies use data mining–driven
anomaly detection to determine if a particular transaction is valid. If
the datamining system flags the transaction as anomalous, you get a
call to see if it was really you who used your card. Insurance compa-
nies also use anomaly detection to determine if claims are fraudulent.
Because these companies process thousands of claims a day, it is impos-
sible to investigate each case, and datamining can identify which claims
are likely to be false. Anomaly detection can even be used to validate
data entry — checking to see if the data entered is correct at the point
of entry.
Churn analysis — Which customers are most likely to switch to a com-
petitor? The telecom, banking, and insurance industries face severe com-
petition. On average, obtaining a single new mobile phone subscriber
costs more than $200. Every business would like to retain as many cus-
tomers as possible. Churn analysis can help marketing managers identify
the customers who are likely to leave and why, and as a result, they can
improve customer relations and retain customers.
Risk management — Should a loan be approved for a particular cus-
tomer? Since the subprime mortgage meltdown, this is the single most
common question in banking. Datamining techniques are used to deter-
mine the risk of a loan application, helping the loan officer make appro-
priate decisions on the cost and validity of each application.
Customer segmentation — How do you think of your customers? Are
your customers the indescribable masses, or can you learn more about
your customers to have a more intimate and appropriate discussion with
them. Customer segmentation determines the behavioral and descriptive
profiles for your customers. These profiles are then used to provide per-
sonalized marketing programs and strategies that are appropriate for
each group.
Targeted ads — Web retailers or portal sites like to personalize their
content for their Web customers. Using navigation or online purchase
patterns, these sites can use datamining solutions to display targeted
advertisements to their Web navigators.
Forecasting — How many cases of wine will you sell next week in this
store? What will the inventory level be in one month? Datamining fore-
casting techniques can be used to answer these types of time-related
questions.
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 6
6 Chapter 1 ■ Introduction to DataMining in SQLServer 2008
Data Mining Tasks
For each question that can be asked of a datamining system, there are many
tasks that may be applied. In some cases, an answer will become obvious
with the application of a single task. In others, you will explore and combine
multiple tasks to arrive at a solution. The following sections describe the
general datamining tasks.
Classification
Classification is the most common datamining task. Business problems such
as churn analysis, risk management, and targeted advertising usually involve
classification.
Classification is the act of assigning a category to each case. Each case
contains a set of attributes, one of which is the class attribute. The task requires
finding a model that describes the class attribute as a function of input
attributes. In the College Plans data set shown in Figure 1-1, the class is the
CollegePlans attribute with two states: Yes and No. A classification model will
use the other attributes of a case (the input attributes) to determine patterns
about the class (the output attribute). Datamining algorithms that require a
target to learn against are considered supervised algorithms.
Typical classification algorithms include decision trees, neural network, and
Na
¨
ıve Bayes.
Clustering
Clustering is also called segmentation.Itisusedtoidentifynaturalgroupingsof
cases based on a set of attributes. Cases within the same group have more or
less similar attribute values.
Figure 1-3 shows a very simple customer data set containing two attributes:
Age and Income. The clustering algorithm groups the data set into three seg-
ments based on these two attributes. Cluster 1 contains a younger population
with low income. Cluster 2 contains middle age customers with higher income.
Cluster 3 is a group of older individuals with a relatively low income.
Clustering is an unsupervised datamining task. There is no single attribute
used to guide the training process, so all input attributes are treated equally.
Most clustering algorithms build the model through a number of iterations,
and stop when the model converges (that is, the boundaries of these segments
are stabilized).
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 7
Data Mining Tasks 7
Income
Cluster 2
Cluster 1
Cluster 3
Age
Figure 1-3 Clustering
Association
Association is also called market basket analysis. A typical association business
problem is to analyze a sales transaction table and identify those products
often in the same shopping basket. The common usage of association is to
identify common sets of items and rules for the purpose of cross-selling, as
shown in Figure 1-4.
Cheese Wine
Milk Cake Beer
Coke Pepsi
Juice
Beef
Donut
Figure 1-4 Product association
In terms of association, each piece of information is considered an item.
The association task has two goals: to find those items that appear together
frequently, and from that, to determine rules about the associations.
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 8
8 Chapter 1 ■ Introduction to DataMining in SQLServer 2008
Regression
The regression task is similar to classification, except that instead of looking for
patterns thatdescribeaclass,the goal is to find patternstodetermine a numerical
value. Simple linear line-fitting techniques are an example of regression, where
the result is a function to determine the output based on the values of the
inputs. More advanced forms of regression support categorical inputs as well
as numerical inputs. The most popular techniques used for regression are
linear regression and logistic regression. Other techniques supported by SQL
Server DataMining are regression trees (part of the Microsoft Decision Trees
algorithm) and neural networks.
Regression is used to solve many business problems — for example, to
predict a couponredemption rate based onthe face value, distribution method,
distribution volume, and season, or to predict wind velocities based on
temperature, air pressure, and humidity.
Forecasting
Forecasting is yet another important datamining task. What will the stock
value of Microsoft Corporation (NASDAQ symbol MSFT) be tomorrow? What
will the sales amount of wine be next month? Forecasting can help answer
these questions. As input, it takes sequences of numbers indicating a series
of values through time, and then it imputes future values of those series
using a variety of machine-learning and statistical techniques that deal with
seasonality, trending, and noisiness of data.
Figure 1-5 shows two curves. The solid line curve is the actual time-series
data on Microsoft stock value, and the dotted curve is a time-series model that
predicts values based on past values.
38
36
34
32
30
28
26
24
22
20
MSFT 3-year price history
Figure 1-5 Time series
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 9
Data Mining Project Cycle 9
Sequence Analysis
Sequence analysis is used to find patterns in a series of events called a sequence.
For example, aDNA sequence is along series composed of four different states:
A, G, C, and T. A click sequence on the Web contains a seriesof URLs. In certain
circumstances, you may model customer purchases as a sequence of data. For
example, a customer first buys a computer, and then buys speakers, and
finally buys a webcam. Both sequence and time-series data are similar in that
they contain adjacent observations that are order-dependent. The difference is
that where a time series contains numerical data, a sequence series contains
discrete states.
Figure 1-6 shows Web click sequences from a news website. Each node
is a URL category, and the lines represent transitions between them. Each
transition is associated with a weight, representing the probability of the
transition between one URL and another.
0.2
0.3
0.2
0.3
0.4
0.1
0.2
0.1
0.2
Home
Page
Business
News Sport
Weather
Science
Figure 1-6 Web navigation sequence
Deviation Analysis
Deviation analysis is used to find rare cases that behave very differently from
the norm. Deviation analysis is widely applicable, the most common usage
being credit card fraud detection. Identifying abnormal cases among millions
of transactions is a very challenging task. Other applications include network
intrusion detection, manufacture error analysis, and so on.
There is no standard technique for deviation analysis. Usually, analysts
apply decision trees, clustering, or neural network algorithms for this task.
Data Mining Project Cycle
From the initial business problem formation through to deployment and
sustained management, most datamining projects pass through the same
phases.
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 10
10 Chapter 1 ■ Introduction to DataMining in SQLServer 2008
Business Problem Formation
What are the problems you are trying to solve? What techniques are you going
to apply to solve the problem? How do you know if you will be successful?
These are important questions to ask before embarking on any project.
You may find that a simple OLAP, reporting, or data integration solution
may be sufficient. A predictive or datamining solution involves determining
the unknown, relying on a belief that making sense of that unknown will add
value. This is a shaky precipice from which to begin any business endeavor.
Luckily, successful datamining solutions have been shown to have an average
of 150-percent return on investment (ROI), so that makes justification easier.
Data Collection
Business data is stored in many systems across an enterprise. For example,
at Microsoft, there are hundreds of online transaction processing (OLTP)
databases and more than 70 data warehouses. The first step is to pull the
relevant data into a database or a data mart where the data analysis is applied.
For example, if you want to analyze your website’s click stream, the first step
is to download the log data from your web servers.
Sometimes you might be lucky and find that there is already an existing
data warehouse on the subject of your analysis. However, in many cases, the
data in the data warehouse is not rich enough and must be supplemented with
additional data. For example, the log data from the web servers contains only
data about web behavior and little (if any) data about the customers. You may
need to gather customer information from other company systems or purchase
demographic data to build models that meet your business requirements.
Data Cleaning and Transformation
Data cleaning and transformation are the most resource-consuming steps in
a datamining project. The purpose of data cleaning is to remove noise and
irrelevant information from the data set. The purpose of data transformation is
to modify the source data in ways that make it useful for mining.
Various techniques are applied to clean and transform data, including the
following:
Numerical transformation — For continuous data such as income and
age, a typical transformation is to bin (or discretize) the data into buckets.
For example, you may want to bin
Age into five predefined age groups.
SQL ServerDataMining has automatic discretization methods, but if
youhavemeaningfulgroupings,theymaybemoreinformativeboth
from a business sense and an algorithmic sense. Additionally, continu-
ousdataisoftennormalized. Normalization maps all numerical values to
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 11
Data Mining Project Cycle 11
a range (such as between 0 and 1) or to have a specific standard deviation
(such as 1).
Grouping — Discrete data often has more distinct values than are use-
ful. You can group these values to reduce the model complexity. For
example, the column
Profession may have many different types of engi-
neers, such as Software Engineer, Telecom Engineer, Mechanical Engi-
neer, and so on. You can group all of these professions to the single value
Engineer.
Aggregation — Aggregation is an important transformation to derive
additional value from your data. Suppose you want to group customers
based on their phone usage. If the call detail record information is too
detailed for the model, you must aggregate all the calls into a few
derived attributes such as total number of calls and the average call
duration. These derived attributes can later be used in the model.
Missing value handling — Most data sets contain missing values. This
can be caused by many different things. For example, you may have two
customer tables coming from two OLTP databases that, when merged,
have missing values because the tables are not aligned. Another example
occurs when customers don’t supply data values such as age. Another is
when you have stock market values with blanks because the markets are
closed on weekends and holidays.
Addressing missing values is important, because it is reflected in the
business value of your solution. You may need to retain the missing
data (for example, customers who refuse to report their age may have
other interesting things in common). You may need to discard the entire
record (having too many unknowns could pollute your model). Or, you
may simply be able to replace missing values with some other value
(such as the previous value for time-series data such as stock market val-
ues, or the most popular value). For more advanced cases, you can use
data mining to predict the most likely value for each missing case.
Removing outliers — Outliers are abnormal data and can be real or (as
is often the case) errors. Abnormal data has an effect on the quality of
your results. The best way to deal with outliers typically is to simply
remove them before beginning the analysis. For example, you could
remove 0.5 percent of the customers with highest or lowest income to
eliminate any situations of people having negative or extremely unlikely
incomes.
SQL ServerIntegrationServices (SSIS), whichisincluded withMicrosoftSQL
Server, is an excellent tool for performing data cleaning and transformation
tasks.
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 12
12 Chapter 1 ■ Introduction to DataMining in SQLServer 2008
Model Building
Model building is the core of data mining, though it is not as time- and
resource-intensive as data transformation. When you understand the shape of
the business problem and the type of datamining task, it is relatively easy to
pick algorithms that are suitable. Usually, you don’t know which algorithm
is the best fit for the problem until you have built the model. The accuracy
of an algorithm depends on the nature of the data. For example, a decision
tree algorithm is usually a very good choice for any classifications. However,
if the relationships among attributes are complicated, a neural network may
perform better.
A good approach is to build multiple models using different algorithms,
and then compare the accuracy of these models. Even with a single algorithm,
you can tune the parameter settings to optimize the model accuracy.
Model Assessment
In the model assessment stage, you use tools to determine the accuracy of
the models that were created, and you examine the models to determine the
meaning of discovered patterns and how they apply to your business. For
example, a model may determine that Relationship =Husband
➪ Gender =Male
with 100-percent confidence. Although the rule is valid, it doesn’t contain any
business value. It is very important to work with business analysts who have
the proper domain knowledge to validate the discoveries.
Sometimes, the model doesn’t contain useful patterns. This is generally
becausethesetofvariablesinthemodelarenottherightonestosolveyour
business problem. You may need to repeat the data cleaning and transforma-
tion steps, or even redefine your problem in order to derive more meaningful
variables. Datamining is an exploratory process, and it often takes a few
iterations before you find the right model.
Reporting and Prediction
In many organizations, the goal of data miners is todeliver reports to marketing
executives. SQLServerDataMining is integrated withSQLServer Reporting
Services to generate reports directly from datamining results. Reports may
contain predictions (such as lists of customers with the highest value potential)
or the rules found in the datamining analysis.
To provide predictions, you apply the selected model against new cases of
data. Consider a banking scenario where you build a model about loan risk
prediction. Every day there are thousands of new loan applications. You can
. Maclennan c01.tex V2 - 10/04 /2008 1:59am Page 3
Introduction to Data Mining in SQL Server 2008 3
Figure 1-1 Student table
In contrast, the data mining approach. Introduction to Data Mining in SQL Server 2008
Model Building
Model building is the core of data mining, though it is not as time- and
resource-intensive as data