Part V Big Data and Future Directions for Business
5.2 Data Mining ConCepts anD appliCations
In an interview with Computerworld magazine in January 1999, Dr. Arno Penzias (Nobel laureate and former chief scientist of Bell Labs) identified data mining from organiza- tional databases as a key application for corporations of the near future. In response to Computerworld’s age-old question of “What will be the killer applications in the corporation?” Dr. Penzias replied: “Data mining.” He then added, “Data mining will become much more important and companies will throw away nothing about their cus- tomers because it will be so valuable. If you’re not doing this, you’re out of business.”
Similarly, in an article in Harvard Business Review, Thomas Davenport (2006) argued that the latest strategic weapon for companies is analytical decision making, providing
examples of companies such as amazon.com, Capital One, Marriott International, and others that have used analytics to better understand their customers and optimize their extended supply chains to maximize their returns on investment while providing the best customer service. This level of success is highly dependent on a company under- standing its customers, vendors, business processes, and the extended supply chain very well.
A large portion of “understanding the customer” can come from analyzing the vast amount of data that a company collects. The cost of storing and processing data has decreased dramatically in the recent past, and, as a result, the amount of data stored in electronic form has grown at an explosive rate. With the creation of large databases, the possibility of analyzing the data stored in them has emerged. The term data mining was originally used to describe the process through which previously unknown patterns in data were discovered. This definition has since been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales with the popularity of the data mining label. In this chapter, we accept the original defini- tion of data mining.
Although the term data mining is relatively new, the ideas behind it are not. Many of the techniques used in data mining have their roots in traditional statistical analysis and artificial intelligence work done since the early part of the 1980s. Why, then, has it suddenly gained the attention of the business world? Following are some of most pro- nounced reasons:
• More intense competition at the global scale driven by customers’ ever-changing needs and wants in an increasingly saturated marketplace.
• General recognition of the untapped value hidden in large data sources.
• Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc.
• Consolidation of databases and other data repositories into a single location in the form of a data warehouse.
• The exponential increase in data processing and storage technologies.
• Significant reduction in the cost of hardware and software for data storage and processing.
• Movement toward the de-massification (conversion of information resources into nonphysical form) of business practices.
Data generated by the Internet is increasing rapidly in both volume and complexity. Large amounts of genomic data are being generated and accumulated all over the world. Disciplines such as astronomy and nuclear physics create huge quantities of data on a regular basis. Medical and pharmaceutical researchers con- stantly generate and store data that can then be used in data mining applications to identify better ways to accurately diagnose and treat illnesses and to discover new and improved drugs.
On the commercial side, perhaps the most common use of data mining has been in the finance, retail, and healthcare sectors. Data mining is used to detect and reduce fraudulent activities, especially in insurance claims and credit card use (Chan et al., 1999);
to identify customer buying patterns (Hoffman, 1999); to reclaim profitable customers (Hoffman, 1998); to identify trading rules from historical data; and to aid in increased profitability using market-basket analysis. Data mining is already widely used to bet- ter target clients, and with the widespread development of e-commerce, this can only become more imperative with time. See Application Case 5.1 for information on how Infinity P&C has used predictive analytics and data mining to improve customer service, combat fraud, and increase profit.
Application Case 5.1
Smarter Insurance: Infinity P&C Improves Customer Service and Combats Fraud with Predictive Analytics
Infinity Property & Casualty Corporation, a provider of nonstandard personal automobile insurance with an emphasis on higher-risk drivers, depends on its ability to identify fraudulent claims for sustained profitability. As a result of implementing analytics tools (from IBM SPSS), Infinity P&C has doubled the accuracy of its fraud identification, contributing to a return on investment of 403 percent per a Nucleus Research study. And the benefits don’t stop there:
According to Bill Dibble, senior vice president in Claims Operations at Infinity P&C, the use of pre- dictive analytics in serving the company’s legitimate claimants is of equal or even greater importance.
Low-hanging fruit
Initially, Dibble focused the power of predictive ana- lytics (i.e., data mining) to assist the company’s Special Investigative Unit (SIU). “In the early days of SIU, adjusters would use laminated cards with ‘red flags’ to indicate potential fraud. Taking those ‘red flags’ and developing rules seemed like an area of low-hanging fruit where we could quickly demonstrate the benefit of our investment in predictive analytics.”
Dibble then leveraged a successful approach from another part of the business. “We recognized how important credit was in the underwriting arena, and I thought, ‘Let’s score our claims in the same way, to give us an indicator of potential fraud.’ The larger the number we attach to a case, the more apt we are to have a fraud situation. Lower number, get the claim paid.” Dibble notes that fraud represents a $20 billion exposure to the insurance industry and in certain venues could be an element in around 40 percent of claims. “A key benefit of the IBM SPSS system is its ability to continually analyze and score these claims, which helps ensure that we get the claim to the right adjuster at the right time,” he says.
Adds Tony Smarrelli, vice president of National Operations: “Industry reports estimate one out of five claims is pure fraud—either opportunity fraud, where someone exaggerates an injury or vehicle damage, or the hard-core criminal rings that work with unethical clinics and attorneys. Rather than putting all five
customers through an investigatory process, SPSS helps us ‘fast-track’ four of them and close their cases within a matter of days. This results in much happier customers, contributes to a more efficient workflow with improved cycle times, and improves retention due to an overall better claims experience.”
an unexpected benefit
Dibble saw subrogation, the process of collecting damages from the at-fault driver’s insurance com- pany, as another piece of low-hanging fruit—and he was right. In the first month of using SPSS, Infinity P&C saw record recovery on paid collision claims, adding about $1 million directly to the company’s bottom line and virtually eliminating the third-party collection fees of more than $70,000 per month that the company was used to paying. What’s more, each of the following 4 or 5 months was even better than the previous one. “I never thought we would recover the money that we’ve recovered with SPSS in the subrogation area,” he says. “That was a real surprise to us. It brought a lot of attention to SPSS within the company, and to the value of predictive analytics in general.”
The rules-based IBM SPSS solution is well suited to Infinity P&C’s business. For example, in states that have no-fault benefits, an insurance com- pany can recover commercial vehicles or vehicles over a certain gross vehicle weight. “We can put a rule in IBM SPSS that if medical expenses are paid on a claim involving this type of vehicle, it is imme- diately referred to the subrogation department,”
explains Dibble. “This is a real-time ability that keeps us from missing potentially valuable subro- gation opportunities, which used to happen a lot when we relied solely on adjuster intuition.”
The rules are just as important on the fraud investigation side. Continues Dibble: “If we see an accident that happened around 1:00 a.m. and involved a gas-guzzling GMC Suburban, we need to start looking for fraud. So we dig a little deeper:
Is this guy upside-down on his loan, such that he owes more money than the car is worth? Did (Continued)
Application Case 5.1 (Continued)
the accident happen in a remote spot, suggesting that it may have been staged? Does the individual move frequently or list multiple addresses? As these elements are added to the equation, the score keeps building, and the case is more and more likely to be referred to one of our SIU investigators.”
With SPSS, Infinity P&C has reduced SIU refer- ral time from an average of 45–60 days to approxi- mately 1–3 days, which means that investigators can get to work on the case before memories and stories start to change, rental and storage charges mount, and the likelihood of getting an attorney involved increases. The company is also creating a better claim for the SIU to investigate; a higher score cor- relates to a higher probability of fraud.
making us smarter
SPSS rules start to score the claim immediately on first notice of loss (FNOL) when the claimant reports the accident. “We have completely revised our FNOL screens to collect more data points,” says Dibble. “SPSS has made us much smarter in asking questions.” Currently SPSS collects data mainly from the company’s claims and policy systems; a future initiative to leverage the product’s text mining capa- bilities will make the information in claims notes available as well.
Having proven its value in subrogation and SIU, the SPSS solution is poised for expansion within Infinity P&C. “One of our key objectives moving for- ward will be what we call ‘right scripting,’ where
we can script the appropriate questions for call cen- ter agents based on the answers they get from the claimant,” says Dibble. “We’ll also be instituting a process to flag claims with high litigation potential.
By reviewing past litigation claims, we can identify predictive traits and handle those cases on a priority basis.” Decision management, customer retention, pricing analysis, and dashboards are also potential future applications of SPSS technology.
But at the end of the day, excellent customer service remains the driving force behind Infinity P&C’s use of predictive analytics. Concludes Dibble:
“My goal is to pay the legitimate customer very quickly and get him on his way. People who are more economically challenged need their car; they typically don’t have a spare vehicle. This is the car they use to go back and forth to work, so I want to get them out and on the road without delay. IBM SPSS makes this possible.”
Questions for Discussion
1. How did Infinity P&C improve customer service with data mining?
2. What were the challenges, the proposed solution, and the obtained results?
3. What was their implementation strategy? Why is it important to produce results as early as pos- sible in data mining studies?
Source: public.dhe.ibm.com/common/ssi/ecm/en/ytc03160 usen/ytc03160usen.Pdf (accessed January 2013).
Definitions, Characteristics, and Benefits
Simply defined, data mining is a term used to describe discovering or “mining” knowl- edge from large amounts of data. When considered by analogy, one can easily realize that the term data mining is a misnomer; that is, mining of gold from within rocks or dirt is referred to as “gold” mining rather than “rock” or “dirt” mining. Therefore, data mining perhaps should have been named “knowledge mining” or “knowledge discov- ery.” Despite the mismatch between the term and its meaning, data mining has become the choice of the community. Many other names that are associated with data mining include knowledge extraction, pattern analysis, data archaeology, information harvest- ing, pattern searching, and data dredging.
Technically speaking, data mining is a process that uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subse- quent knowledge (or patterns) from large sets of data. These patterns can be in the form
of business rules, affinities, correlations, trends, or prediction models (see Nemati and Barko, 2001). Most literature defines data mining as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases,” where the data are organized in records structured by categorical, ordinal, and continuous variables (Fayyad et al., 1996). In this definition, the meanings of the key terms are as follows:
• Process implies that data mining comprises many iterative steps.
• Nontrivial means that some experimentation-type search or inference is involved;
that is, it is not as straightforward as a computation of predefined quantities.
• Valid means that the discovered patterns should hold true on new data with sufficient degree of certainty.
• Novel means that the patterns are not previously known to the user within the context of the system being analyzed.
• Potentially useful means that the discovered patterns should lead to some benefit to the user or task.
• Ultimately understandable means that the pattern should make business sense that leads to the user saying “mmm! It makes sense; why didn’t I think of that” if not immediately, at least after some post processing.
Data mining is not a new discipline, but rather a new definition for the use of many disciplines. Data mining is tightly positioned at the intersection of many disciplines, including statistics, artificial intelligence, machine learning, management science, infor- mation systems, and databases (see Figure 5.1). Using advances in all of these disciplines, data mining strives to make progress in extracting useful information and knowledge from large databases. It is an emerging field that has attracted much attention in a very short time.
The following are the major characteristics and objectives of data mining:
• Data are often buried deep within very large databases, which sometimes contain data from several years. In many cases, the data are cleansed and consolidated into a data warehouse. Data may be presented in a variety of formats (see Technology Insights 5.1 for a brief taxonomy of data).
Artificial Intelligence
Statistics
Management Science and Information Systems
MiningData Pattern Recognition
Mathematical Modeling
Databases Machine Learning
Figure 5.1 Data Mining as a Blend of Multiple Disciplines.
• The data mining environment is usually a client/server architecture or a Web-based information systems architecture.
• Sophisticated new tools, including advanced visualization tools, help to remove the information ore buried in corporate files or archival public records. Finding it involves massaging and synchronizing the data to get the right results. Cutting- edge data miners are also exploring the usefulness of soft data (i.e., unstructured text stored in such places as Lotus Notes databases, text files on the Internet, or enterprise-wide intranets).
• The miner is often an end user, empowered by data drills and other power query tools to ask ad hoc questions and obtain answers quickly, with little or no program- ming skill.
• Striking it rich often involves finding an unexpected result and requires end users to think creatively throughout the process, including the interpretation of the findings.
• Data mining tools are readily combined with spreadsheets and other software development tools. Thus, the mined data can be analyzed and deployed quickly and easily.
• Because of the large amounts of data and massive search efforts, it is sometimes necessary to use parallel processing for data mining.
A company that effectively leverages data mining tools and technologies can acquire and maintain a strategic competitive advantage. Data mining offers organizations an indis- pensable decision-enhancing environment to exploit new opportunities by transforming data into a strategic weapon. See Nemati and Barko (2001) for a more detailed discussion on the strategic benefits of data mining.
technOLOgy insights 5.1 a simple taxonomy of data
Data refers to a collection of facts usually obtained as the result of experiences, observations, or experiments. Data may consist of numbers, letters, words, images, voice recordings, and so on as measurements of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge is derived.
At the highest level of abstraction, one can classify data as structured and unstructured (or semistructured). Unstructured/semistructured data is composed of any combination of tex- tual, imagery, voice, and Web content. Unstructured/semistructured data will be covered in more detailed in the text mining and Web mining chapters (see Chapters 7 and 8). Structured data is what data mining algorithms use, and can be classified as categorical or numeric. The categorical data can be subdivided into nominal or ordinal data, whereas numeric data can be subdivided into interval or ratio. Figure 5.2 shows a simple taxonomy of data.
• categorical data represent the labels of multiple classes used to divide a variable into specific groups. Examples of categorical variables include race, sex, age group, and educational level. Although the latter two variables may also be considered in a numeri- cal manner by using exact values for age and highest grade completed, it is often more informative to categorize such variables into a relatively small number of ordered classes.
The categorical data may also be called discrete data, implying that it represents a finite number of values with no continuum between them. Even if the values used for the categorical (or discrete) variables are numeric, these numbers are nothing more than sym- bols and do not imply the possibility of calculating fractional values.
• nominal data contain measurements of simple codes assigned to objects as labels, which are not measurements. For example, the variable marital status can be generally cat- egorized as (1) single, (2) married, and (3) divorced. Nominal data can be represented with binomial values having two possible values (e.g., yes/no, true/false, good/bad), or multinomial values having three or more possible values (e.g., brown/green/blue, white/
black/Latino/Asian, single/married/divorced).
Data
Structured
Categorical Numerical Textual Multimedia HTML/XML
Unstructured or Semi-structured
Nominal Ordinal Interval Ratio Audio Image/Video
Figure 5.2 A Simple Taxonomy of Data in Data Mining.
• Ordinal data contain codes assigned to objects or events as labels that also represent the rank order among them. For example, the variable credit score can be generally categorized as (1) low, (2) medium, or (3) high. Similar ordered relationships can be seen in variables such as age group (i.e., child, young, middle-aged, elderly) and educational level (i.e., high school, college, graduate school). Some data mining algorithms, such as ordinal multiple logistic regression, take into account this additional rank-order informa- tion to build a better classification model.
• numeric data represent the numeric values of specific variables. Examples of numerically valued variables include age, number of children, total household income (in U.S. dollars), travel distance (in miles), and temperature (in Fahrenheit degrees). Numeric values rep- resenting a variable can be integer (taking only whole numbers) or real (taking also the fractional number). The numeric data may also be called continuous data, implying that the variable contains continuous measures on a specific scale that allows insertion of interim values. Unlike a discrete variable, which represents finite, countable data, a continuous variable represents scalable measurements, and it is possible for the data to contain an infinite number of fractional values.
• interval data are variables that can be measured on interval scales. A common example of interval scale measurement is temperature on the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the difference between the melting temperature and the boiling temperature of water in atmospheric pressure; that is, there is not an absolute zero value.
• ratio data include measurement variables commonly found in the physical sciences and engineering. Mass, length, time, plane angle, energy, and electric charge are examples of physical measures that are ratio scales. The scale type takes its name from the fact that measurement is the estimation of the ratio between a magnitude of a continuous quantity and a unit magnitude of the same kind. Informally, the distinguishing feature of a ratio scale is the possession of a nonarbitrary zero value. For example, the Kelvin temperature scale has a nonarbitrary zero point of absolute zero, which is equal to –273.15 degrees Celsius. This zero point is nonarbitrary, because the particles that comprise matter at this temperature have zero kinetic energy.
Other data types, including textual, spatial, imagery, and voice, need to be converted into some form of categorical or numeric representation before they can be processed by data mining algorithms. Data can also be classified as static or dynamic (i.e., temporal or time-series).
Some data mining methods and algorithms are very selective about the type of data that they can handle. Providing them with incompatible data types may lead to incorrect models or (more often) halt the model development process. For example, some data mining methods