Its monthly release provides up-to-date news items on developments in data mining and knowledge discovery.newslet-http://www.oasis-open.org/ Organization for the Advancement of tured Inf
Trang 1218 Glossary
XML (Extensible Mark Up Language) Based on SGML, XML is used todescribe the format, presentation and control of content of documents thatare based on this language The Extensible Markup Language (XML) isdescriptively identified in the XML 1.0 W3C Recommendation as anextremely simple dialect, or subset, of SGML the goal of which is to enablegeneric SGML to be served, received, and processed on the Web in the waythat is now possible with HTML, for which reason XML has been designedfor ease of implementation, and for interoperability with both SGML andHTML
Trang 2B
References
Pieter Adriaans and Dolf Zantinge Data Mining Addison-Wesley, 1996.
Michael J A Berry and Gordon Linoff Data Mining Techniques for ing, Sales, and Customer Support John Wiley & Sons, 1997.
Market-W A Belson “A technique for studying the effects of a television cast,”Applied Statistics, 5, 1956, 195.
broad-Michael J A Berry and Gordon S Linoff Mastering Data Mining: The Art and Science of Customer Relationship Management John Wiley & Sons,
Leo Breiman, J H Friedman, R A Olshen, and C J Stone Classification and Regression Trees, Wadsworth, 1984.
Barry de Ville, “Applying statistical knowledge to database analysis and knowledge base construction,” Proceedings of the Sixth IEEE Conference
on Artificial Intelligence Applications, IEEE Computer Society,
Trang 3220 References
Morten T Hansen, Nitin Nohria, and Thomas Tierney “What’s Your egy for Managing Knowledge?”Harvard Business Review, 77, 2, 1999,
Strat-106–16 (Available: http://www.hbsp.harvard.edu/products/hbr/marapr99/99206.html)
E Hunt, J Marin, and P Stone Experiments in Induction, Academic Press,
1966
Bill Inmon Managing the Data Warehouse, John Wiley & Sons, 1996.
Robert S Kaplan and David P Norton The Balanced Scorecard: Translating Strategy into Action, Harvard Business School Press, 1996.
Olivia Parr Rud Data Mining Cookbook John Wiley & Sons, 2001.
Abraham Kaplan The Conduct of Inquiry: Methodology for Behavioral ence Chandler Publishing Company, 1964.
Sci-G V Kass “Significance testing in automatic interaction detection,”
Applied Statistics, 24, 2, 1976, 178–189.
G V Kass “An exploratory technique for investigating large quantities of categorical data,”Applied Statistics, 29, 2, 1980, 119–127.
Thomas Kuhn The Structure of Scientific Revolutions, Third Edition
Uni-versity of Chicago Press, 1996
Jesus Mena Data Mining Your Website Butterworth–Heinemann, 1999.
D Michie “Methodologies from Machine Learning in Data Analysis and Software,”The Computer Journal, 34, 6, 1991, 559–565.
Shigeru Mizuno Management for Quality Improvement: The Seven New QC Tools, Productivity Press, 1979.
J N Morgan and J A Sonquist “Problems in the Analysis of Survey Data, and a Proposal,”Journal of the American Statistical Association, 58, June
1963, 415
C O’Dell, F Hasanali, C Hubert, K Lopez, and C Raybourn Stages of Implementation: A Guide for Your Journey to Knowledge Management Best Practices APQC’s Passport to Success Series, Houston, Texas, 2000.
L W Payne and S Elliot “Knowledge sharing at Texas Instruments: ing best practices inside out,”Knowledge Management in Practice, 6,
Turn-1997
Dorian Pyle Data Preparation for Data Mining Morgan Kaufmann, 1999.
Trang 4man-J A Sonquist, E Baker, and man-J Morgan Searching for Structure, Institute for
Social Research, University of Michigan, Ann Arbor, Michigan, 1973.Thomas A Stewart Intellectual Capital, The New Wealth of Organizations,
Doubleday-Currency, 1997
Jake Sturm Data Warehousing with Microsoft® SQL Server 7.0 Technical
Reference, Microsoft Press, 1998
Ian Whitten and Eibe Frank Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000.
Trang 5This Page Intentionally Left Blank
Trang 6indus-http://www.kdnuggets.com/ KD Nuggets is a leading electronic ter on data mining and Web mining Its monthly release provides up-to-date news items on developments in data mining and knowledge discovery.
newslet-http://www.oasis-open.org/ Organization for the Advancement of tured Information Standards (OASIS) is a nonprofit international consor-tium that creates interoperable industry specifications based on publicstandards such as XML and SGML OASIS members include organizationsand individuals who provide, use and specialize in implementing the tech-nologies that make these standards work in practice Provides information
Struc-on such emerging standards as Predictive Model Markup Language(PMML) in the separate XML Cover Pages site http://www.oasis-open.org/cover/
http://www.xml.org A credible, independent resource for news, tion, and information about the application of XML in industrial and com-mercial settings Hosted by OASIS and funded by organizations that arecommitted to product-independent data exchange, XML.ORG offers valu-able tools, such as the XML.ORG Catalog, to help you make critical deci-sions about whether and how to employ XML in your business Forbusinesspeople and technologists alike, XML.ORG offers a uniquely inde-pendent view of what’s happening in the XML industry
educa-http://www.microsoft.com/sql/index.htm This is the home page forMicrosoft SQL Server This particular URL provides a list of Microsoft
Trang 7infor-http://www.msdn.microsoft.com/downloads/ MSDN Online loads offers you one place to find and download all developer-related toolsand add-ons, service packs, product updates, and beta and preview releases.
Down-http://msdn.microsoft.com/library/default.asp The MSDN Library is
an essential resource for developers using Microsoft tools, products, andtechnologies It contains a bounty of technical programming information,including sample code, documentation, technical articles, and referenceguides
http://backoffice.microsoft.com This site provides information aboutany of the Microsoft back office products Many of Microsoft’s back officeproducts integrate with SQL Server
http://www.mssqlserver.com/ Technical reviews, frequently asked tions (FAQs), and all-around information resource for SQL Server issuesand operations
ques-http://www.microsoft.com/solutions/km/DigitalDashboard.htm/
A Microsoft site that describes how to implement a digital dashboard
http://www.microsoft.com/business/ The Microsoft Business Web siteprovides news, information, and executive perspectives from Microsoftabout the technologies that can provide an edge in the digital age This siteprovides a glimpse into Microsoft’s vision for the future of technology, andhow to use it to grow your business
Sections include:
Microsoft’s vision Learn about the Microsoft NET platform and how
it changes how business interacts with customers, employees and pliers
sup- Business strategy In Measuring Business Value, use a tool Microsoft
calls Rapid Economic Justification It can help you quantify the ness value of strategic technology investments to your managementteam In e-commerce, find resources to help you start or grow your
Trang 8busi-Web Sites 225
Appendix C
online business Get details about how to get real-time access to yourmost powerful data in business intelligence, how to manage yourbusiness partnerships more effectively in customer relationship man-agement, or how to share information within your organizationthrough knowledge management Or read how companies plan to usewireless and other mobile technologies in mobility
Industries Get specifics on how other companies in the retail,
health-care, financial services, manufacturing, hospitality, and engineeringindustries are using solutions from Microsoft and its partners to growtheir businesses
Find a solution Find listings in various industries or regions for
inde-pendent software vendors (ISVs) who build solutions for businesses
in the solution directory
http://www.mlnet.org/ This site is dedicated to the field of machinelearning, knowledge discovery, case-based reasoning, knowledge acquisi-tion, and data mining This site provides information about research groupsand persons within the community Browse through the list of software anddata sets, and check out our events page for the latest calls for papers Alter-natively, have a look at the list of job offerings if you are looking for a newopportunity within the field And of course, they greatly appreciate anykind of feedback, so send us your comments and suggestions
www.mdcinfo.com/ This site provides information on the Meta DataCoalition, an organization originally set up by Microsoft to provide metadata solutions in data warehousing, business intelligence and data mining
http://www.icpsr.umich.edu/DDI/Resources.html The Data tation Initiative (DDI) is an effort to establish an international criterion andmethodology for the content, presentation, transport, and preservation ofmetadata (data about data) about data sets in the social and behavioral sci-ences Metadata constitute the information that enables the effective, effi-cient, and accurate use of those data sets The site is hosted by the ICPSR(Inter-university Consortium for Political and Social Research) at the Uni-versity of Michigan
Documen-http://www.dhutton.com/ David Hutton Associates are consultants inquality management They are specialists in Baldridge-style business excel-lence assessment as a tool to drive organizational change and improvement
http://www.salford-systems.com/ Salford Systems are developers ofCART and MARS data mining decision tree and regression modeling prod-ucts The site contains information about these products, white papers, andother technical reports
Trang 9226 Web Sites
http://research.swisslife.ch/kdd-sisyphus/ This is a site for a workgroupdevoted to data preparation, preprocessing, and reasoning for real-worlddata mining applications This workgroup is designed to bring togetherdevelopers of algorithms who want to think about the reprocessing stepsnecessary to apply their algorithms to the data in a real-world database, aswell as people who are interested in building tools that integrate variousdata mining algorithms as possible core phases for KDD applications.The workgroup is especially interested in the following topics:
Identify neccessary and useful preprocessing operations and tools(i.e., get the application know-how from the algorithm developer)
Examine ways of how these preprocessing operations can be sented (e.g., for documention and reuse) as well as executed effi-ciently on large data sets
repre- Compare the different data mining approaches with respect to theirinput requirements
Compare different (logical) representations of the problem and cuss their advantages/disadvantages Examine the need for multirela-tional representations to cover all the 1:N and N:M relations betweenthe different entities of this KDD-Sisyphus problem
dis- Establish usability criteria for various data mining approaches; forexample:
scalability—number of records, number of attributes, multiplerelations versus learning time and space requirements
robustness—handling of missing values, missing related tuples,noise-tolerance, nominal attributes with many different values,etc
learning goal—classification, clustering, rule learning, etc
understandability—size und presentation of mining results
parameter-settings of the data mining algorithm and their impact
on the mining result The KDD-Sisyphus Workgroup provides the Sisyphus I package which
is based on data extracted from a real-world insurance business application
As such it shows typical properties like fragmentation, varying data quality,irregular data value codings, and so on, which makes the application of datamining or machine learning algorithms a real challenge and usually requiressophisticated preprocessing methods
Trang 10Web Sites 227
Appendix C
The work package of KDD-Sisyphus I contains
A data set consisting of 10 relations with 5 to 50 attributes andaround 200,000 data tuples in ASCII format
A rough schema description explaining the data types and theirsemantic relationships
Three data mining task descriptions (two classification and one tering task)
Trang 11clus-This Page Intentionally Left Blank
Trang 12Australian (Australian credit)
Diabetes (diabetes of Pima Indians)
DNA (DNA sequence)
German (German credit)
Heart (heart disease)
Letter (letter recognition)
Segment (image segmentation)
Shuttle (shuttle control)
Satimage (Landsat satellite image)
Vehicle (vehicle recognition using silhouettes)
http://kdd.ics.uci.edu/
The UCI Knowledge Discovery in Databases Archive is an online tory of large data sets that encompasses a wide variety of data types, analysistasks, and application areas The primary role of this repository is to enableresearchers in knowledge discovery and data mining to scale existing andfuture data analysis algorithms to very large and complex data sets
Trang 13reposi-230 Data Mining and Knowledge Discovery Data Sets in the Public Domain
This repository is currently under construction and is still in a nary form This work is supported by a grant from the Information andData Management Program at the National Science Foundation and isintended to extend the current UCI Machine Learning Database Reposi-tory by several orders of magnitude
prelimi-In addition to storing data and description files, the repository alsoarchives task files that describe a specific analysis, such as clustering orregression, for the data sets stored The call for data sets lists typical datatypes and tasks of interest
D.2.1 Discrete sequence data
UNIX user data
This file contains nine sets of sanitized user data drawn from the commandhistories of eight UNIX computer users at Purdue over the course of up totwo years
D.2.2 Customer preference and recommendation data
Entree Chicago recommendation data
This data contains a record of user interactions with the Entree Chicago taurant recommendation system This is an interactive system that recom-mends restaurants to the user based on factors such as cuisine, price, style,atmosphere, and so on or based on similarity to a restaurant in another city(e.g., “find me a restaurant similar to the Patina in Los Angeles”) The usercan then provide feedback such as find a nicer or less expensive restaurant
res-D.2.3 Image data
CMU face images
This data consists of 640 black-and-white face images of people taken withvarying pose (straight, left, right, up), expression (neutral, happy, sad,angry), eyes (wearing glasses or not), and size
Volcanoes on Venus
The JARtool project was a pioneering effort to develop an automatic systemfor cataloging small volcanoes in the large set of Venus images returned bythe Magellan spacecraft This package contains a variety of data to enableresearchers to evaluate algorithms over the same images as used for the JAR-tool experiments
Trang 14Data Mining and Knowledge Discovery Data Sets in the Public Domain 231
COIL data
This data set is from the 1999 Computational Intelligence and Learning(COIL) competition The data contains measurements of river chemicalconcentrations and algae densities
Corel image features
This data set contains image features extracted from a Corel image tion Four sets of features are available based on the color histogram, colorhistogram layout, color moments, and co-occurence texture
collec-Forest CoverType
The forest cover type for 30 × 30 meter cells obtained from US Forest vice (USFS) Region 2 Resource Information System (RIS) data
Ser-The insurance company benchmark (COIL 2000)
This data set used in the COIL 2000 Challenge contains information oncustomers of an insurance company The data consists of 86 variables andincludes product usage data and socio-demographic data derived from ziparea codes The data was collected to answer the following question: Canyou predict who would be interested in buying a caravan insurance policyand give an explanation why?
Internet usage data
This data contains general demographic information on internet users in1997
IPUMS census data
This data set contains unweighted PUMS census data from the Los Angelesand Long Beach areas for the years 1970, 1980, and 1990 The codingschemes have been standardized (by the IPUMS project) to be consistentacross years
Trang 15232 Data Mining and Knowledge Discovery Data Sets in the Public Domain
KDD CUP 1998 data
This is the data set used for The Second International Knowledge ery and Data Mining Tools Competition, which was held in conjunctionwith KDD-98 The Fourth International Conference on Knowledge Dis-covery and Data Mining The competition task is a regression problemwhere the goal is to estimate the return from a direct mailing in order tomaximize donation profits
Discov-KDD CUP 1999 data
This is the data set used for The Third International Knowledge Discoveryand Data Mining Tools Competition, which was held in conjunction withKDD-99 The Fifth International Conference on Knowledge Discovery andData Mining The competition task was to build a network intrusion detec-tor, a predictive model capable of distinguishing between “bad” connec-tions, called intrusions or attacks, and “good” normal connections Thisdatabase contains a standard set of data to be audited, which includes awide variety of intrusions simulated in a military network environment
D.2.5 Relational data
Movies
This data set contains a list of more than 10,000 films including manyolder, odd, and cult films There is information on actors, casts, directors,producers, studios, and so on The material also includes some social infor-mation, as “lived with” and “married to.”
D.2.6 Spatio-temporal data
El Niño data
The data set contains oceanographic and surface meteorological readingstaken from a series of buoys positioned throughout the equatorial Pacific.The data is expected to aid in the understanding and prediction of El Niño/Southern Oscillation (ENSO) cycles
D.2.7 Text
20 newsgroups data
This data set consists of 20,000 messages taken from 20 Usenet groups
Trang 16news-Data Mining and Knowledge Discovery news-Data Sets in the Public Domain 233
Appendix D
Reuters-21578 text categorization collection
This is a collection of documents that appeared on Reuters newswire in
1987 The documents were assembled and indexed with categories
D.2.8 Time series
Australian sign language data
This data consists of sample of Auslan (Australian Sign Language) signs.Examples of 95 signs were collected from five signers with a total of 6,650sign samples
EEG data
This data arises from a large study to examine EEG correlates of geneticpredisposition to alcoholism It contains measurements from 64 electrodesplaced on the scalp sampled at 256 Hz (3.9-msec epoch) for 1 second
Japanese vowels
This data set records 640 time series of 12 LPC cepstrum coefficients takenfrom nine male speakers
Pioneer-1 mobile robot data
This data set contains time series sensor readings of the Pioneer-1 mobilerobot The data is broken into “experiences” in which the robot takes actionfor some period of time and experiences a controlled interaction with itsenvironment (i.e., bumping into a garbage can)
Pseudo periodic synthetic time series
This data set is designed for testing indexing schemes in time series bases The data appears highly periodic, but never exactly repeats itself Thisfeature is designed to challenge the indexing tasks
data-Robot execution failures
This data set contains force and torque measurements on a robot after ure detection Each failure is characterized by 15 force/torque samples col-lected at regular time intervals starting immediately after failure detection
fail-Synthetic control chart time series
This data consists of synthetically generated control charts
Trang 17234 Data Mining and Knowledge Discovery Data Sets in the Public Domain
D.2.9 Web data
Microsoft anonymous Web data
This data set records which areas (Vroots) of www.microsoft.com each uservisited in a one-week timeframe in February 1998
Syskill Webert Web data
This database contains the HTML source of web pages plus the ratings of asingle user on these pages The Web pages are on four separate subjects(bands, or recording artists; goats; sheep; and biomedical.)
http://www.mlnet.org/
The MLnet Online Information Service is dedicated to the field of machinelearning, knowledge discovery, case-based reasoning, knowledge acquisi-tion, and data mining The site provides information on research groupsand persons in the community You can browse through the list of softwareand data sets, and check out the events page for the latest calls for papers.The site also provides lists of job offerings if you are looking for a newopportunity within the field
http://research.swisslife.ch/kdd-sisyphus/
This site provides a large, unpreprocessed, multirelational, and partiallydocumented database extract This data is intended for use in research onpreprocessing techniques for real world data “The KDD-Sisyphus Work-group provides the Sisyphus I package, which is based on data extractedfrom a real-world insurance business application As such it shows typicalproperties like fragmentation, varying data quality, irregular data value cod-ings, etc which makes the application of data mining or machine learningalgorithms a real challenge and usually requires sophisticated preprocessingmethods.”