Microsoft Data Mining integrated business intelligence for e commerc and knowledge phần 8 ppt

Its monthly release provides up-to-date news items on developments in data mining and knowledge discovery.newslet-http://www.oasis-open.org/ Organization for the Advancement of tured Inf

Trang 1

218 Glossary

XML (Extensible Mark Up Language) Based on SGML, XML is used todescribe the format, presentation and control of content of documents thatare based on this language The Extensible Markup Language (XML) isdescriptively identified in the XML 1.0 W3C Recommendation as anextremely simple dialect, or subset, of SGML the goal of which is to enablegeneric SGML to be served, received, and processed on the Web in the waythat is now possible with HTML, for which reason XML has been designedfor ease of implementation, and for interoperability with both SGML andHTML

Trang 2

B

References

Pieter Adriaans and Dolf Zantinge Data Mining Addison-Wesley, 1996.

Michael J A Berry and Gordon Linoff Data Mining Techniques for ing, Sales, and Customer Support John Wiley & Sons, 1997.

Market-W A Belson “A technique for studying the effects of a television cast,”Applied Statistics, 5, 1956, 195.

broad-Michael J A Berry and Gordon S Linoff Mastering Data Mining: The Art and Science of Customer Relationship Management John Wiley & Sons,

Leo Breiman, J H Friedman, R A Olshen, and C J Stone Classification and Regression Trees, Wadsworth, 1984.

Barry de Ville, “Applying statistical knowledge to database analysis and knowledge base construction,” Proceedings of the Sixth IEEE Conference

on Artificial Intelligence Applications, IEEE Computer Society,

Trang 3

220 References

Morten T Hansen, Nitin Nohria, and Thomas Tierney “What’s Your egy for Managing Knowledge?”Harvard Business Review, 77, 2, 1999,

Strat-106–16 (Available: http://www.hbsp.harvard.edu/products/hbr/marapr99/99206.html)

E Hunt, J Marin, and P Stone Experiments in Induction, Academic Press,

1966

Bill Inmon Managing the Data Warehouse, John Wiley & Sons, 1996.

Robert S Kaplan and David P Norton The Balanced Scorecard: Translating Strategy into Action, Harvard Business School Press, 1996.

Olivia Parr Rud Data Mining Cookbook John Wiley & Sons, 2001.

Abraham Kaplan The Conduct of Inquiry: Methodology for Behavioral ence Chandler Publishing Company, 1964.

Sci-G V Kass “Significance testing in automatic interaction detection,”

Applied Statistics, 24, 2, 1976, 178–189.

G V Kass “An exploratory technique for investigating large quantities of categorical data,”Applied Statistics, 29, 2, 1980, 119–127.

Thomas Kuhn The Structure of Scientific Revolutions, Third Edition

Uni-versity of Chicago Press, 1996

Jesus Mena Data Mining Your Website Butterworth–Heinemann, 1999.

D Michie “Methodologies from Machine Learning in Data Analysis and Software,”The Computer Journal, 34, 6, 1991, 559–565.

Shigeru Mizuno Management for Quality Improvement: The Seven New QC Tools, Productivity Press, 1979.

J N Morgan and J A Sonquist “Problems in the Analysis of Survey Data, and a Proposal,”Journal of the American Statistical Association, 58, June

1963, 415

C O’Dell, F Hasanali, C Hubert, K Lopez, and C Raybourn Stages of Implementation: A Guide for Your Journey to Knowledge Management Best Practices APQC’s Passport to Success Series, Houston, Texas, 2000.

L W Payne and S Elliot “Knowledge sharing at Texas Instruments: ing best practices inside out,”Knowledge Management in Practice, 6,

Turn-1997

Dorian Pyle Data Preparation for Data Mining Morgan Kaufmann, 1999.

Trang 4

man-J A Sonquist, E Baker, and man-J Morgan Searching for Structure, Institute for

Social Research, University of Michigan, Ann Arbor, Michigan, 1973.Thomas A Stewart Intellectual Capital, The New Wealth of Organizations,

Doubleday-Currency, 1997

Jake Sturm Data Warehousing with Microsoft® SQL Server  7.0 Technical

Reference, Microsoft Press, 1998

Ian Whitten and Eibe Frank Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000.

Trang 5

This Page Intentionally Left Blank

Trang 6

indus-http://www.kdnuggets.com/ KD Nuggets is a leading electronic ter on data mining and Web mining Its monthly release provides up-to-date news items on developments in data mining and knowledge discovery.

newslet-http://www.oasis-open.org/ Organization for the Advancement of tured Information Standards (OASIS) is a nonprofit international consor-tium that creates interoperable industry specifications based on publicstandards such as XML and SGML OASIS members include organizationsand individuals who provide, use and specialize in implementing the tech-nologies that make these standards work in practice Provides information

Struc-on such emerging standards as Predictive Model Markup Language(PMML) in the separate XML Cover Pages site http://www.oasis-open.org/cover/

http://www.xml.org A credible, independent resource for news, tion, and information about the application of XML in industrial and com-mercial settings Hosted by OASIS and funded by organizations that arecommitted to product-independent data exchange, XML.ORG offers valu-able tools, such as the XML.ORG Catalog, to help you make critical deci-sions about whether and how to employ XML in your business Forbusinesspeople and technologists alike, XML.ORG offers a uniquely inde-pendent view of what’s happening in the XML industry

educa-http://www.microsoft.com/sql/index.htm This is the home page forMicrosoft SQL Server This particular URL provides a list of Microsoft

Trang 7

infor-http://www.msdn.microsoft.com/downloads/ MSDN Online loads offers you one place to find and download all developer-related toolsand add-ons, service packs, product updates, and beta and preview releases.

Down-http://msdn.microsoft.com/library/default.asp The MSDN Library is

an essential resource for developers using Microsoft tools, products, andtechnologies It contains a bounty of technical programming information,including sample code, documentation, technical articles, and referenceguides

http://backoffice.microsoft.com This site provides information aboutany of the Microsoft back office products Many of Microsoft’s back officeproducts integrate with SQL Server

http://www.mssqlserver.com/ Technical reviews, frequently asked tions (FAQs), and all-around information resource for SQL Server issuesand operations

ques-http://www.microsoft.com/solutions/km/DigitalDashboard.htm/

A Microsoft site that describes how to implement a digital dashboard

http://www.microsoft.com/business/ The Microsoft Business Web siteprovides news, information, and executive perspectives from Microsoftabout the technologies that can provide an edge in the digital age This siteprovides a glimpse into Microsoft’s vision for the future of technology, andhow to use it to grow your business

Sections include:

Microsoft’s vision Learn about the Microsoft NET platform and how

it changes how business interacts with customers, employees and pliers

sup- Business strategy In Measuring Business Value, use a tool Microsoft

calls Rapid Economic Justification It can help you quantify the ness value of strategic technology investments to your managementteam In e-commerce, find resources to help you start or grow your

Trang 8

busi-Web Sites 225

Appendix C

online business Get details about how to get real-time access to yourmost powerful data in business intelligence, how to manage yourbusiness partnerships more effectively in customer relationship man-agement, or how to share information within your organizationthrough knowledge management Or read how companies plan to usewireless and other mobile technologies in mobility

Industries Get specifics on how other companies in the retail,

health-care, financial services, manufacturing, hospitality, and engineeringindustries are using solutions from Microsoft and its partners to growtheir businesses

Find a solution Find listings in various industries or regions for

inde-pendent software vendors (ISVs) who build solutions for businesses

in the solution directory

http://www.mlnet.org/ This site is dedicated to the field of machinelearning, knowledge discovery, case-based reasoning, knowledge acquisi-tion, and data mining This site provides information about research groupsand persons within the community Browse through the list of software anddata sets, and check out our events page for the latest calls for papers Alter-natively, have a look at the list of job offerings if you are looking for a newopportunity within the field And of course, they greatly appreciate anykind of feedback, so send us your comments and suggestions

www.mdcinfo.com/ This site provides information on the Meta DataCoalition, an organization originally set up by Microsoft to provide metadata solutions in data warehousing, business intelligence and data mining

http://www.icpsr.umich.edu/DDI/Resources.html The Data tation Initiative (DDI) is an effort to establish an international criterion andmethodology for the content, presentation, transport, and preservation ofmetadata (data about data) about data sets in the social and behavioral sci-ences Metadata constitute the information that enables the effective, effi-cient, and accurate use of those data sets The site is hosted by the ICPSR(Inter-university Consortium for Political and Social Research) at the Uni-versity of Michigan

Documen-http://www.dhutton.com/ David Hutton Associates are consultants inquality management They are specialists in Baldridge-style business excel-lence assessment as a tool to drive organizational change and improvement

http://www.salford-systems.com/ Salford Systems are developers ofCART and MARS data mining decision tree and regression modeling prod-ucts The site contains information about these products, white papers, andother technical reports

Trang 9

226 Web Sites

http://research.swisslife.ch/kdd-sisyphus/ This is a site for a workgroupdevoted to data preparation, preprocessing, and reasoning for real-worlddata mining applications This workgroup is designed to bring togetherdevelopers of algorithms who want to think about the reprocessing stepsnecessary to apply their algorithms to the data in a real-world database, aswell as people who are interested in building tools that integrate variousdata mining algorithms as possible core phases for KDD applications.The workgroup is especially interested in the following topics:

Identify neccessary and useful preprocessing operations and tools(i.e., get the application know-how from the algorithm developer)

Examine ways of how these preprocessing operations can be sented (e.g., for documention and reuse) as well as executed effi-ciently on large data sets

repre- Compare the different data mining approaches with respect to theirinput requirements

Compare different (logical) representations of the problem and cuss their advantages/disadvantages Examine the need for multirela-tional representations to cover all the 1:N and N:M relations betweenthe different entities of this KDD-Sisyphus problem

dis- Establish usability criteria for various data mining approaches; forexample:

scalability—number of records, number of attributes, multiplerelations versus learning time and space requirements

robustness—handling of missing values, missing related tuples,noise-tolerance, nominal attributes with many different values,etc

learning goal—classification, clustering, rule learning, etc

understandability—size und presentation of mining results

parameter-settings of the data mining algorithm and their impact

on the mining result The KDD-Sisyphus Workgroup provides the Sisyphus I package which

is based on data extracted from a real-world insurance business application

As such it shows typical properties like fragmentation, varying data quality,irregular data value codings, and so on, which makes the application of datamining or machine learning algorithms a real challenge and usually requiressophisticated preprocessing methods

Trang 10

Web Sites 227

Appendix C

The work package of KDD-Sisyphus I contains

A data set consisting of 10 relations with 5 to 50 attributes andaround 200,000 data tuples in ASCII format

A rough schema description explaining the data types and theirsemantic relationships

Three data mining task descriptions (two classification and one tering task)

Trang 11

clus-This Page Intentionally Left Blank

Trang 12

Australian (Australian credit)

Diabetes (diabetes of Pima Indians)

DNA (DNA sequence)

German (German credit)

Heart (heart disease)

Letter (letter recognition)

Segment (image segmentation)

Shuttle (shuttle control)

Satimage (Landsat satellite image)

Vehicle (vehicle recognition using silhouettes)

http://kdd.ics.uci.edu/

The UCI Knowledge Discovery in Databases Archive is an online tory of large data sets that encompasses a wide variety of data types, analysistasks, and application areas The primary role of this repository is to enableresearchers in knowledge discovery and data mining to scale existing andfuture data analysis algorithms to very large and complex data sets

Trang 13

reposi-230 Data Mining and Knowledge Discovery Data Sets in the Public Domain

This repository is currently under construction and is still in a nary form This work is supported by a grant from the Information andData Management Program at the National Science Foundation and isintended to extend the current UCI Machine Learning Database Reposi-tory by several orders of magnitude

prelimi-In addition to storing data and description files, the repository alsoarchives task files that describe a specific analysis, such as clustering orregression, for the data sets stored The call for data sets lists typical datatypes and tasks of interest

D.2.1 Discrete sequence data

UNIX user data

This file contains nine sets of sanitized user data drawn from the commandhistories of eight UNIX computer users at Purdue over the course of up totwo years

D.2.2 Customer preference and recommendation data

Entree Chicago recommendation data

This data contains a record of user interactions with the Entree Chicago taurant recommendation system This is an interactive system that recom-mends restaurants to the user based on factors such as cuisine, price, style,atmosphere, and so on or based on similarity to a restaurant in another city(e.g., “find me a restaurant similar to the Patina in Los Angeles”) The usercan then provide feedback such as find a nicer or less expensive restaurant

res-D.2.3 Image data

CMU face images

This data consists of 640 black-and-white face images of people taken withvarying pose (straight, left, right, up), expression (neutral, happy, sad,angry), eyes (wearing glasses or not), and size

Volcanoes on Venus

The JARtool project was a pioneering effort to develop an automatic systemfor cataloging small volcanoes in the large set of Venus images returned bythe Magellan spacecraft This package contains a variety of data to enableresearchers to evaluate algorithms over the same images as used for the JAR-tool experiments

Trang 14

Data Mining and Knowledge Discovery Data Sets in the Public Domain 231

COIL data

This data set is from the 1999 Computational Intelligence and Learning(COIL) competition The data contains measurements of river chemicalconcentrations and algae densities

Corel image features

This data set contains image features extracted from a Corel image tion Four sets of features are available based on the color histogram, colorhistogram layout, color moments, and co-occurence texture

collec-Forest CoverType

The forest cover type for 30 × 30 meter cells obtained from US Forest vice (USFS) Region 2 Resource Information System (RIS) data

Ser-The insurance company benchmark (COIL 2000)

This data set used in the COIL 2000 Challenge contains information oncustomers of an insurance company The data consists of 86 variables andincludes product usage data and socio-demographic data derived from ziparea codes The data was collected to answer the following question: Canyou predict who would be interested in buying a caravan insurance policyand give an explanation why?

Internet usage data

This data contains general demographic information on internet users in1997

IPUMS census data

This data set contains unweighted PUMS census data from the Los Angelesand Long Beach areas for the years 1970, 1980, and 1990 The codingschemes have been standardized (by the IPUMS project) to be consistentacross years

Trang 15

232 Data Mining and Knowledge Discovery Data Sets in the Public Domain

KDD CUP 1998 data

This is the data set used for The Second International Knowledge ery and Data Mining Tools Competition, which was held in conjunctionwith KDD-98 The Fourth International Conference on Knowledge Dis-covery and Data Mining The competition task is a regression problemwhere the goal is to estimate the return from a direct mailing in order tomaximize donation profits

Discov-KDD CUP 1999 data

This is the data set used for The Third International Knowledge Discoveryand Data Mining Tools Competition, which was held in conjunction withKDD-99 The Fifth International Conference on Knowledge Discovery andData Mining The competition task was to build a network intrusion detec-tor, a predictive model capable of distinguishing between “bad” connec-tions, called intrusions or attacks, and “good” normal connections Thisdatabase contains a standard set of data to be audited, which includes awide variety of intrusions simulated in a military network environment

D.2.5 Relational data

Movies

This data set contains a list of more than 10,000 films including manyolder, odd, and cult films There is information on actors, casts, directors,producers, studios, and so on The material also includes some social infor-mation, as “lived with” and “married to.”

D.2.6 Spatio-temporal data

El Niño data

The data set contains oceanographic and surface meteorological readingstaken from a series of buoys positioned throughout the equatorial Pacific.The data is expected to aid in the understanding and prediction of El Niño/Southern Oscillation (ENSO) cycles

D.2.7 Text

20 newsgroups data

This data set consists of 20,000 messages taken from 20 Usenet groups

Trang 16

news-Data Mining and Knowledge Discovery news-Data Sets in the Public Domain 233

Appendix D

Reuters-21578 text categorization collection

This is a collection of documents that appeared on Reuters newswire in

1987 The documents were assembled and indexed with categories

D.2.8 Time series

Australian sign language data

This data consists of sample of Auslan (Australian Sign Language) signs.Examples of 95 signs were collected from five signers with a total of 6,650sign samples

EEG data

This data arises from a large study to examine EEG correlates of geneticpredisposition to alcoholism It contains measurements from 64 electrodesplaced on the scalp sampled at 256 Hz (3.9-msec epoch) for 1 second

Japanese vowels

This data set records 640 time series of 12 LPC cepstrum coefficients takenfrom nine male speakers

Pioneer-1 mobile robot data

This data set contains time series sensor readings of the Pioneer-1 mobilerobot The data is broken into “experiences” in which the robot takes actionfor some period of time and experiences a controlled interaction with itsenvironment (i.e., bumping into a garbage can)

Pseudo periodic synthetic time series

This data set is designed for testing indexing schemes in time series bases The data appears highly periodic, but never exactly repeats itself Thisfeature is designed to challenge the indexing tasks

data-Robot execution failures

This data set contains force and torque measurements on a robot after ure detection Each failure is characterized by 15 force/torque samples col-lected at regular time intervals starting immediately after failure detection

fail-Synthetic control chart time series

This data consists of synthetically generated control charts

Trang 17

234 Data Mining and Knowledge Discovery Data Sets in the Public Domain

D.2.9 Web data

Microsoft anonymous Web data

This data set records which areas (Vroots) of www.microsoft.com each uservisited in a one-week timeframe in February 1998

Syskill Webert Web data

This database contains the HTML source of web pages plus the ratings of asingle user on these pages The Web pages are on four separate subjects(bands, or recording artists; goats; sheep; and biomedical.)

http://www.mlnet.org/

The MLnet Online Information Service is dedicated to the field of machinelearning, knowledge discovery, case-based reasoning, knowledge acquisi-tion, and data mining The site provides information on research groupsand persons in the community You can browse through the list of softwareand data sets, and check out the events page for the latest calls for papers.The site also provides lists of job offerings if you are looking for a newopportunity within the field

http://research.swisslife.ch/kdd-sisyphus/

This site provides a large, unpreprocessed, multirelational, and partiallydocumented database extract This data is intended for use in research onpreprocessing techniques for real world data “The KDD-Sisyphus Work-group provides the Sisyphus I package, which is based on data extractedfrom a real-world insurance business application As such it shows typicalproperties like fragmentation, varying data quality, irregular data value cod-ings, etc which makes the application of data mining or machine learningalgorithms a real challenge and usually requires sophisticated preprocessingmethods.”

Định dạng
Số trang	34
Dung lượng	169,2 KB