Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 34 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
34
Dung lượng
169,2 KB
Nội dung
218 Glossary XML (Extensible Mark Up Language) Based on SGML, XML is used to describe the format, presentation and control of content of documents that are based on this language. The Extensible Markup Language (XML) is descriptively identified in the XML 1.0 W3C Recommendation as an extremely simple dialect, or subset, of SGML the goal of which is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML, for which reason XML has been designed for ease of implementation, and for interoperability with both SGML and HTML. 219 B References Pieter Adriaans and Dolf Zantinge. Data Mining. Addison-Wesley, 1996. Michael J. A. Berry and Gordon Linoff. Data Mining Techniques for Market- ing, Sales, and Customer Support. John Wiley & Sons, 1997. W. A. Belson. “A technique for studying the effects of a television broad- cast,” Applied Statistics, 5, 1956, 195. Michael J. A. Berry and Gordon S. Linoff. Mastering Data Mining: The Art and Science of Customer Relationship Management. John Wiley & Sons, 2000. Alex Berson, Stephen Smith, and Kurt Thearling. Building Data Mining Applications for CRM. McGraw-Hill, 2000. David Biggs, B. de Ville, and E. Suen, “A method of choosing multiway partitions for classification and decision trees,” Journal of Applied Statis- tics, 18, 1, 1991, 49–62. Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees, Wadsworth, 1984. Barry de Ville, “Applying statistical knowledge to database analysis and knowledge base construction,” Proceedings of the Sixth IEEE Conference on Artificial Intelligence Applications, IEEE Computer Society, Washing- ton, 30–36, March 1990. N. M. Dixon. Common Knowledge: How Companies Thrive by Sharing What They Know, Harvard Business School Press, 2000. H. J. Einhorn. “Alchemy in the behavioral sciences,” Public Opinion Quar- terly, 36, 1972, 367–378. Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining, AAAI Press, The MIT Press, 1996. 220 References Morten T. Hansen, Nitin Nohria, and Thomas Tierney. “What’s Your Strat- egy for Managing Knowledge?” Harvard Business Review, 77, 2, 1999, 106–16. (Available: http://www.hbsp.harvard.edu/products/hbr/ marapr99/99206.html) E. Hunt, J. Marin, and P. Stone. Experiments in Induction, Academic Press, 1966. Bill Inmon. Managing the Data Warehouse, John Wiley & Sons, 1996. Robert S. Kaplan and David P. Norton. The Balanced Scorecard: Translating Strategy into Action, Harvard Business School Press, 1996. Olivia Parr Rud. Data Mining Cookbook. John Wiley & Sons, 2001. Abraham Kaplan. The Conduct of Inquiry: Methodology for Behavioral Sci- ence. Chandler Publishing Company, 1964. G. V. Kass. “Significance testing in automatic interaction detection,” Applied Statistics, 24, 2, 1976, 178–189. G. V. Kass. “An exploratory technique for investigating large quantities of categorical data,” Applied Statistics, 29, 2, 1980, 119–127. Thomas Kuhn. The Structure of Scientific Revolutions, Third Edition. Uni- versity of Chicago Press, 1996. Jesus Mena. Data Mining Your Website. Butterworth–Heinemann, 1999. D. Michie. “Methodologies from Machine Learning in Data Analysis and Software,” The Computer Journal, 34, 6, 1991, 559–565. Shigeru Mizuno. Management for Quality Improvement: The Seven New QC To o l s , Productivity Press, 1979. J. N. Morgan and J. A. Sonquist. “Problems in the Analysis of Survey Data, and a Proposal,” Journal of the American Statistical Association, 58, June 1963, 415. C. O’Dell, F. Hasanali, C. Hubert, K. Lopez, and C. Raybourn. Stages of Implementation: A Guide for Your Journey to Knowledge Management Best Practices. APQC’s Passport to Success Series, Houston, Texas, 2000. L. W. Payne and S. Elliot. “Knowledge sharing at Texas Instruments: Turn- ing best practices inside out,” Knowledge Management in Practice, 6, 1997. Dorian Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. References 221 Appendix B R. Quinlan. “Discovering rules by induction from large collections of examples,” Expert Systems in the Micro-electronic Age, D. Michie (ed), Edinburgh, 1979, 168–201. Reid G. Smith and Adam Farquhar. “The road ahead for knowledge man- agement: an AI perspective, AI Magazine, 21, 4, Winter 2000, 17–40. J. A. Sonquist, E. Baker, and J. Morgan. Searching for Structure, Institute for Social Research, University of Michigan, Ann Arbor, Michigan, 1973. Thomas A. Stewart. Intellectual Capital, The New Wealth of Organizations, Doubleday-Currency, 1997. Jake Sturm. Data Warehousing with Microsoft ® SQL Server 7.0 Technical Reference, Microsoft Press, 1998 Ian Whitten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. This Page Intentionally Left Blank 223 C Web Sites http://www.dmg.org/ The Data Mining Group is a consortium of indus- try and academics formed to facilitate the creation of useful standards for the data mining community. The site is hosted by the National Center fro Data Mining at the University of Illinois at Chicago (UIC). The site pro- vides a member area (for members only), a software repository and provides news and announcements. http://www.kdnuggets.com/ KD Nuggets is a leading electronic newslet- ter on data mining and Web mining. Its monthly release provides up-to- date news items on developments in data mining and knowledge discovery. http://www.oasis-open.org/ Organization for the Advancement of Struc- tured Information Standards (OASIS) is a nonprofit international consor- tium that creates interoperable industry specifications based on public standards such as XML and SGML. OASIS members include organizations and individuals who provide, use and specialize in implementing the tech- nologies that make these standards work in practice. Provides information on such emerging standards as Predictive Model Markup Language (PMML) in the separate XML Cover Pages site http://www.oasis-open.org/ cover/. http://www.xml.org A credible, independent resource for news, educa- tion, and information about the application of XML in industrial and com- mercial settings. Hosted by OASIS and funded by organizations that are committed to product-independent data exchange, XML.ORG offers valu- able tools, such as the XML.ORG Catalog, to help you make critical deci- sions about whether and how to employ XML in your business. For businesspeople and technologists alike, XML.ORG offers a uniquely inde- pendent view of what’s happening in the XML industry. http://www.microsoft.com/sql/index.htm This is the home page for Microsoft SQL Server. This particular URL provides a list of Microsoft 224 Web Sites white papers related to SQL Server. The general site provides news and information about SQL Server and future releases. http://www.microsoft.com/data/ This Microsoft Web site provides infor- mation on current and evolving Microsoft data access products, documen- tation (including standards documents), technical materials, and downloads. Here you will find the OLE DB for Data Mining and OLE DB for OLAP specifications and such evolving developments as the XML for Analysis Specification. http://www.msdn.microsoft.com/downloads/ MSDN Online Down- loads offers you one place to find and download all developer-related tools and add-ons, service packs, product updates, and beta and preview releases. http://msdn.microsoft.com/library/default.asp The MSDN Library is an essential resource for developers using Microsoft tools, products, and technologies. It contains a bounty of technical programming information, including sample code, documentation, technical articles, and reference guides. http://backoffice.microsoft.com This site provides information about any of the Microsoft back office products. Many of Microsoft’s back office products integrate with SQL Server. http://www.mssqlserver.com/ Technical reviews, frequently asked ques- tions (FAQs), and all-around information resource for SQL Server issues and operations. http://www.microsoft.com/solutions/km/DigitalDashboard.htm/ A Microsoft site that describes how to implement a digital dashboard. http://www.microsoft.com/business/ The Microsoft Business Web site provides news, information, and executive perspectives from Microsoft about the technologies that can provide an edge in the digital age. This site provides a glimpse into Microsoft’s vision for the future of technology, and how to use it to grow your business. Sections include: Microsoft’s vision. Learn about the Microsoft .NET platform and how it changes how business interacts with customers, employees and sup- pliers. Business strategy. In Measuring Business Value, use a tool Microsoft calls Rapid Economic Justification. It can help you quantify the busi- ness value of strategic technology investments to your management team. In e-commerce, find resources to help you start or grow your Web Sites 225 Appendix C online business. Get details about how to get real-time access to your most powerful data in business intelligence, how to manage your business partnerships more effectively in customer relationship man- agement, or how to share information within your organization through knowledge management. Or read how companies plan to use wireless and other mobile technologies in mobility. Industries. Get specifics on how other companies in the retail, health- care, financial services, manufacturing, hospitality, and engineering industries are using solutions from Microsoft and its partners to grow their businesses. Find a solution. Find listings in various industries or regions for inde- pendent software vendors (ISVs) who build solutions for businesses in the solution directory. http://www.mlnet.org/ This site is dedicated to the field of machine learning, knowledge discovery, case-based reasoning, knowledge acquisi- tion, and data mining. This site provides information about research groups and persons within the community. Browse through the list of software and data sets, and check out our events page for the latest calls for papers. Alter- natively, have a look at the list of job offerings if you are looking for a new opportunity within the field. And of course, they greatly appreciate any kind of feedback, so send us your comments and suggestions. www.mdcinfo.com/ This site provides information on the Meta Data Coalition, an organization originally set up by Microsoft to provide meta data solutions in data warehousing, business intelligence and data mining. http://www.icpsr.umich.edu/DDI/Resources.html The Data Documen- tation Initiative (DDI) is an effort to establish an international criterion and methodology for the content, presentation, transport, and preservation of metadata (data about data) about data sets in the social and behavioral sci- ences. Metadata constitute the information that enables the effective, effi- cient, and accurate use of those data sets. The site is hosted by the ICPSR (Inter-university Consortium for Political and Social Research) at the Uni- versity of Michigan. http://www.dhutton.com/ David Hutton Associates are consultants in quality management. They are specialists in Baldridge-style business excel- lence assessment as a tool to drive organizational change and improvement. http://www.salford-systems.com/ Salford Systems are developers of CART and MARS data mining decision tree and regression modeling prod- ucts. The site contains information about these products, white papers, and other technical reports. 226 Web Sites http://research.swisslife.ch/kdd-sisyphus/ This is a site for a workgroup devoted to data preparation, preprocessing, and reasoning for real-world data mining applications. This workgroup is designed to bring together developers of algorithms who want to think about the reprocessing steps necessary to apply their algorithms to the data in a real-world database, as well as people who are interested in building tools that integrate various data mining algorithms as possible core phases for KDD applications. The workgroup is especially interested in the following topics: Identify neccessary and useful preprocessing operations and tools (i.e., get the application know-how from the algorithm developer). Examine ways of how these preprocessing operations can be repre- sented (e.g., for documention and reuse) as well as executed effi- ciently on large data sets. Compare the different data mining approaches with respect to their input requirements. Compare different (logical) representations of the problem and dis- cuss their advantages/disadvantages. Examine the need for multirela- tional representations to cover all the 1:N and N:M relations between the different entities of this KDD-Sisyphus problem. Establish usability criteria for various data mining approaches; for example: scalability—number of records, number of attributes, multiple relations versus learning time and space requirements robustness—handling of missing values, missing related tuples, noise-tolerance, nominal attributes with many different values, etc. learning goal—classification, clustering, rule learning, etc. understandability—size und presentation of mining results. parameter-settings of the data mining algorithm and their impact on the mining result The KDD-Sisyphus Workgroup provides the Sisyphus I package which is based on data extracted from a real-world insurance business application. As such it shows typical properties like fragmentation, varying data quality, irregular data value codings, and so on, which makes the application of data mining or machine learning algorithms a real challenge and usually requires sophisticated preprocessing methods. Web Sites 227 Appendix C The work package of KDD-Sisyphus I contains A data set consisting of 10 relations with 5 to 50 attributes and around 200,000 data tuples in ASCII format A rough schema description explaining the data types and their semantic relationships Three data mining task descriptions (two classification and one clus- tering task) [...]... variety of data to enable researchers to evaluate algorithms over the same images as used for the JARtool experiments Data Mining and Knowledge Discovery Data Sets in the Public Domain D.2.4 231 Multivariate data Census-income database This data set contains unweighted PUMS census data from the Los Angeles and Long Beach areas for the years 1970, 1 980 , and 1990 The coding schemes have been standardized... “experiences” in which the robot takes action for some period of time and experiences a controlled interaction with its environment (i .e. , bumping into a garbage can) Pseudo periodic synthetic time series This data set is designed for testing indexing schemes in time series databases The data appears highly periodic, but never exactly repeats itself This feature is designed to challenge the indexing... published by Hobart Press (books@hobart.com) and written by William S Cleveland (wsc@research.att.com) There is also a README file so there are 26 files in all Each of the 25 files has the data in an ASCII table format The name of each data file is the name of the data set used in the book To find the description of the data set in the book look under the entry data, name” in the index For example, one data. .. Robot execution failures This data set contains force and torque measurements on a robot after failure detection Each failure is characterized by 15 force/torque samples collected at regular time intervals starting immediately after failure detection Synthetic control chart time series This data consists of synthetically generated control charts Appendix D 234 Data Mining and Knowledge Discovery Data Sets... types and tasks of interest D.2.1 Discrete sequence data UNIX user data This file contains nine sets of sanitized user data drawn from the command histories of eight UNIX computer users at Purdue over the course of up to two years D.2.2 Customer preference and recommendation data Entree Chicago recommendation data This data contains a record of user interactions with the Entree Chicago restaurant recommendation... data sets taken from various case studies These data sets are suitable for model building exercises such as are discussed in the textbook, Time Series Modeling of Water Resources and Environmental Systems by K W Hipel and A I McLeod (Elsevier, 1994) For PC users there is also a zip file, mhsets.zip The shar file and the zip files are about 1.7 Mb and 0.5 Mb, respectively Ian McLeod (aim@fisher.stats.uwo.ca)... biomedical.) D.3 MLnet online information service http://www.mlnet.org/ The MLnet Online Information Service is dedicated to the field of machine learning, knowledge discovery, case-based reasoning, knowledge acquisition, and data mining The site provides information on research groups and persons in the community You can browse through the list of software and data sets, and check out the events page... Applied Statistics Data Mining and Knowledge Discovery Data Sets in the Public Domain 249 1 989 ) There is a large amount of data Please be sure you want it before you ask for it! There are two entries to obtain: wind.desc, a short desciption of the data (81 5 bytes) wind .data, the data (532494 bytes) D.5.79 wind.correlations Estimated correlations between daily 3 p.m wind measurements during September and. .. The data contain a trend and outliers Source: Laurie Davies (mata00@de0hrz1a.BITNET) (43k, 5/Feb/93) D.5 .8 baseball Data on the salaries of North American major league baseball players The data set has performance and salary information on players during the 1 986 season This was the 1 988 ASA Graphics Section Poster Session data set, organized by Lorraine Denby There are two files to retrieve: baseball .data, ... carriers from normals as the more Data Mining and Knowledge Discovery Data Sets in the Public Domain 237 difficult measurements Unfortunately, I don’t remember which measurement is which There are two files to retrieve: biomed.desc, which is a short description of the data and a reference (1457 bytes) biomed .data, which is a shar archive of containing the data for carriers and normals ( 784 3 bytes) D.5.10 . from the algorithm developer). Examine ways of how these preprocessing operations can be repre- sented (e. g., for documention and reuse) as well as executed effi- ciently on large data sets enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets. 230 Data Mining and Knowledge Discovery Data Sets. Multivariate data Census-income database This data set contains unweighted PUMS census data from the Los Angeles and Long Beach areas for the years 1970, 1 980 , and 1990. The coding schemes have been standardized