tài liệu giới thiệu về khai thác dữ liệu
Data Mining Adrian Tuhtan 004757481 CS157A Section1 Overview Introduction Explanation of Data Mining Techniques Advantages Applications Privacy Data Mining What is Data Mining? “The process of semi automatically analyzing large databases to find useful patterns” (Silberschatz) KDD – “Knowledge Discovery in Databases” (3) “Attempts to discover rules and patterns from data” Discover Rules Make Predictions Areas of Use Internet – Discover needs of customers Economics – Predict stock prices Science – Predict environmental change Medicine – Match patients with similar problems cure Example of Data Mining Credit Card Company wants to discover information about clients from databases. Want to find: Clients who respond to promotions in “Junk Mail” Clients that are likely to change to another competitor Clients that are likely to not pay Services that clients use to try to promote services affiliated with the Credit Card Company Anything else that may help the Company provide/ promote services to help their clients and ultimately make more money. Data Mining & Data Warehousing Data Warehouse: “is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site.” (Silberschatz) Collect data Store in single repository Allows for easier query development as a single repository can be queried. Data Mining: Analyzing databases or Data Warehouses to discover patterns about the data to gain knowledge. Knowledge is power. Discovery of Knowledge Data Mining Techniques Classification Clustering Regression Association Rules Classification Classification: Given a set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item. Therefore to classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications “Responds Rarely, Responds Sometimes, Responds Frequently”. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers. Technique for Classification Decision-Tree Classifiers Job Income Job Income Income Carpenter Engineer Doctor Bad Good Bad Good Bad Good <30K <40K <50K>50K >90K >100K Predicting credit risk of a person with the jobs specified. Clustering “Clustering algorithms find groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ” (2) Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as ‘unsupervised learning’