Tổng quan về công nghệ cơ sởdữ liệu Model: “a representation of something, either as a physical object which is usually smaller than the real object, or as a simple Æ Mô hình hóa dữ
Trang 11
Trang 2Tài liệu tham khảo
Techniques”, Second Edition, Morgan Kaufmann Publishers, 2006
[2] David Hand, Heikki Mannila, Padhraic Smyth, “Principles of Data
Mining”, MIT Press, 2001
Techniques”, Springer-Verlag, 2008
Methodology, Techniques, and Applications”, Springer-Verlag, 2006
[5] Hillol Kargupta, Jiawei Han, Philip S Yu, Rajeev Motwani, and
Vipin Kumar, “Next Generation of Data Mining”, Taylor & Francis
Group, LLC, 2009
& Sons, Inc, 2006
[7] Ian H.Witten, Eibe Frank, “Data mining : practical machine
learning tools and techniques”, Second Edition, Elsevier Inc, 2005
[8] Florent Messeglia, Pascal Poncelet & Maguelonne Teisseire,
“Successes and new directions in data mining”, IGI Global, 2008
Discovery Handbook”, Second Edition, Springer Science + Business
Trang 3Nội dung
Chương 3: Hồi qui dữ liệu
Chương 7: Khai phá dữ liệu v à công nghệ cơ
sở dữ liệu
dữ liệu
Trang 4Chương 7: Khai phá dữ liệu và
công nghệ cơ sở dữ liệu
7.1 Tổng quan về công nghệ cơ sở dữ liệu
7.2 Khả năng hỗ trợ khai phá dữ liệu của
công nghệ cơ sở dữ liệu
7.3 Các ngôn ngữ truy vấn dành cho khai
phá dữ liệu
7.4 Hỗ trợ của các DBMS ngày nay dành
cho khai phá dữ liệu
7.5 Tóm tắt
Trang 57.0 Tình huống 1
Người đang sử dụng thẻ ID = 1234 thật
sự là chủ nhân của thẻ hay là một tên trộm?
Trang 77.0 Tình huống 3
Ngày mai cổ phiếu STB sẽ tăng???
Trang 87.0 Tình huống 4
Không (97%)
… 3.0
2.0 47
5.5 82
2007
Có (90%)
… 7.5
9.5 24
2006
Có (80%)
… 6.0
7.0 90
2005
Không
… 3.5
5.5 8
2004
… 14
3 2 1
5.0 2004
Không
… 2.5
4.0 2004
Có
… 8.0
6.5 2004
Có
… 8.5
9.0 2004
TốtNghiệp
… MônHọc2
MônHọc1 Khóa
Làm sao xác định được khả năng tốt nghiệp của
Trang 97.0 Tình huống …
We are data rich, but information poor.
“Necessity is the mother of invention” - Plato
Trang 107.1 Tổng quan về công nghệ cơ sở
dữ liệu
quá trình khai phá dữ liệu (tóm tắt từ
Chương 1)
Bắt nguồn từ yêu cầu ứng dụng thực tiễn
Dữ liệu thật/dữ liệu nhân tạo từ mô phỏng
Cấu trúc từ đơn giản đến phức tạp
Lượng dữ liệu lớn, biến động nhiều
Lưu trữ lâu dài/lưu trữ tạm thời
Æ Quản lý và tận dụng hiệu quả
Trang 11trình khai phá cụ thể
thông số
Lưu trữ lâu dài/lưu trữ tạm thời
Æ Quản lý và tận dụng hiệu quả
Trang 127.1 Tổng quan về công nghệ cơ sở
dữ liệu
Model: “a representation of something,
either as a physical object which is usually
smaller than the real object, or as a simple
Æ Mô hình hóa dữ liệu cho quá trình khai phá
Æ Mô hình hóa kết quả từ quá trình khai phá
Trang 137.1 Tổng quan về công nghệ cơ sở
dữ liệu
Simple Data without Queries
Simple Data with Queries
Complex Data without Queries
Complex Data with Queries
Video, Audio, Image, Text, 3D Graphical Data, etc.
IV
Trang 147.1 Tổng quan về công nghệ cơ sở
dữ liệu
File Systems Relational DB Systems
Object Relational DB Systems Object (Oriented) DB Systems
Simple Data
Complex Data
Source: M Stonebraker, P Brown with D Moore, Object-Relational DBMS’s – Tracking the Next
Trang 155NF relations 1NF relations
No Object behaviors
Richer OCL expressions
Fewer Constraints
Relationship type Relationship type
Relationship type Relationships
Identifier OID (implicit)
Key attribute Object identity
Attribute
-Attribute Object attribute
-
-Weak entity type
Set of dependent
objects
Object type Class
Entity type Set of objects of
interest
NIAM/ORM UML
ERM Conceptual Data Model
Trang 167.1 Tổng quan về công nghệ cơ sở
dữ liệu
Mô hình hóa luận lý dữ liệu cho quá trình khai phá
Methods calling Logical pointer REF
(system-generated)
OID (system-generated) Fully encapsulated object
with atomic/non-atomic attributes
Object
SQL:3, SQL:99, SQL:2003, OQL
Foreign key (attribute values)/logical pointer REF (system- generated)
Primary key (attribute values) / OID (ROWID, REFC) (system- generated)
Relation/un-encapsulated object with atomic/non- atomic attributes
Object
Relational
Nested relational algebra with nest/unnest operations
Foreign key (attribute values)
Primary key (attribute values)
Nested relation with nested relation attributes
Nested
Relational
Relational algebra, tuple relational calculus, SQL:89, SQL:92
Foreign key (attribute values)
Primary key (attribute values)
Relation with atomic attributes
Relational
Language Referential Constraint
Identity Key Construct
Data Model
Trang 17 “A data warehouse is a subject-oriented oriented, integrated integrated, nonvolatile, and nonvolatile
time-variant variant collection of data in support of management’s decisions.”
UML conceptual model
Star (relational)/multidimensional model
Figure 2.5 The structure of the
data warehouse.
Source: W.H Inmon Building the
data warehouse, 3 rd Edition, John
Wiley & Sons, Inc., 2001.
Trang 187.1 Tổng quan về công nghệ cơ sở
dữ liệu
Figure 1 Decision support system architecture, which consists of three principal components: a data
warehouse server, analysis and data mining tools, and data warehouse back-end tools.
Trang 19 Large collection of discovered knowledge
so-called pattern management system just like data by a defined/developed/used DBMS
Architectural issues
Representation constructs: Pattern type, Pattern, Class
Implicit constraints: Pattern-Pattern type, Pattern-Class, Pattern-Pattern type
Specialization, composition, refinement
[108] S Rizzi, E Bertino, B Catania, M Golfarelli, M Halkidi, M Terrovitis, P Vassiliadis, M Vazirginannis, E
Trang 20Reference architecture
for a pattern base
management system
using the logical model
Source: S Rizzi, E Bertino,
B Catania, M Golfarelli, M
Halkidi, M Terrovitis, P
Vassiliadis, M Vazirginannis,
E Vrachnos Towards a logical
model for patterns In
Proceedings of the ER 2003,
LNCS 2813, pp 77-90, 2003.
Trang 21[3-2006] (language and system development – PhD thesis)
[87-2007] (Interoperability issues + support for application programs + driver development)
[73-2008] (summary)
Trang 22Related Works - [108-2003]Æ [105-2007]
[108-2003] (architectural issues + representational constructs + pattern relationships)
[12-2004, 2007] (formal definition, pattern warehouse, query types, predicates and operators)
[105-2007] (more operators on pattern warehouse + indexing techniques – PhD thesis)
Trang 23Related Works - [108-2003]Æ [101-2009]
[108-2003]
[99-2007] (model extension with superclass,
ontology for knowledge evaluation of
association rules and queries)
[100-2008] (pattern comparison methods for clustering)
[101-2009] (pattern comparison for crisp/fuzzy clustering, open
source prototype development (PatternMiner) – PhD thesis)
[98-2005] (Database approach: relational,
object relational, and XML-based databases)
Trang 24Related works
4 B Catania, A Maddalena, M Mazza, E Bertino, S Rizzi A framework for data mining pattern
management In Proceedings of PKDD 2004, LNAI 3202, pp 87-98, 2004.
97. B Catania, A Maddalena Pattern Management: Practice and Challenges In Processing and
Managing Complex Data for Decision Support, J Darmont, O Boussaid (eds.), Idea Group
Publishing, 2006.
73 B Catania Towards effective solutions for pattern management International Journal of
Computer Science and Applications, Vol 5(3), 2008, 36-45.
98 E Kotsifakos, I Ntoutsi, Y Theodoridis Database support for data mining patterns In
Proceedings of the 10th Panhellenic Conference on Informatics (PCI’05), Advances in Informatics
– Springer-Verlag LNCS 3746, 2005.
99 E.E Kotsifakos, G Marketos, Y Theodoridis A framework for integrating ontologies and
pattern-bases Data Mining with Ontologies: Implementations, Findings, and Frameworks, H.O
Nigro, S G Cisaro, D Xodo (eds.), Chapter 12, IDEA Group, 2007.
100 E.E Kotsifakos, I Ntoutsi, Y Vrahoritis, Y Theodoridis PATTERN-MINER: Integrated
management and mining over data mining models (Demo) In Proceedings of the 14th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), 2008.
101 E.E Kotsifakos Pattern representation and management techniques – The PBMS concept
PhD Thesis, Department of Informatics, University of Piraeus, 2009.
3. A Maddalena A unified framework for heterogeneous pattern management PhD thesis in
Computer Science, University of Genova, April 2006.
87 A Maddalena, B Catania Towards an interoperable solution for pattern management In
Proceedings of VLDB’07, 2007.
106. R Meo, G Psaila An XML-based database for knowledge discovery In Proceedings of the
EDBT 2006 Workshops, LNCS 4254, pp 814-828, 2006.
108 S Rizzi, E Bertino, B Catania, M Golfarelli, M Halkidi, M Terrovitis, P Vassiliadis, M
Vazirginannis, E Vrachnos Towards a logical model for patterns In Proceedings of the ER 2003,
LNCS 2813, pp 77-90, 2003.
105 M Terrovitis Modelling and operational issues for pattern base management systems PhD
Thesis, Computer Science Division, School of Electrical and Computer Engineering, National
Technical University of Athens, 2007.
12 M Terrovitis, P Vassiliadis, S Skiadopoulos, E Bertino, B Catania, A Maddalena, S Rizzi
Modeling and language support for the management of pattern-bases Data & Knowledge
Trang 267.2 Khả năng hỗ trợ khai phá dữ
liệu của công nghệ cơ sở dữ liệu
Từ yêu cầu tri thức trong dữ liệu thu thập
được ngày nay đến yêu cầu dành cho quá
trình khai phá dữ liệu
Từ yêu cầu của quá trình khai phá dữ liệu đến
yêu cầu dành cho công nghệ cơ sở dữ liệu
Æ conventional DBMS, in-memory DBMS,
column-oriented DBMS, IR + DBMS, semantic
technologies + DBMS, service-oriented DBMS,
…
Trang 277.3 Các ngôn ngữ truy vấn dành
cho khai phá dữ liệu
A data mining query language
select the data to be mined and pre-process
these data,
specify the kind of patterns to be mined,
item hierarchies when mining generalized
association rules),
define the constraints on the desired patterns ,
post-process extracted patterns
Source: J-F Boulicaut, C Masson, Data Mining Query Languages, Chapter 1 in: The Data
Mining and Knowledge Discovery Handbook, O Maimon and L Rokach (Eds.), Springer,
2005, pp 715-727.
Trang 287.3 Các ngôn ngữ truy vấn dành
cho khai phá dữ liệu
Proposals for association rule mining
MSQL (Imielinski and Virmani, 1999) at the
Rutgers University
MINE RULE (Meo et al., 1998) at the University of
Torino and the Politecnico di Milano
DMQL (Han et al., 1996) at the Simon Fraser
University
OLE DB for DM by Microsoft Corporation (Netz et
al., 2000)
Source: J-F Boulicaut, C Masson, Data Mining Query Languages, Chapter 1 in: The Data
Mining and Knowledge Discovery Handbook, O Maimon and L Rokach (Eds.), Springer,
Trang 297.3 Các ngôn ngữ truy vấn dành
cho khai phá dữ liệu
MSQL (Imielinski and Virmani, 1999) at the Rutgers
University
Source: J-F Boulicaut, C Masson, Data Mining Query Languages, Chapter 1 in: The Data
Mining and Knowledge Discovery Handbook, O Maimon and L Rokach (Eds.), Springer,
2005, pp 715-727.
Inductive queries to mine rules
Post-processing queries over a materialized collection of rules
Trang 307.3 Các ngôn ngữ truy vấn dành
cho khai phá dữ liệu
MINE RULE (Meo et al., 1998) at the University of
Torino and the Politecnico di Milano
Source: J-F Boulicaut, C Masson, Data Mining Query Languages, Chapter 1 in: The Data Mining and
Trang 317.3 Các ngôn ngữ truy vấn dành
cho khai phá dữ liệu
DMQL (Han et al., 1996) at the Simon Fraser University
Source: J-F Boulicaut, C Masson, Data Mining Query Languages, Chapter 1 in: The Data Mining and
Knowledge Discovery Handbook, O Maimon and L Rokach (Eds.), Springer, 2005, pp 715-727.
Trang 327.3 Các ngôn ngữ truy vấn dành
cho khai phá dữ liệu
OLE DB for DM by Microsoft Corporation
(Netz et al., 2000)
Source: J-F Boulicaut, C Masson, Data Mining Query Languages, Chapter 1 in: The Data
Mining and Knowledge Discovery Handbook, O Maimon and L Rokach (Eds.), Springer,
Trang 337.4 Hỗ trợ của các DBMS ngày nay
dành cho khai phá dữ liệu
An initiative developed and published by the International
Organization for Standardization (ISO)
Part 1: Framework
Part 2: Full-Text
Part 3: Spatial
Part 5: Still Image
Part 6: Data Mining
Part 6 specifies an SQL interface to data mining applications and
services through accessing data from SQL/MM-compliant relational
databases
A standardized interface to data mining algorithms that can be layered
atop any objectrelational database system and even deployed as middle-ware when required
A collection of user-defined types provided for the key data mining functions, namely,
Association Rule Discovery Association Rule Discovery, Clustering, Clustering Classification Classification and Regression
Source: S S Anand, M Grobelnik, F Herrmann, D Wettschereck, M Hornick, C Lingenfelder, N.
Trang 347.4 Hỗ trợ của các DBMS ngày nay
dành cho khai phá dữ liệu
The SQL Multimedia and Applications
Packages Standard (SQL/MM) – Part 6
User-defined types related to data
used to submit a single record of data for model application
Source: S S Anand, M Grobelnik, F Herrmann, D Wettschereck, M Hornick, C Lingenfelder,
Trang 357.4 Hỗ trợ của các DBMS ngày nay
dành cho khai phá dữ liệu
Standard (SQL/MM) – Part 6
DM_<Technique>Settings, DM_<Technique>BldTask, DM_<Technique>Model
DM_<Technique>TestTask, DM_<Technique>Model, DM_<Technique>TestResult
DM_<Technique>ApplTask, DM_<Technique>Model, DM_<Technique>Result, DM_ApplicationData
Source: S S Anand, M Grobelnik, F Herrmann, D Wettschereck, M Hornick, C Lingenfelder,
N Rooney, Knowledge discovery standards, Artif Intell Rev (2007) 27:21-56.
Trang 367.4 Hỗ trợ của các DBMS ngày nay
dành cho khai phá dữ liệu
Trang 377.4 Hỗ trợ của các DBMS ngày nay
dành cho khai phá dữ liệu
Source: S S Anand, M Grobelnik, F Herrmann, D Wettschereck, M Hornick, C Lingenfelder,
N Rooney, Knowledge discovery standards, Artif Intell Rev (2007) 27:21-56.
The application retrieves the model with the statement:
and calls the following to compute the predicted class:
Trang 387.4 Hỗ trợ của các DBMS ngày nay
dành cho khai phá dữ liệu
Microsoft’s OLE DB for Data Mining (OLE-DB 2000): an approach which is Microsoft
specially designed for data mining needs—it combines SQL with a low level
API (a set of COM interfaces) to achieve interoperability with other client and
server technologies.
MS Nạve Bayes, MS Decision Trees, MS Time Series, MS Clustering, MS Sequence
Clustering, MS Association Rules, MS Neural Network
IBM’s DB2 Intelligent Miner products contain a set of DB2 database extenders IBM
(DB2-IM 2004): incorporate data mining functionality into standard database
SQL language in a relatively standard way.
Functionality is based on IBM’s “Intelligent Miner” data mining product, now part of
the IBM DB2 Data Warehouse Edition V9.1.
Intelligent Miner fully implements SQL/MM data mining as well as most of PMML
Oracle Data Mining (Oracle 2004): a set of functions available in Oracle’s Oracle
database and accessible though PL/SQL (programming language available to
database programmers) and through a Java interface.
Decision Tree, Generalized Linear Models, Minimum Description Length, Nạve Bayes,
Support Vector Machines, Apriori, k-Means, Non-Negative Matrix Factorization, One Class Support Vector Machine, Orthogonal Partitioning Clustering
Source: S S Anand, M Grobelnik, F Herrmann, D Wettschereck, M Hornick, C Lingenfelder, N Rooney,