1. Trang chủ
  2. » Giáo Dục - Đào Tạo

DATA MINING LECTURE 5 Sequential Pattern Mining

40 5 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 40
Dung lượng 3,41 MB

Nội dung

DATA MINING LECTURE 5 Sequential Pattern Mining. Sequential Pattern Mining DATA MINING LECTURE 5 Sequential Pattern Mining Outline Sequence database Sequential pattern mining Methods for sequential pattern mining Apriori based Approaches GSP SPADE.

DATA MINING LECTURE Sequential Pattern Mining Outline • Sequence database • Sequential pattern mining • Methods for sequential pattern mining – Apriori-based Approaches • GSP • SPADE – Pattern-Growth-based Approaches • PrefixSpan Sequence Databases • A sequence database consists of ordered elements or events • Transaction databases vs sequence databases A transaction database A sequence database TID itemsets SID sequences 10 a, b, d 10 < (a) (abc) (ac) d (cf)> 20 a, c, d 20 30 a, d, e 30 40 b, e, f 40 Applications • Applications of sequential pattern mining – Customer shopping sequences: • First buy computer, then CD-ROM, and then digital camera, within months – Medical treatments, natural disasters (e.g., earthquakes), science & eng processes, stocks and markets, etc – Telephone calling patterns, Weblog click streams – DNA sequences and gene structures Subsequence vs Supersequence • A sequence is an ordered list of events, denoted < e1 e2 … e l > • Given two sequences α=< a1 a2 … an > and β=< b1 b2 … bm > (m>=n) • α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2 and β=< (abc) (de)> What Is Sequential Pattern Mining? • Given a set of sequences and support threshold, find the complete set of frequent subsequences A sequence database SID sequence 10 20 A sequence : < (ef) (ab) (df) c b > An element may contain a set of item Items within an element are unordere and we list them alphabetically is a subsequence of Given support threshold min_sup =2, is a sequential pattern 30 Challenges on Sequential Pattern Mining • A huge number of possible sequential patterns are hidden in databases • A mining algorithm should: – find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold – be highly efficient, scalable, involving only a small number of database scans – be able to incorporate (include) various kinds of user-specific constraints Methods for sequential pattern mining • Apriori-based Approaches – GSP – SPADE • Pattern-Growth-based Approaches – PrefixSpan The Apriori Property of Sequential Patterns • A basic property: Apriori (Agrawal & Sirkant’94) – If a sequence S is not frequent, then none of the super-sequences of S is frequent – E.g, is infrequent so and Seq ID Sequence 10 20 30 40 50 Given support threshold min_sup =2 GSP—Generalized Sequential Pattern Mining • GSP (Generalized Sequential Pattern) mining algorithm • Outline of the method – Initially, every item in DB is a candidate of length-1 – for each level (i.e., sequences of length-k) • scan database to collect support count for each candidate sequence • generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori – repeat until no frequent sequence or no candidate can be found • Major strength: Candidate pruning by Apriori 10 Completeness of PrefixSpan SDB Having prefix SID sequence 10 20 30 40 Length-1 sequential patterns , , , , , Having prefix , …, Having prefix -projected database -projected database Length-2 sequential patterns , , , , , … …… Having prefix Having prefix -proj db … -proj db 26 Example in detail PrefixSpan - Example id Sequence 10 20 30 40 min_support = ` Find length-1 sequential patterns 4 Divide search space Prefix 27 Example in detail PrefixSpan – Example (2)` Find subsets of sequential patterns 1 28 The Algorithm of PrefixSpan • Input: A sequence database S, and the minimum support threshold min_sup • Output: The complete set of sequential patterns • Method: Call PrefixSpan(, 0, S) • Subroutine PrefixSpan(α, l, S|α) • Parameters: – α: sequential pattern, – l: the length of α; – S|α: the α-projected database, if α ≠; otherwise; the sequence database S 29 The Algorithm of PrefixSpan(2) • Method Scan S|α once, find the set of frequent items b such that: a) b can be assembled to the last element of α to form a sequential pattern (I-Concatenation) b) (b) can be appended to α to form a sequential pattern (S-Concatenation) For each frequent item b, append it to α to form a sequential pattern α’, and output α’; For each α’, construct α’-projected database S|α’, and call PrefixSpan(α’, l+1, S|α’) 30 Efficiency of PrefixSpan • No candidate sequence needs to be generated • Projected databases keep shrinking • Major cost of PrefixSpan: constructing projected databases 31 Optimization in PrefixSpan • Kỹ thuật bi-level projection – Bi-level projection can reduce the number and size of projected databases • Kỹ thuật pseudo-projection – Pseudo-projection can reduce the effort of projection when the projected database fits in main memory 32 Scaling Up by Bi-Level Projection • Partition search space based on length-2 sequential patterns • Create projected databases and pursue (follow) recursive mining over bi-level projected databases 33 Speed-up by Pseudo-projection  Observation: postfixes of a sequence often appear repeatedly in recursive projected databases  Method: instead of constructing physical projection by collecting all the postfixes, we can use pointers referring to ` the sequences in the database as a pseudo-projection  Every projection consists of two pieces of information: pointer to the sequence in database and offset to the postfix in the sequence s1= Pointer s1 s1 s1 Offset Postfix 34 Bi-Level Projection Pair-wise Checking Using Smatrix SDB SID sequence 10 20 30 40 happens times happens twice Length-1 sequential patterns , , , , , happens twice happens twice a b (4, 2, 2) c (4, 2, 1) (3, 3, 2) d (2, 1, 1) (2, 2, 0) (1, 3, 0) e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) S-matrix f (2, 1, 1) (2, 2, (1, 2, (1, 1, (2, 0, All length-2 sequential patterns are found in S-ma 0) 1) 1) 1) a b c d e f Mining -projected Database SDB Length-1 sequential patterns SID sequence 10 20 30 40 , , , , , a S-matrix 4, 2, 2) c (4, 2, 1) (3, 3, 2) d (2, 1, 1) (2, 2, 0) (1, 3, 0) e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) (2, 2, 0) (1, 2, 1) (1, 1, No (2, 0, hope 1) 1) b c b ( -projected database f (2, 1, 1) a Lead to pattern Local length-1 sequential patterns: patterns , , d to 1form (a_c), so no need to count it e f S-matrix a c (1, 0, 1) (_c) (∅, 2, ∅) (∅, 1, ∅) ∅ a c (_c) Benefits of Bi-level Projection  More patterns are found in each shoot  Much less projections  In the example, there are 51 patterns  51 level-by-level projections  22 bi-level projections (S-Matrix có 22 có giá trị >=2) 3-way Apriori Checking Using Apriori heuristic to prune items in  projected databases is a pattern! is not a pattern a b (4, 2, 2) c (4, 2, 1) (3, 3, 2) d (2, 1, 1) (2, 2, 0) (1, 3, 0) e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) a b c d e f Từ S-Matrix trên, xây dựng -projected database thực loại bỏ item d chắn is not a pattern Example - Bi-level Projection id 10 20 30 Bi-level Projection Sequence min_support = 2 40 `  Scan to get 1-length sequences  Construct a triangular matrix instead of projected databases for each length-1 patterns a b (4,2,2) (4,2,1) (3,3,2) ALL length-2 sequential pattern dc (2,1,1) (2,2,0) (1,3,0) e (1,2,1) (1,2,0) (1,2,0) (1,1,0) f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) a b c d e f Support() = Support() = Support() = Support() = 39 Example - Bi-level Projection Bi-level projection (2)`   For each length-2 sequential pattern α, construct the α-projected database and find the frequent items Construct corresponding S-matrix a b c (_c) d (_d) e (_e) f (_f) 2 0 a c (1,0,1) (_c) a (φ,2, φ) (φ,1, φ φ) c (_c) 19 40 ... Sequence database • Sequential pattern mining • Methods for sequential pattern mining – Apriori-based Approaches • GSP • SPADE – Pattern- Growth-based Approaches • PrefixSpan Sequence Databases... min_sup =2, is a sequential pattern 30 Challenges on Sequential Pattern Mining • A huge number of possible sequential patterns are hidden in databases • A mining algorithm should:... 40 50 Given support threshold min_sup =2 GSP—Generalized Sequential Pattern Mining • GSP (Generalized Sequential Pattern) mining algorithm • Outline of

Ngày đăng: 08/11/2022, 14:02