DATA MINING LECTURE 5 Sequential Pattern Mining. Sequential Pattern Mining DATA MINING LECTURE 5 Sequential Pattern Mining Outline Sequence database Sequential pattern mining Methods for sequential pattern mining Apriori based Approaches GSP SPADE.
DATA MINING LECTURE Sequential Pattern Mining Outline • Sequence database • Sequential pattern mining • Methods for sequential pattern mining – Apriori-based Approaches • GSP • SPADE – Pattern-Growth-based Approaches • PrefixSpan Sequence Databases • A sequence database consists of ordered elements or events • Transaction databases vs sequence databases A transaction database A sequence database TID itemsets SID sequences 10 a, b, d 10 < (a) (abc) (ac) d (cf)> 20 a, c, d 20 30 a, d, e 30 40 b, e, f 40 Applications • Applications of sequential pattern mining – Customer shopping sequences: • First buy computer, then CD-ROM, and then digital camera, within months – Medical treatments, natural disasters (e.g., earthquakes), science & eng processes, stocks and markets, etc – Telephone calling patterns, Weblog click streams – DNA sequences and gene structures Subsequence vs Supersequence • A sequence is an ordered list of events, denoted < e1 e2 … e l > • Given two sequences α=< a1 a2 … an > and β=< b1 b2 … bm > (m>=n) • α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤ j1 < j2 and β=< (abc) (de)> What Is Sequential Pattern Mining? • Given a set of sequences and support threshold, find the complete set of frequent subsequences A sequence database SID sequence 10 20 A sequence : < (ef) (ab) (df) c b > An element may contain a set of item Items within an element are unordere and we list them alphabetically is a subsequence of Given support threshold min_sup =2, is a sequential pattern 30 Challenges on Sequential Pattern Mining • A huge number of possible sequential patterns are hidden in databases • A mining algorithm should: – find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold – be highly efficient, scalable, involving only a small number of database scans – be able to incorporate (include) various kinds of user-specific constraints Methods for sequential pattern mining • Apriori-based Approaches – GSP – SPADE • Pattern-Growth-based Approaches – PrefixSpan The Apriori Property of Sequential Patterns • A basic property: Apriori (Agrawal & Sirkant’94) – If a sequence S is not frequent, then none of the super-sequences of S is frequent – E.g, is infrequent so and Seq ID Sequence 10 20 30 40 50 Given support threshold min_sup =2 GSP—Generalized Sequential Pattern Mining • GSP (Generalized Sequential Pattern) mining algorithm • Outline of the method – Initially, every item in DB is a candidate of length-1 – for each level (i.e., sequences of length-k) • scan database to collect support count for each candidate sequence • generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori – repeat until no frequent sequence or no candidate can be found • Major strength: Candidate pruning by Apriori 10 Completeness of PrefixSpan SDB Having prefix SID sequence 10 20 30 40 Length-1 sequential patterns , , , , , Having prefix , …, Having prefix -projected database -projected database Length-2 sequential patterns , , , , , … …… Having prefix Having prefix -proj db … -proj db 26 Example in detail PrefixSpan - Example id Sequence 10 20 30 40 min_support = ` Find length-1 sequential patterns 4 Divide search space Prefix 27 Example in detail PrefixSpan – Example (2)` Find subsets of sequential patterns 1 28 The Algorithm of PrefixSpan • Input: A sequence database S, and the minimum support threshold min_sup • Output: The complete set of sequential patterns • Method: Call PrefixSpan(, 0, S) • Subroutine PrefixSpan(α, l, S|α) • Parameters: – α: sequential pattern, – l: the length of α; – S|α: the α-projected database, if α ≠; otherwise; the sequence database S 29 The Algorithm of PrefixSpan(2) • Method Scan S|α once, find the set of frequent items b such that: a) b can be assembled to the last element of α to form a sequential pattern (I-Concatenation) b) (b) can be appended to α to form a sequential pattern (S-Concatenation) For each frequent item b, append it to α to form a sequential pattern α’, and output α’; For each α’, construct α’-projected database S|α’, and call PrefixSpan(α’, l+1, S|α’) 30 Efficiency of PrefixSpan • No candidate sequence needs to be generated • Projected databases keep shrinking • Major cost of PrefixSpan: constructing projected databases 31 Optimization in PrefixSpan • Kỹ thuật bi-level projection – Bi-level projection can reduce the number and size of projected databases • Kỹ thuật pseudo-projection – Pseudo-projection can reduce the effort of projection when the projected database fits in main memory 32 Scaling Up by Bi-Level Projection • Partition search space based on length-2 sequential patterns • Create projected databases and pursue (follow) recursive mining over bi-level projected databases 33 Speed-up by Pseudo-projection Observation: postfixes of a sequence often appear repeatedly in recursive projected databases Method: instead of constructing physical projection by collecting all the postfixes, we can use pointers referring to ` the sequences in the database as a pseudo-projection Every projection consists of two pieces of information: pointer to the sequence in database and offset to the postfix in the sequence s1= Pointer s1 s1 s1 Offset Postfix 34 Bi-Level Projection Pair-wise Checking Using Smatrix SDB SID sequence 10 20 30 40 happens times happens twice Length-1 sequential patterns , , , , , happens twice happens twice a b (4, 2, 2) c (4, 2, 1) (3, 3, 2) d (2, 1, 1) (2, 2, 0) (1, 3, 0) e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) S-matrix f (2, 1, 1) (2, 2, (1, 2, (1, 1, (2, 0, All length-2 sequential patterns are found in S-ma 0) 1) 1) 1) a b c d e f Mining -projected Database SDB Length-1 sequential patterns SID sequence 10 20 30 40 , , , , , a S-matrix 4, 2, 2) c (4, 2, 1) (3, 3, 2) d (2, 1, 1) (2, 2, 0) (1, 3, 0) e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) (2, 2, 0) (1, 2, 1) (1, 1, No (2, 0, hope 1) 1) b c b ( -projected database f (2, 1, 1) a Lead to pattern Local length-1 sequential patterns: patterns , , d to 1form (a_c), so no need to count it e f S-matrix a c (1, 0, 1) (_c) (∅, 2, ∅) (∅, 1, ∅) ∅ a c (_c) Benefits of Bi-level Projection More patterns are found in each shoot Much less projections In the example, there are 51 patterns 51 level-by-level projections 22 bi-level projections (S-Matrix có 22 có giá trị >=2) 3-way Apriori Checking Using Apriori heuristic to prune items in projected databases is a pattern! is not a pattern a b (4, 2, 2) c (4, 2, 1) (3, 3, 2) d (2, 1, 1) (2, 2, 0) (1, 3, 0) e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) a b c d e f Từ S-Matrix trên, xây dựng -projected database thực loại bỏ item d chắn is not a pattern Example - Bi-level Projection id 10 20 30 Bi-level Projection Sequence min_support = 2 40 ` Scan to get 1-length sequences Construct a triangular matrix instead of projected databases for each length-1 patterns a b (4,2,2) (4,2,1) (3,3,2) ALL length-2 sequential pattern dc (2,1,1) (2,2,0) (1,3,0) e (1,2,1) (1,2,0) (1,2,0) (1,1,0) f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) a b c d e f Support() = Support() = Support() = Support() = 39 Example - Bi-level Projection Bi-level projection (2)` For each length-2 sequential pattern α, construct the α-projected database and find the frequent items Construct corresponding S-matrix a b c (_c) d (_d) e (_e) f (_f) 2 0 a c (1,0,1) (_c) a (φ,2, φ) (φ,1, φ φ) c (_c) 19 40 ... Sequence database • Sequential pattern mining • Methods for sequential pattern mining – Apriori-based Approaches • GSP • SPADE – Pattern- Growth-based Approaches • PrefixSpan Sequence Databases... min_sup =2, is a sequential pattern 30 Challenges on Sequential Pattern Mining • A huge number of possible sequential patterns are hidden in databases • A mining algorithm should:... 40 50 Given support threshold min_sup =2 GSP—Generalized Sequential Pattern Mining • GSP (Generalized Sequential Pattern) mining algorithm • Outline of