CloSpan: Mining Closed Sequential Patterns in Large Datasets. PowerPoint Presentation CloSpan Mining Closed Sequential Patterns in Large Datasets SEQUENTIAL PATTERNS Natural Language Processing Lab , NTU, 2006 Slide Outline Introduction Search Space Pruning Cl.
SEQUENTIAL PATTERNS CloSpan: Mining Closed Sequential Patterns in Large Datasets Outline Introduction Search Space Pruning CloSpan Experimental Results Conclusions Natural Language Processing Lab., NTU, 2006 Slide - Introduction Definition – Sequence, Elements, Subsequence and Sequential Pattern A sequence : < (ef) (ab) (df) c b> A sequence database SID sequence 10 20 30 40 Elements items within an element are listed alphabetically is a subsequence of threshold Givendsupport min_sup_count =2, is a sequential pattern Natural Language Processing Lab., NTU, 2006 Slide - Introduction (Cont.) Definition – Frequent Sequential Pattern (FS) Include all the sequences whose support is no less than min_sup – Closed Frequent Sequential Pattern (CS) Include no sequence which has a super-sequence with the same support CS FS Natural Language Processing Lab., NTU, 2006 Slide - Introduction (Cont.) Example – FS & CS ID Sequence min_sup_count = (af)dea eab e(abf)(bde) FS: a:3, b:2, d:2, e:3, f:2, ab:2, ad:2, ae:2, (af):2, ea:3, eb:2, fd:2, fe:2, (af)d:2, (af)e:2, eab:2 CS: ea:3, (af)d:2, (af)e:2, eab:2 Natural Language Processing Lab., NTU, 2006 Slide - Introduction (Cont.) Definition – Prefix and Postfix (Projection) , , and are prefixes of sequence Given sequence Prefix Postfix /Projection Slide - Natural Language Processing Lab., NTU, 2006 Introduction (Cont.) Definition – sequence s = – an item – I-Step extension s = Ex: s= ={e} is an I-Step extension of – S-Step extension s s = Ex: is an S-Step extension of Slide - Natural Language Processing Lab., NTU, 2006 Introduction (Cont.) Definition – Prefix Search Tree as bi as as bs bs bs ci di Slide - Natural Language Processing Lab., NTU, 2006 Search Space Pruning (Cont.) Definition (D) –Gamma (Γγ) Total number of items in D – Equivalence of Projected Database Two sequences s and s’, s s’ D = D (D ) = (D ) s s’ s s’ Example – Df = D(af) = {de, (de)} (D(af)) = (Df) = Natural Language Processing Lab., NTU, 2006 Slide - Search Space Pruning (Cont.) Definition – Early Termination by Equivalence Two sequences s and s’, s s’ And also (D ) = (D ) s s’ Then , support(s ) = support(s’ ) Example (D(af)) = (Df) – (af)d & (af)e are frequent – support((af)d) = support(fd) – support((af)e) = support(fe) – don’t know the support of fd and fe Natural Language Processing Lab., NTU, 2006 Slide - 10 CloSpan (Cont.) Example (Cont.) 0 nil Df d:2, e:2 (Ds) Mod de, (de) as:3 nil fi:2 ds:2 Natural Language Processing Lab., NTU, 2006 es:2 bs:2 es:3 as:3 bs:2 Slide - 36 CloSpan (Cont.) Example (Cont.) as:3 fi:2 ds:2 as:3 bs:2 es:2 es:3 bs:2 ea:3 (af)d:2 (af)e:2 eab:2 Natural Language Processing Lab., NTU, 2006 Slide - 37 Experimental Results Synthetic Data – Parameters D : Number of sequences in 000s C : Average itemsets per sequence T : Average items per itemset N : Number of different items in 000s S : Average itemsets in maximal sequences I : Average items in maximal sequences – Two Data Set D10 C10 T2.5 N10 S6 I2.5 D5 C20 T20 N10 S20 I20 Real world datasets – KDDCup2000 – Gazelle Click Stream Natural Language Processing Lab., NTU, 2006 Slide - 38 Experimental Results (Cont.) Synthetic Data D10 C10 T2.5 N10 S6 I2.5 Natural Language Processing Lab., NTU, 2006 Slide - 39 Experimental Results (Cont.) Synthetic Data D5 C20 T20 N10 S20 I20 Natural Language Processing Lab., NTU, 2006 Slide - 40 Experimental Results (Cont.) Real world datasets – KDDCup2000 29,369 sequences 35,722 sessions 87,546 page views The average number of sessions in a sequence is around The average number of pageviews in a session is The largest session contains 342 views The longest sequence has 140 sessions The largest sequence contains 651 page views Natural Language Processing Lab., NTU, 2006 Slide - 41 Experimental Results Natural Language Processing Lab., NTU, 2006 (Cont.) Slide - 42 Conclusions Clospan to mine frequent closed sequences efficiently Clospan outperforms PrefixSpan Natural Language Processing Lab., NTU, 2006 Slide - 43 Natural Language Processing Lab., NTU, 2006 Slide - 44 Lexicographic Order Definition – Lexicographic Order t = {i , i , …,i }, i i … i k k t’ = {j , j , …,j }, j j … j l l t