Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 139 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
139
Dung lượng
3,15 MB
Nội dung
学校代号 10532 学 号 LB2010034 分 类 号 密 级 Normal TP391 博士学位论文 基于前缀树结构的序列模式 挖掘算法研究 (英文版) 学位申请人姓名 : PHAM THI THIET 培 位 : 信息科学与工程学院 导师姓名及职称 : 骆嘉伟 学 科 专 业 : 计算机科学 研 究 方 向 : 数据挖掘和知识发现 论文提交日期 : 养 单 教授 2013-06-07 University ID: 10532 Student ID: LB2010034 Security Level: Normal Hunan University Doctoral Thesis RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE Aplicant’s Name : PHAM THI THIET College : Information Science and Engineering Supervisor : Professor Luo Jiawei Major : Computer Science Research Field : Data Mining and Knowledge Discorvery Submission Date : 2013-05 Defense Date : 2013-06-07 Defense committee Chairman : Professor Bo Liao Research On The Sequential Pattern Mining Algorithms Using Prefix-Tree Structure By PHAM THI THIET Masters in Computer Science (Guru Gobind Singh Indraprastha University) 2008 A dissertation submitted in partial satisfaction of the Requirements for the degree of Doctor of Science in Computer Science in the Graduate school of Hunan University Supervisor Professor Luo Jiawei June, 2013 HUNAN UNIVERSITY DECLARATION I, Pham Thi Thiet, hereby declare that the work presented in this PhD thesis titled “Research on The Sequential Pattern Mining Algorithms Using Prefix-Tree Structure”, is my original work and has not been presented elsewhere for any academic qualification Where references have been used from books, published papers, reports and web sites, it is fully acknowledged in accordance with the standard referencing practices of the discipline Student’s signature: ……………………… Date: ……………………… … Copyright Statement Permission is herewith granted to Hunan University to circulate and reproduce for non-commercial purposes, at its discretion, this thesis upon the request of individuals or institutions The author does not reserve other publication rights and the thesis nor extensive extracts from it be printed or otherwise reproduce without the author’s written permission This thesis belongs to: Secure □, and this power of attorney is valid after Not secure (Please mark the above corresponding check box with“√”) Author’s Signature: ……………………… Date: …………………………… Supervisor’s Signature: ………………… Date: …………………………… I DOCTORAL THESIS ABSTRACT Together with the rapid development of computer and internet technology, the huge amounts of data have been gathered together from various kinds of applications become more enormous and have far exceeded our human power for apprehension without powerful tools They have been described as a data rich but information poor situation Therefore, data mining with the aim of finding the valuable information and necessary knowledge hidden in a vast amount of data has become one of the most important tasks in the field of data mining research The variety and richness of data have formed different data kinds include transaction data, sequence data, stream data, time-series data and so on Sequence data is an important type of data which occurs frequently in many applications It is composed of sequences of ordering elements or events, listed with or without a specific notion of time Although there is the existence of a lot of general data mining methods to other kinds of data but for sequence data, these methods could not be applied because of among all kinds of data, sequence data has its own unique sequence features and can be existed in many interesting applications which leads to many interesting new kinds of knowledge to be discovered including sequential patterns, approximate biological sequence patterns, partially ordered patterns, periodic patterns, motifs, and so on; and these kinds of patterns will assist the development of new classification, clustering and outlier analysis methods, which in turn call for new, the development of different application kinds The sequential pattern mining is one of important tasks of data mining research and often used popular in sequence data mining applications The process of sequential pattern mining is to extract frequent subsequences in a sequence database This work has also attracted much more attention to researchers in data mining research Many works has been examined on mining sequential patterns, however, the main challenges still exist as large search spaces and the ineffectiveness in handling dense datasets To resolve the above challenges, the problems for mining closed sequential pattern, sequential generator pattern, and sequential rules have been proposed In this thesis, we have proposed novel algorithms to address these problems with the following two main objectives: ⚫ Exploitation of secondary information as sequential pattern, closed sequential pattern, sequential generator pattern based on the corresponding prefix-tree structures ⚫ Generate the kinds of sequential rules based on the secondary information in the prefix-tree structure II RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE In this thesis, we have four mainly contributions which can be briefly described as follows: ⚫ Firstly, this thesis mentions several interestingness measures as Lift, Conviction, Piatetsky-Shapiro, Cosine, Jaccard and so on, which have proposed for mining association rules and classification rules but they have not been applied to mine sequential rules in sequence databases except the traditional measures of rule such as the support and confidence We also propose then an efficient algorithm to generate all relevant sequential rules with the above interestingness measures from the prefix-tree which stored the whole sequential pattern where each child node stores a sequential pattern and its corresponding support value By traversing the prefix-tree, the algorithm can then easily identify the components of a rule, and can calculate the measured values of the rule The experimental results show that sequential rule mining with interestingness measures using the proposed algorithm based on the prefix-tree was always much faster than that using the other existing algorithm as modified Full Especially when mining in large sequence databases with the low minimum support values, the number of sequential patterns generated from sequence databases was large and the proposed algorithm outperformed much because the proposed algorithm only traverse the prefix-tree to immediately determine which sequences are the left- and right-hand sides of a rule as well as their support values to compute the interestingness measure values of the rule from the sequential pattern set In addition, the experimental results also show that the time for mining sequential rules with the confidence measure was the smallest, because it did not need to revisit the prefix-tree to determine the support of Y (the antecedence of rules), while the other interestingness measures need to revisit the prefix-tree to determine the support values of the consequent of rules or both the antecedence and the consequent ⚫ Secondly, in this thesis, the characteristics of sequential generator patterns are combined with the extension of a sequence on the prefix-tree to propose two efficient algorithms, called MSGPs and MSGP_PreTree, for finding all the sequential generator patterns at the same time of the generating sequential patterns Using the prefix-tree, new sequences, which are child nodes, can be easily created by appending an item to the last position of a parent node as an itemset extension or a sequence extension The proposed algorithms use the prime block encoding approach to represent candidate sequences and uses join operations over the prime blocks to determine the frequency for each candidate In the MSGPs algorithm, it uses a hash table to store sequential generator patterns with the hash key as the support of the pattern for fast checking MSGP_PreTree algorithm that is improved from the MSGPs algorithm, to generate all sequential generator patterns The idea of the improved algorithm is performed by modifying the prefix-tree such that each node on the III DOCTORAL THESIS prefix-tree will be added fields to check whether the sequence stored in this node is a sequential generator pattern or not The whole information of the sequence is stored on the prefix-tree, so the MSGP_PreTree algorithm does not need to use a hash table to store sequential generator patterns, which reduce significantly the use of memory The supersequence frequency-based pruning and the non-generator-based pruning on the prefix-tree are applied in the MSGP_PreTree algorithm to reduce the search space The process of extending prefix-tree and determining sequential pattern in the MSGP_PreTree algorithm is performed similar to the MSGPs algorithm All the experimental results for synthetic and real databases show that the number of sequential generator pattern is always smaller than the number of sequential patterns, and in all cases the proposed algorithms outperform the other algorithm in terms of running time ⚫ Thirdly, we propose an efficient algorithm for directly finding both closed sequential patterns and their sequential generator patterns in the generating sequential patterns process called CloGen algorithm (Closed sequential pattern-sequential Generator pattern), which is based on the combination of the child-parent relationship on prefix-tree structure and the definition of closed sequential pattern and sequential generator pattern Each node on the prefix-tree in our approach stores a sequential pattern and its corresponding support value Besides, it will be added one field (IsmSGP) to consider whether this node is a minimal sequential generator pattern, and another field (IsCSP) to consider whether this node is a closed sequential pattern Based on these fields added to each node, the algorithm easily determines if the sequence at each node is a minimal sequential generator pattern or closed sequential pattern, so the mining time is reduced significantly This algorithm also uses join operations over the prime block encoding approach of the prime factorization theory to represent candidate sequences and determine the frequency for each candidate Experimental results show that the performance runtime for mining closed sequential patterns and their minimal sequential generator patterns using the CloGen algorithm is much faster than one order of magnitude The CloGen algorithm can generate all sequential patterns, sequential generator patterns, and closed sequential patterns at the same time Furthermore, the built prefix-tree in the our approach will be one of the most efficient prefix-trees for mining non-redundant sequential rules in the future and also for mining all sequential rules ⚫ Fourthly, an efficient algorithm called MNSR-Pretree for mining non-redundant sequential rules is proposed in this thesis The proposed algorithm is decomposed two phases In the first phase, it builds a prefix-tree that stores all the sequential patterns from a given sequence database Then in the second phase, it mines non-redundant sequential rules from this prefix-tree In the prefix-tree building process, each node on the prefix-tree has a field IV RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE (IsmSGP) that indicates whether this node is a minimal sequential generator pattern, and another field (IsCSP) that indicates whether this node is a closed sequential pattern, which is performed by the CloGen algorithm in the previous contribution By traversing the prefix-tree, non-redundant sequential rules can be easily mined from a minimal sequential generator pattern X to a closed sequential pattern Y such that X is a prefix of Y, which greatly reduces the mining time required Based on the values of IsmSGP and IsCSP, the MNSR-Pretree algorithm only mines rules from a parent node whose IsmSGP value is true to children nodes whose IsCSP value is true, so that the sequence at the parent node is considered as an antecedent of the rules to be generated, and the consequents of rules are generated by removing the prefix part, which the sequence at the parent node has, from closed sequential patterns The experimental results on synthetic and real databases show that the number of non-redundant sequential rules is much smaller than that of sequential rules, and that the time required for mining non-redundant sequential rules is much less than that required for mining sequential rules Besides, the results also show that the time required for mining non-redundant sequential rules of the proposed algorithm is less than that required by an existing algorithm In summary, in this thesis we have proposed the efficient algorithms and also completed the initial introduced objective is that "To improve the efficiency of the exploitation of secondary information algorithms as sequential pattern, closed sequential pattern, sequential generator pattern based on the prefix-tree structure" with the main contribution is "the use of the prefix-tree in order to generate significantly the kinds of sequential rules as sequential rules with interestingness measures and non-redundant sequential rules from the secondary information" The goal of this thesis has been achieved by using the child- parent relationship on the prefix-tree structure and the extension of sequences to propose novel algorithms for mining works related to sequential patterns in the sequence database including algorithms for mining sequential rules with interestingness measures, mining sequential generator patterns, mining closed sequential patterns and their sequential generator patterns and mining non-redundant sequential rules The above proposed methods can be evaluated with both synthetic and real datasets Experimental results illustrate the effectiveness and efficiency of our algorithms, which improved significantly the efficiency Keywords: Sequential pattern, closed sequential pattern, sequential generator pattern, interestingness measure, sequential rule, non-redundant sequential rule, prefix-tree V DOCTORAL THESIS 摘 要 随着计算机和因特网技术的迅猛发展,从各种各样应用中收集到的数 据量越来越庞大,若不采用有效工具挖掘需要信息,这些海量数据信息将 超出人类的理解范畴。长此以往将演变为数据量大而有效信息却贫乏的形 势。因此,从海量数据中挖掘出有价值的信息和所需要的知识已经成为数 据挖掘研究领域中的重要任务之一。数据的多样性和丰富性已经形成不同 的数据种类,其中包括事务数据、序列数据、流数据、时间序列数据等。 序列数据是数据的一种重要类型,其被广泛运用在科学与工程学、商 业 、 客 户 行 为 分 析 、 股 票 趋 势 预 测 、 DNA 序 列 分 析 、 web 使 用 行 为 分 析 及 其他的一些实际应用中。它由具有有序元素或事件的序列组成,并且列出 或没有列出特定时间概念。尽管对于其他种类的数据已经存在大量通用的 数据挖掘方法,但对于序列数据,这些方法不能被应用。因为在所有类型 的数据中,序列数据有其自身独特的序列特征,并且可以应用于许多有趣 的应用程序中,这使得发现了许多新的有趣种类的知识,包括序列模式、 相似生物序列模式、部分有序模式、周期性的模式、模体等;这些种类的 模式将有助于开发新的分类、聚类和异常值分析方法,这需要新的不同种 类应用程序的发展。 序列模式挖掘是数据挖掘研究的重要任务之一,并且被普遍使用到序 列数据挖掘应用程序中。它在挖掘关联,相关分析和许多其他有趣的数据 之间的关系起着根本性的作用。此外,它提供数据分类,聚类,和其他的 数据挖掘任务。序列模式挖掘的过程即在序列数据库中提取频繁子序列。 这项工作也更加吸引数据挖掘研究的研究人员的注意,并有许多关于挖掘 序列模式的研究作品被审查。然而,面临的主要挑战仍然以大的搜索空间 和无效处理稠密数据集的方式存在。例如,当挖掘包含组合数的频繁序列 的长频繁序列,长模式挖掘过程中会产生大量频繁子序列,或当使用非常 低的支持度阈值来挖掘序列模式时,这在时间和空间成本上都是十分昂贵 的。因此,序列模式挖掘算法的性能通常会出乎意料地被降低。要解决上 述挑战,挖掘序列规则,闭序列模式,以及顺序生成模式的问题已经被提 出。 前缀树是一个有序的树数据结构,用于存储序列的快速查找,其中父 节点的所有孩子节点都有一个与该节点相关的序列的共同的前缀,而根节 点与空序列有关联。其最简单的形式中通常可以使用的关键字的列表或字 VI DOCTORAL THESIS prefix-tree To reduce the search spaces and the use of memory, we have applied supersequence frequency-based pruning and non-generator-based pruning on the prefix-tree and also improved the prefix-tree structure by adding a field (1 bit) into each node on the prefix-tree to define whether the sequential pattern on that node is a sequential generator pattern or not All the experimental results for synthetic and real databases showed that the proposed algorithms outperform the others in terms of running time The details can be found in Chapter ⚫ Based on the important role of sequential generator patterns and closed sequential patterns in the data mining area and to reduce the time for mining above types of patterns, in Chapter 5, we have proposed an efficient algorithm called CloGen to mine them in the same process This algorithm is built by modifying the prefix-tree structure to store their information, it is specifically modified by adding IsCSP and IsmSGP fields into each node on the prefix-tree to determine whether a sequential pattern in this node is a closed sequential pattern, or sequential generator pattern, or only sequential pattern The achieved prefix-tree in this work is one of the most efficient prefix-trees for mining non-redundant sequential rules in the future and also for mining all sequential rules More the detail is shown in Chapter ⚫ Up to now, there are only two methods proposed, which was mentioned in Chapter 2, to generate non-redundant sequential rules These methods could remove a significant number of redundant sequential rules but require a lot of time to check sequential generator patterns and closed sequential patterns to generate non-redundant sequential rules So, in this present study we have proposed an efficient algorithm by applying the child-parent relationship of the prefix-tree structure in the previous contribution for mining non-redundant sequential rules By traversing the prefix-tree, non-redundant sequential rules can be easily mined from a sequence X at the node whose is a sequential generator pattern to a sequence Y at the node whose is a closed sequential pattern such that X is a prefix of Y, which greatly reduces the mining time required The experimental results on synthetic and real databases have shown that the number of non-redundant sequential rules is much smaller than that of sequential rules, and that the time required for mining non-redundant sequential rules is much less than that required for mining sequential rules Besides, the results have also shown that the time required for mining non-redundant sequential rules of the proposed algorithm is less than that required by existing algorithms The details can be found in Chapter The prime block encoding approach and the join operations over the prime blocks in [30] have been applied for generating candidate sequences and determining the frequency for each candidate in the all of the proposed algorithms in this thesis Our algorithms are evaluated in both real and synthetic datasets and also compared with the existing methods Experimental 99 RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE results illustrate the effectiveness and efficiency of our algorithms, which improved significantly the efficiency Future works In this thesis, we have pointed out the limitations related to mining sequential patterns and addressed the limitations by modifying the prefix-tree structure to propose efficient algorithms for the mining of secondary information and generating sequential rules, non-redundant sequential rules from the secondary information, which improved significantly the efficiency However, in this work we only examine in theoretical, not applied to practical applications to show the applicability and effectiveness of them Hence, we will survey some practical applications as the weblog usage, the online navigational, the online shopping mall, biological sequences, and then apply our methods for these practical applications to enhance the persuasiveness of achievements The maximal sequential pattern mining also plays an important role in the sequential pattern Similar to closed sequential pattern and sequential generator pattern, the maximal sequential pattern is also proposed to reduce the number of sequential pattern The maximal sequential pattern mining problem [91] is to determine long frequent sequences, instead of listing all their subsequences In the future, we will study and propose efficient algorithms related to mine maximal sequential pattern Inter-sequences mining [92~95] is one of the approaches for sequence mining where sequential patterns can be mined inside a transaction and inter-transaction patterns in several transactions to mine sequential patterns In the future, by using the prefix-tree, we will study methods for mining inter-sequences and rules from the inter-sequence set will then be generated Besides, several methods for mining frequent weighted itemsets including frequent weighted utility itemsets [96~97] and high utility itemsets [98~ 104] are also recently mentioned As future work, we can study how to apply these methods to mine sequential patterns with weight constraints and extend to mine closed weighted sequential patterns based on the prefix-tree structure In addition, the mining rules from the set of these patterns may also be interested in the future research direction A lattice is a concept that is mentioned very early in the field of mathematics The lattice-based approaches have been proposed for mining association rules and classification association rules in recent years [50,105~106] in order to reduce the mining time Building lattice is the process of building the directly child-parent relationship between frequent itemsets (frequent closed itemsets) together with the purpose to make the process of generating rules 100 DOCTORAL THESIS take place faster and more intuitive We will study how to apply this lattice-based approach for mining sequential patterns and sequential rules in the future The incremental data mining problem has also been proposed for maintaining sequential patterns especially the modified records in databases as record insertion and deletion, record modification in recent years [34~ 36,107~108] We will also study incremental data mining and generating sequential rules from them in the future 101 RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE REFERENCES [1] Han, J., Kamber, M Data Mining: Concepts and Techniques Morgan Kaufmann Publishers Inc San Francisco, CA, USA, 2006 [2] Huang, T.C.-K Mining time-interval sequential the change of patterns Applied Soft customer behavior in fuzzy Computing, 2012, 12: 1068–1086 [3] Miller, H and Han, J Geographic Data Mining and Knowledge Discovery Taylor and Francis, 2001 [4] Shekhar, S and Chawla, S Spatial Databases: A Tour Prentice Hall, 2003 [5] Dong, G., Pei, J Sequence Data Mining, Springer Science + Business Media, LLC, 2007 [6] Nishia, M.A., Ahmed, C.F., Samiullah, M., Jeon, B.-S Effective periodic pattern mining in time series databases Expert Systems with Applications, 2013, 40: 3015–3027 [7] Shasha, D., Zhu, Y High Performance Discovery In Time Series: Techniques and Case Studies Springer, 2004 [8] Berry, M.J., Linoff, G.S Data mining techniques for marketing, sales and customer support John Wiley & Sons, 1997 [9] Chen, Y.-L., Kuo, M.-H., Wu, S.-Y., Tang, K Discovering recency, frequency, and monetary (RFM) sequential patterns from customers’ purchasing data Electronic Commerce Research and Applications, 2009, 8: 241–251 [10] Olaniyi, S A S., Adewole, Kayode S , Jimoh, R G Stock Trend Prediction Using Regression Analysis –A Data Mining Approach ARPN Journal of Systems and Software, 2011, 1(4): 154-157 [11] Wu, Y.-P., Wu, K.-P.; Lee, H.-M Stock Trend Prediction by Sequential Chart Pattern via K-Means and AprioriAll Algorithm Technologies and Applications of Artificial Intelligence (TAAI2012), 2012: 176 – 181 [12] Chang, Y.I., Wu, C.C., Chen, J.R., Jeng, Y.H Mining sequential motifs from protein databases based on a bit pattern approach International Journal of Innovative Computing, Information and Control, 2012, 8(1B): 647 – 657 [13] Guerbas, A., Addam, O., Zaarour, O., Nagi, M., Elhajj, A., Ridley, M., Alhajj, R Effective web log mining and online navigational pattern prediction Knowledge-Based Systems, 2013 (Available online xxxx) 102 DOCTORAL THESIS [14] Wang, Y.-T., Lee, A.J.T Mining Web navigation patterns with a path traversal graph Expert Systems with Applications, 2011, 38(6): 7112–7122 [15] Jeong, H., Shin, D., and Choi, J FEROM Feature Extraction and Refinement for Opinion Mining, ETRI Journal, 2011, 33(5): 7112–7122 [16] Agrawal, R., Imielinski, T., and Swami, A N Mining association rules between sets of items in large databases Proc of the 1993 ACM SIGMOD International Conference on Management of Data, 1993: 207-216 [17] Agrawal, R., Srikant, R Fast algorithms for mining association rules Proc of International Conference on Very Large Data Bases, 1994: 487–499 [18] Harms, S K., Deogun, J and Tadesse, T Discovering sequential association rules with constraints and time lags in multiple sequences Proc of 13th Int Symp on Methodologies for Intelligent Systems Lyon, France, 2002: 373-376 [19] Harms, S K., Deogun, J.S Sequential Association Rule Mining with Time Lags Journal of Intelligent Information Systems, 2004, 22(1): 7-22 [20] Brin, S., Motwani, R., Silverstein, C Beyond market baskets Generalizing association rules to correlations ACM SIGMOD/PODS ’97 Joint Conference, 1997: 265– 276 [21] Beil, F., Ester, M., and Xu, X Frequent term-based text clustering Proc of the eighth ACM SIGKDD International Conference on Knowledge discovery and data mining, 2002: 436–442 [22] Guha, S., Rastogi, R., Shim, K CURE: an efficient clustering algorithm for large databases Proceedings of the 1998 ACM SIGMOD International Conference on Management of data, 1998, 27(2): 73-84 [23] Maitra, R and Melnykov, V Simulation data to study performance of finite mixture modeling and clustering algorithms The Journal of Computational and Graphical Statistics, 2010, 19 (2): 354-376 [24] Agrawal, R., Srikant, R Mining sequential patterns Proc of 11th International Conference on Data Engineering, Taipei, Taiwan, 1995: 3–14 [25] Srikant, R., Agrawal, R Mining sequential patterns Generalizations and performance improvements Proc of 5th Int’l Conf Extending Database Technology, 1996: 3–17 [26] Masseglia, F., Cathala, F., Poncelet, P The PSP Approach for Mining Sequential Patterns Proc of the Second European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD '98), 1998: 176-184 [27] Zaki, M.J SPADE An efficient algorithm for mining frequent sequences Machine Learning Journal, 2000, 42(1/2): 31–60 103 RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE [28] Pei, J., Han, J., Mortazavi-Asl, J., Wang, J., Pinto, H., et al Mining sequential patterns by pattern-growth the prefixspan approach IEEE Trans Knowledge and Data Engineering, 2004, 16(10): 1424–1440 [29] Ayres, J., Gehrke, J.E., Yiu, T., Flannick, J Sequential pattern mining using a bitmap representation Proc of ACM SIGKDD International Conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada, 2002: 429–435 [30] Gouda, K., Hassaan, M., Zaki, M.J PRISM: A primal-encoding approach for frequent sequence mining Journal of Computer and System Sciences, 2010, 76(1): 88-102 [31] Chen, E., Cao, H., Li, Q., Qian, T Efficient strategies for tough aggregate constraint-based sequential pattern mining Information Sciences, 2008, 178: 1498–1518 [32] Garofalakis, M N., Rastogi, R., and Rastogi, K SPIRIT Sequential Pattern Mining with Regular Expression Constraints Proc of the 25th VLDB Conference, Edinburgh, Scotland, 1999: 223-234 [33] Geng, L., Hamilton, H.J Interestingness Measures for Data Mining: A Survey ACM Computing Surveys, 2006, 38(3), Article [34] Chang L., Wang T., Yang D., Luan H., Tang S Efficient algorithms for incremental maintenance of closed sequential patterns in large databases Data and Knowledge Engineering, 2009, 68(1): 68-106 [35] Hong, T.P., Wang, C.Y., Tseng, S.S An incremental mining algorithm for maintaining sequential patterns using pre-large sequences Expert Systems with Applications, 2011, 38(6): 7051–7058 [36] Lin, C.-W., Hong, T.-P., Lu, W.-H The Pre-FUFP algorithm for incremental mining Expert Systems with Applications, 2009, 36: 9498–9505 [37] Huynh, H X Interestingness Measures for Association Rules in A KDD Process: Postprocessing of Rules With Arqat Tool PhD Thesis, Universit E De Nantes, 2006 [38] Yun, U., Ryu, K.H Approximate weighted frequent pattern mining with/without noisy environments Knowledge-Based Systems, 2011, 24: 73–82 [39] Yang, K.-J., Hong, T.-P., Chen, Y.-M., Lan, G.-C Projection-based partial periodic pattern mining for event sequences Expert Systems with Applications, 2013, 40: 4232–4240 [40] Anwar, F., Petrounias, I., Morris, T., Kodogiannis, V Mining anomalous events against frequent sequences in surveillance videos from commercial environments Expert Systems with Applications, 2012, 39: 4511–4531 104 DOCTORAL THESIS [41] Huang G., Yang F., Hu C., Ren J Fast discovery of frequent closed sequential patterns based on positional data Proceedings of the 9th International Conference on Machine Learning and Cybernetics, Qingdao, 2010: 444 – 449 [42] Hamilton, H J and Karimi, K The TIMERS II Algorithm for the discovery of causality Proc of 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hanoi, Vietnam, 2005: 744-750 [43] Fournier-Viger, P., Nkambou, R., Poirier, P Generic episodic learning model implemented in a cognitive agent by means of temporal pattern mining Proc of IEA-AIE 2010, Cordoba, Spain, 2010: 438-449 [44] Liu, C., Fei, L., Yan, X., Han, J., and Midki, S.P Statistical debugging: A hypothesis testing-based approach IEEE Trans Software Engineering, 2006, 32: 831-848 [45] Lo, D., Khoo, S.-C SMArTIC: Toward building an accurate, robust and scalable specification miner Proc of SIGSOFT Symposium on the Foundations of Software Engineering, 2006: 265–275 [46] Lo, D., Khoo, S.C., Liu, C., Efficient Mining of Recurrent Rules from a Sequence Database, in Proceedings of International Conference on Database Systems for Advanced Applications, 2008: 67–83 [47] Lo, D., Khoo, S.C., Wong, L Non-redundant sequential rules-theory and algorithm Information Systems, 2009, 34(4/5): 438-453 [48] Yang, J., Evans, D., Bhardwaj, D., Bhat, T., Das, M Mining temporal API rules from imperfect traces Proc of International Conference on Software Engineering, 2006: 282–291 [49] Lenca, P., Mayer, P., Valliant, B., Lallich, S On selecting interestingness measures for association rules User oriented description and multiple criteria decision aid European Journal of Operational Research, 2008, 1842: 610–626 [50] Vo, B., Le, B Interestingness measures for association rules Combination between lattice and hash tables Expert Systems with Applications, 2011, 38(9): 11630-11640 [51] Shaharanee, I N M., Hadzic, F., Dillon, T S Interestingness measures for association rules based on statistical validity Knowledge-Based Systems, 2011, 24(3): 386–392 [52] Nguyen, L.T.T., Vo, B., Hong, T.P., Thanh, H.C Interestingness Measures for Classification Based on Association Rules Computational Collective Intelligence Technologies and Applications (ICCCI 2012), 2012, LNAI 7654: 383-392 [53] Yan, X., Han, J., Afshar, R CloSpan Mining closed sequential patterns in large datasets Proc of the 3th SIAM International Conference on Data Mining, San Francisco, CA, USA SIAM Press, 2003: 166-177 105 RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE [54] Wang, J., Han, J BIDE Efficient mining of frequent closed sequences Proc of the 20th International Conference on Data Engineering (ICDE’95), IEEE Computer Society Press, 2004: 79-91 [55] Tzvetkov, P., Yan, X., Han, J TSP Mining Top -K Closed Sequential Patterns Knowl Inf Syst., 2005: 438-457 [56] Thilagu, M., Nadarajan, R., Ahmed, M.S.I., Bama, S.S PBFMCSP Prefix Based Fast Mining of Closed Sequential Patterns The International Conference on Advances in Computing, Control, and Telecommunication Technologies ATC’09, Trivandrum, Kerala, India, 2009: 484 – 488 [57] Lo, D., Khoo, S.-C., Li, J Mining and ranking generators of sequential patterns SIAM Conference on Data Mining (SDM 2008), Atlanta, Georgia, USA, 2008: 553-564 [58] Gao C.C., Wang J.Y., He Y.K., Zhou L.Z Efficient mining of frequent sequence generators Proceedings of the 17th international conference on World Wide Web, Beijing, China, 2008: 1051-1052 [59] Yia, S.W., Zhao, T.H., Zhang, Y.Y., Ma, S.L., Che, Z.B An effective algorithm for mining sequential generators Proc of 2011 International Conference on Advanced in Control Engineering and Information Science (CEIS 2011), 2011, 15: 3653 – 3657 [60] Zang, H., Xu, Y., Li, Y Non-Redundant Sequential Association Rule Mining and Application in Recommender Systems Proc of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, DC, USA, 2010, 3: 292-295 [61] Zang, H Non-Redundant Sequential Association Rule Mining based on Closed Sequential Patterns Thesis, Queensland University of Technology, 2010 [62] Tsai, C.Y., Lo, P.H A sequential pattern based route suggestion system International Journal of Innovative Computing, Information and Control, 2010, 6(10): 4389 – 4408 [63] Pei, J., Han, J., Mao, R CLOSET An efficient algorithm for mining frequent closed itemsets In DMKD’01 workshop, Dallas, TX, 2001 [64] Zaki, M.J., and Hsiao, C CHARM An efficient algorithm for closed itemset mining, In SDM ‘02, Arlington, VA, 2002: 457—473 [65] Li, J., Li, H., Wong, L., Pei, J., Dong, G Minimum description length (MDL) principle Generators are preferable to closed patterns Proc of the 21th National Conference on Artificial Intelligence (AAAI’06), Boston, Massachusetts, USA, 2006: 409 - 414 [66] Baralis, E., Chiusano, S., Dutto, R Applying Sequential Rules to Protein Localization Prediction Computer and Mathematics with Applications, 2008, 55(5): 867–878 106 DOCTORAL THESIS [67] Spiliopoulou, M Managing interesting rules in sequence mining Proc of European Conference on Principles of Data Mining and Knowledge Discovery, 1999: 554–560 [68] Van,T.-T., Van, B., Le, B Mining sequential rules based on prefix-tree Studies in Computational Intelligence (Springer), 2011, 351:147-156 [69] Deogun, J.S., Jiang, L Prediction mining – An approach to mining association rules for prediction Proc of RSFDGrC 2005 Conference Regina, Canada, 2005: 98-108 [70] Fournier-Viger, P., Nkambou, R., Tseng, V.S RuleGrowth Mining sequential rules common to several sequences by Pattern-Growth SAC’11 Proc of the 2011 ACM Symposium on Applied Computing, TaiChung, Taiwan, 2011: 956-961 [71] Fournier-Viger, P., Faghihi, U., Nkambou, R., Nguifo, E.M CMRules An efficient algorithm for mining sequential rules common to several sequences Knowledge-based Systems, 2012a, 25(1): 63-76 [72] Fournier-Viger, P., Wu, C.-W., Tseng, V.S., Nkambou, R Mining Sequential Rules Common to Several Sequences with the Window Size Constraint Proc of the 25th Canadian International Conference on Artificial Intelligence (AI 2012), 2012b, LNAI 7310: 299-304 [73] Hsieh, Y L., Yang, D.-L., Wu, J Using data mining to study upstream and downstream causal relationship in stock market Proc of 2006 Joint Conference on Information Sciences, Kaohsiung, Taiwan, 2006 [74] Mannila, H., Toivonen, H., Verkamo, A.I Discovery of frequent episodes in event sequences DMKD 1, 1997: 259–289 [75] Han, J., Pei, J., Yin, Y Mining frequent patterns without candidate generation In proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000: 1–12 [76] Dave, B.A., Andpriestley, H A Introduction to Lattices and Order Cambridge University Press, 1990 [77] Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., and Hsu, M.C., Freespan Frequent pattern-projected sequential pattern mining Proc Of 2000 International Conference on Knowledge Discovery and Data Mining (KDD’00), 2000: 355–359 [78] Huynh, H X., Guillet, F., Blanchard, J., Kuntz, P., Gras, R., Briand, H A graph based clustering approach to evaluate interestingness measures a tool and a comparative study Quality measures in data mining Springer-Verlag, 2007, 43: 25–50 [79] Tan, P N., Kumar, V., Srivastava, J Selecting the right interestingness measure for association patterns Proc of the ACM SIGKDD international conference on knowledge discovery in databases (KDD’02), 2002: 32–41 107 RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE [80] Silberschatz, A & Tuzhilin, A What makes patterns interesting in knowledge discovery systems IEEE Transactions on Knowledge and Data Engineering, 1996, 5(6): 970-974 [81] Liu, B., Hsu, W., Mun, L.-F & Lee, H.-Y Finding interesting patterns using user expectations IEEE Transactions on Knowledge and Data Engineering, 1999, 11(6): 817-832 [82] Padmanabhan, B & Tuzhilin, A A belief-driven method for discovering un-expected patterns KDD'98, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 2012: 94-100 [83] Frawley, W.J., Piatetsky-Shapiro, G & Matheus, C.J Knowledge discovery in databases an overview Knowledge Discovery in Databases, 1991:1-27 [84] Piatetsky-Shapiro, G & Matheus, C J The interestingness of deviations AAAI'94, Knowledge Discovery in Databases Workshop, 1994: 25-36 [85] Aze′, J., Kodratoff, Y A study of the effect of noisy data in rule extraction systems Proc of the Sixteenth European Meeting on Cybernetics and Systems Research, 2002: 781–786 [86] Hilderman, R., Hamilton, H Knowledge discovery and measures of interest Kluwer Academic Publishers, 2001 [87] Huebner, R A Diversity-based interestingness measures for association rule mining In Proceedings of ASBBS, Las Vegas, 2009, 16(1) [88] Piatetsky-Shapiro, G Discovery, analysis, and presentation of strong rules Knowledge Discovery in Databases, 1991: 229–248 [89] Tan, P.-N., Kuma, V., Srivastava, J Selecting the right objective measure for association analysis Information Systems, 2004, 29(4): 293–313 [90] http//fimi.ua.ac.be/data/ [91] Luo, C., Chung, S.M A scalable algorithm for mining maximal frequent sequences using a sample Knowledge and Information Systems, 2008, 15(2): 149-179 [92] Wang, C.-S., Chu, K.-C Using a projection-based approach to mine frequent inter-transaction patterns Expert Systems with Applications, 2011, 38: 11024–11031 [93] Vo, B., Tran, M.-T., Hong, T.-P, Nguyen, H., Le, B A Dynamic Bit-vector Approach for Efficiently Mining Inter-sequence Patterns Proc of the 3rd International Conference on Innovations in Bio-Inspired Computing and Application (IBICA), 2012: 51 – 56 [94] Wang, C.-S., Lee, A.J.T Mining inter-sequence patterns Expert Systems with Applications, 2009, 36: 8649–8658 108 DOCTORAL THESIS [95] Wang, C.-S., Liu, Y.-H., Chu, K.-C Closed inter-sequence pattern mining The Journal of Systems and Software, 2013, 86: 1603 – 1612 [96] Sulaiman, M., Maybin K., Muyeba, K., Frans C A weighted utility framework for mining association rules Proceedings of Second UKSIM European Symposium on Computer Modeling and Simulation Second UKSIM European Symposium on Computer Modeling and Simulation, 2008: 87 – 92 [97] Maybin, K Muyeba, M Sulaiman, K., Frans C Fuzzy weighted association rule mining with weighted support and confidence framework Proceedings of 1st Int Workshop on Algorithms for Large-Scale Information Processing in Knowledge Discovery (ALSIP 2008), held in conjunction with PAKDD 2008 (Japan), 2008: 52 – 64 [98] Alva, E., Raj, P., Gopalan, N., Achuthan, R A bottom-up projection based algorithm for mining high utility itemsets Proc of the 2nd international workshop on Integrating artificial intelligence and data mining, Gold Coast, Australia, 2007: – 11 [99] Alva, E., Raj, P., Gopalan, N., Achuthan, R CTU-Mine An efficient high utility itemset mining algorithm using the pattern growth approach Proceedings of the IEEE 7th International Conferences on Computer and Information Technology, Aizu Wakamatsu, Japan, 2007: 71 – 76 [100] Hong, Y., Hamilton, H J Mining itemsets utilities from transaction databases Data and Knowledge Engineering, 2005, 59(3), 603–626 [101] Le, B., Nguyen, H., Vo, B Efficient algorithms for mining frequent weighted itemsets from weighted items databases IEEE-RIVF 2010, Ha Noi, Viet Nam, 2010: 59-64 [102] Le, B., Nguyen, H., Vo, B An efficient strategy for mining high utility itemsets International Journal of Intelligent Information and Database Systems, 2011, 5(2): 164–176 [103] Liu, J., Liao, W., Choudhary, A A fast high utility itemsets mining algorithm Proceedings of UBDM '05, August 21, Chicago, USA, 2005: 90–99 [104] Yu, G., Shao, S., Zeng, X Mining long high utility itemsets in transaction databases WSEAS Transactions on Information Science & Applications, 2008, 2(5): 326 – 331 [105] Nguyen L.T.T, Vo B., Hong T.P., Thanh H.C Classification based on association rules a lattice-based Expert Systems with Applications, 2012, 39(13): 11357–11366 [106] Vo, B., Hong, T.P, Le, B A lattice-based approach for mining most generalization association rules Knowledge-Based Systems, 2013, 45: 20-30 109 RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE [107] Chang, L., Wang, T., Yang, D., Luan, H., Tang, S Efficient algorithms for incremental maintenance of closed sequential patterns in large databases Data & Knowledge Engineering, 2009, 68: 68–106 [108] Hong, T.-P., Wang, C.-Y., Lin, C.-W Providing Timely updated Sequential Patterns in Decision Making International Journal of Information Technology and Decision Making, 2010, 9(6): 873-888 [109] Ren, T.-D., Zhou, X.-L An Efficient Algorithm for Incremental Mining of Sequential Pattern Proc of the 4th international conference on Advances in Machine Learning and Cybernetics (ICMLC 2005), 2006, LNAI 3930: 179-188 110 DOCTORAL THESIS APPENDIX A: LIST OF RESEARCH PUBLICATIONS Thi-Thiet Pham, Jiawei Luo, Tzung-Pei Hong, Bay Vo: MSGPs: A Novel Algorithm for Mining Sequential Generator Patterns Computational Collective Intelligence Technologies and Applications (ICICIC 2012), 2012, LNAI 7654: 393-401 (EI Compendex) Thi-Thiet Pham, Jiawei Luo, Tzung-Pei Hong: An Efficient Algorithm for Mining Sequential Rules with Interestingness Measures International Journal of Innovative Computing, Information and Control (IJICIC), 2013, 9(11) (November) (Accepted-April, 2013) (SCI Expanded, EI Compendex) Thi-Thiet Pham, Jiawei Luo, Bay Vo: An Effective Algorithm for Mining Closed Sequential Patterns and Their Minimal Generators Based on Prefix-Trees International Journal of Intelligent Information and Database Systems (IJIIDS), 2013, 7(4), Accepted on January 2013 (Scopus, EI Compendex) Thi-Thiet Pham, Jiawei Luo, Tzung-Pei Hong: An Efficient Algorithm for Mining Sequential Generator Pattern using Prefix-Trees and Hash Tables International Journal of Intelligent Systems Technologies and Applications (IJISTA), Accepted on April 2013 (Scopus, EI Compendex) Thi-Thiet Pham, Jiawei Luo, Tzung-Pei Hong, Bay Vo: Efficiently Mining Sequential Generator Patterns Using Prefix-Trees Fundamenta Informaticae, Submitted on July 2012 (Under Review - SCIE) Thi-Thiet Pham, Jiawei Luo, Tzung-Pei Hong, Bay Vo: An Efficient Method for Mining Non-Redundant Sequential Rules Using Prefix-Trees Engineering Applications of Artificial Intelligence, Submitted on February 2013 (Revising - SCI) 111 RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX-TREE STRUCTURE APPENDIX B: PROJECTS This dissertation is supported by National Natural Science Foundation of China (Grant no.61240046) and the Science and Technology Planning Project of Hunan Province (Grant no.2011FJ3048) 112 DOCTORAL THESIS ACKNOWLEDGMENTS It is my pleasure to thank all of people who made this thesis possible First of all, I sincerely would like to express my most truthful gratefulness to my supervisor, Professor Luo Jiawei for her encouragement, guidance and willing support during this research work I would also like to thank Hunan University - China and University of Industry HoChiMinh City - Vietnam for the opportunity to study and complete my studies in China Special thanks go to all members in the International student office and the School of Information Sciences and Engineering - Hunan University Secondly, I am strongly thankful to Doctor Vo Dinh Bay for his precious discussions on the ideas in my work Particularly, I owe my gratitude to Professor Hong Tzung-Pei for his corrections of my articles in English I am grateful to thank my teacher, Associate Professor Le Hoai Bac, for his advice and encouragement during my graduate study Thirdly, I would like to thank all of members in the lab of Biological Informatics Computing, during I stay in this lab, I have met many interesting friends who provided an excellent and stimulating working environment and always had time for interesting discussions Special thanks go to Wu Yuan, Wei Miao, and Yu Ling Yao for volunteering to translate my work into Chinese language and guide me fill forms Finally, I am deeply thankful to my parents, my parents-in-law, my sisters, my brothers, and my friends for their continuous support, endless encouragement and love throughout all these years Last but not least, I thank my dear husband for bringing so much fun to my life and sharing every moment of my success and disappointment Pham Thi Thiet Hunan, …/…/2013 113 ... corresponding prefix- tree structures ⚫ Generate the kinds of sequential rules based on the secondary information in the prefix- tree structure II RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS. .. based on the combination of the child-parent relationship on prefix- tree structure and the definition of closed sequential pattern and sequential generator pattern Each node on the prefix- tree. .. sequential RESEARCH ON THE SEQUENTIAL PATTERN MINING ALGORITHMS USING PREFIX- TREE STRUCTURE generator patterns at the same process of generating sequential patterns By modifying prefix- tree structure,