Nghiên cứu và phát triển một số kỹ thuật che dấu thông tin nhạy cảm trong khai phá hữu ích cao.Nghiên cứu và phát triển một số kỹ thuật che dấu thông tin nhạy cảm trong khai phá hữu ích cao.Nghiên cứu và phát triển một số kỹ thuật che dấu thông tin nhạy cảm trong khai phá hữu ích cao.Nghiên cứu và phát triển một số kỹ thuật che dấu thông tin nhạy cảm trong khai phá hữu ích cao.Nghiên cứu và phát triển một số kỹ thuật che dấu thông tin nhạy cảm trong khai phá hữu ích cao.Nghiên cứu và phát triển một số kỹ thuật che dấu thông tin nhạy cảm trong khai phá hữu ích cao.Nghiên cứu và phát triển một số kỹ thuật che dấu thông tin nhạy cảm trong khai phá hữu ích cao.Nghiên cứu và phát triển một số kỹ thuật che dấu thông tin nhạy cảm trong khai phá hữu ích cao.THE UNIVERSITY OF DA NANG UNIVERSITY OF SCIENCE AND TECHNOLOGY HUỲNH TRIỆU VỸ RESEARCH AND DEVELOP A NUMBER OF TECHNIQUES TO HIDE SENSITIVE INFORMATION IN HIGH UTILITY MINING Major Computer science Co.
THE UNIVERSITY OF DA NANG UNIVERSITY OF SCIENCE AND TECHNOLOGY HUỲNH TRIỆU VỸ RESEARCH AND DEVELOP A NUMBER OF TECHNIQUES TO HIDE SENSITIVE INFORMATION IN HIGH UTILITY MINING Major: Computer science Code/ID: 9480101 SUMMARY OF TECHNICAL DOCTORAL THESIS Đà Nẵng - 2023 The doctoral dissertation has been finished at UNIVERSITY OF SCIENCE AND TECHNOLOGY Advisors: Dr Trương Ngọc Châu Dr Lê Quốc Hải Reviewer 1: ……………………………………………… Reviewer 2: ……………………………………………… Reviewer 3: ……………………………………………… The dissertation is defended before The Assessment Committee at University of Science and Technology - The University of Danang Time … h … Date:……/……./202 The dissertation is available at: - National Library of Vietnam - Center for Learning Materials and Communication, University of Science and Technology, University of Danang PREFACE Rational Today, global trade cooperation and transnational business are the common trend of the world Businesses that want to develop cannot operate independently but have close association and cooperation with other partners Sharing data is an important requirement to leverage the collaboration between companies for their mutual benefits However, sharing database may implicit sensitive information which is valuable for supporting decision making of the database owner It is disadvantage in competition for the database owner if sensitive information is explored by competitors In order to protect the privacy information from being exploited when sharing database to outside a company, the database needs to be distorted by using a data hiding algorithm The algorithm aims to hide the privacy information by modifying the value of some data There are many approaches to solve this problem, such as: Heuristic approach, Border-based approach, Exact approach, The target of approaches is to hide all of sensitive information with the minimum side effects In general, the approaches of the proposed algorithms are based on the heuristic approach to modify the database for local optimization However, each algorithm focuses on providing a local optimal method for one or several criteria for minimizing side effects, the other criteria of side effects are still high Therefore, continuing to research and propose algorithms to hide sensitive information in high utility mining more effectively than current algorithms is a necessary research direction In order to contribute to partly solving the above problem, the PhD student has chosen the topic "Research and develop a number of techniques to hide sensitive information in high utility mining" as the research content of the doctoral thesis The target of the research The thesis aims at solving the challenge of the privacy knowledge hiding problem in high utility mining in order to warrant that the database owner can protect his/her sensitive information when sharing data outside his/her party Namely, this thesis concentrates on both of targets: - First, research and propose algorithms for hiding sensitive highutility itemsets and sensitive high-utility association rules based on heuristic techniques - Second, research and apply lattice theory to reduce side effects in the process of hiding sensitive information in high-utility mining Objectives and scope of the research The objective of the research includes: High utility itemset hiding; Lattice theory for the high utility pattern hiding problem The scope of research: - Hiding sensitive high utility association rule; Hiding sensitive high utility itemset; Hiding sensitive high-average utility itemset; Hiding sensitive high utility and frequent itemset - Applying the properties of the lattice of high utility and frequent itemset to specify exactly the victim item in order to hide sensitive itemsets or association rules with the minimal side effects Research Methodology The thesis uses the theoretical research methodology and experimental methodology Thesis presentation The thesis is organized by three chapters, the introduction and conclusion sections Chapter 1: The Overview of high utility mining and hiding sensitive information in high utility mining from transactional databases This chapter presents an overview of high utility mining that hits to the main point of the literature review of mining high utility patterns and privacy preserving for high utility pattern mining This aims to explore the fundamental for proposing the efficient algorithms of high utility pattern hiding in the next chapters Otherwise, this chapter introduces the basic of lattice theory and its application in knowledge discovery problem The thesis concentrates on this mathematical theory to optimize the high utility pattern hiding algorithms Chapter 2: Hiding sensitive information in high utility mining based on heuristic techniques The first part of this chapter deals with the problem of hiding privacy information in high utility pattern mining The remaining of the chapter focuses on presenting improved models and algorithms for hiding privacy information in high utility pattern mining Chapter 3: Hiding sensitive information in high utility mining based on lattice theory This chapter mainly presents the proposal of the strategies to specify victim items based on the intersection lattice of the high utility and frequent itemsets for the hiding high utility and frequent itemset and the improvement of hiding high utility association rule proposed in the Chapter Contributions of the research This research gains the results and contributions as follows: (1) Propose the algorithm for hiding sensitive-high utility itemset in the problem of privacy preserving in the high utility pattern mining (2) Propose model and algorithm for hiding sensitive high-average utility itemsets (3) Propose the algorithm for hiding sensitive high utility and frequent itemsets (4) Propose model and algorithm for hiding high utility association rule (5) Propose a constrained intersection lattice of the high utility and frequent itemsets for proposing the sensitive high utility and frequent itemset hiding algorithm CHAPTER THE OVERVIEW OF HIGH UTILITY MINING AND HIDING SENSITIVE INFORMATION IN HIGH UTILITY MINING FROM TRANSACTIONAL DATABASES 1.1 Overview of high utility databases mining from transactional Theoretical foundations of high utility mining Let 𝑰 = {𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒎 } be a finite set of items, where each item 𝒙𝒊 ∈ 𝑰 has an external utility 𝒑(𝒙𝒊 ) An itemset 𝑿 = {𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒌 }, with 𝒙𝒋 ∈ 𝑰, 𝟏 ≤ 𝒋 ≤ 𝒌 A transaction database D is a set of transactions {𝑻𝟏 , 𝑻𝟐 , … , 𝑻𝒏 }, where each transaction 𝑻𝒄 ⊆ 𝑰, 𝟏 ≤ 𝒄 ≤ 𝒏 has a unique identifier, called Tid Each item x in a transaction T c is associated with a weight indicator called quantity 𝒒(𝒙, 𝑻𝒄 ), which is the number of item x appearing in the transaction T c 1.1.1.1 High utility itemset mining Given a minimum utility threshold 𝜀 An itemset X is said to be a high utility itemset if the utility of X is not less than 𝜀 - The high utility itemset mining from transactional database trình khai thác từ CSDL giao tác tất tập mục có giá trị hữu ích khơng nhỏ ngưỡng hữu ích tối thiểu cho trước - The utiltiy of an item x in a transaction T c, denoted as u(x,Tc), is defined as: 𝑢 (𝑥, 𝑇𝑐 ) = 𝑞(𝑥, 𝑇𝑐 ) ∗ 𝑝(𝑥) - The utiltiy of an itemset X in a transaction T c, denoted as u(X,Tc), is defined as: 𝑢(𝑋, 𝑇𝑐 ) = ∑𝑥∈𝑋 𝑢(𝑥, 𝑇𝑐 ) - The utiltiy of an itemset X in database D, denoted as u(X), is defined as: 𝑢 (𝑋) = ∑𝑋⊆𝑇𝑐 ∧𝑇𝑐 ∈𝐷 𝑢(𝑥, 𝑇𝑐 ) 1.1.1.2 High utility and frequent itemset mining The itemset X is said to be a high utility and frequent itemset in database D if its utility not less than minimum utility threshold and its support not less than minimum support threshold 1.1.1.3 High-average utility itemset mining The itemset X is said to be an high average-utility itemset in database D if its average-utility is greater than or equal to a minimal average utility threshold - Average utility of an itemset X in a transaction Tc, denoted as 𝑢(𝑋,𝑇 ) 𝑎𝑢(𝑋, 𝑇𝑐 ), is defined as: 𝑢(𝑋, 𝑇𝑐 ) = |𝑋| 𝑐 - Average of an itemset X in database D, denoted as 𝑎𝑢(𝑋), is defined as: 𝑎𝑢(𝑋) = ∑𝑋⊆𝑇𝑐 ∧𝑇𝑐 ∈𝐷 𝑎𝑢(𝑋, 𝑇𝑐 ) 1.1.1.4 High utility association rule mining Give a high utility itemset XY (X∩Y=∅), an association rule R:X→Y is a high utility association rule if the utility confidence of the rule R is not less than the minimum utility confidence - The local utility value of an item x in an itemset X at transaction Tc denoted as luv(x,X,Tc), is defined: 𝑙𝑢𝑣(𝑥, 𝑋, 𝑇𝑐 ) = 𝑢(𝑥, 𝑇𝑐 )|𝑥 ∈ 𝑋 ∧ 𝑋 ⊆ 𝑇𝑐 ∧ 𝑇𝑐 ∈ 𝐷 - The local utility value of an itemset X in another itemset Y (such that 𝑋 ⊆ 𝑌) at transaction Tc, denoted as luv(X,Y,Tc), is defined: 𝑙𝑢𝑣(𝑋, 𝑌, 𝑇𝑐 ) = ∑𝑥∈𝑋∧𝑋⊆𝑌∧𝑌⊆𝑇𝑐 𝑙𝑢𝑣(𝑥, 𝑋, 𝑇𝑐 ) - The local utility value of an itemset X in another itemset Y in database D, denoted as luv(X,Y), is defined: 𝑙𝑢𝑣(𝑋, 𝑌) = ∑𝑋⊆𝑌∧𝑌⊆𝑇𝑐 ∑𝑥∈𝑋 𝑙𝑢𝑣(𝑥, 𝑋, 𝑇𝑐 ) - The utility confidence of an association rule 𝑅: 𝑋 → 𝑌 denoted 𝑙𝑢𝑣(𝑋,𝑋∪𝑌) as uconf(R), is defined: 𝑢𝑐𝑜𝑛𝑓(𝑅) = 𝑢(𝑋) Overview of high utility mining The first high utility itemset (HUI) mining model was proposed by H Yao et al in 2004 Liu Y et al proposed a HUI mining algrorithm including two phases In the first phase, the authors applied closure property named TWU (Transaction-WeightedUtilization) to prune searching space when generating candidate itemsets In 2012, M Liu et al proposed a new data structure named utilitylist and HUI-Miner algorithm to mine HUIs without generating candidate itemsets Based on the utility-list structure, many improved algorithms are also proposed In 2015, Jayakrushna Sahoo et al propose the method for discovering association rules from a set of high utility itemsets Those rules reflect the relationship between high utility itemsets in the database, and said to be high utility association rules Two characteristics of high utility association rules are utility and confidence of utility 1.2 Hiding sensitive information in high utility mining The problems of hiding sensitive information in data mining can be classified into two categories: Hiding sensitive raw data and Hiding sensitive patterns In this thesis, we only study about hiding sensitive patterns in high utility mining from transactional databases Some techniques to hide sensitive information in data mining Currently, there are many techniques applied to develop algorithms to hide sensitive patterns in data mining, but the most popular today can be divided into three main approaches, such as: heuristic approach, border-based approach and exact approach Overview of hiding sensitive information in high utility mining In 2010, Jieh-Shan Yeh et al were the first authors who proposed heuristic algorithms for hiding SHUIs named HHUIF and MSICF The main idea of both algorithms is to minimize side effects basing on selecting an appropriate victim item for database modification Based on this foundation, many more efficient algorithms have been proposed such as: algorithms for hiding sensitive high utility itemsets based on GA (GA-based, PPUMGAT); algorithms for hiding sensitive high utility itemsets based on Max-Min theory (MSUMAU, MSU-MIU, SMAU, SMIU, SMSE); algorithms for hiding sensitive high utility and frequent itemsets based on Max-Min theory (MSMU, MCRSU, HUFI), Units of measure in evaluating the side effects of sensitive information hiding algorithms in high utility mining The side effect of the itemset hiding process is the difference between the original database and the distortion database and the difference between the results of itemset mining from the original database and the distortion database In order to estimate the side effects of itemset hiding algorithms, many measurement units have been used: - Missing Cost (MC): The MC is the ratio between the nonsensitive high utility itemsets lost by the hiding process - Hiding Failure(HF): The HF is the number of sensitive high utility itemsets discovered from the sanitized database - Artificial Cost (AC): AC is the ratio of the number of non-high utility itemsets in the original database but become high utility itemsets in the sanitized database - The similar rate between the original database and the sanitized database reflects their difference, including: database structure similarity, database utility similarity and itemsets utility similarity 1.3 The lattice theory applied in data mining Recently, there are many approaches that improve the efficiency of mining frequent patterns The lattice theory approach is one of the best approach that gains better performance than the others 1.4 Describe the transactional databases used to run the experiments of the algorithms in the thesis The databases selected to run experiments in the algorithms proposed in this thesis have different characteristics in terms of: Number of items in set I, the total number of transactions in the database, average length of each transaction in the database database, the maximum transaction length in the database Databases include: - The database is randomly generated by a program written in Java language - Foodmart, Retail, Mushroom, Chess and Chainstore: These databases are taken from the open-source library (http://www.philippe-fournier-viger.com/spmf/), which is a shared library The source code of more than 226 data mining algorithms has been published in the world's major journals in the field of information technology, together with the algorithms is the experimental database of more than 226 of these algorithms Currently, the library has more than million visits 1.5 Summary of chapter This chapter has focused on presenting an overview of high utility mining and hiding sensitive information in high utility mining CHAPTER HIDING SENSITIVE INFORMATION IN HIGH UTILITY MINING BASED ON HEURISTIC TECHNIQUES 2.1 The process of hiding sensitive information in high utility mining from transactional databases is based on hueristic technique The process of hiding sensitive high utility itemsets using heuristic approach is to modify the database in such a way that all of sensitive itemsets cannot be discover from the modified database: - Step 1: Identify a set of sensitive high utility patterns that need to be concealed; - Step 2: Apply the algorithm to hide the set of sensitive high utility patterns specified in Step 1; Step 3: Evaluate the results of the sanitized database before publishing it to the outside 2.2 The review on hiding sensitive information in high utility mining from transactional databases based on heuristic techniques 2.2.1 Hiding sensitive high utility itemset Jieh-Shan Yeh et al were the first authors who proposed heuristic algorithms for hiding sensitive high utility itemsets (SHUIs) named HHUIF and MSICF The main idea of both algorithms is to minimize side effects basing on selecting an appropriate victim item for database modification The victim item specified by HHUIF is an item which has maximal utility among sensitive items in a SHUI while the victim item selected by MSICF is an item which has maximal frequency among sensitive items of all SHUIs Although the proposed algorithms achieve a good result in hiding SHUIs with low HF, they cause high MC and DIF because they not specify exactly minimal utility value which need to be reduced for hiding SHUI This leads to the case that a SHUI has already been hidden but data modification has been still continuing Moreover, if utility of a SHUI is equal to minimal utility threshold then it cannot be hidden by 10 algorithms In 2013, R Selvaraj et al proposed an improvement named MHIS (Modified HHUIF algorithm with Item Selector) In case of existing more than one maximal-utility item in SHUI, MHIS gives priority to modify the item having higher frequency A novel method which hides SHUIs by adding pseudo transactions into database was proposed by Chun-Wei Lin et al (2014) The authors applied GA methodology to compute exactly number of additional transactions and set of items in each transaction The experiment shows that it is more efficient than previous methods However, this method creates new HUIs (ghost HUIs, the itemsets are non-HUIs in original database but are HUIs in the distorted database) In 2016, Chun-Wei Lin et al proposed two heuristic algorithms, including: MSU-MAU (Maximum Sensitive Utility-Maximum item Utility) and MSU-MIU (Maximum Sensitive Utility-Minimum item Utility) MSU-MAU assigns victim transaction to the transaction in which the SHUI achieves maximal utility and victim item to the item having maximal utility amongst sensitive items MSU-MIU selects victim transaction as the same as MSU-MAU, but it assigns victim item to the item having minimal utility among sensitive items Experiment results indicate that these algorithms achieve better performance than the HHUIF and MSICF algorithms However, the drawback of HHUIF and MSICF algorithm has not been solved by MSU-MAU and MSU-MIU 2.2.2 Hiding sensitive high utility and frequent itemset In 2012, Rajalaxmi et al proposed two novel algorithms named MSMU (Minimum Support and Maximum Utility) and MCRSU (Maximum Conflict Ratio for Support and Utility) The methodology [14] is to modify the value of data item to reduce both support and utility of the sensitive itemset to less than the minimum support threshold and minimum utility threshold, respectively Both algorithms firstly reduce support of the sensitive itemset In the case of the sensitive itemset is still not hidden, they then reduce its utility to lower than the minimum utility threshold Although this strategy 11 successes in hiding all of the sensitive itemsets, the MSMU and MCRSU still cause many non-sensitive itemsets to be hidden This weakness was then overcome by the HUFI algorithm which was proposed by X Liu In order to hide the sensitive itemset, the HUFI algorithm modifies the value of a sensitive item until either the support of the sensitive itemset is less than the minimum support threshold or the utility of the sensitive itemset is less than the minimum utility threshold To minimize the side effect, X.Liu proposed a maximal border to specify whether the algorithm should reduce support or utility to achieve better performance This method gains better results in comparison to the previous works However, the HUFI algorithm uses the same method that selects victim item and victim transaction for both support and utility reduction This causes more side effects 2.2.3 Hiding sinsitive high average utility itemsets In recent years, the topic of mining the average high utility itemset from transactional databases has attracted many researchers' attention and many research results have been published In parallel with the development of algorithms for mining high average utility itemsets, the study and proposal of algorithms to hide sensitive information in high average utility mining to ensure that sensitive information does not can be mined by algorithms to mine the average high utility itemset from a shared database to a partner or to an external publication is necessary However, to the knowledge of the author of this thesis, up to now, there have been no publications related to the problem of hiding sensitive information in mining the average high utility itemset Therefore, this chapter proposes a model and algorithm to hide the sensitive high average utility itemset to solve the problem of hiding sensitive information in mining the high average utility itemset 2.3.4 Hiding sensitive high utility association rules To the knowledge of the author of the thesis, up to now, there have been no publications related to the sensitive high utility association rules hiding 12 2.3 The proposed algorithm for hiding sensitive high utility itemset The process of hiding sensitive high utility itemsets using heuristic approach is to modify the database in such a way that all of sensitive itemsets cannot be discover from the modified database This process is presented in Fig 2.1 The side effects of this process belong to the approach to specify victim item and victim transaction for the modification Fig 2.1: The process of high utility itemset hiding This algorithm aims at hiding every sensitive high utility itemsets with a minimal side effects In order to minimize the side effects, this research proposes theorems and properties for the base of proposing an heuristic to specify victim item The exact victim item contributes to modify exactly data item to hide sensitive itemsets with minimal affects to the database; therefore, it contributes to minimize the side effects The experiment results indicate that the proposed algorithm gains better performance compared to the previous works 13 2.4 The proposed algorithm for hiding high utility and frequent itemsets The proposed algorithm applies heuristic approach to specify victim items and victim transactions for a local optimization in order to hide all high utility and frequent itemsets with an acceptable side effect The process of hiding sensitive high utility and frequent itemsets of the proposed algorithm is shown in Figure 2.2 The experiment results show that the proposed algorithm gains better performance compared to the previous works 2.5 The proposed itemset algorithm for hiding high-average utility The algorithm of hiding sensitive high-average utility itemset executes three main steps to modify data items: (1) Specify victim 14 transaction, (2) Specify victim item, and (3) Modify the victim item in the victim transaction Theorems are proposed to proof the mathematical correctness of the minimal value of internal utility that needs to be reduced from victim item in order to hide the sensitive itemset Concurrently, a system of reasoning properties and theorems is designed to proof that specifying exactly victim item contributes to minimize the side effects The experimental results indicate that the proposed algorithm achieves the target of minimize side effects because its performance is better than the previous works when hiding every sensitive highaverage utility itemset The side effects is lower than those of the author’s algorithm proposed in 2018 2.6 The proposed association rule algorithm for hiding sensitive high utility The proposed algorithm based on the method of modifying data of items in the left hand side of the sensitive association rule The method for specifying victim item and victim transaction is to compare the rate of affects to the database when modifying sensitive items, item by item The item with lowest affect to database is selected as victim item The result indicates that the proposed algorithm achieves the target of hiding all sensitive association rules with an acceptable side effect Summary This chapter proposed: - Algorithm for hiding sensitive high utility itemset; - Algorithm for hiding sensitive high utility and frequent itemset; - The model and algorithm for hiding sensitive high averageutility itemset; 15 - The model and algorithm for hiding sensitive high utility association rule 16 CHƯƠNG HIDING SENSITIVE INFORMATION IN HIGH UTILITY MINING BASED ON LATTICE THEORY 3.1 Lattice theory This section presents the basis of lattice theory proposed by G Birkhoff This theory is a mathematical base for proposing the method to specify victim item for data modification process that allows to hide sensitive high utility patterns from database 3.1.1 Lattice as orders 3.1.2 Lattice as algebras 3.1.3 The set lattice 3.1.4 The intersection lattice of frequent itemsets 3.2 Hiding sensitive information in high utility itemset mining based on lattice theory The set of high utility and frequent itemsets is not formed a set lattice because it does not satisfy Apriori property In order to build a lattice of high utility and frequent itemsets, this research proposes a constrained intersection operator, and based on this operator to define a new constrained intersection lattice of high utility and frequent itemsets, denoted by ℒ∩𝐻 By proposing a set of properties of ℒ∩𝐻 , this research proof that the constrained intersection lattice of high utility and frequent itemsets can be applied to specify exactly victim item for data modification in order hide sensitive high itemsets with a minimal side effects The experimental result indicates that the proposed algorithm achieves better performance than the previous works Summary Basing on the lattice theory, this chapter presents the literature review of lattice theory applied in finding victim item for hiding sensitive high utility and frequent itemset, namely: To define a 17 constrained intersection lattice of the set of high utility and frequent itemset Basing on this lattice, the target of minimizing side effects is transformed into the target of protecting the nodes in the constrained intersection lattice from being removed when modifying data 18 ... utility itemset mining from transactional database trình khai thác từ CSDL giao tác tất tập mục có giá trị hữu ích khơng nhỏ ngưỡng hữu ích tối thiểu cho trước - The utiltiy of an item x in a... value of some data There are many approaches to solve this problem, such as: Heuristic approach, Border-based approach, Exact approach, The target of approaches is to hide all of sensitive information... searching space when generating candidate itemsets In 2012, M Liu et al proposed a new data structure named utilitylist and HUI-Miner algorithm to mine HUIs without generating candidate itemsets