Tài liệu High-Performance Parallel Database Processing and Grid Databases- P10 pptx

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	50
Dung lượng	365,59 KB

Nội dung

430 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns DB Operational Data Extract Filter Transform Integrate Classify Aggregate Summarize Data Extraction Data Warehouse Integrated Non-Volatile Time-Variant Subject-Oriented Figure 16.2 Building a data warehouse A data warehouse is integrated and subject-oriented, since the data is already integrated from various sources through the cleaning process, and each data warehouse is developed for a certain domain of subject area in an organization, such as sales, and therefore is subject-oriented. The data is obviously nonvolatile, meaning that the data in a data warehouse is not update-oriented, unlike operational data. The data is also historical and normally grouped to reflect a certain period of time, and hence it is time-variant. Once a data warehouse has been developed, management is able to perform some operation on the data warehouse, such as drill-down and rollup. Drill-down is performed in order to obtain a more detailed breakdown of a certain dimension, whereas rollup, which is exactly the opposite, is performed in order to obtain more general information about a certain dimension. Business reporting often makes use of data warehouses in order to produce historical analysis for decision support. Parallelism of OLAP has already been presented in Chapter 15. As can be seen from the above, the main difference between a database and a data warehouse lies in the data itself: operational versus historical. However, any decision to support the use of a data warehouse has its own limitations. The query for historical reporting needs to be formulated similarly to the operational data. If the management does not know what information or pattern or knowledge to expect, data warehousing is not able to satisfy this requirement. A typical anec- dote is that a manager gives a pile of data to subordinates and asks them to find something useful in it. The manager does not know what to expect but is sure that something useful and surprising may be extracted from this pile of data. This is not a typical database query or data warehouse processing. This raises the need for a data mining process. Data mining, defined as a process to mine knowledge from a collection of data, generally involves three components: the data, the mining process, and the knowledge resulting from the mining process (see Fig. 16.1). The data itself needs to go through several processes before it is ready for the mining process. This prelimi- nary process is often referred to as data preparation. Although Figure 16.1 shows that the data for data mining is coming from a data warehouse, in practice this Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 16.2 Data Mining: A Brief Overview 431 may or may not be the case. It is likely that the data may be coming from any data repositories. Therefore, the data needs to be somehow transformed so that it becomes ready for the mining process. Data preparation steps generally cover: ž Data selection: Only relevant data to be analyzed is selected from the database. ž Data cleaning: Data is cleaned of noise and errors. Missing and irrelevant data is also excluded. ž Data integration: Data from multiple, heterogeneous sources may be integrated into one simple flat table format. ž Data transformation: Data is transformed and consolidated into forms appro- priate for mining by performing summary or aggregate operations. Once the data is ready for the mining process, the mining process can start. The mining process employs an intelligent method applied to the data in order to extract data patterns. There are various mining techniques, including but not limited to association rules, sequential patterns, classification, and clustering. The results of this mining process are knowledge or patterns. 16.2 DATA MINING: A BRIEF OVERVIEW As mentioned earlier, data mining is a process for discovering useful, interesting, and sometimes surprising knowledge from a large collection of data. Therefore, we need to understand various kinds of data mining tasks and techniques. Also required is a deeper understanding of the main difference between querying and the data mining process. Accepting the difference between querying and data mining can be considered as one of the main foundations of the study of data mining techniques. Furthermore, it is also necessary to recognize the need for parallelism of the data mining technique. All of the above will be discussed separately in the following subsections. 16.2.1 Data Mining Tasks Data mining tasks can be classified into two categories: ž Descriptivedataminingand ž Predictive data mining Descriptive data mining describes the data set in a concise manner and presents interesting general properties of the data. This somehow summarizes the data in terms of its properties and correlation with others. For example, within a set of data, some data have common similarities among the members in that group, and hence the data is grouped into one cluster. Another example would be that when certain data exists in a transaction, another type of data would follow. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 432 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns Predictive data mining builds a prediction model whereby it makes inferences from the available set of data and attempts to predict the behavior of new data sets. For example, for a class or category, a set of rules has been inferred from the available data set, and when new data arrives the rules can be applied to this new data to determine to which class or category it should belong. Prediction is made possible because the model consisting of a set of rules is able to predict the behavior of new information. Either descriptive or predictive, there are various data mining techniques. Some of the common data mining techniques include class description or characterization, association, classification, prediction, clustering, and time-series analysis. Each of these techniques has many approaches and algorithms. Class description or characterization summarizes a set of data in a concise way that distinguishes this class from others. Class characterization provides the characteristics of a collection of data by summarizing the properties of the data. Once a class of data has been characterized, it may be compared with other collections in order to determine the differences between classes. Association rules discover association relationships or correlation among a set of items. Association analysis is widely used in transaction data analysis, such as a market basket. A typical example of an association rule in a market basket analysis is the finding of rule (magazine ! sweet), indicating that if a magazine is bought in a purchase transaction, there is a likely chance that a sweet will also appear in the same transaction. Association rule mining is one of the most widely used data mining techniques. Since its introduction in the early 1990s through the Apriori algorithm, association rule mining has received huge attention across various research communities. The association rule mining methods aim to discover rules based on the correlation between different attributes/items found in the data set. To discover such rules, association rule mining algorithms at first capture a set of significant correlations present in a given data set and then deduce mean- ingful relationships from these correlations. Since the discovery of such rules is a computationally intensive task, many association rule mining algorithms have been proposed. Classification analyzes a set of training data and constructs a model for each class based on the features in the data. There are many different kinds of classi- fications. One of the most common is the decision tree. A decision tree is a tree consisting of a set of classification rules, which is generated by such a classification process. These rules can be used to gain a better understanding of each class in the database and for classification of new incoming data. An example of classification using a decision tree is that a “fraud” class has been labeled and it has been identified with the characteristics of fraudulent credit card transactions. These characteristics are in the form of a set of rules. When a new credit card transaction takes place, this incoming transaction is checked against a set of rules to identify whether or not this incoming transaction is classified as a fraudulent transaction. In constructing a decision tree, the primary task is to form a set of rules in the form of a decision tree that correctly reflects the rules for a certain class. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 16.2 Data Mining: A Brief Overview 433 Prediction predicts the possible values of some missing data or the value distribution of certain attributes in a set of objects. It involves the finding of the set of attributes relevant to the attribute of interest and predicting the value distribution based on the set of data similar to the selected objects. For example, in a time-series data analysis, a column in the database indicates a value over a period of time. Some values for a certain period of time might be missing. Since the presence of these values might affect the accuracy of the mining algorithm, a prediction algorithm may be applied to predict the missing values, before the main mining algorithm may proceed. Clustering is a process to divide the data into clusters, whereby a cluster con- tains a collection of data objects that are similar to one another. The similarity is expressed by a similarity function, which is a metric to measure how similar two data objects are. The opposite of a similarity function is a distance function, which is used to measure the distance between two data objects. The further the distance, the greater is the difference between the two data objects. Therefore, the distance function is exactly the opposite of the similarity function, although both of them may be used for the same purpose, to measure two data objects in terms of their suitability for a cluster. Data objects within one cluster should be as similar as possible, compared with data objects from a different cluster. Therefore, the aim of a clustering algorithm is to ensure that the intracluster similarity is high and the intercluster similarity is low. Time-series analysis analyzes a large set of time series data to find certain reg- ularities and interesting characteristics. This may include finding sequences or sequential patterns, periodic patterns, trends, and deviations. A stock market value prediction and analysis is a typical example of a time-series analysis. 16.2.2 Querying vs. Mining Although it has been stated that the purpose of mining (or data mining) is to discover knowledge, it should be differentiated from querying (or database querying), which simply retrieves data. In some cases, this is easier said than done. Conse- quently, highlighting the differences is critical in studying both database querying and data mining. The differences can generally be categorized into unsupervised and supervised learning. Unsupervised Learning The previous section gave the example of a pile of data from which some knowledge can be extracted. The difference in attitude between a data miner and a data warehouse reporter was outlined, albeit in an exaggerated manner. In this example, no direction is given about where the knowledge may reside. There is no guideline of where to start and what to expect. In a machine learning term, this is called unsupervised learning, in which the learning process is not guided, or even dic- tated, by the expected results. To put it in another way, unsupervised learning does Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 434 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns not require a hypothesis. Exploring the entire possible space in the jungle of data might be overstating, but can be analogous that way. Using the example of a supermarket transaction list, a data mining process is used to analyze all transaction records. As a result, perhaps, a pattern, such as the majority of people who bought milk will also buy cereal in the same transaction, is found. Whether this is interesting or not is a different matter. Nevertheless, this is data mining, and the result is an association rule. On the contrary, a query such as “What do people buy together with milk?” is a database query, not a data mining process. If the pattern milk ! cereal is generalized into X ! Y ,whereX and Y are items in the supermarket, X and Y are not predefined in data mining. On the other hand, database querying requires X as an input to the query, in order to find Y , or vice versa. Both are important in their own context. Database querying requires some selection predicates, whereas data mining does not. Definition 16.1 (association rule mining vs. database querying): Given a database D, association rule mining produces an association rule Ar.D/ D X ! Y ,whereX; Y 2 D. A query Q.D; X/ D Y produces records Y matching the predicate specified by X. The pattern X ! Y may be based on certain criteria, such as: ž Majority ž Minority ž Absence ž Exception The majority indicates that the rule X ! Y is formed because the majority of records follow this rule. The rule X ! Y indicates that if a person buys X,itis 99% likely that the person will also buy Y at the same time, and both items X and Y must be bought frequently by all customers, meaning that items X and Y (separately or together) must appear frequently in the transactions. Some interesting rules or patterns might not include items that frequently appear in the transactions. Therefore, some patterns may be based on the minority.This type of rules indicates that the items occur very rarely or sporadically, but the pattern is important. Using X and Y above, it might be that although both X and Y occur rarely in the transactions, when they both appear together it becomes interesting. Some rules may also involve the absence of items, which is sometimes called negative association. For example, if it is true that for a purchase transaction that includes coffee it is very likely that it will NOT include tea, then the items tea and coffee are negatively associated. Therefore, rule X !¾ Y ,wherethe¾ symbol in front of Y indicates the absence of Y , shows that when X appears in a transaction, it is very unlikely that Y will appear in the same transaction. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 16.2 Data Mining: A Brief Overview 435 Other rules may indicate an exception, referring to a pattern that contradicts the common belief or practice. Therefore, pattern X ! Y is an exception if it is uncommon to see that X and Y appear together. In other words, it is common to see that X or Y occurs just by itself without the other one. Regardless of the criteria that are used to produce the patterns, the patterns can be produced only after analyzing the data globally. This approach has the greatest potential, since it provides information that is not accessible in any other way. On the contrary, database querying relies on some directions or inputs given by the user in order to retrieve suitable records from the database. Definition 16.2 (sequential patterns vs. database querying): Given a database D, a sequential pattern Sp.D/ D O : X ! Y ,whereO indicates the owner of a transaction and X; Y 2 D. A query Q.D; X; Y / D O,orQ.D; aggr/ D O,where aggr indicates some aggregate functions. Given a set of database transactions, where each transaction involves one customer and possibly many items, an example of a sequential pattern is one in which a customer who bought item X previously will later come back after some allow- able period of time to buy item Y . Hence, O : X ! Y ,whereO refers to the customer sets. If this were a query, the query could possibly request “Retrieve customers who have bought a minimum of two different items at different times.” The results will not show any patterns, but merely a collection of records. Even if the query were rewritten as “Retrieve customers who have bought items X and Y at different times,” it would work only if items X and Y are known apriori. The sequential pattern O : X ! Y obviously requires a number of steps of processes in order to produce such a rule, in which each step might involve several queries including the query mentioned above. Definition 16.3 (clustering vs. database querying): Given database D,aclus- tering Cl.D/ D n P iD1 fX i1 ; X i2 ;:::g, where it produces n clusters each of which consists of a number of items X. A query Q.D; X 1 / DfX 2 ; X 3 ; X 4 ;:::g,where it produces a list of items fX 2 ; X 3 ; X 4 ;:::g having the same cluster as the given item X 1 . Given a movement database consisting of mobile users and their locations at a specific time, a cluster containing a list of mobile users fm 1 ; m 2 ; m 3 ;:::g might indicate that they are moving together or being at a place together for a period of time. This shows that there is a cluster of users with the same characteristics, which in this case is the location. On the contrary, a query is able to retrieve only those mobile users who are moving together or being at a place at the same time for a period of time with the given mobile user, say m 1 . So the query can be expressed to something like: “Who are mobile users usually going with m 1 ?” There are two issues here. One is whether or not the query can be answered directly, which depends on the data itself and whether there is explicit information about the question in the query. Second, the records to be retrieved are dependent on the given input. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 436 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns Supervised Learning Supervised learning is naturally the opposite of unsupervised learning, since supervised learning starts with a direction pointing to the target. For example, given a list of top salesmen, a data miner would like to find the other properties that they have in common. In this example, it starts with something, namely, a list of top salesmen. This is different from unsupervised learning, which does not start with any particular instances. In data warehousing and OLAP, as explained in Chapter 15, we can use drill-down and rollup to find further detailed (or higher level) information about a given record. However, it is still unable to formulate the desired properties or rules of the given input data. The process is complex enough and looks not only at a particular category (e.g., top salesmen), but all other categories. Database querying is not designed for this. Definition 16.4 (decision tree classification vs. database querying): Given database D, a decision tree Dt.D; C/ D P,whereC is the given category and P is the result properties. A query Q.D; P/ D R is where the property is known in order to retrieve records R. Continuing the above example, when mining all properties of a given category, we can also find other instances or members who also possess the same properties. For example, find the properties of a good salesman and find who the good salesman are. In database querying, the properties have to be given so that we can retrieve the names of the salesmen. But in data mining, and in particular decision tree classification, the task is to formulate such properties in the first place. 16.2.3 Parallelism in Data Mining Like any other data-intensive applications, parallelism is used purely because of the large size of data involved in the processing, with an expectation that parallelism will speed up the process and therefore the elapsed time will be much reduced. This is certainly still applicable to data mining. Additionally, the data in the data mining often has a high dimension (large number of attributes), not only a large volume of data (large number of records). Depending on how the data is structured, high-dimension data in data mining is very common. Processing high-dimension data produces some degree of complexity, not previously found or applicable to databases or even data warehousing. In general, more common in data mining is the fact that even a simple data mining technique requires a number of iterations of the process, and each of the iterations refines the results until the ultimate results are generated. Data mining is often needed to process complex data such as images, geograph- ical data, scientific data, unstructured or semistructured documents, etc. Basically, the data can be anything. This phenomenon is rather different from databases and data warehouses, whose data follows a particular structure and model, such as relational structure in relational databases or star schema or data cube in data Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 16.2 Data Mining: A Brief Overview 437 warehouses. The data in data mining is more flexible in terms of the structures, as it is not confined to a relational structure only. As a result, the processing of complex data also requires parallelism to speed up the process. The other motivation is due to the widely available multiple processors or parallel computers. This makes the use of such a machine inevitable, not only for data-intensive applications, but basically for any application. The objectives of parallelism in data mining are not uniquely different from those of parallel query processing in databases and data warehouses. Reducing data mining time, in terms of speed up and scale up, is still the main objective. However, since data mining processes and techniques might be considered much more complex than query processing, parallelism of data mining is expected to simplify the mining tasks as well. Furthermore, it is sometimes expected to produce better mining results. There are several forms of parallelism that are available for data mining. Chapter 1 described various forms of parallelism, including: interquery parallelism (parallelism among queries), intraquery parallelism (parallelism within a query), intra- operation parallelism (partitioned parallelism or data parallelism), interoperation parallelism (pipelined parallelism and independent parallelism), and mixed parallelism. In data mining, for simplicity purposes, parallelism exists in either ž Data parallelism or ž Result parallelism If we look at the data mining process at a high level as a process that takes data input and produces knowledge or patterns or models, data parallelism is where parallelism is created due to the fragmentation of the input data, whereas result parallelism focuses on the fragmentation of the results, not necessarily the input data. More details about these two data mining parallelisms are given below. Data Parallelism In data parallelism, as the name states, parallelism is basically created because the data is partitioned into a number of processors and each processor focuses on its partition of the data set. After each processor completes its local processing and produces the local results, the final results are formed basically by combining all local results. Since data mining processes normally exist in several iterations, data parallelism raises some complexities. In every stage of the process, it requires an input and produces an output. On the first iteration, the input of the process in each processor is its local data partitions, and after the first iteration, completes each processor will produce the local results. The question is: What will the input be for the subsequent iterations? In many cases, the next iteration requires the global picture of the results from the immediate previous iteration. Therefore, the local results from each processor need to be reassembled globally. In other words, at the end of each iteration, a global reassembling stage to compile all local results is necessary before the subsequent iteration starts. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 438 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns Proc 1 DB Proc 2 Proc 3 Proc n Result 1 Result 2 Result 3 Result n 1 st iteration Global results after first iteration Global re-assembling the results Result 1’ Result 2’ Result 3’ Result 4’ 2 nd iteration Global results after second iteration Global re-assembling the results Global re-assembling the results Result 1” Result 2” Result 3” Result 4” k th iteration Final results Data partitioning Data partition n Data partition 3 Data partition 2 Data partition 1 Figure 16.3 Data parallelism for data mining This situation is not that common in database query processing, because for a primitive database operation, even if there exist several stages of processing each processor may not need to see other processors’ results until the final results are ultimately generated. Figure 16.3 illustrates how data parallelism is achieved in data mining. Note that the global temporary result reassembling stage occurs between iterations. It is clear that parallelism is driven by the database partitions. Result Parallelism Result parallelism focuses on how the target results, which are the output of the processing, can be parallelized during the processing stage without having Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 16.2 Data Mining: A Brief Overview 439 produced any results or temporary results. This is exactly the opposite of data parallelism, where parallelism is created because of the input data partitioning. Data parallelism might be easier to grasp because the partitioning is done up front, and then parallelism occurs. Result parallelism, on the other hand, works by partitioning the target results, and each processor focuses on its target result partition. The way result parallelism works can be explained as follows. The target result space is normally known in advance. The target result of an association rule mining is frequent itemsets in a lexical order. Although we do not know the actual instances of frequent itemsets before they are created, nevertheless, we should know the range of the items, as they are confined by the itemsets of the input data. Therefore, result parallelism partitions the frequent itemset space into a number of partitions, such as frequent itemset starting with item A to I will be processed by processor 1, frequent itemset starting with item H to N by the next processor, and so on. In a classification mining, since the target categories are known, each target category can be assigned a processor. Once the target result space has been partitioned, each processor will do what- ever it takes to produce the result within the given range. Each processor will take any input data necessary to produce the desired result space. Suppose that the ini- tial data partition 1 is assigned to processor 1, and if this processor needs data partitions from other processors in order to produce the desired target result space, it will gather data partitions from other processors. The worst case would be one where each processor needs the entire database to work with. Because the target result space is already partitioned, there is no global temporary result reassembling stage at the end of each iteration. The temporary local results will be refined only in the next iteration, until ultimately the final results are generated. Figure 16.4 illustrates result parallelism for data mining processes. Contrasting with the parallelism that is normally adopted by database queries, query parallelism to some degree follows both data and result parallelism. Data parallelism is quite an obvious choice for parallelizing query processing. However, result parallelism is inherently used as well. For example, in a disjoint partitioning parallel join, each processor receives a disjoint partition based on a certain partitioning function. The join results of a processor will follow the assigned partitioning function. In other words, result parallelism is used. However, because disjoint partitioning parallel join is already achieved by correctly partitioning the input data, it is also said that data parallelism is utilized. Consequently, it has never been necessary to distinguish between data and result parallelism. The difference between these two parallelism models is highlighted in the data mining processing because of the complexity of the mining process itself, where there are multiple iterations of the entire process and the local results may need to be refined in each iteration. Therefore, adopting a specific parallelism model becomes necessary, thereby emphasizing the difference between the two parallelism models. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. [...]... data parallelism and result parallelism introduced in the previous chapters Section 17.1 briefly introduces clustering and classification Sections 17.2 and 17.3 describe parallelism models for clustering and classification, respectively Because a thorough understanding of the main concepts of clustering and classification is important in order to understand their parallelism models, Sections 17.2 and 17.3... 16 Parallel Data Mining—Association Rules and Sequential Patterns The work on parallel sequential patterns includes that of Cong et al (KDD 2005) and Demiriz (ICDM 2002) Recently, there have emerged works on grid data mining Wang and Helian (Euro-Par 2005) used Oracle Grid for global association rule mining Li and Bollinger (2004) and Congiusta et al (2005) introduced parallel data mining on the Grid. .. rules and sequential patterns Data parallelism in association rules and sequential patterns is often known as count distribution, where the counts of candidate itemsets in each iteration are shared and distributed to all processors Hence, there is a synchronization phase Result parallelism, on the other hand, is parallelization of the results (i.e., frequent itemset and sequence itemset) This parallelism... important articles on parallel association rules and parallel sequential patterns, such as Parthasarathy and Zaki et al (1998, 2001), which thoroughly discussed parallel association rule mining, and Zaki (1999, 2001), which introduced parallel sequence mining, especially for shared-memory architecture There was also a journal special issue on parallel and distributed data mining edited by Zaki and Pan (DAPD... phase Chapter 17 Parallel Clustering and Classification This chapter continues the discussion of parallel data mining from Chapter 16, but focuses on clustering and classification There are many different techniques for clustering and classification For this chapter, we have chosen k-means and decision tree for clustering and classification, respectively Parallelism models for k-means and decision tree... unknown The clustering algorithm tries to form groups from the data characteristics, not based on cluster labels High-Performance Parallel Database Processing and Grid Databases, by David Taniar, Clement Leung, Wenny Rahayu, and Sushant Goel Copyright  2008 John Wiley & Sons, Inc 464 17.1 Clustering and Classification 465 The following are a couple of clustering examples: ž ž Cluster customers according to... data set and finds all frequent 1-itemsets Ž In the second iteration, it joins each frequent 1-itemset and generates candidate 2-itemset Then it scans the data set again, enumerates the exact support of each of these candidate itemsets, and prunes all infrequent candidate 2-itemsets Ž In the third iteration, it again joins each of the frequent 2-itemsets and generates the following potential candidate... important work on parallel association rule mining is that by Zaïane et al (ICDM 2001), who proposed parallel association rule mining without candidate generation The work on parallel data mining using PC clusters has been reported in a number of research articles, such as Jin and Ziavras (IEEE TPDS 2004), Senger et al (2004), Kitsuregawa and Pramudiono (2003), Goda et al (DEXA 2002), and Oguchi and Kitsuregawa... milk, bread cheese milk, and cheese coffee milkg Then it prunes those candidate 3-itemsets that do not have a subset itemset in F2 For example, itemsets “bread coffee” and “bread cheese” are not frequent and are pruned After pruning, it has a single candidate 3-itemset fcheese coffee milkg It scans the data set and finds the exact support of that candidate itemset It finds that this candidate 3-itemset is... 1 and s2 2 Fk 1 , join s1 and s2 if the subsequence obtained by dropping the first item of s1 is the same as the subsequence obtained by dropping the last item of s2 Given x the last element of s1 , and y the last item of s2 , the new candidate is if y is a single item element, or add y to x otherwise For candidate 2-sequences, for all x 2 F1 and y 2 F1 , join x and y to become and . intraquery parallelism (parallelism within a query), intra- operation parallelism (partitioned parallelism or data parallelism), interoperation parallelism. (pipelined parallelism and independent parallelism), and mixed parallelism. In data mining, for simplicity purposes, parallelism exists in either ž Data parallelism

Ngày đăng: 21/01/2014, 18:20

Xem thêm