Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 50 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
50
Dung lượng
365,59 KB
Nội dung
430 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns
DB
Operational
Data
Extract
Filter
Transform
Integrate
Classify
Aggregate
Summarize
Data
Extraction
Data
Warehouse
Integrated
Non-Volatile
Time-Variant
Subject-Oriented
Figure 16.2 Building a data warehouse
A data warehouse is integrated and subject-oriented, since the data is already
integrated from various sources through the cleaning process, and each data ware-
house is developed for a certain domain of subject area in an organization, such as
sales, and therefore is subject-oriented. The data is obviously nonvolatile, meaning
that the data in a data warehouse is not update-oriented, unlike operational data.
The data is also historical and normally grouped to reflect a certain period of time,
and hence it is time-variant.
Once a data warehouse has been developed, management is able to perform
some operation on the data warehouse, such as drill-down and rollup. Drill-down
is performed in order to obtain a more detailed breakdown of a certain dimension,
whereas rollup, which is exactly the opposite, is performed in order to obtain more
general information about a certain dimension. Business reporting often makes
use of data warehouses in order to produce historical analysis for decision support.
Parallelism of OLAP has already been presented in Chapter 15.
As can be seen from the above, the main difference between a databaseand a
data warehouse lies in the data itself: operational versus historical. However, any
decision to support the use of a data warehouse has its own limitations. The query
for historical reporting needs to be formulated similarly to the operational data.
If the management does not know what information or pattern or knowledge to
expect, data warehousing is not able to satisfy this requirement. A typical anec-
dote is that a manager gives a pile of data to subordinates and asks them to find
something useful in it. The manager does not know what to expect but is sure that
something useful and surprising may be extracted from this pile of data. This is not
a typical database query or data warehouse processing. This raises the need for a
data mining process.
Data mining, defined as a process to mine knowledge from a collection of data,
generally involves three components: the data, the mining process, and the knowl-
edge resulting from the mining process (see Fig. 16.1). The data itself needs to go
through several processes before it is ready for the mining process. This prelimi-
nary process is often referred to as data preparation. Although Figure 16.1 shows
that the data for data mining is coming from a data warehouse, in practice this
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
16.2 Data Mining: A Brief Overview 431
may or may not be the case. It is likely that the data may be coming from any
data repositories. Therefore, the data needs to be somehow transformed so that it
becomes ready for the mining process.
Data preparation steps generally cover:
ž
Data selection: Only relevant data to be analyzed is selected from the
database.
ž
Data cleaning: Data is cleaned of noise and errors. Missing and irrelevant data
is also excluded.
ž
Data integration: Data from multiple, heterogeneous sources may be inte-
grated into one simple flat table format.
ž
Data transformation: Data is transformed and consolidated into forms appro-
priate for mining by performing summary or aggregate operations.
Once the data is ready for the mining process, the mining process can start.
The mining process employs an intelligent method applied to the data in order
to extract data patterns. There are various mining techniques, including but not
limited to association rules, sequential patterns, classification, and clustering. The
results of this mining process are knowledge or patterns.
16.2 DATA MINING: A BRIEF OVERVIEW
As mentioned earlier, data mining is a process for discovering useful, interesting,
and sometimes surprising knowledge from a large collection of data. Therefore,
we need to understand various kinds of data mining tasks and techniques. Also
required is a deeper understanding of the main difference between querying and the
data mining process. Accepting the difference between querying and data mining
can be considered as one of the main foundations of the study of data mining
techniques. Furthermore, it is also necessary to recognize the need for parallelism
of the data mining technique. All of the above will be discussed separately in the
following subsections.
16.2.1 Data Mining Tasks
Data mining tasks can be classified into two categories:
ž
Descriptivedataminingand
ž
Predictive data mining
Descriptive data mining describes the data set in a concise manner and presents
interesting general properties of the data. This somehow summarizes the data in
terms of its properties and correlation with others. For example, within a set of
data, some data have common similarities among the members in that group, and
hence the data is grouped into one cluster. Another example would be that when
certain data exists in a transaction, another type of data would follow.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
432 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns
Predictive data mining builds a prediction model whereby it makes inferences
from the available set of data and attempts to predict the behavior of new data
sets. For example, for a class or category, a set of rules has been inferred from
the available data set, and when new data arrives the rules can be applied to this
new data to determine to which class or category it should belong. Prediction is
made possible because the model consisting of a set of rules is able to predict the
behavior of new information.
Either descriptive or predictive, there are various data mining techniques. Some
of the common data mining techniques include class description or characteri-
zation, association, classification, prediction, clustering, and time-series analysis.
Each of these techniques has many approaches and algorithms.
Class description or characterization summarizes a set of data in a concise way
that distinguishes this class from others. Class characterization provides the char-
acteristics of a collection of data by summarizing the properties of the data. Once
a class of data has been characterized, it may be compared with other collections
in order to determine the differences between classes.
Association rules discover association relationships or correlation among a set
of items. Association analysis is widely used in transaction data analysis, such
as a market basket. A typical example of an association rule in a market basket
analysis is the finding of rule (magazine ! sweet), indicating that if a magazine
is bought in a purchase transaction, there is a likely chance that a sweet will also
appear in the same transaction. Association rule mining is one of the most widely
used data mining techniques. Since its introduction in the early 1990s through the
Apriori algorithm, association rule mining has received huge attention across var-
ious research communities. The association rule mining methods aim to discover
rules based on the correlation between different attributes/items found in the data
set. To discover such rules, association rule mining algorithms at first capture a
set of significant correlations present in a given data set and then deduce mean-
ingful relationships from these correlations. Since the discovery of such rules is
a computationally intensive task, many association rule mining algorithms have
been proposed.
Classification analyzes a set of training data and constructs a model for each
class based on the features in the data. There are many different kinds of classi-
fications. One of the most common is the decision tree. A decision tree is a tree
consisting of a set of classification rules, which is generated by such a classifica-
tion process. These rules can be used to gain a better understanding of each class
in the databaseand for classification of new incoming data. An example of clas-
sification using a decision tree is that a “fraud” class has been labeled and it has
been identified with the characteristics of fraudulent credit card transactions. These
characteristics are in the form of a set of rules. When a new credit card transaction
takes place, this incoming transaction is checked against a set of rules to identify
whether or not this incoming transaction is classified as a fraudulent transaction.
In constructing a decision tree, the primary task is to form a set of rules in the form
of a decision tree that correctly reflects the rules for a certain class.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
16.2 Data Mining: A Brief Overview 433
Prediction predicts the possible values of some missing data or the value dis-
tribution of certain attributes in a set of objects. It involves the finding of the set
of attributes relevant to the attribute of interest and predicting the value distribu-
tion based on the set of data similar to the selected objects. For example, in a
time-series data analysis, a column in the database indicates a value over a period
of time. Some values for a certain period of time might be missing. Since the
presence of these values might affect the accuracy of the mining algorithm, a pre-
diction algorithm may be applied to predict the missing values, before the main
mining algorithm may proceed.
Clustering is a process to divide the data into clusters, whereby a cluster con-
tains a collection of data objects that are similar to one another. The similarity is
expressed by a similarity function, which is a metric to measure how similar two
data objects are. The opposite of a similarity function is a distance function, which
is used to measure the distance between two data objects. The further the distance,
the greater is the difference between the two data objects. Therefore, the distance
function is exactly the opposite of the similarity function, although both of them
may be used for the same purpose, to measure two data objects in terms of their
suitability for a cluster. Data objects within one cluster should be as similar as pos-
sible, compared with data objects from a different cluster. Therefore, the aim of
a clustering algorithm is to ensure that the intracluster similarity is high and the
intercluster similarity is low.
Time-series analysis analyzes a large set of time series data to find certain reg-
ularities and interesting characteristics. This may include finding sequences or
sequential patterns, periodic patterns, trends, and deviations. A stock market value
prediction and analysis is a typical example of a time-series analysis.
16.2.2 Querying vs. Mining
Although it has been stated that the purpose of mining (or data mining) is to dis-
cover knowledge, it should be differentiated from querying (or database querying),
which simply retrieves data. In some cases, this is easier said than done. Conse-
quently, highlighting the differences is critical in studying both database querying
and data mining. The differences can generally be categorized into unsupervised
and supervised learning.
Unsupervised Learning
The previous section gave the example of a pile of data from which some knowl-
edge can be extracted. The difference in attitude between a data miner and a data
warehouse reporter was outlined, albeit in an exaggerated manner. In this example,
no direction is given about where the knowledge may reside. There is no guideline
of where to start and what to expect. In a machine learning term, this is called
unsupervised learning, in which the learning process is not guided, or even dic-
tated, by the expected results. To put it in another way, unsupervised learning does
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
434 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns
not require a hypothesis. Exploring the entire possible space in the jungle of data
might be overstating, but can be analogous that way.
Using the example of a supermarket transaction list, a data mining process is
used to analyze all transaction records. As a result, perhaps, a pattern, such as the
majority of people who bought milk will also buy cereal in the same transaction, is
found. Whether this is interesting or not is a different matter. Nevertheless, this is
data mining, and the result is an association rule. On the contrary, a query such as
“What do people buy together with milk?” is a database query, not a data mining
process.
If the pattern milk ! cereal is generalized into X ! Y ,whereX and Y are
items in the supermarket, X and Y are not predefined in data mining. On the other
hand, database querying requires X as an input to the query, in order to find Y ,
or vice versa. Both are important in their own context. Database querying requires
some selection predicates, whereas data mining does not.
Definition 16.1 (association rule mining vs. database querying): Given
a database D, association rule mining produces an association rule
Ar.D/ D X ! Y ,whereX; Y 2 D. A query Q.D; X/ D Y produces records Y
matching the predicate specified by X.
The pattern X ! Y may be based on certain criteria, such as:
ž
Majority
ž
Minority
ž
Absence
ž
Exception
The majority indicates that the rule X ! Y is formed because the majority of
records follow this rule. The rule X ! Y indicates that if a person buys X,itis
99% likely that the person will also buy Y at the same time, and both items X
and Y must be bought frequently by all customers, meaning that items X and Y
(separately or together) must appear frequently in the transactions.
Some interesting rules or patterns might not include items that frequently appear
in the transactions. Therefore, some patterns may be based on the minority.This
type of rules indicates that the items occur very rarely or sporadically, but the pat-
tern is important. Using X and Y above, it might be that although both X and
Y occur rarely in the transactions, when they both appear together it becomes
interesting.
Some rules may also involve the absence of items, which is sometimes called
negative association. For example, if it is true that for a purchase transaction that
includes coffee it is very likely that it will NOT include tea, then the items tea and
coffee are negatively associated. Therefore, rule X !¾ Y ,wherethe¾ symbol in
front of Y indicates the absence of Y , shows that when X appears in a transaction,
it is very unlikely that Y will appear in the same transaction.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
16.2 Data Mining: A Brief Overview 435
Other rules may indicate an exception, referring to a pattern that contradicts
the common belief or practice. Therefore, pattern X ! Y is an exception if it is
uncommon to see that X and Y appear together. In other words, it is common to
see that X or Y occurs just by itself without the other one.
Regardless of the criteria that are used to produce the patterns, the patterns can
be produced only after analyzing the data globally. This approach has the greatest
potential, since it provides information that is not accessible in any other way. On
the contrary, database querying relies on some directions or inputs given by the
user in order to retrieve suitable records from the database.
Definition 16.2 (sequential patterns vs. database querying): Given a database
D, a sequential pattern Sp.D/ D O : X ! Y ,whereO indicates the owner of a
transaction and X; Y 2 D. A query Q.D; X; Y / D O,orQ.D; aggr/ D O,where
aggr indicates some aggregate functions.
Given a set of database transactions, where each transaction involves one cus-
tomer and possibly many items, an example of a sequential pattern is one in which
a customer who bought item X previously will later come back after some allow-
able period of time to buy item Y . Hence, O : X ! Y ,whereO refers to the
customer sets.
If this were a query, the query could possibly request “Retrieve customers who
have bought a minimum of two different items at different times.” The results
will not show any patterns, but merely a collection of records. Even if the query
were rewritten as “Retrieve customers who have bought items X and Y at different
times,” it would work only if items X and Y are known apriori. The sequential
pattern O : X ! Y obviously requires a number of steps of processes in order to
produce such a rule, in which each step might involve several queries including the
query mentioned above.
Definition 16.3 (clustering vs. database querying): Given database D,aclus-
tering Cl.D/ D
n
P
iD1
fX
i1
; X
i2
;:::g, where it produces n clusters each of which
consists of a number of items X. A query Q.D; X
1
/ DfX
2
; X
3
; X
4
;:::g,where
it produces a list of items fX
2
; X
3
; X
4
;:::g having the same cluster as the given
item X
1
.
Given a movement database consisting of mobile users and their locations at a
specific time, a cluster containing a list of mobile users fm
1
; m
2
; m
3
;:::g might
indicate that they are moving together or being at a place together for a period of
time. This shows that there is a cluster of users with the same characteristics, which
in this case is the location.
On the contrary, a query is able to retrieve only those mobile users who are
moving together or being at a place at the same time for a period of time with
the given mobile user, say m
1
. So the query can be expressed to something like:
“Who are mobile users usually going with m
1
?” There are two issues here. One is
whether or not the query can be answered directly, which depends on the data itself
and whether there is explicit information about the question in the query. Second,
the records to be retrieved are dependent on the given input.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
436 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns
Supervised Learning
Supervised learning is naturally the opposite of unsupervised learning, since super-
vised learning starts with a direction pointing to the target. For example, given a
list of top salesmen, a data miner would like to find the other properties that they
have in common. In this example, it starts with something, namely, a list of top
salesmen. This is different from unsupervised learning, which does not start with
any particular instances.
In data warehousing and OLAP, as explained in Chapter 15, we can use
drill-down and rollup to find further detailed (or higher level) information about
a given record. However, it is still unable to formulate the desired properties or
rules of the given input data. The process is complex enough and looks not only
at a particular category (e.g., top salesmen), but all other categories. Database
querying is not designed for this.
Definition 16.4 (decision tree classification vs. database querying): Given
database D, a decision tree Dt.D; C/ D P,whereC is the given category and P
is the result properties. A query Q.D; P/ D R is where the property is known in
order to retrieve records R.
Continuing the above example, when mining all properties of a given category,
we can also find other instances or members who also possess the same proper-
ties. For example, find the properties of a good salesman and find who the good
salesman are. In database querying, the properties have to be given so that we can
retrieve the names of the salesmen. But in data mining, and in particular decision
tree classification, the task is to formulate such properties in the first place.
16.2.3 Parallelism in Data Mining
Like any other data-intensive applications, parallelism is used purely because of the
large size of data involved in the processing, with an expectation that parallelism
will speed up the process and therefore the elapsed time will be much reduced.
This is certainly still applicable to data mining. Additionally, the data in the data
mining often has a high dimension (large number of attributes), not only a large
volume of data (large number of records). Depending on how the data is structured,
high-dimension data in data mining is very common. Processing high-dimension
data produces some degree of complexity, not previously found or applicable to
databases or even data warehousing. In general, more common in data mining is
the fact that even a simple data mining technique requires a number of iterations of
the process, and each of the iterations refines the results until the ultimate results
are generated.
Data mining is often needed to process complex data such as images, geograph-
ical data, scientific data, unstructured or semistructured documents, etc. Basically,
the data can be anything. This phenomenon is rather different from databases and
data warehouses, whose data follows a particular structure and model, such as
relational structure in relational databases or star schema or data cube in data
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
16.2 Data Mining: A Brief Overview 437
warehouses. The data in data mining is more flexible in terms of the structures,
as it is not confined to a relational structure only. As a result, the processing of
complex data also requires parallelism to speed up the process.
The other motivation is due to the widely available multiple processors or par-
allel computers. This makes the use of such a machine inevitable, not only for
data-intensive applications, but basically for any application.
The objectives of parallelism in data mining are not uniquely different from
those of parallel query processing in databases and data warehouses. Reducing
data mining time, in terms of speed up and scale up, is still the main objective.
However, since data mining processes and techniques might be considered much
more complex than query processing, parallelism of data mining is expected to
simplify the mining tasks as well. Furthermore, it is sometimes expected to produce
better mining results.
There are several forms of parallelism that are available for data mining. Chapter
1 described various forms of parallelism, including: interquery parallelism (paral-
lelism among queries), intraquery parallelism (parallelism within a query), intra-
operation parallelism (partitioned parallelism or data parallelism), interoperation
parallelism (pipelined parallelism and independent parallelism), and mixed paral-
lelism. In data mining, for simplicity purposes, parallelism exists in either
ž
Data parallelism or
ž
Result parallelism
If we look at the data mining process at a high level as a process that takes data
input and produces knowledge or patterns or models, data parallelism is where
parallelism is created due to the fragmentation of the input data, whereas result
parallelism focuses on the fragmentation of the results, not necessarily the input
data. More details about these two data mining parallelisms are given below.
Data Parallelism
In data parallelism, as the name states, parallelism is basically created because the
data is partitioned into a number of processors and each processor focuses on its
partition of the data set. After each processor completes its local processing and
produces the local results, the final results are formed basically by combining all
local results.
Since data mining processes normally exist in several iterations, data parallelism
raises some complexities. In every stage of the process, it requires an input and pro-
duces an output. On the first iteration, the input of the process in each processor
is its local data partitions, and after the first iteration, completes each processor
will produce the local results. The question is: What will the input be for the sub-
sequent iterations? In many cases, the next iteration requires the global picture
of the results from the immediate previous iteration. Therefore, the local results
from each processor need to be reassembled globally. In other words, at the end of
each iteration, a global reassembling stage to compile all local results is necessary
before the subsequent iteration starts.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
438 Chapter 16 Parallel Data Mining—Association Rules and Sequential Patterns
Proc 1
DB
Proc 2 Proc 3 Proc n
Result
1
Result
2
Result
3
Result
n
1
st
iteration
Global results after first iteration
Global re-assembling
the results
Result
1’
Result
2’
Result
3’
Result
4’
2
nd
iteration
Global results after second iteration
Global re-assembling
the results
Global re-assembling
the results
Result
1”
Result
2”
Result
3”
Result
4”
k
th
iteration
Final results
Data partitioning
Data
partition
n
Data
partition
3
Data
partition
2
Data
partition
1
Figure 16.3 Data parallelism for data mining
This situation is not that common in database query processing, because for a
primitive database operation, even if there exist several stages of processing each
processor may not need to see other processors’ results until the final results are
ultimately generated.
Figure 16.3 illustrates how data parallelism is achieved in data mining. Note
that the global temporary result reassembling stage occurs between iterations. It is
clear that parallelism is driven by the database partitions.
Result Parallelism
Result parallelism focuses on how the target results, which are the output of
the processing, can be parallelized during the processing stage without having
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
16.2 Data Mining: A Brief Overview 439
produced any results or temporary results. This is exactly the opposite of data
parallelism, where parallelism is created because of the input data partitioning.
Data parallelism might be easier to grasp because the partitioning is done up
front, and then parallelism occurs. Result parallelism, on the other hand, works
by partitioning the target results, and each processor focuses on its target result
partition.
The way result parallelism works can be explained as follows. The target result
space is normally known in advance. The target result of an association rule min-
ing is frequent itemsets in a lexical order. Although we do not know the actual
instances of frequent itemsets before they are created, nevertheless, we should
know the range of the items, as they are confined by the itemsets of the input data.
Therefore, result parallelism partitions the frequent itemset space into a number of
partitions, such as frequent itemset starting with item A to I will be processed by
processor 1, frequent itemset starting with item H to N by the next processor, and
so on. In a classification mining, since the target categories are known, each target
category can be assigned a processor.
Once the target result space has been partitioned, each processor will do what-
ever it takes to produce the result within the given range. Each processor will take
any input data necessary to produce the desired result space. Suppose that the ini-
tial data partition 1 is assigned to processor 1, and if this processor needs data
partitions from other processors in order to produce the desired target result space,
it will gather data partitions from other processors. The worst case would be one
where each processor needs the entire database to work with.
Because the target result space is already partitioned, there is no global tem-
porary result reassembling stage at the end of each iteration. The temporary local
results will be refined only in the next iteration, until ultimately the final results are
generated. Figure 16.4 illustrates result parallelism for data mining processes.
Contrasting with the parallelism that is normally adopted by database queries,
query parallelism to some degree follows both data and result parallelism. Data
parallelism is quite an obvious choice for parallelizing query processing. However,
result parallelism is inherently used as well. For example, in a disjoint partition-
ing parallel join, each processor receives a disjoint partition based on a certain
partitioning function. The join results of a processor will follow the assigned par-
titioning function. In other words, result parallelism is used. However, because
disjoint partitioning parallel join is already achieved by correctly partitioning the
input data, it is also said that data parallelism is utilized. Consequently, it has never
been necessary to distinguish between data and result parallelism.
The difference between these two parallelism models is highlighted in the data
mining processing because of the complexity of the mining process itself, where
there are multiple iterations of the entire process and the local results may need
to be refined in each iteration. Therefore, adopting a specific parallelism model
becomes necessary, thereby emphasizing the difference between the two paral-
lelism models.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
[...]... data parallelism and result parallelism introduced in the previous chapters Section 17.1 briefly introduces clustering and classification Sections 17.2 and 17.3 describe parallelism models for clustering and classification, respectively Because a thorough understanding of the main concepts of clustering and classification is important in order to understand their parallelism models, Sections 17.2 and 17.3... 16 Parallel Data Mining—Association Rules and Sequential Patterns The work on parallel sequential patterns includes that of Cong et al (KDD 2005) and Demiriz (ICDM 2002) Recently, there have emerged works on grid data mining Wang and Helian (Euro-Par 2005) used Oracle Grid for global association rule mining Li and Bollinger (2004) and Congiusta et al (2005) introduced parallel data mining on the Grid. .. rules and sequential patterns Data parallelism in association rules and sequential patterns is often known as count distribution, where the counts of candidate itemsets in each iteration are shared and distributed to all processors Hence, there is a synchronization phase Result parallelism, on the other hand, is parallelization of the results (i.e., frequent itemset and sequence itemset) This parallelism... important articles on parallel association rules andparallel sequential patterns, such as Parthasarathy and Zaki et al (1998, 2001), which thoroughly discussed parallel association rule mining, and Zaki (1999, 2001), which introduced parallel sequence mining, especially for shared-memory architecture There was also a journal special issue on paralleland distributed data mining edited by Zaki and Pan (DAPD... phase Chapter 17 Parallel Clustering and Classification This chapter continues the discussion of parallel data mining from Chapter 16, but focuses on clustering and classification There are many different techniques for clustering and classification For this chapter, we have chosen k-means and decision tree for clustering and classification, respectively Parallelism models for k-means and decision tree... unknown The clustering algorithm tries to form groups from the data characteristics, not based on cluster labels High-PerformanceParallelDatabaseProcessingandGrid Databases, by David Taniar, Clement Leung, Wenny Rahayu, and Sushant Goel Copyright 2008 John Wiley & Sons, Inc 464 17.1 Clustering and Classification 465 The following are a couple of clustering examples: ž ž Cluster customers according to... data set and finds all frequent 1-itemsets Ž In the second iteration, it joins each frequent 1-itemset and generates candidate 2-itemset Then it scans the data set again, enumerates the exact support of each of these candidate itemsets, and prunes all infrequent candidate 2-itemsets Ž In the third iteration, it again joins each of the frequent 2-itemsets and generates the following potential candidate... important work on parallel association rule mining is that by Zaïane et al (ICDM 2001), who proposed parallel association rule mining without candidate generation The work on parallel data mining using PC clusters has been reported in a number of research articles, such as Jin and Ziavras (IEEE TPDS 2004), Senger et al (2004), Kitsuregawa and Pramudiono (2003), Goda et al (DEXA 2002), and Oguchi and Kitsuregawa... milk, bread cheese milk, and cheese coffee milkg Then it prunes those candidate 3-itemsets that do not have a subset itemset in F2 For example, itemsets “bread coffee” and “bread cheese” are not frequent and are pruned After pruning, it has a single candidate 3-itemset fcheese coffee milkg It scans the data set and finds the exact support of that candidate itemset It finds that this candidate 3-itemset is... 1 and s2 2 Fk 1 , join s1 and s2 if the subsequence obtained by dropping the first item of s1 is the same as the subsequence obtained by dropping the last item of s2 Given x the last element of s1 , and y the last item of s2 , the new candidate is if y is a single item element, or add y to x otherwise For candidate 2-sequences, for all x 2 F1 and y 2 F1 , join x and y to become and . intraquery parallelism (parallelism within a query), intra-
operation parallelism (partitioned parallelism or data parallelism), interoperation
parallelism. (pipelined parallelism and independent parallelism), and mixed paral-
lelism. In data mining, for simplicity purposes, parallelism exists in either
ž
Data parallelism