Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 20 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
20
Dung lượng
1,44 MB
Nội dung
1080
CHAPTER
20.
IXFORlI.4TION IXTEGRATIO-\-
it appears. Figure 20.17 suggests the process of adding a border to the cube
in each dimension, to represent
the
*
ulue and the aggregated values that
it implies. In this figure
u;e see three din~ensions. with the lightest shading
representing aggregates in one dimension, darker shading for aggregates over
two dimensions, and
tlle darkest cube in the corner for aggregation over all
three dimensions. Notice that if the number of values along each
dirnension is
reasonably large, but not so large that
most poil~ts in tlle cube are unoccupied.
then the "border" represents only a small addition to the volume of the cube
(i.e., the number of tuples in the fact table).
In
that case, the size of the stored
data
CCBE(F)
is not much greater than tlic size of
F
itself.
Figure 20.17: The cube operator augments a data cube with
a
border of aggre-
gations in all
combinations of ctinien.ions
A
tuple of the table
CLBE;(F)
that has
*
in one or more dimensions TI-ill
have for each dependent attribute the sum (or another aggregate f~~nction) of
the values of that attribute in all the tuples that
xve can obtain by replacing
the
*'s by real values.
In
effect.
we
build into the data the result of aggregating
along any set of dimensions.
Sotice. holvever. that the
CUBE
operator does
not support
<\ggregation at intermediate levels of granularity based on values in
the dirnension tables For instance. ne may either leave data broken dovi-11 by
day (or whatever the finest granularity for time is). or xve may aggregate time
completely, but
\re cannot, with thc
CCBE
operator alone, aggregate by weeks.
months. or years.
Example
20.17
:
Let us reconsider the -1ardvark database from Esarnple 20.12
in the light of ~vhat the
Ct-BE
oprr;i~or
can
givc us. Recall the fact table from
that exiumplc, is
Sales(serialN0, date, dealer, price)
Hoxvever, the dimension represented by
serialNo
is not well suited for the cube.
since the serial number is a key for
Sales.
Thus. sumning the price over all
dates, or over all
dealers, but keeping the serial ~lumbrr fixed has
110
effect:
n-e
n-ould still gct the "sum" for the one auto ~vith that serial number.
.I
Illole
useful data cube would replace the serial number by the txo attributes
-
model
and color
-
to which the serial number connects
Sales
via the dimension table
Autos.
Sotice that if we replace
serialNo
by
model
and
color,
then tile cube
no longer has a key among its dimensions. Thus, an entry of the cube
~vould
hare the total sales price for all automobiles of a given model. with a given
color, by a given dealer, on a given date.
There is another change that is useful for the data-cube implementation
of the
Sales
fact table. Since the
CUBE
operator normally sums dependent
variables, and
13-e
might want to get average prices for sales in some category,
n-e need both the sum of the prices for each category of automobiles (a given
model of a given color sold
on a given day by a given dealer) and the total
number of sales in that category. Thus, the relation
Sales
to which we apply
the
CCBE
operator is
Sales(mode1, color, date, dealer, val, cnt)
The attribute
val
is intended to be the total price of all automobiles for the
given model, color. date. and dealer,
while
cnt
is the total number of automo-
biles in that category.
Xotice that in this data cube. individual cars are not
identified: they only affect the value and count for their category.
Son let us consider the relation
cC~~(Sa1es).
I
hypothetical tuple that
n-ould be in both
Sales
and
ti lo sales).
is
('Gobi', 'red', '2001-05-21', 'Friendly Fred', 45000, 2)
The interpretation is that
on
May 21; 2001. dealer Friendly Fled sold two red
Gobis for a total of $45.000. The tuple
('Gobi',
*,
'2001-05-21', 'Friendly Fred', 152000,
7)
says that on SIay 21, 2001. Friendly Fred sold seven Gobis of all colors, for
a total price of S152.000.
Sote that this tuple is in
sales)
but not in
Sales.
Relation
sales)
also contains tuples that represent the aggregation
over more than one attribute. For instance.
('Gobi',
*,
'2001-05-21',
*,
2348000, 100)
says rliat on \la!- 21. 2001. rllei-e n-ere 100 Gobis sold
by
all the dealers. and
the total price of tliose Gobis Tvas S2.348.000.
('Gobi',
*,
*,
*,
1339800000, 58000)
Says that over all time, dealers. and colors. 58.000 Gobis have been sold for a
total price of
S1.339.800.000. Lastly. the tuple
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
tells us that total sales of all Aardvark lnodels in all colors, over all time at all
dealers is 198.000 cars for
a
total price of $3,521,727,000.
Consider how to answer
a
query in \\-hich we specify conditions on certain
attributes of the Sales relation and group by
some other attributes, n-hile
asking for the sum, count, or average price. In the relation
are r sales),
we
look for those tuples
t
with the fo1lov;ing properties:
1. If the query specifies
a
value
v
for attribute
a;
then tuple
t
has
v
in its
component for
a.
2. If the query groups by an attribute
a,
then
t
has any non-* value in its
conlponent for
a.
3.
If the query neither groups by attribute
a
nor specifies a value for
a.
then
t
has
*
in its component for
a.
Each tuple
t
has tlie sum and count for one of the desired groups. If n-e \%-ant
the average price, a division is performed on the sum and count conlponents of
each tuple
t.
Example
20.18
:
The query
SELECT color,
AVG(price)
FROM Sales
WHERE model
=
'Gobi'
GROUP
BY
color;
is ansn-ered by looking for all tuples of
sales)
~vith the form
('Gobi',
C.
*,
*,
21,
n)
here
c
is any specific color. In this tuple,
v
will be the sum of sales of Gobis
in that color, while
n
will be the nlini!)cr of sales of Gobis in that color. Tlie
average price. although not an attribute of Sales or
sales)
directly. is
v/n.
Tlie answer to the query is the set of
(c,
vln)
pairs obtained fi-om all
('Gobi'.
c,
*,
*.
v.
n)
tuples.
20.5.2
Cube ImplementaOion
by
Materialized Views
11%
suggested in Fig. 20.17 that adding
aggregations
to the cube doesn't cost
much in tcrms of space. and saves a lot in time \vhen the common kincis of
decision-support queries are asked.
Ho~vever: our analysis is based on the as-
sumption that queries choose either to aggregate completely in a dimension
or not to aggregate at all. For some
dime~isions. there are many degrees of
granularity that could be chosen for a grouping on
that dimension.
Uc have already mentioned thc case of time. xvl-here numerolls options such
as aggregation by
weeks, months: quarters, or ycars exist,, in addition to the
all-or-nothing choices of grouping by day or aggregating over all time. For
another esanlple based on our running automobile database, Ive could choose
to aggregate dealers completely or not aggregate them at all.
Hon-ever, we could
also choose to aggregate by city, by state, or perhaps by other regions, larger
or smaller. Thus: there are at least
sis choices of grouping for time and at least
four for dealers.
l\Tllen the number of choices for grouping along each dimension grows, it
becomes increasingly expensive to store the results of aggregating by every
possible
conlbination of groupings. Sot only are there too many of them, but
they are not as easily organized as
the structure of Fig. 20.17 suggests for tlle
all-or-nothing case. Thus, commercial data-cube systems may help the user to
choose
some
n~aterialized
views
of the data cube.
A
materialized view is the
result of some query,
which we choose to store in the database, rather than
reconstructing (parts of) it as needed in response to queries. For the data cube,
the vie~vs we n-ould choose to materialize xi11 typically be aggregations of the
full data cube.
The coarser the partition implied by the grouping, the less space the mate-
rialized
view takes. On the other hand, if ire ~vant to use a view to answer a
certain query,
then the view must not partition any dimension more coarsely
than the query does. Thus, to
maximize the utility of materialized views, we
generally n-ant some large \-iers that group dimensions into a fairly fine parti-
tion. In addition, the choice of
vien-s to materialize is heavily influenced by the
kinds of
queries that the analysts are likely
to
ask.
.in
example will suggest tlie
tradeoffs in\-011-ed.
INSERT INTO SalesVl
SELECT model, color, month, city,
SUM(va1) AS val, SUM(cnt) AS cnt
FROM Sales JOIN Dealers
ON
dealer
=
name
GROUP
BY
model, color, month, city;
Figure 20.18: The materialized
vien. SalesVl
Example
20.19
:
Let us return to the data cube
Sales (model, color, date, dealer,
val
,
cnt)
that ne de\-eloped in Esample 20.17. One possible materialized vie\\- groups
dates by
nionth and dealers by city. This view. 1%-hich
1%-e
call SalesV1, is
constlucted
by
the query in Fig. 20.18. This query is not strict
SQL.
since n-e
imagine that dates and their grouping units such as months are understood
by the data-cube system n-ithout being told to join Sales with the imaginary
relation
rep~esenting dajs that \ve discussed in Example 20.14.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER
20.
IiYFORI\IATIOAr IArTEGR.4TION
20.5.
DdT.4 CUBES
1055
INSERT INTO SalesV2
SELECT model, week, state,
SUM(va1) AS val, SUM(cnt) AS cnt
FROM Sales
JOIN Dealers
ON
dealer
=
name
GROUP
BY
model, week, state;
Figure 20.19: Another materialized view,
SalesV2
Another possible materialized view aggregates colors completely, aggregates
time into
u-eeks, and dealers by states. This view,
SalesV2,
is defined by the
query in Fig. 20.19. Either view
SalesVl
or
SalesV2
can be used to ansn-er a
query that partitions no more finely than either in any dimension. Thus, the
query
41:
SELECT model, SUM(va1)
FROM Sales
GROUP
BY
model;
can be answered either by
SELECT model, SUM(va1)
FROM SalesVl
GROUP
BY
model;
SELECT model,
SUM(va1)
FROM SalesV2
GROUP BY model;
On the other hand, the query
42: SELECT model, year, state, SUM(va1)
FROM Sales JOIN Dealers
ON
dealer
=
name
GROUP
BY
model, year, state;
can on1 be ans\vered from
SalesV1.
as
SELECT model, year, state, SUM(va1)
FROM SalesVl
GROUP
BY
model, year, state;
Incidentally. the query inmediately above. like the qu'rics that nggregate time
units, is not strict
SQL.
That is.
state
is not ari attribute of
SalesVl:
only
city
is. \Ye rmust assume that the data-cube systenl knol\-s how to perform the
aggregation of cities into states, probably by accessing the dimension table for
dealers.
\Ye
cannot answer Q2 from
SalesV2.
Although we could roll-up cities into
states
(i.e aggregate the cities into their states) to use
SalesV1,
we
carrrlot
roll-up ~veeks into years, since years are not evenly divided into weeks. and
data from a
week beginning. say, Dec.
29,
2001. contributes to years 2001 and
2002 in a way we
carinot tell from the data aggregated by weeks.
Finally, a query like
43:
SELECT model, color, date, ~~~(val)
FROM Sales
GROUP BY model, color, date;
can be anslvered from neither
SalesVl
nor
SalesV2.
It cannot be answered
from
Salesvl
because its partition of days by ~nonths is too coarse to recover
sales by day,
and it cannot be ans~vered from
SalesV2
because that view does
not group by color. We would have to answer this query directly from the full
data cube.
20.5.3
The Lattice
of
Views
To formalize the cbservations of Example 20.10. it he!ps to think of a lattice of
possibl~ groupings for each dimension of the cube. The points of the lattice are
the
ways that we can partition the ~alucs of a dimension
by
grouping according
to one or
more attributes of its dimension table.
nB
say that partition
PI
is
belo~v partition
P2.
written
PI
5
P2
if and only if each group of
Pl
is contained
within some group of
PZ.
All
Years
/
1
I
Quarters
I
Weeks Months
Days
Figure 20.20:
A
lattice of partitions for time inter\-als
Example
20.20:
For the lattice of time partitions n-e might choose the dia-
gram of Fig. 20.20.
-4
path from some node
fi
dotvn to
PI
means that
PI
5
4.
These are not the only possible units of time, but they
\\-ill
serve as an example
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
of what units a s~stern might support. Sotice that daks lie below both \reeks
and months, but weeks do not lie below months. The reason is that while a
group of events that took place in
one day surely took place within one \reek
and within one month. it is not true that a group of events taking place in one
week necessarily took place in any one month. Similarly, a week's group need
not be contained within the group
cor~esponding to one quarter or to one year.
At
tlie top is a partition we call "all," meaning that events are grouped into a
single group;
i.e we niake no distinctions among diffeient times.
All
I
State
I
City
I
Dealer
Figure 20.21:
A
lattice of partitions for automobile dealers
Figure 20.21
shows another lattice, this time for the dealer dimension of our
automobiles example. This lattice is siniplcr: it shows that partitioning sales
by
dealer gives a finer partition than partitioning by the city of the dealer. i<-hich is
in turn finer than partitioning by tlie state of tlie dealer.
The top of tlle ldrtice
is the partition that places all dealers in one group.
Having a lattice for each dimension,
15-12
can now define a lattice for all the
possible materialized
views of a data cube that can be formed by grouping
according to some partition
in each dimension. If
15
and
1%
are two views
formed by choosing a partition (grouping) for
each dimension, then
1;
5
11
means that in each dimension, the partition
Pl
that ~ve use in
1;
is at least as
fine as the partition
Pl
that n.e use for that dimension in
Ti;
that is.
Pl
5
P?
Man) OLAP queries can also be placed in the lattice of views
In
fact. fie-
quently an OLAP query has the same form as the views we have described: the
query specifies some pa~titioning (possibly none or all) for each of the dimen-
sions. Other
OL.iP queiics involve tliis same soit of grouping, and then "slice
tlie cube to
focus
011
a subset of the data. as nas suggested
by
the diag~ani in
Fig. 20.15.
The general rule is.
I\c can ansn-er a quciy
Q
using view
1-
if and o~ily if
1-
5
Q.
Example 20.21
:
Figure 20.22 takes the vielvs and queries of Example 20.19
and places them in a lattice.
Sotice that the Sales data cube itself is technically
a view. corresponding to tlie finest possible partition along each climensio~l. As
we observed in the original example,
QI
can be ans~vered from either SalesVl or
Sales
Figure 20.22: The lattice of
views and queries from Example 20.19
SalesV2; of course it could also be answered froni the full data cube Sales, but
there is
no reason to want to do so if one of the other views is materialized.
Q2
can be answered from either SalesVl or Sales, while
Q3
can only be answered
from Sales. Each of these relationships is expressed in Fig. 20.22 by the paths
downxard from the queries to their supporting vie~vs.
Placing queries in the lattice of views helps design data-cube databases.
Some recently developed design tools for data-cube
systems start with a set of
queries that they regard as
typical" of the application at hand. They then
select
a
set of views to materialize so that each of these queries is above at least
one of the
riel\-s, preferably identical to it or very close (i.e., the query and the
view use the same grouping in most of the dimensions).
20.5.4
Exercises
for
Section
20.5
Exercise 20.5.1
:
IVhat is the ratio of the size of CCBE(F) to the size of
F
if
fact table
F
has the follorving characteristics?
*
a)
F
has ten dimension attributes, each with ten different values.
b)
F
has ten dimension attributes. each with two differcnt values.
Exercise 20.5.2:
Let us use the cube ~nBE(Sa1es) from Example 20.17,
~vhich was built from the relation
Sales (model, color, date, dealer,
val,
cnt)
Tcll I\-hat tuples of the cube n-e 15-ould use to answer tlle follon-ing queries:
*
a) Find the total sales of I~lue cars for each dealer.
b) Find the total
nurnber of green Gobis sold by dealer .'Smilin' Sally."
c) Find the average number of
Gobis sold on each day of March, 2002 by
each dealer.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
1088
CHAPTER
20.
ISFORJlATIOS IXTEGRA4TIOS
*!
Exercise
20.5.3:
In Exercise 20.4.1 lve spoke of PC-order data organized as
a cube. If we are to apply the
CCBE operator, we might find it convenient to
break several dimensions more finely. For example, instead of one processor
dimension,
we might have one dimension for the type (e.g., AlID Duron or
Pentium-IV), and another
d~mension for the speed. Suggest a set of dimrnsions
and dependent attributes that will allow us to obtain answers to a variety of
useful aggregation queries. In particular,
what role does the customer play?
.Also, the price in Exercise 20.4.1 referred to the price of one macll~ne, while
several identical machines could be ordered in a single tuple.
What should the
dependent
attribute(s) be?
Exercise
20.5.4
:
What tuples of the cube from Exercise 20.5.3 would you use
to answer the following queries?
a) Find, for each processor speed, the total number of computers ordered in
each month of the year 2002.
b) List for
each type of hard disk (e.g., SCSI or IDE) and eacli processor
type
the number of computers ordered.
c) Find the average price of computers with
1500 megahertz processors for
each month from Jan., 2001.
!
Exercise
20.5.5
:
The computers described in the cube of Exercise 20.5.3 do
not include monitors.
IVhat dimensions would you suggest to represent moni-
tors? You
may assume that the price of the monitor is included in the price of
the computer.
Exercise
20.5.6
:
Suppose that a cube has 10 dimensions. and eacli dimension
has
5
options for granularity of aggregation. including "no aggregation" and
"aggregate fully.''
How many different views can we construct by clioosing a
granularity in each
dinlension?
Exercise
20.5.7
:
Show how to add the following time units to the lattice of
Fig. 20.20: hours, minutes, seconds, fortnights
(two-week periods). decades.
and centuries.
Exercise
20.5.8:
How 15-onld you change the dealer lattice of Fig. 20.21 to
include
-regions." ~f:
a)
A
region is a set of states.
*
b) Regions are not com~liensurate with states. but each city is in only one
region.
c) Regions are like area codes: each region is contained
\vithin a state. some
cities are in
two or more regions. and some regions ha~e several cities.
20.6.
DATA
-111-YIA-G
1089
!
Exercise
20.5.9:
In Exercise 20.5.3 ne designed a cube suitable for use ~vith
the CCBE operator.
Horn-ever.
some of the dimensions could also be given a non-
trivial lattice structure. In particular, the processor type could be organized by
manufacturer (e
g., SUT, Intel. .AND. llotorola). series (e.g
SUN
Ult~aSparc.
Intel Pentium or Celeron. AlID rlthlon, or llotorola G-series), and model (e.g.,
Pentiuni-I\- or G4).
a) Design tlie lattice of processor types following the examples described
above.
b) Define a view that groups processors by series, hard disks by type, and
removable disks by speed, aggregating everything else.
c) Define a
view that groups processors by manufacturer, hard disks by
speed. and aggregates everything else except memory size.
d) Give esamples of
qneries that can be ansn-ered from the view of (11) only,
the
vieiv of (c) only, both, and neither.
*!!
Exercise
20.5.10:
If the fact table
F
to n-hicli n-e apply the
CuBE
operator is
sparse
(i.e there are inany fen-er tuples in
F
than the product of the number
of possihle values along each dimension), then tlie ratio of the sizes of CCBE(F)
and
F
can be very large. Hon large can it be?
20.6
Data
Mining
A
family of database applications cal!ed
data
rnin,ing
or
knowledge discovery
in
dntnbases
has captured considerable interest because of opportunities to learn
surprising facts
fro111 esisting databases. Data-mining queries can be thought
of as an estended
form of decision-support querx, although the distinction is in-
formal (see the
box on -Data-llining Queries and Decision-Support Queries").
Data
nli11i11:. stresses both the cpcry-optimization and data-management com-
ponents of a traditional database system, as
1%-ell as suggesting some important
estensions to database languages, such as language
primitix-es that support effi-
cient sampling of data. In
this section, we shall esamine the principal directions
data-mining applications have taken.
Me then focus on tlie problem called "fre-
quc'iit iteinsets." n-hich has 1-eceiwd the most attention from thedatabase point
of
view.
20.6.1
Data-Iblining Applications
Broadly. data-mining queries ask for a useful summary of data, often ~vithout
suggcstir~g the values of para~netcrs that would best yield such a summary.
This family of problems thus requires rethinking the nay database systems are
to be used to provide
snch insights abo~it the data. Below are some of tlie
applications
and problems that are being addressed using very large amounts
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
1092
CHAPTER
20.
I;YFORhlATION INTEGR.4TION
(stop words)
such as .'and" or 'The." which tend to be present in all docu-
ments and tell us nothing about the content
A
document is placed in this
space according to the fraction of its word occurrences that are any particular
word. For instance, if the document has
1000 word occurrences, two of which
are "database." then the doculllent ~vould be placed at the ,002 coordinate in
the dimension
cor~esponding to "database." By clustering documents in this
space, we tend to get groups of documents that talk about the same thing.
For instance, documents that talk about databases might
have occurrences of
words like "data," "query," "lock,"
and so on, while documents about baseball
are unlikely to
have occurrences of these rvords.
The data-mining problem here is to take the data and select the
"means"
or centers of the clusters. Often the number of clusters is given in advance.
although that number niay be selectable by the data-mining process as
ti-ell.
Either way, a naive algorithm for choosing the centers so that the average
distance from a point to its nearest center is minimized involves many queries;
each of which does a complex aggregation.
20.6.2
Finding
Frequent Sets
of
Items
Now. we shall see a data-mining problem for which algorithms using secondary
storage effectively have been developed. The problem is most easily described
in terms of its principal application: the analysis of
market-basket
data. Stores
today often hold in a data warehouse a record of what customers have bought
together. That is,
a
customer approaches the checkout with a .'market basket"
full of the items he or she has selected. The cash register records all of these
items as part of
a
single transaction. Thus, even if lve don't know anything
about the customer, and
we
can't tell if the customer returns and buys addi-
tional items.
we
do
know certain items that a single customer bu-s together.
If items appear together in market baskets more often
than ~vould be es-
pected, then the store has an opportunity to learn something about how cus-
tomers are likely to traverse the store. The items can
be placed in the store so
that customers
will tend to take certain paths through the store, and attractive
items can be placed along these paths.
Example
20.22
:
.A
famous example. which has been clainied by several peo-
ple; is
the discovery that people rvho buy diapcrs are unusually likely also to
buy beer. Theories have
been advanced foi n.hy that relationship is true. in-
cluding
tile possibility that peoplc n-110 buy diapers. having a baby at home. ale
less likely to go out to a bar in the evening and therefore tcnd to drink beer at
home. Stores may use the fact that
inany customers 15-ill walk through the store
from where the diapers are to where the
beer is. or vice versa. Clever maiketers
place beer and diapers near each other, rvitli potato chips in the middle. The
claim is that sales of all three items then increase.
We can represent market-basket data by a fact table:
Baskets (basket, item)
where the first attribute is a .'basket ID," or unique identifier for a market
basket, and the
secoild attribute is the ID of some item found in that basket.
Sote that it is not essential for the relation to come from true ma~ket-basket
data; it could be any relation from which we xant to find associated items. For
~nstance, the '.baskets" could be documents and the "items" could be words,
in which case
ne are really looking for words that appear in many documents
together.
The simplest form of market-basket analysis searches for sets of items that
frequently appear together in market baskets. The
support
for a set of items is
the number of baskets in
which all those items appear. The problem of finding
frequent sets of ~tems
is to find, given a support threshold
s,
all those sets of
items that have support at least
s.
If the number of items in thedatabase is large, then even if we restrict our
attention to small sets, say pairs of items only, the
time needed to count the
support for all pairs of items is enormous. Thus, the straightforward way to
solve even the frequent pairs problem
-
compute the support for each pair of
items
z
and
j,
as suggested by the SQL query in Fig. 20.24
-
~vill not work
This query involves joining
Baskets
r~ith itself, grouping the resulting tuples
by the
tri-o lte~ns found
111
that tuple, and throwing anay groups where the
number of baskets is belon- the support threshold
s
Sote that the condition
I. item
<
J. item
in the WHERE-clause is there to prevent the same pair from
being considered in
both orders. or for a .'pair" consisting of the same item
twice from being considered at all.
SELECT
I.itern, J.item, COUNT(I.basket)
FROM Baskets I, Baskets
J
WHERE 1.basket
=
J.basket AND
I.item
<
J.item
GROUP BY I.item, J.item
HAVING COUNT(I.basket)
>=
s;
Figure 20.24: Saive way to find all high-support pairs of items
20.6.3
The A-Priori Algorithm
There is an optimization that greatly reduccs the running time of a qutry like
Fig. 20.21
\\-hen the support threshold is sufficiently large that few pairs meet
it. It is ieaso~iable to set the threshold high, because a list of thousands or
millions of pairs
would not be very useful anyxay; ri-e xi-ant the data-mining
query to focus our attention on a
sn~all number of the best candidates. The
a-przorz
algorithm is based on the folloiving observation:
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
1094
CHAPTER
20.
IATFORlI~4TION INTEGR.ATION
Association
Rules
A
more complex type of market-basket mining searches for
associatzon
~xles
of the form {il, 22,
.
.
.
,
in)
3
j.
TKO possible properties that \ve
might want in useful rules of this form are:
1.
Confidence:
the probability of finding item
j
in a basket that has
all of
{il,i2
. .
,
in) is above a certain threshold. e.g., 50%; e.g "at
least 50% of the people who buy diapers buy beer."
2.
Interest:
the probability of finding item
j
in a basket that has all
of
{il,
i2,.
. .
,in} is significantly higher or lower than the probability
of finding
j
in a random basket.
In
statistical terms,
j
correlates
with
{il,
iz,
. .
.
,
i,,),
either positively or negatively. The discovery in
Example 20.22
was
really that the rule {diapers)
+
beer has high
interest.
Sote that el-en if an association rule
has
high confidence or interest. it n-ill
tend not to be useful unless the set of items inrrolved has high support.
The reason is that if the support is low, then the number of instances of
the rule is
not large, which limits the benefit of
a
strategy that exploits
the rule.
If
a
set of items
S
has support
s.
then each subset of
A'
must also have
support at least
s.
In particular, if a pair of items. say
{i.
j) appears in, say, 1000 baskets. then
we know there are at least 1000 baskets with item
i
and we know there are at
least
1000 baskets xvith item
j.
The converse of the above rule is that if we are looking for pairs of items
~vith support at least
s.
we may first eliminate from consideration any item that
does not by itself appear in at least
s
baskets. The
a-priorz algorltl~m
ans11-ers
the same query as Fig. 20.24 by:
1.
First finding the srt
of
candidate
nte~ns
-
those that appear in a sufficient
number of baskets
by
thexnsel~es
-
and then
2. Running the query of Fig. 20.24 on
only the candidate items.
The a-priori algorithnl is thus summarized by the sequence of two
SQL
queries
in Fig. 20.25. It first computes
Candidates.
the subset of the
Baskets
relation
i~hose iter~ls ha\-c high support by theniselves. then joins
Candidates
~vith itself.
as in the
naive algorithm of Fig. 20.24.
INSERT INTO Candidates
SELECT
*
FROM Baskets
WHERE item IN
(
SELECT item
FROM Baskets
GROUP BY item
HAVING COUNT(*)
>=
s
>;
SELECT I.item, J.item, ~~~N~(~.basket)
FROM Candidates I, Candidates J
WHERE 1.basket
=
J.basket AND
I.item
<
J.item
GROUP BY I.item, J.item
HAVING COUNT(*)
>=
s;
Figure 20.25: Tlie a-priori algorithm first finds frequent items before finding
frequent pairs
Example
20.23
:
To get a feel for how the a-priori algorithm helps, consider a
supermarket that sells 10,000 different items. Suppose that
the average market-
basket has 20 items in it. Also assume that thedatabase keeps 1,000,000 baskets
as data (a small number compared with
what would be stored in practice).
Then
the
Baskets
relation has 20,000,000 tuples, and the join in Fig. 20.24
(the naive algorithm)
has 190,000,000 pairs. This figure represents one million
baskets times
(y)
which is 190: pairs of items. These 190,000,000 tuples must
all be grouped
and counted.
However, suppose that
s
is 10,000, i.e., 1% of the baskets. It is impossi-
ble that
Inore than 20.000,000/10,000
=
2000 items appear in at least 10,000
baskets. because there are only 20,000.000 tuples in
Baskets,
and any item ap-
pearing in 10.000 baskets appears in at least 10,000 of those tuples. Thus: if
we
use the a-priori algoritllrn of Fig. 20.25, the subquery that finds the candidate
ite~ns cannot produce more than 2000 items. and I\-ill probably produce many
fewer than 2000.
\\'e
cannot he sure ho~v large
Candidates
is. since in the norst case
all
the
items that appear in
Baskets
will appear in at least
1%
of them. Honever. in
practice
Candidates
will be considerably smaller than
Baskets.
if the threshold
s
is high. For sake of argument, suppose
Candidates
has on the average 10
itelns per basket: i.e., it is half the size of
Baskets.
Then the join of
Candidates
with itself in step (2) has 1,000,000 times
(y)
=
45,000,000 tuples, less than
11-1 of the number of tuples in the join of
Baskets
~-ith itself. \Ye ~vould
thtis expect the a-priori algorithm to run in about
111
the time of the naive
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
1096
CHAPTER
20.
IlYFORM-rlTI0.V INTEGRATION
algorithm. In common situations, where
Candidates
has much less than half
tlie tuples of
Baskets,
the improvement is even greater, since running time
shrinks quadratically with the reduction in the number of tuples involved in
the join.
20.6.4
Exercises
for
Section
20.6
Exercise
20.6.1:
Suppose we are given the eight "market baskets" of Fig.
20.26.
B1
=
{milk, coke, beer)
BP
=
{milk, pepsi, juice)
B3
=
{milk, beer)
B4
=
{coke, juice)
Bg
=
{milk, pepsi, beer)
B6
=
{milk, beer, juice, pepsi)
B7
=
{coke, beer, juice)
B8
=
{beer, pepsi)
Figure
20.26:
Example market-basket data
*
a) As a percentage of the baskets, what is the support of the set {beer, juice)?
b) What is the support of the set {coke, pepsi)?
*
c) What is the confidence of milk given beer (i.e., of the association rule
{beer)
+
milk)?
d)
What is the confidence of juice given milk?
e)
What is the confidence of coke, given beer and juice?
*
f) If the support threshold is
35%
(i.e.,
3
out of the eight baskets are needed),
which pairs of items are frequent?
g) If the support threshold is
50%,
which pairs of items are frequent?
!
Exercise
20.6.2
:
The a-priori algorithm also may be used to find frequent sets
of
more than ttvo items. Recall that a set
S
of
k
items cannot have support at
least
s
t~nless every proper subset of
S
has support at least
s.
In
particular.
the subsets of
X
that are of size
k
-
1
must all have support at least
s.
Thus.
having found the frequent itemsets (those with support at least
s)
of size
k
-
1.
we can define the
candidate sets
of size
k
to be those sets of
k
items, all of nhose
subsets of size
k
-
1
have support at least
s.
Write
SQL
queries that, given the
frequent
itemsets of size
k
-
1
first compute the candidate sets of size
k,
and
then compute the frequent sets of size
k.
20.7.
SC'AIAI,4RY
OF
CHAPTER
20
1097
Exercise
20.6.3:
Using the baskets of Exercise
20.6.1,
answer the following:
a) If the support threshold is
35%,
what is the set of candidate triples?
b) If the support threshold is
35%,
what sets of triples are frequent?
20.7
Summary
of
Chapter
20
+
Integration of Information:
Frequently, there exist
a
variety of databases
or other information sources that contain related information.
nTe have
the opportunity to combine these sources into one.
Ho~vever, hetero-
geneities in the schemas often exist; these incompatibilities include dif-
fering types, codes or conventions for values, interpretations of concepts,
and different sets of concepts represented in different schernas.
+
Approaches to Information Integration:
Early approaches involved "fed-
eration," where each database
would query the others in the terms under-
stood by the second.
Nore recent approaches involve ~varehousing, where
data is translated to a global schema and copied to the warehouse. An
alternative is mediation, where a virtual warehouse is created to
allolv
queries to a global schema; the queries are then translated to the terms
of the data sources.
+
Extractors and Wrappers:
Warehousing and mediation require compo-
nents at each source, called extractors and wrappers, respectively.
X
ma-
jor function is to translate
querics and results betneen the global schema
and the local schema at the source.
+
Wrapper Generators:
One approach to designing wrappers is to use tem-
plates,
which describe how
a
query of a specific form is translated from the
global schema to the local
schema. These templates are tabulated and in-
terpreted
by a driver that tries to match queries to templates. The driver
may also have
the ability to combine templates in various ways, and/or
perform additional ~vork such as filtering. to answer more con~plex queries.
+
Capability-Based Optimtzation:
The sources for a mediator often are able
or
~villing to answer only limited forms of queries. Thus. the mediator
must select a query plan based on the capabilities of its sources, before it
can el-en think
about optiniizing the cost of query plans as con\-entional
DBAIS's do.
+
OLAP:
An important application of data I<-arehouses is the ability to ask
complex queries that touch all or
much of the data. at the same ti~ne that
transaction processing is conducted at the data sources. These queries,
which usually involve aggregation of data. are termed on-line analytic
processing, or
OLAP;
queries.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
1098
CHAPTER
20.
IXFORJIIATION IhTTEGR.4TI0.\'
20.8.
REFERENCES FOR CH-APTER
20
1099
+
ROLAP and AIOLAP:
It is frequently useful when building a warehouse
for OLAP, to think of the data as residing in a multidimensional space.
with diniensions corresponding to independent aspects of the data repre-
sented. Systems that support such a
vie~v of data take either a relational
point of view (ROLAP, or relational OLAP systems), or use the special-
ized data-cube model
(lIOL.AP, or multidimensional OLAP systems).
+
Star Schernas:
In a star schema, each data element (e.g., a sale of an item)
is represented in
one relation, called tlie fact table, while inforniation
helping to interpret the values along each dimension (e.g what kind of
product is
iten1 1234?) is stored in a diinension table for each diinension.
+
The Cube Operator:
A
specialized operator called
CCBE
pre-aggregates
the fact table along all subsets of dimensions. It
may add little to the space
needed by the fact table, and greatly increases the speed with
which many
OLAP queries can be answered.
+
Dzmenszon Lattices and Alaterialized Vzews:
A
more polverful approach
than the
CLBE
operator, used by some data-cube implementations. is to
establish a lattice of granularities for aggregation along each dimension
(e.g., different time units like days, months, and years). The ~vareliouse
is then designed by materializing certain view that aggregate in different
\va!.s along the different dimensions, and the rien- with the closest fit is
used to
answer a given query.
+
Data Mining:
IVareliouses are also used to ask broad questions that in-
volve not only aggregating on command. as in
OL.1P queries, but search-
ing for the "right" aggregation.
Common types of data mining include
clustering data into similar groups. designing decision trees to predict one
attribute based on the value of others. and finding sets of
items that occur
together frequently.
+
The A-Priori Algorithm:
-An efficiellt \\-a?; to find
frequent
itemsets is to
use the a-priori algorithm. This technique exploits the fact that if a set
occurs frequently. then so do all of its subsets.
20.8
References for Chapter
20
Recent smveys of \varehonsing arid related technologics are in [9]. [3]. and
[TI.
Federated systems are surveyed
111
11'21.
The concept of tlic mediato1 conies
from [14].
Implementation of mediators and \\-rappers, especially tlie mapper-genera-
tor approach. is covered in
[5]. Capabilities-based optilnization for iriediators
n-as explored in
[ll.
131.
The cube operator was proposed in 161. The i~iipleinentation of cubes by
materialized
vie\\-s appeared in 181.
[4] is
a
survey of data-mining techniques, and [13] is an on-line survey of
data
mining. The a-priori algorithm was del-eloped in [I] and 121.
1.
R.
Agranal,
T.
Imielinski, and A. Sn-ami: '.lIining association rules be-
tween sets of
items in large
databases,"
Proc. -ACAi SIGAlOD Intl. Conf.
on
ibfanagement of Data
(1993), pp. 203-216.
2.
R.
Agrawal, and
R.
Srikant, "Fast algorithms for mining association rules,"
Proc. Intl. Conf. on Veq Large Databa.ses
(1994), pp. 487-199.
3. S. Chaudhuri and
U.
Dayal, .'Ail overview of data warehousing and OLAP
technology,"
SIGAJOD Record
26:
1
(1997), pp. 63-74.
4.
U.
52. Fayyad, G. Piatetsky-Shapiro. P. Smyth, and
R.
Uthurusamy,
Ad-
Lances in Knowledge Discovery and Data hlznzng.
AAAI Press, hlenlo
Park
CA,
1996.
3.
H.
Garcia-llolina,
Y.
Papakonstalltinou.
D.
Quass. -1. Rajalaman,
Y.
Sa-
giv.
V.
\Bssalos.
J.
D.
Ullman, and
J.
n7idorn) The TSIlIlIIS approach
to mediation: data
nlodels and languages.
J.
Intellzgent Informatzon Sys-
tems
8:2 (1997), pp. 117-132.
6.
J.
S.
Gray,
A.
Bosworth,
A.
Layman. and
H.
Pirahesh, .'Data cube: a
relational aggregation operator generalizing group-by. cross-tab, and sub-
totals."
Proc. Intl. Conf. on Data Englneerzng
(1996). pp. 132-139.
7.
-1.
Gupta and I. S. SIumick.
A.laterioltzed Vieccs: Technzques, Implemcn-
tatzons, and Applzcatzons.
lIIT Pres4. Cambridge 11-1. 1999
8.
V.
Harinarayan,
-1.
Rajaraman, and
J.
D.
Ullman. ~~Implementiiig data
cubes efficiently."
Proc. ACAf SIGilfOD Intl. Conf. on Management of
Data
(1996). pp. 205-216.
9. D. Loniet and
J.
U-idom (eds.). Special i~sue on materialized l-ie~vs and
data warehouses.
IEEE Data Erlg?ilcerlng Builet~n
18:2 (1395).
10.
I*.
Papakonstantinou.
H.
Garcia-llolina. arid
J.
n'idom. "Object ex-
change across heterogeneous
information sources."
Proc. Intl. Conf. on
Data
Englneerlng
(1993). pp 251-260.
11.
I
Papakonstantinou.
.I.
Gupta. and
L.
Haas. "Capnl>ilities-base query
ren-riting
in mediator s!-stems."
Conference
011
Par(111el and Distributed
Informntion
Systc~ns
(1996). ,\l-;lil~il~le as:
12.
.A. P. Sheth and
J.
-1. Larson. "Federated databases for managing dis-
tributed. heterogeneous. and autonomous databases."
Cornputzng Surreys
22:3 (1990), pp. 183-236.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
[...]... Data structure 503 Data type 292 See also UDT Data ~\-areho~ls' 9 See also %rehouse Database 2 Database address 579-580, 582 Database administrator 10 Database element 879, 957 Database management system 1 910 Database programming 1, 15 17 Database schema See Relational database schema Database state See State, of a database Data-definition language 10 292 See also ODL, Schema Datalog 463-502 Data-manipulation... 717, 733 735736, 765 822 879, 888 See also Database address Disk controller 517: 522 Disk crash See Media failure Disk failure 546-563 See also Disk crash Disk head 516 See also Head assembly Disk I/O 511.519-523.525-526,717, 840: 832 8.56 Disk scheduling 538 Disk striping Sce RAID Striping Diskette 519 See also Floppy disk DISTINCT 277, 279 429-430 Distributed database 1018-1035 Distributive law 218... 720-723, 728, 733-734, 871 See also Pipelining Java 393 Java database connectivity Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark See JDBC JDBC 349, 393-397 Join 112-113,192-193.254-255.270272, j05-506 See also antisemijoin, CROSS JOIN, Equijoin, Satural join, Sestedloop join, Outerjoin, Selectivity, of a join Semijoin, Theta-join, Zig-zag join Join ordcring 818, 8-17-859 Join... 664 Schedule 918, 923-924 See also Serial schedule, Serializable schedule Scheduler 917 932 934-936 951937.969.973-97.5.979-980 Schema 49 8.5, 167 173 504, 572 575 See also Database schema Global schema Relat~onschema, Relational database schema Star schema Schrleider, R 711 Sclinarz, P 916 1044 Scope, of names 269 Scrolling cursor 361 Search key 605-606 612 614 623 663 See also Ilash key Second nornlal... 716 719, 721, 861-862, 867-868 Tag 178 Tagged field 593 Tanaka H 785 Tape 512 Template 1058-1059 Tertiary memory 512-513 Tertiary storage 6 Thalhein~.B 60 T E 368 HN Theta-join 199-201 205, 220, 477, 731.796-799.802,805,819520 826-827 Theta-outerjoin 229 Third norrnal form See 3 S F Thomas R H 1045 Tliomasian -1 988 Thrashing 766 3 S F 114-116 124-125 Three-valued logic 249-2.51 Thuraisingliam B 988... of a database Data-definition language 10 292 See also ODL, Schema Datalog 463-502 Data-manipulation language See Query language DATE 247, 293, 571-572 Date, C J 314 Dayal, U 348, 1099 DB2 492 DBMS See Database management system DDL See Data-definition language Deadlock 14, 885, 939, 1009-1018 1033 Decision tree 1090-1091 Decision-support query 1070, 10891090 See also OL.iP DECLARE 352-353,356, 367 Decoinposition... Rollback ABSOLUTE361 ichilles, X.-C 21 ACID properties 14 See also Atomicity, Consistency, Durability, Isolation ACR schedule See Cascading rollback Action 340 -ADA 350 ADD 294 Addition rule 101 Address See Database address Forwarding address Logical address, \Iemor>- address Physical address Structured address I'irtual memory Address space 309, 582 880 -1dornment 1066, lOG8 AFTER341-3-12 -1ggregation 221-223... 652-656 Estensible markup language See SAIL Cstensional predicate 469 Estent 131-152.170 Estractor See \\lapper Faloutsos, C 663,712 Faulstich, L.C 188 Fayyad, U.1099 FD See Functional dependency Federated databases 1047,1049-1051 FETCH 356,361,389-390 Field 132.567,570.573 FIFO See First-in-first-out File 504,506,567 See also Sequential file File system 2 Filter 844,860-862,868 Filter, for a wrapper 1060-1061... Griffiths P.P 424 438-441 GROUP BY 277.280-284 Group co~nrnit996-997 Group niode 954-955, 961 Groupi~lg 221-226.279.727-728.737 740-741.747 751.755.771 773 780,806-808.834 See also GROUP BY Gulutzan P 314 Gunther 0 712 Gupta, A 237,785,1099 Guttman, A 712 H Haderle, D J 916,1044 Hadzilacos, 1 916,987 ' Haerder, T 916 Hall, P.A V 874 Hamilton, G 424 Hamming code 557,562 Hamming distance 562 Handle 386 Hapner,... Idempotence 230,891, 998 Identity 555 IDREF 183 I F 368 Imielinski, T 1099 Immediate constraint checking 323325 Immutable object 133 Impedence mismatch 350-351 I N 266-267,430 Inapplicable value 248 Incomplete transaction 889, 898 Increment lock 946-949 Incremental dump 910 Incremental update 1052 Index 12-13, 16, 295-300,318-319, 605-606,757-764, 1065 See also Bitmap index, B-tree, Clustering index, . implied by the grouping, the less space the mate-
rialized
view takes. On the other hand, if ire ~vant to use a view to answer a
certain query,
then the view. only a small addition to the volume of the cube
(i.e., the number of tuples in the fact table).
In
that case, the size of the stored
data
CCBE(F)