Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 19 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
19
Dung lượng
1,28 MB
Nội dung
tells us that total sales of all Aardvark lnodels in all colors, over all time at all
dealers is 198.000 cars for
a
total price of $3,521,727,000.
Consider how to answer
a
query in \\-hich we specify conditions on certain
attributes of the Sales relation and group by
some other attributes, n-hile
asking for the sum, count, or average price. In the relation
are r sales),
we
look for those tuples
t
with the fo1lov;ing properties:
1. If the query specifies
a
value
v
for attribute
a;
then tuple
t
has
v
in its
component for
a.
2. If the query groups by an attribute
a,
then
t
has any non-* value in its
conlponent for
a.
3.
If the query neither groups by attribute
a
nor specifies a value for
a.
then
t
has
*
in its component for
a.
Each tuple
t
has tlie sum and count for one of the desired groups. If n-e \%-ant
the average price, a division is performed on the sum and count conlponents of
each tuple
t.
Example
20.18
:
The query
SELECT color,
AVG(price)
FROM Sales
WHERE model
=
'Gobi'
GROUP
BY
color;
is ansn-ered by looking for all tuples of
sales)
~vith the form
('Gobi',
C.
*,
*,
21,
n)
here
c
is any specific color. In this tuple,
v
will be the sum of sales of Gobis
in that color, while
n
will be the nlini!)cr of sales of Gobis in that color. Tlie
average price. although not an attribute of Sales or
sales)
directly. is
v/n.
Tlie answer to the query is the set of
(c,
vln)
pairs obtained fi-om all
('Gobi'.
c,
*,
*.
v.
n)
tuples.
20.5.2
Cube ImplementaOion
by
Materialized Views
11%
suggested in Fig. 20.17 that adding
aggregations
to the cube doesn't cost
much in tcrms of space. and saves a lot in time \vhen the common kincis of
decision-support queries are asked.
Ho~vever: our analysis is based on the as-
sumption that queries choose either to aggregate completely in a dimension
or not to aggregate at all. For some
dime~isions. there are many degrees of
granularity that could be chosen for a grouping on
that dimension.
Uc have already mentioned thc case of time. xvl-here numerolls options such
as aggregation by
weeks, months: quarters, or ycars exist,, in addition to the
all-or-nothing choices of grouping by day or aggregating over all time. For
another esanlple based on our running automobile database, Ive could choose
to aggregate dealers completely or not aggregate them at all.
Hon-ever, we could
also choose to aggregate by city, by state, or perhaps by other regions, larger
or smaller. Thus: there are at least
sis choices of grouping for time and at least
four for dealers.
l\Tllen the number of choices for grouping along each dimension grows, it
becomes increasingly expensive to store the results of aggregating by every
possible
conlbination of groupings. Sot only are there too many of them, but
they are not as easily organized as
the structure of Fig. 20.17 suggests for tlle
all-or-nothing case. Thus, commercial data-cube systems may help the user to
choose
some
n~aterialized
views
of the data cube.
A
materialized view is the
result of some query,
which we choose to store in the database, rather than
reconstructing (parts of) it as needed in response to queries. For the data cube,
the vie~vs we n-ould choose to materialize xi11 typically be aggregations of the
full data cube.
The coarser the partition implied by the grouping, the less space the mate-
rialized
view takes. On the other hand, if ire ~vant to use a view to answer a
certain query,
then the view must not partition any dimension more coarsely
than the query does. Thus, to
maximize the utility of materialized views, we
generally n-ant some large \-iers that group dimensions into a fairly fine parti-
tion. In addition, the choice of
vien-s to materialize is heavily influenced by the
kinds of
queries that the analysts are likely
to
ask.
.in
example will suggest tlie
tradeoffs in\-011-ed.
INSERT INTO SalesVl
SELECT model, color, month, city,
SUM(va1) AS val, SUM(cnt) AS cnt
FROM Sales JOIN Dealers
ON
dealer
=
name
GROUP
BY
model, color, month, city;
Figure 20.18: The materialized
vien. SalesVl
Example
20.19
:
Let us return to the data cube
Sales (model, color, date, dealer,
val
,
cnt)
that ne de\-eloped in Esample 20.17. One possible materialized vie\\- groups
dates by
nionth and dealers by city. This view. 1%-hich
1%-e
call SalesV1, is
constlucted
by
the query in Fig. 20.18. This query is not strict
SQL.
since n-e
imagine that dates and their grouping units such as months are understood
by the data-cube system n-ithout being told to join Sales with the imaginary
relation
rep~esenting dajs that \ve discussed in Example 20.14.
CHAPTER
20.
IiYFORI\IATIOAr IArTEGR.4TION
20.5.
DdT.4 CUBES
1055
INSERT INTO SalesV2
SELECT model, week, state,
SUM(va1) AS val, SUM(cnt) AS cnt
FROM Sales
JOIN Dealers
ON
dealer
=
name
GROUP
BY
model, week, state;
Figure 20.19: Another materialized view,
SalesV2
Another possible materialized view aggregates colors completely, aggregates
time into
u-eeks, and dealers by states. This view,
SalesV2,
is defined by the
query in Fig. 20.19. Either view
SalesVl
or
SalesV2
can be used to ansn-er a
query that partitions no more finely than either in any dimension. Thus, the
query
41:
SELECT model, SUM(va1)
FROM Sales
GROUP
BY
model;
can be answered either by
SELECT model, SUM(va1)
FROM SalesVl
GROUP
BY
model;
SELECT model,
SUM(va1)
FROM SalesV2
GROUP BY model;
On the other hand, the query
42: SELECT model, year, state, SUM(va1)
FROM Sales JOIN Dealers
ON
dealer
=
name
GROUP
BY
model, year, state;
can on1 be ans\vered from
SalesV1.
as
SELECT model, year, state, SUM(va1)
FROM SalesVl
GROUP
BY
model, year, state;
Incidentally. the query inmediately above. like the qu'rics that nggregate time
units, is not strict
SQL.
That is.
state
is not ari attribute of
SalesVl:
only
city
is. \Ye rmust assume that the data-cube systenl knol\-s how to perform the
aggregation of cities into states, probably by accessing the dimension table for
dealers.
\Ye
cannot answer Q2 from
SalesV2.
Although we could roll-up cities into
states
(i.e aggregate the cities into their states) to use
SalesV1,
we
carrrlot
roll-up ~veeks into years, since years are not evenly divided into weeks. and
data from a
week beginning. say, Dec.
29,
2001. contributes to years 2001 and
2002 in a way we
carinot tell from the data aggregated by weeks.
Finally, a query like
43:
SELECT model, color, date, ~~~(val)
FROM Sales
GROUP BY model, color, date;
can be anslvered from neither
SalesVl
nor
SalesV2.
It cannot be answered
from
Salesvl
because its partition of days by ~nonths is too coarse to recover
sales by day,
and it cannot be ans~vered from
SalesV2
because that view does
not group by color. We would have to answer this query directly from the full
data cube.
20.5.3
The Lattice
of
Views
To formalize the cbservations of Example 20.10. it he!ps to think of a lattice of
possibl~ groupings for each dimension of the cube. The points of the lattice are
the
ways that we can partition the ~alucs of a dimension
by
grouping according
to one or
more attributes of its dimension table.
nB
say that partition
PI
is
belo~v partition
P2.
written
PI
5
P2
if and only if each group of
Pl
is contained
within some group of
PZ.
All
Years
/
1
I
Quarters
I
Weeks Months
Days
Figure 20.20:
A
lattice of partitions for time inter\-als
Example
20.20:
For the lattice of time partitions n-e might choose the dia-
gram of Fig. 20.20.
-4
path from some node
fi
dotvn to
PI
means that
PI
5
4.
These are not the only possible units of time, but they
\\-ill
serve as an example
of what units a s~stern might support. Sotice that daks lie below both \reeks
and months, but weeks do not lie below months. The reason is that while a
group of events that took place in
one day surely took place within one \reek
and within one month. it is not true that a group of events taking place in one
week necessarily took place in any one month. Similarly, a week's group need
not be contained within the group
cor~esponding to one quarter or to one year.
At
tlie top is a partition we call "all," meaning that events are grouped into a
single group;
i.e we niake no distinctions among diffeient times.
All
I
State
I
City
I
Dealer
Figure 20.21:
A
lattice of partitions for automobile dealers
Figure 20.21
shows another lattice, this time for the dealer dimension of our
automobiles example. This lattice is siniplcr: it shows that partitioning sales
by
dealer gives a finer partition than partitioning by the city of the dealer. i<-hich is
in turn finer than partitioning by tlie state of tlie dealer.
The top of tlle ldrtice
is the partition that places all dealers in one group.
Having a lattice for each dimension,
15-12
can now define a lattice for all the
possible materialized
views of a data cube that can be formed by grouping
according to some partition
in each dimension. If
15
and
1%
are two views
formed by choosing a partition (grouping) for
each dimension, then
1;
5
11
means that in each dimension, the partition
Pl
that ~ve use in
1;
is at least as
fine as the partition
Pl
that n.e use for that dimension in
Ti;
that is.
Pl
5
P?
Man) OLAP queries can also be placed in the lattice of views
In
fact. fie-
quently an OLAP query has the same form as the views we have described: the
query specifies some pa~titioning (possibly none or all) for each of the dimen-
sions. Other
OL.iP queiics involve tliis same soit of grouping, and then "slice
tlie cube to
focus
011
a subset of the data. as nas suggested
by
the diag~ani in
Fig. 20.15.
The general rule is.
I\c can ansn-er a quciy
Q
using view
1-
if and o~ily if
1-
5
Q.
Example 20.21
:
Figure 20.22 takes the vielvs and queries of Example 20.19
and places them in a lattice.
Sotice that the Sales data cube itself is technically
a view. corresponding to tlie finest possible partition along each climensio~l. As
we observed in the original example,
QI
can be ans~vered from either SalesVl or
Sales
Figure 20.22: The lattice of
views and queries from Example 20.19
SalesV2; of course it could also be answered froni the full data cube Sales, but
there is
no reason to want to do so if one of the other views is materialized.
Q2
can be answered from either SalesVl or Sales, while
Q3
can only be answered
from Sales. Each of these relationships is expressed in Fig. 20.22 by the paths
downxard from the queries to their supporting vie~vs.
Placing queries in the lattice of views helps design data-cube databases.
Some recently developed design tools for data-cube
systems start with a set of
queries that they regard as
typical" of the application at hand. They then
select
a
set of views to materialize so that each of these queries is above at least
one of the
riel\-s, preferably identical to it or very close (i.e., the query and the
view use the same grouping in most of the dimensions).
20.5.4
Exercises
for
Section
20.5
Exercise 20.5.1
:
IVhat is the ratio of the size of CCBE(F) to the size of
F
if
fact table
F
has the follorving characteristics?
*
a)
F
has ten dimension attributes, each with ten different values.
b)
F
has ten dimension attributes. each with two differcnt values.
Exercise 20.5.2:
Let us use the cube ~nBE(Sa1es) from Example 20.17,
~vhich was built from the relation
Sales (model, color, date, dealer,
val,
cnt)
Tcll I\-hat tuples of the cube n-e 15-ould use to answer tlle follon-ing queries:
*
a) Find the total sales of I~lue cars for each dealer.
b) Find the total
nurnber of green Gobis sold by dealer .'Smilin' Sally."
c) Find the average number of
Gobis sold on each day of March, 2002 by
each dealer.
1088
CHAPTER
20.
ISFORJlATIOS IXTEGRA4TIOS
*!
Exercise
20.5.3:
In Exercise 20.4.1 lve spoke of PC-order data organized as
a cube. If we are to apply the
CCBE operator, we might find it convenient to
break several dimensions more finely. For example, instead of one processor
dimension,
we might have one dimension for the type (e.g., AlID Duron or
Pentium-IV), and another
d~mension for the speed. Suggest a set of dimrnsions
and dependent attributes that will allow us to obtain answers to a variety of
useful aggregation queries. In particular,
what role does the customer play?
.Also, the price in Exercise 20.4.1 referred to the price of one macll~ne, while
several identical machines could be ordered in a single tuple.
What should the
dependent
attribute(s) be?
Exercise
20.5.4
:
What tuples of the cube from Exercise 20.5.3 would you use
to answer the following queries?
a) Find, for each processor speed, the total number of computers ordered in
each month of the year 2002.
b) List for
each type of hard disk (e.g., SCSI or IDE) and eacli processor
type
the number of computers ordered.
c) Find the average price of computers with
1500 megahertz processors for
each month from Jan., 2001.
!
Exercise
20.5.5
:
The computers described in the cube of Exercise 20.5.3 do
not include monitors.
IVhat dimensions would you suggest to represent moni-
tors? You
may assume that the price of the monitor is included in the price of
the computer.
Exercise
20.5.6
:
Suppose that a cube has 10 dimensions. and eacli dimension
has
5
options for granularity of aggregation. including "no aggregation" and
"aggregate fully.''
How many different views can we construct by clioosing a
granularity in each
dinlension?
Exercise
20.5.7
:
Show how to add the following time units to the lattice of
Fig. 20.20: hours, minutes, seconds, fortnights
(two-week periods). decades.
and centuries.
Exercise
20.5.8:
How 15-onld you change the dealer lattice of Fig. 20.21 to
include
-regions." ~f:
a)
A
region is a set of states.
*
b) Regions are not com~liensurate with states. but each city is in only one
region.
c) Regions are like area codes: each region is contained
\vithin a state. some
cities are in
two or more regions. and some regions ha~e several cities.
20.6.
DATA
-111-YIA-G
1089
!
Exercise
20.5.9:
In Exercise 20.5.3 ne designed a cube suitable for use ~vith
the CCBE operator.
Horn-ever.
some of the dimensions could also be given a non-
trivial lattice structure. In particular, the processor type could be organized by
manufacturer (e
g., SUT, Intel. .AND. llotorola). series (e.g
SUN
Ult~aSparc.
Intel Pentium or Celeron. AlID rlthlon, or llotorola G-series), and model (e.g.,
Pentiuni-I\- or G4).
a) Design tlie lattice of processor types following the examples described
above.
b) Define a view that groups processors by series, hard disks by type, and
removable disks by speed, aggregating everything else.
c) Define a
view that groups processors by manufacturer, hard disks by
speed. and aggregates everything else except memory size.
d) Give esamples of
qneries that can be ansn-ered from the view of (11) only,
the
vieiv of (c) only, both, and neither.
*!!
Exercise
20.5.10:
If the fact table
F
to n-hicli n-e apply the
CuBE
operator is
sparse
(i.e there are inany fen-er tuples in
F
than the product of the number
of possihle values along each dimension), then tlie ratio of the sizes of CCBE(F)
and
F
can be very large. Hon large can it be?
20.6
Data
Mining
A
family of database applications cal!ed
data
rnin,ing
or
knowledge discovery
in
dntnbases
has captured considerable interest because of opportunities to learn
surprising facts
fro111 esisting databases. Data-mining queries can be thought
of as an estended
form of decision-support querx, although the distinction is in-
formal (see the
box on -Data-llining Queries and Decision-Support Queries").
Data
nli11i11:. stresses both the cpcry-optimization and data-management com-
ponents of a traditional database system, as
1%-ell as suggesting some important
estensions to database languages, such as language
primitix-es that support effi-
cient sampling of data. In
this section, we shall esamine the principal directions
data-mining applications have taken.
Me then focus on tlie problem called "fre-
quc'iit iteinsets." n-hich has 1-eceiwd the most attention from thedatabase point
of
view.
20.6.1
Data-Iblining Applications
Broadly. data-mining queries ask for a useful summary of data, often ~vithout
suggcstir~g the values of para~netcrs that would best yield such a summary.
This family of problems thus requires rethinking the nay database systems are
to be used to provide
snch insights abo~it the data. Below are some of tlie
applications
and problems that are being addressed using very large amounts
[...]... Data structure 503 Data type 292 See also UDT Data ~\-areho~ls' 9 See also %rehouse Database 2 Database address 579-580, 582 Database administrator 10 Database element 879, 957 Database management system 1 910 Database programming 1, 15 17 Database schema See Relational database schema Database state See State, of a database Data-definition language 10 292 See also ODL, Schema Datalog 463-502 Data-manipulation... 717, 733 735736, 765 822 879, 888 See also Database address Disk controller 517: 522 Disk crash See Media failure Disk failure 546-563 See also Disk crash Disk head 516 See also Head assembly Disk I/O 511.519-523.525-526,717, 840: 832 8.56 Disk scheduling 538 Disk striping Sce RAID Striping Diskette 519 See also Floppy disk DISTINCT 277, 279 429-430 Distributed database 1018-1035 Distributive law 218... lei-el -107-408 ISO\VG3 313 Iterator 720-723, 728, 733-734, 871 See also Pipelining Java 393 Java database connectivity See JDBC JDBC 349, 393-397 Join 112-113,192-193.254-255.270272, j05-506 See also antisemijoin, CROSS JOIN, Equijoin, Satural join, Sestedloop join, Outerjoin, Selectivity, of a join Semijoin, Theta-join, Zig-zag join Join ordcring 818, 8-17-859 Join tree 848 Joined tuple 198 Juke box 512... 664 Schedule 918, 923-924 See also Serial schedule, Serializable schedule Scheduler 917 932 934-936 951937.969.973-97.5.979-980 Schema 49 8.5, 167 173 504, 572 575 See also Database schema Global schema Relat~onschema, Relational database schema Star schema Schrleider, R 711 Sclinarz, P 916 1044 Scope, of names 269 Scrolling cursor 361 Search key 605-606 612 614 623 663 See also Ilash key Second nornlal... 716 719, 721, 861-862, 867-868 Tag 178 Tagged field 593 Tanaka H 785 Tape 512 Template 1058-1059 Tertiary memory 512-513 Tertiary storage 6 Thalhein~.B 60 T E 368 HN Theta-join 199-201 205, 220, 477, 731.796-799.802,805,819520 826-827 Theta-outerjoin 229 Third norrnal form See 3 S F Thomas R H 1045 Tliomasian -1 988 Thrashing 766 3 S F 114-116 124-125 Three-valued logic 249-2.51 Thuraisingliam B 988... 652-656 Estensible markup language See SAIL Cstensional predicate 469 Estent 131-152.170 Estractor See \\lapper Faloutsos, C 663,712 Faulstich, L.C 188 Fayyad, U.1099 FD See Functional dependency Federated databases 1047,1049-1051 FETCH 356,361,389-390 Field 132.567,570.573 FIFO See First-in-first-out File 504,506,567 See also Sequential file File system 2 Filter 844,860-862,868 Filter, for a wrapper 1060-1061... Griffiths P.P 424 438-441 GROUP BY 277.280-284 Group co~nrnit996-997 Group niode 954-955, 961 Groupi~lg 221-226.279.727-728.737 740-741.747 751.755.771 773 780,806-808.834 See also GROUP BY Gulutzan P 314 Gunther 0 712 Gupta, A 237,785,1099 Guttman, A 712 H Haderle, D J 916,1044 Hadzilacos, 1 916,987 ' Haerder, T 916 Hall, P.A V 874 Hamilton, G 424 Hamming code 557,562 Hamming distance 562 Handle 386 Hapner,... Idempotence 230,891, 998 Identity 555 IDREF 183 I F 368 Imielinski, T 1099 Immediate constraint checking 323325 Immutable object 133 Impedence mismatch 350-351 I N 266-267,430 Inapplicable value 248 Incomplete transaction 889, 898 Increment lock 946-949 Incremental dump 910 Incremental update 1052 Index 12-13, 16, 295-300,318-319, 605-606,757-764, 1065 See also Bitmap index, B-tree, Clustering index,... 390-392 Binding parameters 392-393 Bit 572 Bit string 246 292 Bitmap indes 666 702-710 Blair, R H 502 Blasgen, 11 W 785, 916 BLOB See Binary large object Block 509 See also Disk block Block address See Database address Block header 576-577 Body 465 Boolean 292 Bosworth, A 1099 Bottom-up plan selection 843 Bound adornment See Adornment Branch and bound 844 B-tree 16, 609, 611, 632-648, 652, 665,670-671,674,762,963964,... 785 kd-tree 666 690-694 Iiedem, Z 988 Ice)- 17-51, 70 8-1-88 97 152-154 161, 316 See also Foreign key, Hash key Prirnary key, Search key Sort key Kim Mr 188 Iiitsurega~va \I 785 Iinon-ledge discovery in databases See Data mining Iinuth D E 604, 663 KO H.-P 988 Iiorth H F 988 Iiossma11 D 785 Iireps P 21 Iiriegel H.-P 711 Iiunlar I- 916 Iiumg H.-T 988 L Lampson B 366 1044 Larson J A 1099 Latency 519 535 . implied by the grouping, the less space the mate-
rialized
view takes. On the other hand, if ire ~vant to use a view to answer a
certain query,
then the view. if the document has
1000 word occurrences, two of which
are " ;database. " then the doculllent ~vould be placed at the ,002 coordinate in
the