Tài liệu Database Systems: The Complete Book- P13 docx

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	19
Dung lượng	1,28 MB

Nội dung

tells us that total sales of all Aardvark lnodels in all colors, over all time at all dealers is 198.000 cars for a total price of $3,521,727,000. Consider how to answer a query in \\-hich we specify conditions on certain attributes of the Sales relation and group by some other attributes, n-hile asking for the sum, count, or average price. In the relation are r sales), we look for those tuples t with the fo1lov;ing properties: 1. If the query specifies a value v for attribute a; then tuple t has v in its component for a. 2. If the query groups by an attribute a, then t has any non-* value in its conlponent for a. 3. If the query neither groups by attribute a nor specifies a value for a. then t has * in its component for a. Each tuple t has tlie sum and count for one of the desired groups. If n-e \%-ant the average price, a division is performed on the sum and count conlponents of each tuple t. Example 20.18 : The query SELECT color, AVG(price) FROM Sales WHERE model = 'Gobi' GROUP BY color; is ansn-ered by looking for all tuples of sales) ~vith the form ('Gobi', C. *, *, 21, n) here c is any specific color. In this tuple, v will be the sum of sales of Gobis in that color, while n will be the nlini!)cr of sales of Gobis in that color. Tlie average price. although not an attribute of Sales or sales) directly. is v/n. Tlie answer to the query is the set of (c, vln) pairs obtained fi-om all ('Gobi'. c, *, *. v. n) tuples. 20.5.2 Cube ImplementaOion by Materialized Views 11% suggested in Fig. 20.17 that adding aggregations to the cube doesn't cost much in tcrms of space. and saves a lot in time \vhen the common kincis of decision-support queries are asked. Ho~vever: our analysis is based on the as- sumption that queries choose either to aggregate completely in a dimension or not to aggregate at all. For some dime~isions. there are many degrees of granularity that could be chosen for a grouping on that dimension. Uc have already mentioned thc case of time. xvl-here numerolls options such as aggregation by weeks, months: quarters, or ycars exist,, in addition to the all-or-nothing choices of grouping by day or aggregating over all time. For another esanlple based on our running automobile database, Ive could choose to aggregate dealers completely or not aggregate them at all. Hon-ever, we could also choose to aggregate by city, by state, or perhaps by other regions, larger or smaller. Thus: there are at least sis choices of grouping for time and at least four for dealers. l\Tllen the number of choices for grouping along each dimension grows, it becomes increasingly expensive to store the results of aggregating by every possible conlbination of groupings. Sot only are there too many of them, but they are not as easily organized as the structure of Fig. 20.17 suggests for tlle all-or-nothing case. Thus, commercial data-cube systems may help the user to choose some n~aterialized views of the data cube. A materialized view is the result of some query, which we choose to store in the database, rather than reconstructing (parts of) it as needed in response to queries. For the data cube, the vie~vs we n-ould choose to materialize xi11 typically be aggregations of the full data cube. The coarser the partition implied by the grouping, the less space the materialized view takes. On the other hand, if ire ~vant to use a view to answer a certain query, then the view must not partition any dimension more coarsely than the query does. Thus, to maximize the utility of materialized views, we generally n-ant some large \-iers that group dimensions into a fairly fine partition. In addition, the choice of vien-s to materialize is heavily influenced by the kinds of queries that the analysts are likely to ask. .in example will suggest tlie tradeoffs in\-011-ed. INSERT INTO SalesVl SELECT model, color, month, city, SUM(va1) AS val, SUM(cnt) AS cnt FROM Sales JOIN Dealers ON dealer = name GROUP BY model, color, month, city; Figure 20.18: The materialized vien. SalesVl Example 20.19 : Let us return to the data cube Sales (model, color, date, dealer, val , cnt) that ne de\-eloped in Esample 20.17. One possible materialized vie\\- groups dates by nionth and dealers by city. This view. 1%-hich 1%-e call SalesV1, is constlucted by the query in Fig. 20.18. This query is not strict SQL. since n-e imagine that dates and their grouping units such as months are understood by the data-cube system n-ithout being told to join Sales with the imaginary relation rep~esenting dajs that \ve discussed in Example 20.14. CHAPTER 20. IiYFORI\IATIOAr IArTEGR.4TION 20.5. DdT.4 CUBES 1055 INSERT INTO SalesV2 SELECT model, week, state, SUM(va1) AS val, SUM(cnt) AS cnt FROM Sales JOIN Dealers ON dealer = name GROUP BY model, week, state; Figure 20.19: Another materialized view, SalesV2 Another possible materialized view aggregates colors completely, aggregates time into u-eeks, and dealers by states. This view, SalesV2, is defined by the query in Fig. 20.19. Either view SalesVl or SalesV2 can be used to ansn-er a query that partitions no more finely than either in any dimension. Thus, the query 41: SELECT model, SUM(va1) FROM Sales GROUP BY model; can be answered either by SELECT model, SUM(va1) FROM SalesVl GROUP BY model; SELECT model, SUM(va1) FROM SalesV2 GROUP BY model; On the other hand, the query 42: SELECT model, year, state, SUM(va1) FROM Sales JOIN Dealers ON dealer = name GROUP BY model, year, state; can on1 be ans\vered from SalesV1. as SELECT model, year, state, SUM(va1) FROM SalesVl GROUP BY model, year, state; Incidentally. the query inmediately above. like the qu'rics that nggregate time units, is not strict SQL. That is. state is not ari attribute of SalesVl: only city is. \Ye rmust assume that the data-cube systenl knol\-s how to perform the aggregation of cities into states, probably by accessing the dimension table for dealers. \Ye cannot answer Q2 from SalesV2. Although we could roll-up cities into states (i.e aggregate the cities into their states) to use SalesV1, we carrrlot roll-up ~veeks into years, since years are not evenly divided into weeks. and data from a week beginning. say, Dec. 29, 2001. contributes to years 2001 and 2002 in a way we carinot tell from the data aggregated by weeks. Finally, a query like 43: SELECT model, color, date, ~~~(val) FROM Sales GROUP BY model, color, date; can be anslvered from neither SalesVl nor SalesV2. It cannot be answered from Salesvl because its partition of days by ~nonths is too coarse to recover sales by day, and it cannot be ans~vered from SalesV2 because that view does not group by color. We would have to answer this query directly from the full data cube. 20.5.3 The Lattice of Views To formalize the cbservations of Example 20.10. it he!ps to think of a lattice of possibl~ groupings for each dimension of the cube. The points of the lattice are the ways that we can partition the ~alucs of a dimension by grouping according to one or more attributes of its dimension table. nB say that partition PI is belo~v partition P2. written PI 5 P2 if and only if each group of Pl is contained within some group of PZ. All Years / 1 I Quarters I Weeks Months Days Figure 20.20: A lattice of partitions for time inter\-als Example 20.20: For the lattice of time partitions n-e might choose the dia- gram of Fig. 20.20. -4 path from some node fi dotvn to PI means that PI 5 4. These are not the only possible units of time, but they \\-ill serve as an example of what units a s~stern might support. Sotice that daks lie below both \reeks and months, but weeks do not lie below months. The reason is that while a group of events that took place in one day surely took place within one \reek and within one month. it is not true that a group of events taking place in one week necessarily took place in any one month. Similarly, a week's group need not be contained within the group cor~esponding to one quarter or to one year. At tlie top is a partition we call "all," meaning that events are grouped into a single group; i.e we niake no distinctions among diffeient times. All I State I City I Dealer Figure 20.21: A lattice of partitions for automobile dealers Figure 20.21 shows another lattice, this time for the dealer dimension of our automobiles example. This lattice is siniplcr: it shows that partitioning sales by dealer gives a finer partition than partitioning by the city of the dealer. i<-hich is in turn finer than partitioning by tlie state of tlie dealer. The top of tlle ldrtice is the partition that places all dealers in one group. Having a lattice for each dimension, 15-12 can now define a lattice for all the possible materialized views of a data cube that can be formed by grouping according to some partition in each dimension. If 15 and 1% are two views formed by choosing a partition (grouping) for each dimension, then 1; 5 11 means that in each dimension, the partition Pl that ~ve use in 1; is at least as fine as the partition Pl that n.e use for that dimension in Ti; that is. Pl 5 P? Man) OLAP queries can also be placed in the lattice of views In fact. fie- quently an OLAP query has the same form as the views we have described: the query specifies some pa~titioning (possibly none or all) for each of the dimensions. Other OL.iP queiics involve tliis same soit of grouping, and then "slice tlie cube to focus 011 a subset of the data. as nas suggested by the diag~ani in Fig. 20.15. The general rule is. I\c can ansn-er a quciy Q using view 1- if and o~ily if 1- 5 Q. Example 20.21 : Figure 20.22 takes the vielvs and queries of Example 20.19 and places them in a lattice. Sotice that the Sales data cube itself is technically a view. corresponding to tlie finest possible partition along each climensio~l. As we observed in the original example, QI can be ans~vered from either SalesVl or Sales Figure 20.22: The lattice of views and queries from Example 20.19 SalesV2; of course it could also be answered froni the full data cube Sales, but there is no reason to want to do so if one of the other views is materialized. Q2 can be answered from either SalesVl or Sales, while Q3 can only be answered from Sales. Each of these relationships is expressed in Fig. 20.22 by the paths downxard from the queries to their supporting vie~vs. Placing queries in the lattice of views helps design data-cube databases. Some recently developed design tools for data-cube systems start with a set of queries that they regard as typical" of the application at hand. They then select a set of views to materialize so that each of these queries is above at least one of the riel\-s, preferably identical to it or very close (i.e., the query and the view use the same grouping in most of the dimensions). 20.5.4 Exercises for Section 20.5 Exercise 20.5.1 : IVhat is the ratio of the size of CCBE(F) to the size of F if fact table F has the follorving characteristics? * a) F has ten dimension attributes, each with ten different values. b) F has ten dimension attributes. each with two differcnt values. Exercise 20.5.2: Let us use the cube ~nBE(Sa1es) from Example 20.17, ~vhich was built from the relation Sales (model, color, date, dealer, val, cnt) Tcll I\-hat tuples of the cube n-e 15-ould use to answer tlle follon-ing queries: * a) Find the total sales of I~lue cars for each dealer. b) Find the total nurnber of green Gobis sold by dealer .'Smilin' Sally." c) Find the average number of Gobis sold on each day of March, 2002 by each dealer. 1088 CHAPTER 20. ISFORJlATIOS IXTEGRA4TIOS *! Exercise 20.5.3: In Exercise 20.4.1 lve spoke of PC-order data organized as a cube. If we are to apply the CCBE operator, we might find it convenient to break several dimensions more finely. For example, instead of one processor dimension, we might have one dimension for the type (e.g., AlID Duron or Pentium-IV), and another d~mension for the speed. Suggest a set of dimrnsions and dependent attributes that will allow us to obtain answers to a variety of useful aggregation queries. In particular, what role does the customer play? .Also, the price in Exercise 20.4.1 referred to the price of one macll~ne, while several identical machines could be ordered in a single tuple. What should the dependent attribute(s) be? Exercise 20.5.4 : What tuples of the cube from Exercise 20.5.3 would you use to answer the following queries? a) Find, for each processor speed, the total number of computers ordered in each month of the year 2002. b) List for each type of hard disk (e.g., SCSI or IDE) and eacli processor type the number of computers ordered. c) Find the average price of computers with 1500 megahertz processors for each month from Jan., 2001. ! Exercise 20.5.5 : The computers described in the cube of Exercise 20.5.3 do not include monitors. IVhat dimensions would you suggest to represent monitors? You may assume that the price of the monitor is included in the price of the computer. Exercise 20.5.6 : Suppose that a cube has 10 dimensions. and eacli dimension has 5 options for granularity of aggregation. including "no aggregation" and "aggregate fully.'' How many different views can we construct by clioosing a granularity in each dinlension? Exercise 20.5.7 : Show how to add the following time units to the lattice of Fig. 20.20: hours, minutes, seconds, fortnights (two-week periods). decades. and centuries. Exercise 20.5.8: How 15-onld you change the dealer lattice of Fig. 20.21 to include -regions." ~f: a) A region is a set of states. * b) Regions are not com~liensurate with states. but each city is in only one region. c) Regions are like area codes: each region is contained \vithin a state. some cities are in two or more regions. and some regions ha~e several cities. 20.6. DATA -111-YIA-G 1089 ! Exercise 20.5.9: In Exercise 20.5.3 ne designed a cube suitable for use ~vith the CCBE operator. Horn-ever. some of the dimensions could also be given a non- trivial lattice structure. In particular, the processor type could be organized by manufacturer (e g., SUT, Intel. .AND. llotorola). series (e.g SUN Ult~aSparc. Intel Pentium or Celeron. AlID rlthlon, or llotorola G-series), and model (e.g., Pentiuni-I\- or G4). a) Design tlie lattice of processor types following the examples described above. b) Define a view that groups processors by series, hard disks by type, and removable disks by speed, aggregating everything else. c) Define a view that groups processors by manufacturer, hard disks by speed. and aggregates everything else except memory size. d) Give esamples of qneries that can be ansn-ered from the view of (11) only, the vieiv of (c) only, both, and neither. *!! Exercise 20.5.10: If the fact table F to n-hicli n-e apply the CuBE operator is sparse (i.e there are inany fen-er tuples in F than the product of the number of possihle values along each dimension), then tlie ratio of the sizes of CCBE(F) and F can be very large. Hon large can it be? 20.6 Data Mining A family of database applications cal!ed data rnin,ing or knowledge discovery in dntnbases has captured considerable interest because of opportunities to learn surprising facts fro111 esisting databases. Data-mining queries can be thought of as an estended form of decision-support querx, although the distinction is in- formal (see the box on -Data-llining Queries and Decision-Support Queries"). Data nli11i11:. stresses both the cpcry-optimization and data-management com- ponents of a traditional database system, as 1%-ell as suggesting some important estensions to database languages, such as language primitix-es that support effi- cient sampling of data. In this section, we shall esamine the principal directions data-mining applications have taken. Me then focus on tlie problem called "fre- quc'iit iteinsets." n-hich has 1-eceiwd the most attention from the database point of view. 20.6.1 Data-Iblining Applications Broadly. data-mining queries ask for a useful summary of data, often ~vithout suggcstir~g the values of para~netcrs that would best yield such a summary. This family of problems thus requires rethinking the nay database systems are to be used to provide snch insights abo~it the data. Below are some of tlie applications and problems that are being addressed using very large amounts [...]... Data structure 503 Data type 292 See also UDT Data ~\-areho~ls' 9 See also %rehouse Database 2 Database address 579-580, 582 Database administrator 10 Database element 879, 957 Database management system 1 910 Database programming 1, 15 17 Database schema See Relational database schema Database state See State, of a database Data-definition language 10 292 See also ODL, Schema Datalog 463-502 Data-manipulation... 717, 733 735736, 765 822 879, 888 See also Database address Disk controller 517: 522 Disk crash See Media failure Disk failure 546-563 See also Disk crash Disk head 516 See also Head assembly Disk I/O 511.519-523.525-526,717, 840: 832 8.56 Disk scheduling 538 Disk striping Sce RAID Striping Diskette 519 See also Floppy disk DISTINCT 277, 279 429-430 Distributed database 1018-1035 Distributive law 218... lei-el -107-408 ISO\VG3 313 Iterator 720-723, 728, 733-734, 871 See also Pipelining Java 393 Java database connectivity See JDBC JDBC 349, 393-397 Join 112-113,192-193.254-255.270272, j05-506 See also antisemijoin, CROSS JOIN, Equijoin, Satural join, Sestedloop join, Outerjoin, Selectivity, of a join Semijoin, Theta-join, Zig-zag join Join ordcring 818, 8-17-859 Join tree 848 Joined tuple 198 Juke box 512... 664 Schedule 918, 923-924 See also Serial schedule, Serializable schedule Scheduler 917 932 934-936 951937.969.973-97.5.979-980 Schema 49 8.5, 167 173 504, 572 575 See also Database schema Global schema Relat~onschema, Relational database schema Star schema Schrleider, R 711 Sclinarz, P 916 1044 Scope, of names 269 Scrolling cursor 361 Search key 605-606 612 614 623 663 See also Ilash key Second nornlal... 716 719, 721, 861-862, 867-868 Tag 178 Tagged field 593 Tanaka H 785 Tape 512 Template 1058-1059 Tertiary memory 512-513 Tertiary storage 6 Thalhein~.B 60 T E 368 HN Theta-join 199-201 205, 220, 477, 731.796-799.802,805,819520 826-827 Theta-outerjoin 229 Third norrnal form See 3 S F Thomas R H 1045 Tliomasian -1 988 Thrashing 766 3 S F 114-116 124-125 Three-valued logic 249-2.51 Thuraisingliam B 988... 652-656 Estensible markup language See SAIL Cstensional predicate 469 Estent 131-152.170 Estractor See \\lapper Faloutsos, C 663,712 Faulstich, L.C 188 Fayyad, U.1099 FD See Functional dependency Federated databases 1047,1049-1051 FETCH 356,361,389-390 Field 132.567,570.573 FIFO See First-in-first-out File 504,506,567 See also Sequential file File system 2 Filter 844,860-862,868 Filter, for a wrapper 1060-1061... Griffiths P.P 424 438-441 GROUP BY 277.280-284 Group co~nrnit996-997 Group niode 954-955, 961 Groupi~lg 221-226.279.727-728.737 740-741.747 751.755.771 773 780,806-808.834 See also GROUP BY Gulutzan P 314 Gunther 0 712 Gupta, A 237,785,1099 Guttman, A 712 H Haderle, D J 916,1044 Hadzilacos, 1 916,987 ' Haerder, T 916 Hall, P.A V 874 Hamilton, G 424 Hamming code 557,562 Hamming distance 562 Handle 386 Hapner,... Idempotence 230,891, 998 Identity 555 IDREF 183 I F 368 Imielinski, T 1099 Immediate constraint checking 323325 Immutable object 133 Impedence mismatch 350-351 I N 266-267,430 Inapplicable value 248 Incomplete transaction 889, 898 Increment lock 946-949 Incremental dump 910 Incremental update 1052 Index 12-13, 16, 295-300,318-319, 605-606,757-764, 1065 See also Bitmap index, B-tree, Clustering index,... 390-392 Binding parameters 392-393 Bit 572 Bit string 246 292 Bitmap indes 666 702-710 Blair, R H 502 Blasgen, 11 W 785, 916 BLOB See Binary large object Block 509 See also Disk block Block address See Database address Block header 576-577 Body 465 Boolean 292 Bosworth, A 1099 Bottom-up plan selection 843 Bound adornment See Adornment Branch and bound 844 B-tree 16, 609, 611, 632-648, 652, 665,670-671,674,762,963964,... 785 kd-tree 666 690-694 Iiedem, Z 988 Ice)- 17-51, 70 8-1-88 97 152-154 161, 316 See also Foreign key, Hash key Prirnary key, Search key Sort key Kim Mr 188 Iiitsurega~va \I 785 Iinon-ledge discovery in databases See Data mining Iinuth D E 604, 663 KO H.-P 988 Iiorth H F 988 Iiossma11 D 785 Iireps P 21 Iiriegel H.-P 711 Iiunlar I- 916 Iiumg H.-T 988 L Lampson B 366 1044 Larson J A 1099 Latency 519 535 . implied by the grouping, the less space the materialized view takes. On the other hand, if ire ~vant to use a view to answer a certain query, then the view. if the document has 1000 word occurrences, two of which are " ;database. " then the doculllent ~vould be placed at the ,002 coordinate in the

Ngày đăng: 26/01/2014, 15:20

Xem thêm