FUNDAMENTALS OF DATABASE SYSTEMS Fourth Edition phần 9 pdf

For example, each record in a relational database table-such as the EMPLOYEE table in Figure S.6-follows the same format as the other records in that table.For structured data, it is com

Trang 1

25.7 DISTRIBUTED DATABASES IN ORACLE

In the client-server architecture, the Oracle database system is divided into two parts:(l) a front-end as the client portion, and (2) a back-end as the server portion The cli-ent portion is the front-end database application that interacts' with the user The cli-ent has no data access responsibility and merely handles the requesting, processing, andpresentation of data managed by the server The server portion runs Oracle and handlesthe functions related to concurrent shared access It acceptsSQLandPL/SQLstatementsoriginating from client applications, processes them, and sends the results back to theclient Oracle client-server applications provide location transparency by making loca-tion of data transparent to users; several features like views, synonyms, and procedurescontribute to this Global naming is achieved by using <TABLENAME.@, DATABASENAME> torefer to tables uniquely

Oracle uses a two-phase commit protocol to deal with concurrent distributedtransactions TheCOMMITstatement triggers the two-phase commit mechanism TheRECO(recoverer) background process automatically resolves the outcome of those distributedtransactions in which the commit was interrupted The RECO of each local Oracle Serverautomatically commits or rolls back any "in-doubt" distributed transactions consistently on allinvolved nodes For long-term failures, Oracle allows each localDBA to manually commitorroll back any in-doubt transactions and free up resources Global consistency can bemaintained by restoring the database at each site to a predetermined fixed point in the past.Oracle's distributed database architecture is shown in Figure 25.9 A node in adistributed database system can act as a client, as a server, or both, depending on thesituation The figure shows two sites where databases called HQ (headquarters) and Salesare kept For example, in the application shown running at the headquarters, for anSQLstatement issued against local data (for example,DELETE FRDM DEPT ••• ), the HQ computeracts as a server, whereas for a statement against remote data (for example, INSERT INTO EMP@SALES), the HQ computer acts as a client

All Oracle databases in a distributed database system (DDBS)use Oracle's networkingsoftware NetS for interdatabase communication NetS allows databases to communicateacross networks to support remote and distributed transactions It packages SQLstatements into one of the many communication protocols to facilitate client to servercommunication and then packages the results back similarly to the client Each databasehas a unique global name provided by a hierarchical arrangement of network domainnames that is prefixed to the database name to make it unique

Oracle supports database links that define a one-way communication path from oneOracle databasetoanother For example,

CREATE DATABASE LINK sales.us.americas;

establishes a connection to the sales database in Figure 25.9 under the network domain

us that comes under domain ameri cas

Data in an OracleDDBScan be replicated using snapshots or replicated master tables.Replication is provided at the following levels:

• Basic replication: Replicas of tables are managed for read-only access For updates,data must be accessed at a single primary site

Trang 2

Database

server

Database server

INSERT INTO EMP@SALES ;

DELETE FROM DEPT ;

SELECT

FROM EMP@SALES ;

COMMIT;

TRANSACTION

INSERT INTO EMP@SALES ;

DELETE FROM DEPT ;

SELECT

FROM EMP@SALES ;

COMMIT;

FIGURE 25.9 Oracle distributed database systems Source:From Oracle (1997a)

• Advanced (symmetric) replication: This extends beyond basic replication by allowing

applications to update table replicas throughout a replicatedDDBS. Data can be read

and updated at any site This requires additional software called Oracle's advanced

replication option A snapshot generates a copy of a part of the table by means of a

query called thesnapshot definingquery.A simple snapshot definition looks like this:

CREATE SNAPSHOT sales.orders AS

SELECT * FROM sa1es.orders@hq.us.americas;

Trang 3

Oracle groups snapshots into refresh groups By specifying a refresh interval, thesnapshot is automatically refreshed periodically at that interval by up to ten SnapshotRefresh Processes (SNPs) If the defining query of a snapshot contains a distinct oraggregate function, a GROUP BY or CONNECT BY clause, or join or set operations, thesnapshot is termed a complex snapshot and requires additional processing Oracle (up toversion 7.3) also supportsROWID snapshots that are based on physical row identifiers ofrows in the master table.

Heterogeneous Databases in Oracle. In a heterogeneous DDBS, at least onedatabase is a non-Oracle system Oracle Open Gateways provides access to a non-Oracledatabase from an Oracle server, which uses a database link to access data or to executeremote procedures in the non-Oracle system The Open Gateways feature includes thefollowing:

• Distributed transactions: Under the two-phase commit mechanism, transactions mayspan Oracle and non-Oracle systems

• Transparent SQL access: SQL statements issued by an application are transparentlytransformed into SQL statements understood by the non-Oracle system

• Pass-through SQL and stored procedures: An application can directly access a Oracle system using that system's version of SQL Stored procedures in a non-OracleSQL-based system are treated as if they were PL!SQL remote procedures

non-• Global query optimization: Cardinality information, indexes, etc., at the non-Oraclesystem are accounted for by the Oracle Server query optimizer to perform globalquery optimization

• Procedural access: Procedural systems like messaging or queuing systems are accessed

by the Oracle server using PL!SQL remote procedure calls

In addition to the above, data dictionary references are translated tomake the Oracle data dictionary appear as a part of the Oracle Server's dictionary Character settranslations are done between national language character sets to connect multilingualdatabases

In this chapter we provided an introduction to distributed databases This is a very broadtopic, and we discussed only some of the basic techniques used with distributed databases Wefirst discussed the reasons for distribution and the potential advantages of distributed databasesover centralized systems We also defined the concept of distribution transparency and therelated concepts of fragmentation transparency and replication transparency We discussedthe design issues related to data fragmentation, replication, and distribution, and we distin-guished between horizontal and vertical fragments of relations We discussed the use of datareplication to improve system reliability and availability We categorized DDBMSs by using cri-teria such as degree of homogeneity of software modules and degree of local autonomy We dis-

Trang 4

cussed the issues of federated database management in some detail focusing on the needs of

supporting various types of autonomies and dealing with semantic heterogeneity

We illustrated some of the techniques used in distributed query processing, and

discussed the cost of communication among sites, which is considered a major factor in

distributed query optimization We compared different techniques for executing joins and

presented the semijoin technique for joining relations that reside on different sites We

briefly discussed the concurrency control and recovery techniques used in DDBMSs We

reviewed some of the additional problems that must be dealt with in a distributed

environment that do not appear in a centralized environment

We then discussed the client-server architecture concepts and related them to

distributed databases, and we described some of the facilities in Oracle to support

distributed databases

Review Questions

25.1 What are the main reasons for and potential advantages of distributed databases?

25.2 What additional functions does a DDBMS have over a centralized DBMS?

25.3 What are the main software modules of a DDBMS? Discuss the main functions of

each of these modules in the context of the client-server architecture

25.4 What is a fragment of a relation? What are the main types of fragments? Why is

fragmentation a useful concept in distributed database design?

25.5 Why is data replication useful in DDBMSs? What typical units of data are

replicated?

25.6 What is meant by data allocation in distributed database design? What typical

units of data are distributed over sites?

25.7 How is a horizontal partitioning of a relation specified? How can a relation be put

back together from a complete horizontal partitioning?

25.8 How is a vertical partitioning of a relation specified? How can a relation be put

back together from a complete vertical partitioning?

25.9 Discuss what is meant by the following terms: degree of homogeneity of aDDBMS,

degree of local autonomy of aDDBMS,federatedDBMS,distribution transparency,

frag-mentation transparency, replication transparency, multidatabase system.

25.10 Discuss the naming problem in distributed databases

25.11 Discuss the different techniques for executing an equijoin of two files located at

different sites What main factors affect the cost of data transfer?

25.12 Discuss the semijoin method for executing an equijoin of two files located at

dif-ferent sites Under what conditions is an equijoin strategy efficient?

25.13 Discuss the factors that affect query decomposition How are guard conditions and

attribute lists of fragments used during the query decomposition process?

25.14 How is the decomposition of an update request different from the decomposition

of a query? How are guard conditions and attribute lists of fragments used during

the decomposition of an update request?

25.15 Discuss the factors that do not appear in centralized systems that affect

concur-rency control and recovery in distributed systems

Trang 5

25.16 Compare the primary site method with the primary copy method for distributedconcurrency control How does the use of backup sites affect each?

25.17 When are voting and elections used in distributed databases?

25.18 What are the software components in a client-server DDBMS? Compare the tier and three-tier client-server architectures

a For each employee in department 5, retrieve the employee name and thenames of the employee's dependents

b Print the names of all employees who work in department 5 but who work on

some project not controlled by department 5.

25.20 Consider the following relations:

BOOKS (Book#, Primary_author, Topic, Total_stock, $price)BOOKSTORE (Store#, City, State, Zip, Inventory_value)STOCK (Store#, Book#, Qty)

TOTAL_STOCK is the total number of books in stock, and INVENTORY_VALUE is the totalinventory value for the store in dollars

a Give an example of two simple predicates that would be meaningful for theBOOKSTORE relation for horizontal partitioning

b How would a derived horizontal partitioning of STOCK be defined based on thepartitioning of BOOKSTORE?

c Show predicates by which BOOKS may be horizontally partitioned by topic

d Show how the STOCK may be further partitioned from the partitions in (b) by

adding the predicates in (c)

25.21 Consider a distributed database for a bookstore chain called National Books with

3sites called EAST, MIDDLE, and WEST The relation schemas are given in question24.20.Consider that BOOKS are fragmented by $PRICE amounts into:

B1:BOOK!: up to $20

Bz: BOOK2: from $20.01to$50

B3:BOOK3: from$50.01 to$100

B4:BOOK4:$100.01 and above

Similarly, BOOKSTORES are divided by Zi pcodes into:

SI: EAST: Zi pcodes up to35000

s, MIDDLE: Zipcodes35001 to70000

S3: WEST: Zi pcodes70001to 99999

Assume that STOCK is a derived fragment based on BOOKSTORE only

Trang 6

a Consider the query:

SELECT Book#, Total_stock

FROM Books

WHERE $price > 15 and $price < 55;

Assume that fragments of BOOKSTORE are non-replicated and assigned based on

region Assume further thatBOOKSare allocated as:

b If the bookprice of BOOK#= 1234 is updated from $45 to $55 at site MIDDLE,

what updates does that generate? Write in English and then inSQl

c Given an example query issued at WEST that will generate a subquery for

MIDDLE

d Write a query involving selection and projection on the above relations and

show two possible query trees that denote different ways of execution

25.22 Consider that you have been asked to propose a database architecture in a large

organization, General Motors, as an example, to consolidate all data including

legacy databases (from Hierarchical and Network models, which are explained in

Appendices C and D; no specific knowledge of these models is needed) as well as

relational databases, which are geographically distributed so that global

applica-tions can be supported Assume that alternative one is to keep all databases as

they are, while alternative two is to first convert them to relational and then

sup-port the applications over a distributed integrated database

a Draw two schematic diagrams for the above alternatives showing the linkages

among appropriate schemas For alternative one, choose the approach of

pro-viding export schemas for each database and constructing unified schemas for

each application

b List the steps one has to go through under each alternative from the present

situation until global applications are viable

c Compare these from the issuesof: (i) design time considerations, and (ii)

run-time considerations

Selected Bibliography

The textbooks by Ceri and Pelagatti (1984a) and Ozsu and Valduriez (1999) are devoted

to distributed databases Halsaal (1996), Tannenbaum (1996), and Stallings (1997) are

textbooks on data communications and computer networks Comer (1997) discusses

net-works and internets Dewire (1993) is a textbook on client-server computing Ozsu et at

(1994) has a collection of papers on distributed object management

Trang 7

Distributed database design has been addressed in terms of horizontal and verticalfragmentation, allocation, and replication Ceri et a1 (1982) defined the concept ofminterm horizontal fragments Ceri et a1 (1983) developed an integer programmingbased optimization model for horizontal fragmentation and allocation N avathe et'11.

(1984) developed algorithms for vertical fragmentation based on attribute affinity andshowed a variety of contexts for vertical fragment allocation Wilson and Navathe (1986)present an analytical model for optimal allocation of fragments Elmasri et a1 (1987)discuss fragmentation for the EeR model; Karlapalem et a1 (1994) discuss issues fordistributed design of object databases Navathe et a1 (1996) discuss mixed fragmentation

by combining horizontal and vertical fragmentation; Karlapalem et a1 (1996) present amodel for redesign of distributed databases

Distributed query processing, optimization, and decomposition are discussed inHevner and Yao (1979), Kerschberg et a1 (1982), Apers et a1 (1983), Ceri and Pelagatti(1984), and Bodorick et a1 (1992) Bernstein and Goodman (1981) discuss the theorybehind semijoin processing Wong (1983) discusses the use of relationships in relationfragmentation Concurrency control and recovery schemes are discussed in Bernstein andGoodman (1981a) Kumar and Hsu (1998) have some articles related to recovery indistributed databases Elections in distributed systems are discussed in Garcia-Molina(1982) Lamport (1978) discusses problems with generating unique timestamps in adistributed system

A concurrency control technique for replicated data that is based on voting ispresented by Thomas (1979) Gifford (1979) proposes the use of weighted voting, andParis (1986) describes a method called voting with witnesses ]ajodia and Mutchler(1990) discuss dynamic voting A technique calledavailable copyis proposed by Bernsteinand Goodman (1984), and one that uses the idea of a group is presented in EIAbbadi andToueg (1988) Other recent work that discusses replicated data includes Gladney (1989),Agrawal and E1Abbadi (1990), E1Abbadi and Toueg (1990), Kumar and Segev (1993),Mukkamala (1989), and Wolfson and Milo (1991) Bassiouni (1988) discusses optimisticprotocols for DDB concurrency control Garcia-Molina (1983) and Kumar andStonebraker (1987) discuss techniques that use the semantics of the transactions.Distributed concurrency control techniques based on locking and distinguished copies arepresented by Menasce et a1 (1980) and Minoura and Wiederhold (1982) Obermark(1982) presents algorithms for distributed deadlock detection

A survey of recovery techniques in distributed systems is given by Kohler (1981).Reed (1983) discusses atomic actions on distributed data A book edited by Bhargava(1987) presents various approaches and techniques for concurrency and reliability indistributed systems

Federated database systems were first defined in McLeod and Heimbigner (1985).Techniques for schema integration in federated databases are presented by Elmasri et al.(1986), Batini et a1 (1986), Hayne and Ram (1990), and Motro (1987) Elmagarmid andHelal (1988) and Gamal-Eldin et a1 (1988) discuss the update problem in heterogeneousDDBSs Heterogeneous distributed database issues are discussed in Hsiao and Kamel(1989) Sheth and Larson (1990) present an exhaustive survey of federated databasemanagement

Trang 8

Recently, multidatabase systems and interoperability have become important topics.

Techniques for dealing with semantic incompatibilities among multiple databases are

examined in DeMichiel (1989), Siegel and Madnick (1991), Krishnamurthy et al

(1991), and Wang and Madnick (1989) Castano et al (1998) present an excellent

survey of techniques for analysis of schemas Pitoura et al (1995) discuss object

orientation in multidatabase systems

Transaction processing in multidatabases is discussed in Mehrotra et al (1992),

Georgakopoulos et al (1991), Elmagarmid et al (1990), and Brietbart et al (1990),

among others Elmagarmid et al (1992) discuss transaction processing for advanced

applications, including engineering applications discussed in Heiler et a1 (1992)

The workflow systems, which are becoming popular to manage information in

complex organizations, use multilevel and nested transactions in conjunction with

distributed databases Weikum (1991) discusses multilevel transaction management

Alonso et al (1997) discuss limitations of current workflow systems

A number of experimental distributed DBMSs have been implemented These include

distributed INGRES (Epstein et al., 1978), DDTS (Devor and Weeldreyer, 1980), SDD-l

(Rothnie et al., 1980), System R* (Lindsay et al., 1984), SIRIUS-DELTA (Ferrier and

Stangret, 1982), and MULTIBASE (Smith et al., 1981) The OMNIBASE system

(Rusinkiewicz et al., 1988) and the Federated Information Base developed using the

Candide data model (Navathe et al., 1994) are examples of federated DDBMS Pitoura et al

(1995) present a comparative survey of the federated database system prototypes Most

commercial DBMS vendors have products using the client-server approach and offer

distributed versions of their systems Some system issues concerning client-server DBMS

architectures are discussed in Carey et al (1991), DeWitt et al (1990), and Wang and

Rowe (1991) Khoshafian et al (1992) discuss design issues for relational DBMSs in the

client-server environment Client-server management issues are discussed in many books,

such as Zantinge and Adriaans (1996)

Trang 9

EMERGING TECHNOLOGIES

Trang 10

We now turn our attention to how databases are used and accessed from the Internet

Many electronic commerce (e-commerce) and other Internet applications provide Web

interfaces to access information stored in one or more databases These databases are

often referred to as data sources It is common to use two-tier and three-tier clientserver

architectures for Internet applications (see Section 2.5) In some cases, other variations of

the clientserver model are used E-commerce and other Internet database applications are

designed to interact with the user through Web interfaces that display Web pages The

common method of specifying the contents and formatting of Web pages is through the

use of hyperlink documents There are various languages for writing these documents,

the most common beingHTML(Hypertext Markup Language) AlthoughHTMLis widely

used for formatting and structuring Web documents, it is not suitable for specifying

(Extended Markup Language)-has emerged as the standard for structuring and

exchang-ing data over the Web XML can be used to provide information about the structure and

meaning of the data in the Web pages rather than just specifying how the Web pages are

formatted for display on the screen The formatting aspects are specified separately-for

example, by using a formatting language such asXSL(Extended Stylesheet Language)

This chapter describes the basics of accessing and exchanging information over the

Internet We start in Section 26.1 by discussing how traditional Web pages differ from

structured databases, and discuss the differences between structured, semistructured, and

unstructured data Then in Section 26.2 we turn our attention to theXML standard and

841

Trang 11

its tree-structured (hierarchical) data model Section 26.3 discussesXMLdocuments andthe languages for specifying the structure of these documents, namely, XML DTD(Document Type Definition) and XML schema Section 26.4 presents the variousapproaches for storing XML documents, whether in their native (text) format, in acompressed form, or in relational and other types of databases Section 26.5 gives anoverview of the languages proposed for queryingXMLdata Section 26.6 summarizes thechapter.

UNSTRUCTURED DATA

The information stored in databases is known as structured data because it is represented

in a strict format For example, each record in a relational database table-such as the

EMPLOYEE table in Figure S.6-follows the same format as the other records in that table.For structured data, it is common to carefully design the database using techniques such asthose described in Chapters 3, 4, 7, 10, and 11 in order to create the database schema.TheDBMSthen checks to ensure that all data follows the structures and constraints spec-ified in the schema

However, not all data is collected and inserted into carefully designed structureddatabases In some applications, data is collected in an ad-hoc manner before it is knownhow it will be stored and managed This data may have a certain structure, but not all theinformation collected will have identical structure Some attributes may be shared amongthe various entities, but other attributes may exist only in a few entities Moreover,additional attributes can be introduced in some of the newer data items at any time, andthere is no predefined schema This type of data is known as semistructured data Anumber of data models have been introduced for representing semistructured data, oftenbased on using tree or graph data structures rather than the flat relational model structures

A key difference between structured and semistructured data concerns how theschema constructs (such as the names of attributes, relationships, and entity types) arehandled In semistructured data, the schema information ismixedin with the data values,since each data object can have different attributes that are not known in advance.Hence, this type of data is sometimes referred to as self-describing data Consider thefollowing example We want to collect a list of bibliographic references related to acertain research project Some of these may be books or technical reports, others may beresearch articles in journals or conference proceedings, and still others may refer tocomplete journal issues or conference proceedings Clearly, each of these may havedifferent attributes and different types of information Even for the same type ofreference-say, conference articles-we may have different information For example,one article citation may be quite complete, with full information about author names,title, proceedings, page numbers, and so on, whereas another citation may not have allthe information available New types of bibliographic sources may appear in the future-for example, referencestoWeb pages ortoconference tutorials-and these may have newattributes that describe them

Trang 12

FIGURE 26.1 Representing semistructured data as a graph.

Semistructured data may be displayed as a directed graph, as shown in Figure 26.1

The information shown in Figure 26.1 corresponds to some of the structured data shown

in Figure 5.6 As we can see, this model somewhat resembles the object model (see Figure

20.1) in its ability to represent complex objects and nested structures In Figure 26.1, the

labels or tags on the directed edges represent the schema names: thenames of attributes,

object types (or entity typesor classes), and relationships. The internal nodes represent

individual objects or composite attributes The leaf nodes represent actual data values of

simple (atomic) attributes

There are two main differences between the semistructured model and the object

model that we discussed in Chapter 20:

1.The schema information-names of attributes, relationships, and classes (object

types) in the semistructured model is intermixed with the objects and their data

values in the same data structure

2 In the semistructured model, there is no requirement for a predefined schema to

which the data objects must conform

In addition to structured and semistructured data, a third category exists, known as

unstructured data because there is very limited indication of the type of data A typical

example is a text document that contains information embedded within it Web pages in

HTML that contain some data are considered to be unstructured data Consider part of

an HTMLfile, shown in Figure 26.2 Text that appears between angled brackets, < >, is

an HTMLtag A tag with a backslash, «] >, indicates an end tag, which represents the

Trang 13

<head>

</head>

<body>

<H1>List of company projects and the employees in each project<\H1>

<H2>The ProductX project:</H2>

<TR>

<TO width="50%"><font size="2" face="Arial">John Smith:</font></TO>

<TO>32.5 hours per week</TO>

</TR>

<TR>

<TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO>

<TO>20.0 hours per week</TD>

</TR>

</table>

<H2>The ProductY project:</H2>

<TR>

<TO width="50%"><font size="2" face="Arial">John Smith:</font></TO>

</TR>

<TR>

<TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO>

</TR>

<TR>

<TO width="50%%"><font size="2" face="Arial">Franklin Wong:</font></TO>

</TR>

</table>

</body>

</html>

FIGURE 26.2 Part of an HTML document representing unstructured data

ending of the effect of a matching start tag The tags mark up the document! in order toinstruct an HTML processor howto display the text between a start tag and a matchingend tag Hence, the tags specify document formatting rather than the meaning of thevarious data elements in the document.HTMLtags specify information, such as font sizeand style (boldface, italics, and so on), color, heading levels in documents, and so on.Some tags provide text structuring in documents, such as specifying a numbered or

1 That is why it is known as HypertextMarkupLanguage

Trang 14

unnumbered list or a table Even these structuring tags specify that the embedded textual

data is to be displayed in a certain manner, rather than indicating the type of data

represented in the table

HTML uses a large number of predefined tags, which are used to specify a variety of

commands for formatting Web documents for display The start and end tags specify the

range of text to be formatted by each command A few examples of the tags shown in

Figure 26.2 follow:

• The <html> </html> tags specify the boundaries of the document

• The document header information-within the <head> </head> tags-specifies

various commands that will be used elsewhere in the document For example, it may

specify various script functions in a language such asJAVAScript orPERL,or certain

formatting styles (fonts, paragraph styles, header styles, and so on) that can be used

in the document Itcan also specify a title to indicate what theHTMLfile is for, and

other similar information that will not be displayed as part of the document

• The body of the document-specified within the <body> </body> tags-includes

the document text and the markup tags that specify how the text is to be formatted

and displayed It can also include references to other objects, such as images, videos,

voice messages, and other documents

• The <HI> </HI> tags specify that the text is to be displayed as a level I heading

There are many heading levels «H2>, <H3>, and so on), each displaying text in a

less prominent heading format

• The <table> </table> tags specify that the following text is to be displayed as a

table Each row in the table is enclosed within <TR> </TR> tags, and the actual

text data in a row is displayed within <TD> </TD> tags.2

• Some tags may have attributes, which appear within the start tag and describe

addi-tional properties of the tag." In Figure 26.2, the <table> start tag has four attributes

describing various characteristics of the table The following <TD> and <font> start

tags have one and two attributes, respectively

HTML has a very large number of predefined tags, and whole books are devoted to

describing how to use these tags If designed properly,HTMLdocuments can be formatted

so that humans are able to easily understand the document contents, and are able to

navigate through the resulting Web documents However, the source HTML text

documents are very difficult tointerpret automatically bycomputer programsbecause they

do not include schema information about the type of data in the documents As

e-commerce and other Internet applications become increasingly automated, it is becoming

crucial to be able to exchange Web documents among various computer sites and to

interpret their contents automatically This need was one of the reasons that led to the

development ofXML, which we discuss in the next section

2 <TR> stands for table row, and <TO> for table data

3 This is how the termattributeis used in document markup languages, which differs from how it is

used in database models

Trang 15

26.2 XMl HIERARCHICAL (TREE) DATA MODEL

We now introduce the data model used inXML.The basic object isXMLin theXMLment Two main structuring concepts are used to construct an XMLdocument: elementsand attributes.Itis importanttonote right away that the term attribute inXMLis not used

docu-in the same manner as is customary docu-in database termdocu-inology, but rather as it is used docu-indocument description languages such as HTML and SGML.4 Attributes in XML provideadditional information that describes elements, as we shall see There are additional con-cepts in XML,such as entities, identifiers, and references, but we first concentrate ondescribing elements and attributestoshow the essence of theXMLmodel

Figure 26.3 shows an example of an XML element called <projects> As in HTML,

elements are identified in a document by their start tag and end tag The tag names areenclosed between angled brackets < >, and end tags are further identified by abackslash, </ >.5Complex elements are constructed from other elements hierarchically,whereas simple elements contain data values A major difference betweenXMLandHTML

is that XML tag names are defined to describe the meaning of the data elements in thedocument, rather than to describe how the text is to be displayed This makes it possible

to process the data elements in theXMLdocument automatically by computer programs

Itis straightforward to see the correspondence between theXMLtextual representationshown in Figure 26.3 and the tree structure shown in Figure 26.1 In the tree representation,internal nodes represent complex elements, whereas leaf nodes represent simple elements.That is why theXMLmodel is called a tree model or a hierarchical model In Figure 26.3,the simple elements are the ones with the tag names <Name>, <Number>, <Location>,

<DeptNo>, <SSN>, <LastName>, <FirstName>, and <hours> The complex elements arethe ones with the tag names <projects>, <project>, and <Worker> In general, there is nolimit on the levels of nesting of elements

In general, it is possible to characterize three main types ofXMLdocuments:

fol-Iowa specific structure and hence may be extracted from a structured database Theyare formatted asXMLdocuments in ordertoexchange them or display them over theWeb

such as news articles or books There are few or no structured data elements in thesedocuments

and other parts that are predominantly textual or unstructured

It is importanttonote that data-centricXMLdocuments can be considered either assemistructured data or as structured data If an XMLdocument conforms to a predefined

4.SGML(Standard Generalized Markup Language) is a more general language for describing ments and provides capabilities for specifying new tags However, it is more complex thanHTML

docu-and XML.

5 The left and right angled bracket characters« and» are reserved characters, as are the sand (&), apostrophee),and single quotation marks (') To include them within the text of a doc-ument, they must be encoded as &It;, >, &, ', and ", respectively

Trang 16

FIGURE 26.3 A complexXMLelement called <projects>.

XML schema or DTD (see Section 26.3), then the document can be considered as

structureddata. On the other hand, XML allows documents that do not conform to any

schema; and these would be considered assemistructureddata.The latter are also known as

schemaless XML documents When the value of the STANDALONEattribute in an XML document

is"YES",as in the first line of Figure 26.3, the document is standalone and schemaless

XML attributes are generally used in a manner similartohow they are used in HTML

(see Figure 26.2), namely,todescribe properties and characteristics of the elements (tags)

within which they appear It is also possible to use XML attributes tohold the values of

Trang 17

simple data elements; however this is definitely not recommended We discuss XMLattributes further in Section 26.3 when we discussXMLschema andDTD.

26.3.1 Well-Formed and Valid XML Documents and XML DTD

In Figure 26.3, we saw what a simple XMLdocument may look like AnXMLdocument iswell formed if it follows a few conditions In particular, it must start with anXMLdeclara-tionto indicate the version ofXMLbeing used as well as any other relevant attributes, asshown in the first line of Figure 26.3 Itmust also follow the syntactic guidelines of thetree model This means that there should be asingle root element,and every element mustinclude a matching pair of start and end tags within the start and end tagsof the parent element.This ensures that the nested elements specify a well-formed tree structure

A well-formedXMLdocument is syntactically correct This allows it to be processed

by generic processors that traverse the document and create an internal treerepresentation A standard set of API (application programming interface) functionscalledDOM(Document Object Model) allows programs to manipulate the resulting treerepresentation corresponding to a well-formed XML document However, the wholedocument must be parsed beforehand when using DOM.Another APIcalledSAXallowsprocessing ofXMLdocuments on the fly by notifying the processing program whenever astart or end tag is encountered This makes it easier to process large documents and allowsfor processing of so-called streamingXMLdocuments, where the processing program canprocess the tags as they are encountered

A well-formedXML document can have any tag names for the elements within thedocument There is no predefined set of elements (tag names) that a program processingthe document knows to expect This gives the document creator the freedom to specifynew elements, but limits the possibilities for automatically interpreting the elementswithin the document

<!DOCTYPE projects [

<!ELEMENT projects (project+»

<!ELEMENT project (Name, Number, Location, DeptNo?, Workers»

<!ELEMENT Name (#PCDATA»

<!ELEMENT Number (#PCDATA»

<!ELEMENT Location (#PCDATA»

<!ELEMENT DeptNo (#PCDATA»

<!ELEMENT Workers (Worker*»

<!ELEMENT Worker (SSN, LastName?, FirstName?, hours»

<!ELEMENT SSN (#PCDATA»

<!ELEMENT LastName (#PCDATA»

<!ELEMENT FirstName (#PCDATA»

<!ELEMENT hours (#PCDATA»

] >

FIGURE 26.4 AnXML DTDfile called projects

Trang 18

A stronger criterion is for an XML document to be valid In this case, the document

must be well formed, and in addition the element names used in the start and end tag

pairs must follow the structure specified in a separate XML DTD (Document Type

Definition) file or XMLschema file We first discussXML DTDhere, then give an overview

ofXMLschema in Section 26.3.2 Figure 26.4 shows a simpleXML DTDfile, which specifies

the elements (tag names) and their nested structures Any valid documents conforming

to this DTD should follow the specified structure A special syntax exists for specifying

DTD files, as illustrated in Figure 26.4 First, a name is given to the root tag of the

document, which is called projects in the first line of Figure 26.4 Then the elements and

their nested structure are specified

When specifying elements, the following notation is used:

• A *following the element name means that the element can be repeated zero or

more times in the document This kind of element is known as anoptional multivalued

(repeating) element.

• A + following the element name means that the element can be repeated one or

more times in the document This kind of element is arequired multivalued (repeating)

element.

• A ?following the element name means that the element can be repeated zero or one

times This kind is an optional single-valued (nonrepeating) element.

• An element appearing without any of the preceding three symbols must appear

exactly once in the document This kind is a required single-valued (nonrepeating)

element.

• The type of the element is specified via parentheses following the element If the

parentheses include names of other elements, these latter elements are the childrenof

the element in the tree structure If the parentheses include the keyword #PCDATA or

one of the other data types available inXML DTD, the element is a leaf node PCDATA

stands forparsed characterdata,which is roughly similar to a string data type

• Parentheses can be nested when specifying elements

• A bar symbol(e\ Iez )specifies that eithere\orezcan appear in the document

We can see that the tree structure in Figure 26.1 and theXML document in Figure

26.3 conform to the XML DTD in Figure 26.4 To require that an XML document be

checked for conformance to a DTD, we must specify this in the declaration of the

document For example, we could change the first line in Figure 26.3 to the following:

<?xml version="1.0" standalone="no"?>

<!DOCTYPE projects SYSTEM "proj.dtd">

When the value of the standalone attribute in an XML document is "no", the

document needs to be checked against a separateDTDdocument TheDTDfile shown in

Figure 26.4 should be stored in the same file system as theXML document, and should be

given the file name "proj dtd" Alernatively, we could include theDTD document text

at the beginning of theXMLdocument itself to allow the checking

Although XML DTD is quite adequate for specifying tree structures with required,

optional, and repeating elements, it has several limitations First, the data types in DTD

Trang 19

are not very general Second,DTDhas its own special syntax and thus requires specializedprocessors Itwould be advantageous to specifyXMLschema documents using the syntaxrules ofXMLitself so that the same processors used forXMLdocuments could processXMLschema descriptions Third, all DTDelements are always forced to follow the specifiedordering of the document, so unordered elements are not permitted These drawbacks led

to the development ofXMLschema, a more general language for specifying the structureand elements ofXMLdocuments

TheXMLschema language is a standard for specifying the structure ofXMLdocuments Ituses the same syntax rules as regularXMLdocuments, so that the same processors can beused on both To distinguish the two types of documents, we will use the term XML

instance documentorXML documentfor a regularXMLdocument, andXML schema document

for a document that specifies an XML schema Figure 26.5 shows an XML schema ment correspondingtothe COMPANYdatabase shown in Figures 3.2 and 5.5 Although it isunlikely that we would want to display the whole database as a single document, therehave been proposals to store data in nativeXMLformat as an alternative to storing thedata in relational databases The schema in Figure 26.5 would serve the purpose of speci-fying the structure of theCOMPANYdatabase if it were stored in a nativeXMLsystem We dis-cuss this topic further in Section 26.4

docu-As withXML DTD, XMLschema is based on the tree data model, with elements andattributes as the main structuring concepts However, it borrows additional concepts from

<7xml version="l.O" encoding="UTF-8" 7>

<xsd:schema xmlns:xsd=''http://www.w3.org/2001/XMLSchema''>

<xsd:annotation>

<xsd:documentation xml:lang="en">Company Schema (Element Approach)

-Prepared by Babak Hojabri</xsd:documentation>

Trang 21

<xsd:complexType name="Department">

<xsd:sequence>

<xsd:element name="departmentName" type="xsd:string" />

<xsd:element name="departmentNumber" type="xsd:string" />

<xsd:element name="departmentManagerSSN" type="xsd:string" />

<xsd:element name="departmentManagerStartDate" type="xsd:date" />

<xsd:element name="departmentLocation" type="xsd:string"

<xsd:element name="employeeName" type="Name" />

<xsd:element name="employeeSSN" type="xsd:string" />

<xsd:element name="employeeSex" type="xsd:string" />

<xsd:element name="employeeSalary" type="xsd:unsignedlnt" />

<xsd:element name="employeeBirthDate" type="xsd:date" />

<xsd:element name="employeeDepartmentNumber" type="xsd:string" />

<xsd:element name="employeeSupervisorSSN" type="xsd:string" />

<xsd:element name="employeeAddress" type="Address" />

<xsd:element name="employeeWorksOn" type="WorksOn" m;nOccurs="I"maxOccurs="unbounded" />

<xsd:element name="employeeDependent" type="Dependent" m;nOccurs="O"maxOccurs="unbounded" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="Project">

<xsd:sequence>

<xsd:element name="projectName" type="xsd:string" />

<xsd:element name="projectNumber" type="xsd:string" />

<xsd:element name="projectLocat;on" type="xsd:string" />

<xsd:element name="projectDepartmentNumber" type="xsd:string" />

<xsd:element name="projectWorker" type="Worker" m;nOccurs="I"maxOccurs="unbounded" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="Dependent">

<xsd:sequence>

<xsd:element name="dependentName" type="xsd:string" />

<xsd:element name="dependentSex" type="xsd:string" />

<xsd:element name="dependentBirthDate" type="xsd:date" />

<xsd:element name="dependentRelationship" type="xsd:string" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="Address">

<xsd:sequence>

<xsd:element name="number" type="xsd:string" />

<xsd:element name="street" type="xsd:string" />

<xsd:element name="city" type="xsd:string" />

<xsd:element name="state" type="xsd:string" />

</xsd:sequence>

FIGURE 26.5(CONTINUED) An XMLschema file called company

Trang 22

<xsd:complexType name="Name">

<xsd:sequence>

<xsd:element name="firstName" type="xsd:string" />

<xsd:element name="middleName" type="xsd:string" />

<xsd:element name="lastName" type="xsd:string" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="Worker">

<xsd:sequence>

<xsd:element name="SSN" type="xsd:string" />

<xsd:element name="hours" type="xsd:float" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="WorksOn">

<xsd:sequence>

<xsd:element name="projectNumber" type="xsd:string" />

<xsd:element name="hours" type="xsd:float" />

</xsd:sequence>

</xsd:complexType>

</xsd:schema>

FIGURE 26.5(CONTINUED) An XMLschema file called company

database and object models, such as keys, references, and identifiers We here describe the

features of XML schema in a step-by-step manner, referring to the example XML schema

document of Figure 26.5 for illustration We introduce and describe some of the schema

concepts in the order in which they are used in Figure 26.5

ofXML schema language elements (tags) being used by specifying a file stored at a

Web site location The second line in Figure 26.5 specifies the file used in this

example, which is http://www.w3.org/200l/XMLSchema" This is the most

commonly used standard for XML schema commands Each such definition is

called an XML namespace, because it defines the set of commands (names) that

can be used The file name is assigned to the variable xsd (XML schema

descrip-tion) using the attribute xml ns (XML narnespace}, and this variable is used as a

prefix to all XML schema commands (tag names) For example, in Figure 26.5,

when we write xsd: el ement or xsd: sequence, we are referringtothe definitions

of the element and sequence tags as defined in the file ''http://www.w3.org/

200l/XMLSchema"

26.5 illustrate the XML schema elements (tags) xsd: annotati on and

xsd: documentati on, which are used for providing comments and other

descrip-tions in the XML document The attribute xml : 1ang of the xsd:documentati on

element specifies the language being used, where "en" stands for the English

language

Trang 23

schema, the name attribute of the xsd: element tag specifies the element name,which is called company for the root element in our example (see Figure 26.5).The structure of the company root element can then be specified, which in ourexample is xsd: complexType This is further specified to be a sequence of depart-ments, employees, and projects using the xsd: sequence structure ofXMLschema.

Itis important to note here that this is not the only way to specify anXMLschemafor theCOMPANYdatabase We will discuss other options in Section 26.4

4 First-level elements in theCOMPANYdatabase:Next, we specify the three first-level ments under the company root element in Figure 26.5 These elements are namedemployee, department, and proj ect, and each is specified in an xsd: element tag.Notice that if a tag has only attributes and no further subelements or data within

ele-it, it can be ended with the backslash symbol C/» directly instead of having aseparate matching end tag These are called empty elements; examples are thexsd: el ement elements named department and project in Figure 26.5

5 Specifying element type andminimumand maximum occurrences:InXMLschema, theattributes type, minOccu rs , and maxOccurs in the xsd: element tag specify thetype and multiplicity of each element in any document that conforms to theschema specifications If we specify a type attribute in an xsd: element, the struc-ture of the element must be described separately, typically using thexsd : comp1exType element of XMLschema This is illustrated by the employee,department, and project elements in Figure 26.5 On the other hand, if no typeattribute is specified, the element structure can be defined directly following thetag, as illustrated by the company root element in Figure 26.5 The mi nOccurs andmaxOccurs tags are used for specifying lower and upper bounds on the number ofoccurrences of an element in any document that conforms to the schema specifi-cations If they are not specified, the default is exactly one occurrence Theseserve a similar role tothe ", +,and? symbols ofXML DTD,and to the (min, max)constraints of theERmodel (see Section 3.7.4)

6 Specifying keys:In XMLschema, it is possible to specify constraints that correspond

to unique and primary key constraints in a relational database (see Section 5.2.2),

as well as foreign keys (or referential integrity) constraints (see Section 5.2,4).The xsd: uni que tag specifies elements that correspond to unique attributes in arelational database that are not primary keys We can give each such uniquenessconstraint a name, and we must specify xsd: sel ector and xsd: fi e1d tags for it

to identify the element type that contains the unique element and the elementname within it that is unique via the xpath attribute This is illustrated by thedepartmentNameUni que and proj ectNameUni que elements in Figure 26.5 Forspecifying primary keys, the tag xsd: key is used instead of xsd: uni que, as illus-trated by the projectNumberKey, departmentNumberKey, and employeeSSNKeyelements in Figure 26.5 For specifying foreign keys, the tag xsd: keyref is used,

as illustrated by the six xsd: key ref elements in Figure 26.5 When specifying aforeign key, the attribute refer of the xsd: key ref tag specifies the referencedprimary key, whereas the tags xsd: se1ector and xsd: fi e1d specify the referenc-ing element type and foreign key (see Figure 26.5)

Trang 24

7 Specifying the structures of complex elements via complex types:The next part of our

example specifies the structures of the complex elements Department, Employee,

Project, and Dependent, using the tag xsd:complexType (see Figure 26.5) We

specify each of these as a sequence of subelements corresponding to the database

attributes of each entity type (see Figures 3.2 and 5.7) by using the xsd: sequence

and xsd: element tags ofXMLschema Each element is given a name and type via

the attributes name and type of xsd: element We can also specify mi nOccurs and

maxOccu rs attributes if we need to change the default of exactly one occurrence

For (optional) database attributes where null is allowed, we need to specify

mi nOccurs = 0, whereas for multivalued database attributes we need to specify

maxOccurs = "unbounded" on the corresponding element Notice that if we were

not going to specify any key constraints, we could have embedded the subelernents

within the parent element definitions directly without having to specify complex

types However, when unique, primary key, and foreign key constraints need to be

specified, we must define complex types to specify the element structures

specified as complex types in Figure 26.5, as illustrated by the Address, Name,

Worker, and WorksOn complex types These could have been directly embedded

within their parent elements

This example illustrates some of the main features ofXMLschema There are other

features, but they are beyond the scope of our presentation In the next section, we discuss

the different approaches to creatingXMLdocuments from relational databases and storing

XMLdocuments

26.4 XML DOCUMENTS AND DATABASES

We now discuss how various types ofXMLdocuments can be stored and retrieved Section

26.4.1 gives an overview of the various approaches for storingXMLdocuments Section

26.4.2 discusses one of these approaches, in which data-centric XMLdocuments are

extracted from existing databases, in more detail In particular, we show how tree

struc-tured documents can be created from graph-strucstruc-tured databases Section 26.4.3 discusses

the problem of cycles and how it can be dealt with

26.4.1 Approaches to Storing XML Documents

Several approaches to organizing the contents ofXMLdocumentstofacilitate their

subse-quent querying and retrieval have been proposed The following are the most common

approaches:

1.Using a DBMS to store the documents as text: A relational or object DBMScan be

used to store whole XMLdocuments as text fields within the DBMS records or

objects This approach can be used if theDBMShas a special module for document

processing, and would work for storing schemaless and document-centric XML

Trang 25

documents The keyword indexing functions of the document processing module(see Chapter 22) can be used to index and speed up search and retrieval of thedocuments.

2 Using aDBMS to store the document contents as data elements: This approach would

work for storing a collection of documents that follow a specificXML DTDorXML

schema Because all the documents have the same structure, one can design arelational (or object) database to store the leaf-level data elements within the

XMLdocuments This approach would require mapping algorithms to design adatabase schema that is compatible with theXMLdocument structure as specified

in the XMLschema or DTDand to recreate the XMLdocuments from the storeddata These algorithms can be implemented either as an internalDBMSmodule or

as separate middleware that is not part of theDBMS

3 Designing a specialized system for storing nativeXMLdata: A new type of database

system based on the hierarchical (tree) model could be designed and mented The system would include specialized indexing and querying techniques,and would work for all types ofXMLdocuments It could also include data com-pression techniques to reduce the size of the documents for storage

imple-4 Creatingorpublishing customizedXMLdocuments from preexisting relational databases:

Because there are enormous amounts of data already stored in relational bases, parts of this data may need to be formatted as documents for exchanging ordisplaying over the Web This approach would use a separate middleware softwarelayertohandle the conversions needed between theXMLdocuments and the rela-tional database

data-All four of these approaches have received considerable attention over the past fewyears We focus on approach 4 in the next subsection, because it gives a good conceptualunderstanding of the differences between the XML tree data model and the traditionaldatabase models based on flat files (relational model) and graph representations (ER

Trang 26

We will use the simplifiedUNIVERSITY ERschema shown in Figure 26.6 to illustrate our

discussion Suppose that an application needs to extract XMLdocuments for student,

course, and grade information from the UNIVERSITY database The data needed for these

documents is contained in the database attributes of the entity types COURSE, SECTION, and

STUDENTfrom Figure 26.6, and the relationships s-s and c-s between them In general,

most documents extracted from a database will only use a subset of the attributes, entity

types, and relationships in the database In this example, the subset of the database that is

needed is shown in Figure 26.7

0ections taught

FIGURE 26.6 AnERschema diagram for a simplified UNIVERSITYdatabase

~

Students attended

Trang 27

At least three possible document hierarchies can be extracted from the databasesubset in Figure 26.7 First, we can choose COURSE as the root, as illustrated in Figure26.8 Here, each course entity has the set of its sections as subelements, and eachsection has its students as subelements We can see one consequence of modeling theinformation in a hierarchical tree structure If a student has taken multiple sections,that student's information will appear multiple times in the document-once undereach section A possible simplified XMLschema for this view is shown in Figure 26.9.The Grade database attribute in thes-s relationship is migrated to theSTUDENTelement.This is because STUDENT becomes a child of SECTION in this hierarchy, so each STUDENT

element under a specific SECTION element can have a specific grade in that section Inthis document hierarchy, a student taking more than one section will have severalreplicas, one under each section, and each replica will have the specific grade given inthat particular section

In the second hierarchical document view, we can choose STUDENT as root (Figure26.10) In this hierarchical view, each student has a set of sections as its child elements,and each section is related to one course as its child, because the relationship between

SECTION and COURSE is N:1.We can hence merge the COURSE and SECTION elements in this

COURSE

sections N SECTION

Trang 28

<xsd:element name="root">

<xsd:sequence>

<xsd:element name="course" m;nOccurs="O" maxOccurs="unbounded">

<xsd:sequence>

<xsd:element name="cname" type="xsd:string" />

<xsd:element name="cnumber" type="xsd:unsignedlnt" />

<xsd:element name="section" m;nOccurs="O" maxOccurs="unbounded">

<xsd:sequence>

<xsd:element name="secnumber" type="xsd:unsignedlnt" />

<xsd:element name="year" type="xsd:string" />

<xsd:element name="quarter" type="xsd:string" />

<xsd:element name="student" m;nOccurs="O" maxOccurs="unbounded">

<xsd:sequence>

<xsd:element name="ssn" type="xsd:string" />

<xsd:element name="sname" type="xsd:string" />

<xsd:element name="class" type="xsd:string" />

<xsd:element name="grade" type="xsd:string" />

FIGURE 26.9 XMLschema document with COURSEas the root.

view, as shown in Figure 26.10 In addition, the GRADEdatabase attribute can be migrated

to the SECTION element In this hierarchy, the combined COURSE/SECTION information is

replicated under each student who completed the section A possible simplified XML

schema for this view is shown in Figure 26.11

The third possible way isto choose SECTION as the root, as shown in Figure 26.12

Similar to the second hierarchical view, the COURSE information can be merged into the

SECTIONelement TheGRADEdatabase attribute can be migrated to theSTUDENTelement As

we can see, even in this simple example, there can be numerous hierarchical document

views, each corresponding to a different root and a differentXMLdocument structure

26.4.3 Breaking Cycles to Convert Graphs into Trees

In the previous examples, the subset of the database of interest had no cycles.Itis

pos-sibletohave a more complex subset with one or more cycles, indicating multiple

rela-tionships among the entities In this case, it is more complex to decide how to create

the document hierarchies Additional duplication of entities may be needed to

repre-sent the multiple relationships We shall illustrate this with an example using the ER

schema in Figure 26.6

Trang 29

Sections completed

FIGURE 26.10 Hierarchical (tree)viewwith STUDENTas the root

Suppose that we need the information in all the entity types and relationships ofFigure 26.6 for a particularXMLdocument, withSTUDENTas the root element Figure 26.13illustrates how a possible hierarchical tree structure can be created for this document.First, we get a lattice withSTUDENTas the root, as shown in part(l)of Figure 26.13 This isnot a tree structure because of the cycles One way to break the cycles is to replicate theentity types involved in the cycles First, we replicate INSTRUCTOR as shown in part (2) ofFigure 26.13, calling the replica to the rightINSTRUCTORI.TheINSTRUCTORreplica on the leftrepresents the relationship between instructors and the sections they teach, whereas the

INSTRUCTOR1replica on the right represents the relationship between instructors and thedepartment each works in After this, we still have the cycle involvingCOURSE, so we canreplicateCOURSE in a similar manner, leading to the hierarchy shown in part (3) of Figure26.13 The COURSEIreplica to the left represents the relationship between courses andtheir sections, whereas the COURSEreplicatothe right represents the relationship betweencourses and the department that offers each course

In part (3) of Figure 26.13, we have converted the initial graph to a hierarchy Wecan do further merging if desired (as in our previous example) before creating the finalhierarchy and the correspondingXMLschema structure

Trang 30

<xsd:element name="root">

<xsd:sequence>

<xsd:element name="student" minOccurs="O" maxOccurs="unbounded">

<xsd:sequence>

<xsd:element name="ssn" type="xsd:string" />

<xsd:element name="sname" type="xsd:string" />

<xsd:element name="class" type="xsd:string" />

<xsd:element name="section" minOccurs="O" maxOccurs="unbounded">

<xsd:sequence>

<xsd:element name="secnumber" type="xsd:unsignedlnt" />

<xsd:element name="year" type="xsd:string" />

<xsd:element name="quarter" type="xsd:string" />

<xsd:element name="cnumber" type="xsd:unsignedlnt" />

<xsd:element name="cname" type="xsd:string" />

<xsd:element name="grade" type="xsd:string" />

,,

Trang 31

FIGURE 26.13 Converting a graph with cycles into a hierarchical (tree) structure.

26.4.4 Other Steps for Extracting XML Documents from

There have been several proposals for XMLquery languages, but two standards haveemerged The first is XPath, which provides language constructs for specifying pathexpressions to identify certain nodes (elements) within anXMLdocument that match spe-

Trang 32

cific patterns The second is XQuery, which is a more general query language XQuery

uses XPath expressions but has additional constructs We give an overview of each of

these languages in this section

26.5.1 XPath: Specifying Path Expressions in XML

An XPath expression returns a collection of element nodes that satisfy certain patterns

specified in the expression The names in the XPath expression are node names in theXML

document tree that are either tag (element) names or attribute names, possibly with

addi-tional qualifier conditions to further restrict the nodes that satisfy the pattern Two main

slash before a tag specifies that the tag must appear as a direct child of the previous

(par-ent) tag, whereas a double slash specifies that the tag can appear as a descendant of the

pre-vious tagat any level.Let us look at some examples of XPath as shown in Figure 26.14

The first XPath expression in Figure 26.14 returns the company root node and all its

descendant nodes, which means that it returns the wholeXMLdocument We should note

that it is customary to include the file name in the XPath query This allows us to specify

any local file name or even any path name that specifies a file on the Web For example, if

the COMPANYXMLdocument is stored at the location

www.company.com/info.xml

then the first XPath expression in Figure 26.14 can be written as

doc(www.company.com/info.xml)/company

This prefix would also be included in the other examples

The second example in Figure 26.14 returns all department nodes (elements) and

their descendant subtrees Note that the nodes (elements) in an XML document are

ordered, so the XPath result that returns multiple nodes will do so in the same order in

which the nodes are ordered in the document tree.

The third XPath expression in Figure 26.14 illustrates the use of II, which is

convenienttouse if we do not know the full path name we are searching for, but do know

the name of some tags of interest within theXMLdocument This is particularly useful for

schemaless XMLdocuments or for documents with many nested levels of nodes.6The

1 /company

2 /company/department

3 //employee [employeeSalary gt 70000]/employeeName

4. /company/employee [employeeSalary gt 70000]/employeeName

5 /company/project/projectWorker [hours ge 20.0]

FIGURE 26.14 Some examples of XPath expressions on XMLdocuments that follow

the XMLschema file COMPANYin Figure 26.5

- - -

-6 We are using the terms node, tag,andelementinterchangeably here

Trang 33

expression returns all emp1oyeeName nodes that are direct children of an emp1oyee node,such that the employee node has another child element employeeSalary whose value isgreater than 70000. This illustrates the use of qualifier conditions, which restrict thenodes selected by the XPath expression to those that satisfy the condition XPath has anumber of comparison operations for use in qualifier conditions, including standardarithmetic, string, and set comparison operations.

The fourth XPath expression should return the same result as the previous one, exceptthat we specified the full path name in this example The fifth expression in Figure26.14

returns all p roj ectWo rke r nodes and their descendant nodes that are children under a path/company/project and have a child node hours with a value greater than20.0hours

26.5.2 XQuery: Specifying Queries in XML

XPath allows us towrite expressions that select nodes from a tree-structured XML ment XQuery permits the specification of more general queries on one or more XML doc-uments The typical form of a query in XQuery is known as aFLWR expression, whichstands for the four main clauses of XQuery and has the following form:

docu-FOR <variable bindings to individual nodes (elements»

LET <variable bindings to collections of nodes (elements»

WHERE <qualifier conditions>

RETURN <query result specification>

Figure 26.15includes some examples of queries in XQuery that can be specified onXML instance documents that follow the XML schema document in Figure26.5.The firstquery retrieves the first and last names of employees who earn more than$70,000.The

1 FOR $x IN

doc(www.company.com/info.xml)//employee [employeeSalary gt 70000]/employeeNameRETURN <res> $x/firstName, $x/lastName <Ires>

2 FOR $x IN

doc(www.company.com/info.xml)/company/employeeWHERE $x/employeeSalary gt 70000

RETURN <res> $x/employeeName/firstName,

$y/employeeName/lastName, $x/hours <Ires>

FIGURE 26.15 Some examples of XQuery queries on XMLdocuments that follow the

XML schema fileCOMPANYin Figure 26.5

Trang 34

variable $x is bound to each emp1oyeeName element that is a child of an employee

element, but only for employee elements that satisfy the qualifier that their

employeeSalary value is greater than $70,000 The result retrieves the fi rs"tName and

1asrNamechild elements of the selected empl oyeeName elements The second query is an

alternative way of retrieving the same elements retrieved by the first query

The third query illustrates how a join operation can be performed by having more

than one variable Here, the $x variable is bound to each projec"tWorker element that is

a child of project number 5, whereas the $y variable is bound to each employee element

The join condition matchesSSNvalues in order to retrieve the employee names

This concludes our brief introduction to XQuery The interested reader is referred to

the Web site www.w3.org, which contains documents describing the latest standards

related to XML

This chapter gave an overview of the standard for representing and exchanging data over

the Internet We started by discussing the differences between structured, semistructured,

and unstructured data, then discussed why there was a need for a specification language

such as XML We described the XML standard and its tree-structured (hierarchical) data

model, and discussed XML documents and the languages for specifying the structure of

these documents, namely, XML DTD (Document Type Definition) and XML schema We

then gave an overview of the various approaches for storing XML documents, whether in

their native (text) format, in a compressed form, or in relational and other types of

data-bases, and discussed the mapping issues that arise when there is need to convert data

stored in traditional databases into XML documents Finally, we gave an overview of the

XPath and XQuery languages proposed for querying XML data

26.3 What are the differences between the use of tags in XML versus HTML?

26.4 What is the difference between data-centric and document-centric XML

documents?

26.5 What is the difference between attributes and elements in XML? List some of the

important attributes used in specifying elements in XML schema

26.6 What is the difference between XML schema and XML DTD?

Trang 35

26.7 Create an XMLinstance document to correspond to the data stored in the tional database shown in Figure 5.6 such that theXMLdocument conforms to theXMLschema document in Figure 26.5

rela-26.8 CreateXMLschema documents to correspondtothe hierarchies shown in Figures26.12 and 26.13 part(3)

26.9 Consider the LIBRARY relational database schema of Figure 5.20 Create an XMLschema document that corresponds to this database schema

26.10 Specify the following views as queries in XQuery on the COMPANY XML schemashown in Figure 26.5

a A view that has the department name, manager name, and manager salary forevery department

b A view that has the employee name, supervisor name, and employee salary foreach employee who works in the Research department

c A view that has the project name, controlling department name, number ofemployees, and total hours worked per week on the project for each project

d A view that has the project name, controlling department name, number ofemployees, and total hours worked per week on the project for each projectwith more than one employee working on it

Selected Bibliography

There are so many articles and books on various aspects ofXMLthat it would be ble to make even a modest list We will mention one book: Chaudhri, Rashid, and Zicari,eds (2003) This book discusses various aspects ofXMLand contains a list of some recentreferences to XMLresearch and practice

Trang 36

impossi-Over the last three decades, many organizations have generated a large amount of

machine-readable data in the form of files and databases To process this data, we have

the database technology available that supports query languages like SQL The problem

withSQLis that it is a structured language that assumes the user is aware of the database

schema.SQLsupports operations of relational algebra that allow a user to select rows and

columns of data from tables or join related information from tables based on common

fields In the next chapter, we shall see that data warehousing technology affords several

types of functionality: that of consolidation, aggregation, and summarization of data Data

warehouses let us view the same information along multiple dimensions In this chapter,

we will focus our attention on another very popular area of interest known as data

min-ing As the term connotes, data mining refers to the mining or discovery of new

informa-tion in terms of patterns or rules from vast amounts of data To be practically useful, data

mining must be carried out efficiently on large files and databases To date, it isnot

well-integrated with database management systems

We will briefly review the state of the art of this rather extensive field of data mining,

which uses techniques from such areas as machine learning, statistics, neural networks,

and genetic algorithms We will highlight the nature of the information that is

discovered, the types of problems faced when trying to mine databases, and the types of

applications of data mining We also survey the state of the art of a large number of

commercial tools available (see Section 26.2.5) and describe a number of research

advances that are needed to make this area viable

867

Trang 37

27.1 OVERVIEW OF DATA MINING

TECHNOLOGY

In reports such as the very popular GartnerReport,'data mining has been hailed as one

of the top technologies for the near future In this section we relate data mining to thebroader area called knowledge discovery and contrast the two by means of an illustrativeexample

Data Mining versus Data Warehousing The goal of a data warehouse (seeChapter 28) is to support decision making with data Data mining can be used inconjunction with a data warehouse to help with certain types of decisions Data miningcan be applied to operational databases with individual transactions To make datamining more efficient, the data warehouse should have an aggregated or summarizedcollection of data Data mining helps in extracting meaningful new patterns that cannot

be found necessarily by merely querying or processing data or metadata in the datawarehouse Data mining applications should therefore be strongly considered early, duringthe design of a data warehouse Also, data mining tools should be designed to facilitatetheir use in conjunction with data warehouses In fact, for very large databases runninginto terabytes of data, successful use of data mining applications will depend first on theconstruction of a data warehouse

Data Mining as a Partof the Knowledge Discovery Process Knowledge Discovery

in Databases frequently abbreviated as KDD, typically encompasses more than datamining The knowledge discovery process comprises six phases.' data selection, datacleansing, enrichment, data transformation or encoding, data mining, and the reportingand display of the discovered information

As an example, consider a transaction database maintained by a specialty consumergoods retailer Suppose the client data includes a customer name, zip code, phonenumber, date of purchase, item code, price, quantity, and total amount A variety of newknowledge can be discovered by KDD processing on this client database During dataselection, data about specific items or categories of items, or from stores in a specific region

or area of the country, may be selected The data cleansing process then may correctinvalid zip codes or eliminate records with incorrect phone prefixes Enrichment typicallyenhances the data with additional sources of information For example, given the clientnames and phone numbers, the store may purchase other data about age, income, andcredit rating and append them to each record Data transformationand encoding may bedone to reduce the amount of data For instance, item codes may be grouped in terms ofproduct categories into audio, video, supplies, electronic gadgets, camera, accessories, and

so on Zip codes may be aggregated into geographic regions, incomes may be divided intoranges, and so on In Figure 28.1, we will show a step calledcleaningas a precursortothe

Trang 38

data warehouse creation.Ifdata mining is based on an existing warehouse for this retail

store chain, we would expect that the cleaning has already been applied Itis only after

such preprocessing that data mining techniques are used to mine different rules and

patterns

The result of mining may be to discover the following type of "new" information:

a Association rules-for example, whenever a customer buys video equipment,

he or she also buys another electronic gadget

b Sequential patterns-for example, suppose a customer buys a camera, and

within three months he or she buys photographic supplies, then within six

months he is likely to buy an accessory item This defines a sequential pattern

of transactions A customer who buys more than twice in the lean periods may

be likely to buy at least once during the Christmas period

c Classification trees-for example, customers may be classified by frequency of

visits, by types of financing used, by amount of purchase, or by affinity for types

of items, and some revealing statistics may be generated for such classes

We can see that many possibilities exist for discovering new knowledge about buying

patterns, relating factors such as age, income group, place of residence, to what and how

much the customers purchase This information can then be utilized to plan additional

store locations based on demographics, to run store promotions, to combine items in

advertisements, or to plan seasonal marketing strategies As this retail store example

shows, data mining must be preceded by significant data preparation before it can yield

useful information that can directly influence business decisions

The results of data mining may be reported in a variety of formats, such as listings,

graphic outputs, summary tables, or visualizations

carried out with some end goals or applications Broadly speaking, these goals fall into the

following classes: prediction, identification, classification, and optimization

• Prediction-Data mining can show how certain attributes within the data will

behave in the future Examples of predictive data mining include the analysis of

buy-ing transactions to predict what consumers will buy under certain discounts, how

much sales volume a store would generate in a given period, and whether deleting a

product line would yield more profits In such applications, business logic is used

cou-pled with data mining In a scientific context, certain seismic wave patterns may

pre-dict an earthquake with high probability

• Identification-Data patterns can be used to identify the existence of an item, an

event, or an activity For example, intruders trying to break a system may be

identi-fied by the programs executed, files accessed, andCPU time per session In biological

applications, existence of a gene may be identified by certain sequences of nucleotide

symbols in theDNAsequence The area known as authentication is a form of

identifi-cation.Itascertains whether a user is indeed a specific user or one from an authorized

class, and involves a comparison of parameters or images or signals against a database

Trang 39

gories can be identified based on combinations of parameters For example, customers

in a supermarket can be categorized into discount-seeking shoppers, shoppers in arush, loyal regular shoppers, shoppers attached to name brands, and infrequent shop-pers This classification may be used in different analyses of customer buying transac-tions as a post-mining activity Sometimes classification based on common domainknowledge is used as an input to decompose the mining problem and make it simpler.For instance, health foods, party foods, or school lunch foods are distinct categories

in the supermarket business It makes sense to analyze relationships within and acrosscategories as separate problems Such categorization may be usedtoencode the dataappropriately before subjecting it to further data mining

• Optimization-One eventual goal of data mining may be to optimize the use of ited resources such as time, space, money, or materials and to maximize output vari-ables such as sales or profits under a given set of constraints As such, this goal of datamining resembles the objective function used in operations research problems thatdeals with optimization under constraints

lim-The term data mining is popularly being used in a very broad sense In somesituations it includes statistical analysis and constrained optimization as well as machinelearning There is no sharp line separating data mining from these disciplines.Itis beyondour scope, therefore, to discuss in detail the entire range of applications that make up thisvast body of work For a detailed understanding of the area, readers are referred to

specialized books devoted to data mining

Types of Knowledge Discovered During Data Mining. The term "knowledge"

is very broadly interpreted as involving some degree of intelligence There is a progressionfrom raw data to information to knowledge as we go through additional processing.Knowledge is often classified as inductive versus deductive Deductive knowledgededuces new information based on applyingpre-specifiedlogical rules of deduction on thegiven data Data mining addresses inductive knowledge, which discovers new rules andpatterns from the supplied data Knowledge can be represented in many forms: In anunstructured sense, it can be represented by rules or propositional logic In a structuredform, it may be represented in decision trees, semantic networks, neural networks, orhierarchies of classes or frames It is common to describe the knowledge discovered duringdata mining in five ways, as follows

• Association rules-These rules correlate the presence of a set of items with anotherrange of values for another set of variables Examples: (1)When a female retail shop-per buys a handbag, she is likely to buy shoes (2) An X-ray image containing charac-teristics a and b is likely to also exhibit characteristic c

• Classification hierarchies-The goal is to work from an existing set of events ortransactions to create a hierarchy of classes Examples: (I) A population may bedivided into five ranges of credit worthiness based on a history of previous credittransactions (2) A model may be developed for the factors that determine the desir-ability oflocation of a store on a 1-10scale (3) Mutual funds may be classified based

on performance data using characteristics such as growth, income, and stability

Trang 40

• Sequential patterns-A sequence of actions or events is sought Example: If a patient

underwent cardiac bypass surgery for blocked arteries and an aneurysm and later

developed high blood urea within a year of surgery, he or she is likely to suffer from

kidney failure within the next 18 months Detection of sequential patterns is

equiva-lenttodetecting associations among events with certain temporal relationships

• Patterns within time series-Similarities can be detected within positions of a time

series of data, which is a sequence of data taken at resular intervals such as daily sales

or daily closing stock prices Examples: (1) Stocks of a utility company, ABC Power,

and a financial company, XYZ Securities, showed the same pattern during 2002 in

terms of closing stock price (2) Two products show the same selling pattern in

sum-mer but a different one in winter (3) A pattern in solar magnetic wind may be used

to predict changes in earth atmospheric conditions

• Clustering-A given population of events or items can be partitioned (segmented)

into sets of "similar" elements Examples: (1) An entire population of treatment data

on a disease may be divided into groups based on the similarity of side effects

pro-duced (2) The adult population in the United States may be categorized into five

groups from "most likely to buy" to "least likely to buy" a new product (3) The web

accesses made by a collection of users against a set of documents (say, in a digital

library) may be analyzed in terms of the keywords of documents to reveal clusters or

categories of users

For most applications, the desired knowledge is a combination of the above types

We expand on each of the above knowledge types in the following sections

27.2.1 Market-Basket Model, Support, and Confidence

One of the major technologies in data mining involves the discovery of association

rules The database is regarded as a collection of transactions, each involving a set of

items A common example is that of market-basket data Here the market basket

corresponds to the sets of items a consumer buys in a supermarket during one visit

Consider four such transactions in a random sample shown in Figure 27.1

An association rule is of the form X=>Y, where X ={Xl' Xz, ,xn}, and Y={yj'

Yz, , Y m }are sets of items, withXiandYj being distinct items for alli and allj This

association states that if a customer buys X, he or she is also likely to buy Y In general,

any association rule has the form LHS (left-hand side) => RHS (right-hand side), where

LHS and RHS are sets of items The set LHS U RHS is called an itemset, the set of

items purchased by customers For an association rule to be of interest to a data miner, the

rule should satisfy some interest measure Two common interest measures are support and

confidence

The support for a rule LHS => RHS is with respect tothe iternset: it refers to how

frequently a specific itemset occurs in the database That is, the support is the percentage

Định dạng
Số trang	96
Dung lượng	3,65 MB