For example, each record in a relational database table-such as the EMPLOYEE table in Figure S.6-follows the same format as the other records in that table.For structured data, it is com
Trang 125.7 DISTRIBUTED DATABASES IN ORACLE
In the client-server architecture, the Oracle database system is divided into two parts:(l) a front-end as the client portion, and (2) a back-end as the server portion The cli-ent portion is the front-end database application that interacts' with the user The cli-ent has no data access responsibility and merely handles the requesting, processing, andpresentation of data managed by the server The server portion runs Oracle and handlesthe functions related to concurrent shared access It acceptsSQLandPL/SQLstatementsoriginating from client applications, processes them, and sends the results back to theclient Oracle client-server applications provide location transparency by making loca-tion of data transparent to users; several features like views, synonyms, and procedurescontribute to this Global naming is achieved by using <TABLENAME.@, DATABASENAME> torefer to tables uniquely
Oracle uses a two-phase commit protocol to deal with concurrent distributedtransactions TheCOMMITstatement triggers the two-phase commit mechanism TheRECO(recoverer) background process automatically resolves the outcome of those distributedtransactions in which the commit was interrupted The RECO of each local Oracle Serverautomatically commits or rolls back any "in-doubt" distributed transactions consistently on allinvolved nodes For long-term failures, Oracle allows each localDBA to manually commitorroll back any in-doubt transactions and free up resources Global consistency can bemaintained by restoring the database at each site to a predetermined fixed point in the past.Oracle's distributed database architecture is shown in Figure 25.9 A node in adistributed database system can act as a client, as a server, or both, depending on thesituation The figure shows two sites where databases called HQ (headquarters) and Salesare kept For example, in the application shown running at the headquarters, for anSQLstatement issued against local data (for example,DELETE FRDM DEPT ••• ), the HQ computeracts as a server, whereas for a statement against remote data (for example, INSERT INTO EMP@SALES), the HQ computer acts as a client
All Oracle databases in a distributed database system (DDBS)use Oracle's networkingsoftware NetS for interdatabase communication NetS allows databases to communicateacross networks to support remote and distributed transactions It packages SQLstatements into one of the many communication protocols to facilitate client to servercommunication and then packages the results back similarly to the client Each databasehas a unique global name provided by a hierarchical arrangement of network domainnames that is prefixed to the database name to make it unique
Oracle supports database links that define a one-way communication path from oneOracle databasetoanother For example,
CREATE DATABASE LINK sales.us.americas;
establishes a connection to the sales database in Figure 25.9 under the network domain
us that comes under domain ameri cas
Data in an OracleDDBScan be replicated using snapshots or replicated master tables.Replication is provided at the following levels:
• Basic replication: Replicas of tables are managed for read-only access For updates,data must be accessed at a single primary site
Trang 2Database
server
Database server
INSERT INTO EMP@SALES ;
DELETE FROM DEPT ;
SELECT
FROM EMP@SALES ;
COMMIT;
TRANSACTION
INSERT INTO EMP@SALES ;
DELETE FROM DEPT ;
SELECT
FROM EMP@SALES ;
COMMIT;
FIGURE 25.9 Oracle distributed database systems Source:From Oracle (1997a)
Copyright©Oracle Corporation 1997 All rights reserved
• Advanced (symmetric) replication: This extends beyond basic replication by allowing
applications to update table replicas throughout a replicatedDDBS. Data can be read
and updated at any site This requires additional software called Oracle's advanced
replication option A snapshot generates a copy of a part of the table by means of a
query called thesnapshot definingquery.A simple snapshot definition looks like this:
CREATE SNAPSHOT sales.orders AS
SELECT * FROM sa1es.orders@hq.us.americas;
Trang 3Oracle groups snapshots into refresh groups By specifying a refresh interval, thesnapshot is automatically refreshed periodically at that interval by up to ten SnapshotRefresh Processes (SNPs) If the defining query of a snapshot contains a distinct oraggregate function, a GROUP BY or CONNECT BY clause, or join or set operations, thesnapshot is termed a complex snapshot and requires additional processing Oracle (up toversion 7.3) also supportsROWID snapshots that are based on physical row identifiers ofrows in the master table.
Heterogeneous Databases in Oracle. In a heterogeneous DDBS, at least onedatabase is a non-Oracle system Oracle Open Gateways provides access to a non-Oracledatabase from an Oracle server, which uses a database link to access data or to executeremote procedures in the non-Oracle system The Open Gateways feature includes thefollowing:
• Distributed transactions: Under the two-phase commit mechanism, transactions mayspan Oracle and non-Oracle systems
• Transparent SQL access: SQL statements issued by an application are transparentlytransformed into SQL statements understood by the non-Oracle system
• Pass-through SQL and stored procedures: An application can directly access a Oracle system using that system's version of SQL Stored procedures in a non-OracleSQL-based system are treated as if they were PL!SQL remote procedures
non-• Global query optimization: Cardinality information, indexes, etc., at the non-Oraclesystem are accounted for by the Oracle Server query optimizer to perform globalquery optimization
• Procedural access: Procedural systems like messaging or queuing systems are accessed
by the Oracle server using PL!SQL remote procedure calls
In addition to the above, data dictionary references are translated tomake the Oracle data dictionary appear as a part of the Oracle Server's dictionary Character settranslations are done between national language character sets to connect multilingualdatabases
In this chapter we provided an introduction to distributed databases This is a very broadtopic, and we discussed only some of the basic techniques used with distributed databases Wefirst discussed the reasons for distribution and the potential advantages of distributed databasesover centralized systems We also defined the concept of distribution transparency and therelated concepts of fragmentation transparency and replication transparency We discussedthe design issues related to data fragmentation, replication, and distribution, and we distin-guished between horizontal and vertical fragments of relations We discussed the use of datareplication to improve system reliability and availability We categorized DDBMSs by using cri-teria such as degree of homogeneity of software modules and degree of local autonomy We dis-
Trang 4cussed the issues of federated database management in some detail focusing on the needs of
supporting various types of autonomies and dealing with semantic heterogeneity
We illustrated some of the techniques used in distributed query processing, and
discussed the cost of communication among sites, which is considered a major factor in
distributed query optimization We compared different techniques for executing joins and
presented the semijoin technique for joining relations that reside on different sites We
briefly discussed the concurrency control and recovery techniques used in DDBMSs We
reviewed some of the additional problems that must be dealt with in a distributed
environment that do not appear in a centralized environment
We then discussed the client-server architecture concepts and related them to
distributed databases, and we described some of the facilities in Oracle to support
distributed databases
Review Questions
25.1 What are the main reasons for and potential advantages of distributed databases?
25.2 What additional functions does a DDBMS have over a centralized DBMS?
25.3 What are the main software modules of a DDBMS? Discuss the main functions of
each of these modules in the context of the client-server architecture
25.4 What is a fragment of a relation? What are the main types of fragments? Why is
fragmentation a useful concept in distributed database design?
25.5 Why is data replication useful in DDBMSs? What typical units of data are
replicated?
25.6 What is meant by data allocation in distributed database design? What typical
units of data are distributed over sites?
25.7 How is a horizontal partitioning of a relation specified? How can a relation be put
back together from a complete horizontal partitioning?
25.8 How is a vertical partitioning of a relation specified? How can a relation be put
back together from a complete vertical partitioning?
25.9 Discuss what is meant by the following terms: degree of homogeneity of aDDBMS,
degree of local autonomy of aDDBMS,federatedDBMS,distribution transparency,
frag-mentation transparency, replication transparency, multidatabase system.
25.10 Discuss the naming problem in distributed databases
25.11 Discuss the different techniques for executing an equijoin of two files located at
different sites What main factors affect the cost of data transfer?
25.12 Discuss the semijoin method for executing an equijoin of two files located at
dif-ferent sites Under what conditions is an equijoin strategy efficient?
25.13 Discuss the factors that affect query decomposition How are guard conditions and
attribute lists of fragments used during the query decomposition process?
25.14 How is the decomposition of an update request different from the decomposition
of a query? How are guard conditions and attribute lists of fragments used during
the decomposition of an update request?
25.15 Discuss the factors that do not appear in centralized systems that affect
concur-rency control and recovery in distributed systems
Trang 525.16 Compare the primary site method with the primary copy method for distributedconcurrency control How does the use of backup sites affect each?
25.17 When are voting and elections used in distributed databases?
25.18 What are the software components in a client-server DDBMS? Compare the tier and three-tier client-server architectures
a For each employee in department 5, retrieve the employee name and thenames of the employee's dependents
b Print the names of all employees who work in department 5 but who work on
some project not controlled by department 5.
25.20 Consider the following relations:
BOOKS (Book#, Primary_author, Topic, Total_stock, $price)BOOKSTORE (Store#, City, State, Zip, Inventory_value)STOCK (Store#, Book#, Qty)
TOTAL_STOCK is the total number of books in stock, and INVENTORY_VALUE is the totalinventory value for the store in dollars
a Give an example of two simple predicates that would be meaningful for theBOOKSTORE relation for horizontal partitioning
b How would a derived horizontal partitioning of STOCK be defined based on thepartitioning of BOOKSTORE?
c Show predicates by which BOOKS may be horizontally partitioned by topic
d Show how the STOCK may be further partitioned from the partitions in (b) by
adding the predicates in (c)
25.21 Consider a distributed database for a bookstore chain called National Books with
3sites called EAST, MIDDLE, and WEST The relation schemas are given in question24.20.Consider that BOOKS are fragmented by $PRICE amounts into:
B1:BOOK!: up to $20
Bz: BOOK2: from $20.01to$50
B3:BOOK3: from$50.01 to$100
B4:BOOK4:$100.01 and above
Similarly, BOOKSTORES are divided by Zi pcodes into:
SI: EAST: Zi pcodes up to35000
s, MIDDLE: Zipcodes35001 to70000
S3: WEST: Zi pcodes70001to 99999
Assume that STOCK is a derived fragment based on BOOKSTORE only
Trang 6a Consider the query:
SELECT Book#, Total_stock
FROM Books
WHERE $price > 15 and $price < 55;
Assume that fragments of BOOKSTORE are non-replicated and assigned based on
region Assume further thatBOOKSare allocated as:
b If the bookprice of BOOK#= 1234 is updated from $45 to $55 at site MIDDLE,
what updates does that generate? Write in English and then inSQl
c Given an example query issued at WEST that will generate a subquery for
MIDDLE
d Write a query involving selection and projection on the above relations and
show two possible query trees that denote different ways of execution
25.22 Consider that you have been asked to propose a database architecture in a large
organization, General Motors, as an example, to consolidate all data including
legacy databases (from Hierarchical and Network models, which are explained in
Appendices C and D; no specific knowledge of these models is needed) as well as
relational databases, which are geographically distributed so that global
applica-tions can be supported Assume that alternative one is to keep all databases as
they are, while alternative two is to first convert them to relational and then
sup-port the applications over a distributed integrated database
a Draw two schematic diagrams for the above alternatives showing the linkages
among appropriate schemas For alternative one, choose the approach of
pro-viding export schemas for each database and constructing unified schemas for
each application
b List the steps one has to go through under each alternative from the present
situation until global applications are viable
c Compare these from the issuesof: (i) design time considerations, and (ii)
run-time considerations
Selected Bibliography
The textbooks by Ceri and Pelagatti (1984a) and Ozsu and Valduriez (1999) are devoted
to distributed databases Halsaal (1996), Tannenbaum (1996), and Stallings (1997) are
textbooks on data communications and computer networks Comer (1997) discusses
net-works and internets Dewire (1993) is a textbook on client-server computing Ozsu et at
(1994) has a collection of papers on distributed object management
Trang 7Distributed database design has been addressed in terms of horizontal and verticalfragmentation, allocation, and replication Ceri et a1 (1982) defined the concept ofminterm horizontal fragments Ceri et a1 (1983) developed an integer programmingbased optimization model for horizontal fragmentation and allocation N avathe et'11.
(1984) developed algorithms for vertical fragmentation based on attribute affinity andshowed a variety of contexts for vertical fragment allocation Wilson and Navathe (1986)present an analytical model for optimal allocation of fragments Elmasri et a1 (1987)discuss fragmentation for the EeR model; Karlapalem et a1 (1994) discuss issues fordistributed design of object databases Navathe et a1 (1996) discuss mixed fragmentation
by combining horizontal and vertical fragmentation; Karlapalem et a1 (1996) present amodel for redesign of distributed databases
Distributed query processing, optimization, and decomposition are discussed inHevner and Yao (1979), Kerschberg et a1 (1982), Apers et a1 (1983), Ceri and Pelagatti(1984), and Bodorick et a1 (1992) Bernstein and Goodman (1981) discuss the theorybehind semijoin processing Wong (1983) discusses the use of relationships in relationfragmentation Concurrency control and recovery schemes are discussed in Bernstein andGoodman (1981a) Kumar and Hsu (1998) have some articles related to recovery indistributed databases Elections in distributed systems are discussed in Garcia-Molina(1982) Lamport (1978) discusses problems with generating unique timestamps in adistributed system
A concurrency control technique for replicated data that is based on voting ispresented by Thomas (1979) Gifford (1979) proposes the use of weighted voting, andParis (1986) describes a method called voting with witnesses ]ajodia and Mutchler(1990) discuss dynamic voting A technique calledavailable copyis proposed by Bernsteinand Goodman (1984), and one that uses the idea of a group is presented in EIAbbadi andToueg (1988) Other recent work that discusses replicated data includes Gladney (1989),Agrawal and E1Abbadi (1990), E1Abbadi and Toueg (1990), Kumar and Segev (1993),Mukkamala (1989), and Wolfson and Milo (1991) Bassiouni (1988) discusses optimisticprotocols for DDB concurrency control Garcia-Molina (1983) and Kumar andStonebraker (1987) discuss techniques that use the semantics of the transactions.Distributed concurrency control techniques based on locking and distinguished copies arepresented by Menasce et a1 (1980) and Minoura and Wiederhold (1982) Obermark(1982) presents algorithms for distributed deadlock detection
A survey of recovery techniques in distributed systems is given by Kohler (1981).Reed (1983) discusses atomic actions on distributed data A book edited by Bhargava(1987) presents various approaches and techniques for concurrency and reliability indistributed systems
Federated database systems were first defined in McLeod and Heimbigner (1985).Techniques for schema integration in federated databases are presented by Elmasri et al.(1986), Batini et a1 (1986), Hayne and Ram (1990), and Motro (1987) Elmagarmid andHelal (1988) and Gamal-Eldin et a1 (1988) discuss the update problem in heterogeneousDDBSs Heterogeneous distributed database issues are discussed in Hsiao and Kamel(1989) Sheth and Larson (1990) present an exhaustive survey of federated databasemanagement
Trang 8Recently, multidatabase systems and interoperability have become important topics.
Techniques for dealing with semantic incompatibilities among multiple databases are
examined in DeMichiel (1989), Siegel and Madnick (1991), Krishnamurthy et al
(1991), and Wang and Madnick (1989) Castano et al (1998) present an excellent
survey of techniques for analysis of schemas Pitoura et al (1995) discuss object
orientation in multidatabase systems
Transaction processing in multidatabases is discussed in Mehrotra et al (1992),
Georgakopoulos et al (1991), Elmagarmid et al (1990), and Brietbart et al (1990),
among others Elmagarmid et al (1992) discuss transaction processing for advanced
applications, including engineering applications discussed in Heiler et a1 (1992)
The workflow systems, which are becoming popular to manage information in
complex organizations, use multilevel and nested transactions in conjunction with
distributed databases Weikum (1991) discusses multilevel transaction management
Alonso et al (1997) discuss limitations of current workflow systems
A number of experimental distributed DBMSs have been implemented These include
distributed INGRES (Epstein et al., 1978), DDTS (Devor and Weeldreyer, 1980), SDD-l
(Rothnie et al., 1980), System R* (Lindsay et al., 1984), SIRIUS-DELTA (Ferrier and
Stangret, 1982), and MULTIBASE (Smith et al., 1981) The OMNIBASE system
(Rusinkiewicz et al., 1988) and the Federated Information Base developed using the
Candide data model (Navathe et al., 1994) are examples of federated DDBMS Pitoura et al
(1995) present a comparative survey of the federated database system prototypes Most
commercial DBMS vendors have products using the client-server approach and offer
distributed versions of their systems Some system issues concerning client-server DBMS
architectures are discussed in Carey et al (1991), DeWitt et al (1990), and Wang and
Rowe (1991) Khoshafian et al (1992) discuss design issues for relational DBMSs in the
client-server environment Client-server management issues are discussed in many books,
such as Zantinge and Adriaans (1996)
Trang 9EMERGING TECHNOLOGIES
Trang 10We now turn our attention to how databases are used and accessed from the Internet
Many electronic commerce (e-commerce) and other Internet applications provide Web
interfaces to access information stored in one or more databases These databases are
often referred to as data sources It is common to use two-tier and three-tier clientserver
architectures for Internet applications (see Section 2.5) In some cases, other variations of
the clientserver model are used E-commerce and other Internet database applications are
designed to interact with the user through Web interfaces that display Web pages The
common method of specifying the contents and formatting of Web pages is through the
use of hyperlink documents There are various languages for writing these documents,
the most common beingHTML(Hypertext Markup Language) AlthoughHTMLis widely
used for formatting and structuring Web documents, it is not suitable for specifying
(Extended Markup Language)-has emerged as the standard for structuring and
exchang-ing data over the Web XML can be used to provide information about the structure and
meaning of the data in the Web pages rather than just specifying how the Web pages are
formatted for display on the screen The formatting aspects are specified separately-for
example, by using a formatting language such asXSL(Extended Stylesheet Language)
This chapter describes the basics of accessing and exchanging information over the
Internet We start in Section 26.1 by discussing how traditional Web pages differ from
structured databases, and discuss the differences between structured, semistructured, and
unstructured data Then in Section 26.2 we turn our attention to theXML standard and
841
Trang 11its tree-structured (hierarchical) data model Section 26.3 discussesXMLdocuments andthe languages for specifying the structure of these documents, namely, XML DTD(Document Type Definition) and XML schema Section 26.4 presents the variousapproaches for storing XML documents, whether in their native (text) format, in acompressed form, or in relational and other types of databases Section 26.5 gives anoverview of the languages proposed for queryingXMLdata Section 26.6 summarizes thechapter.
UNSTRUCTURED DATA
The information stored in databases is known as structured data because it is represented
in a strict format For example, each record in a relational database table-such as the
EMPLOYEE table in Figure S.6-follows the same format as the other records in that table.For structured data, it is common to carefully design the database using techniques such asthose described in Chapters 3, 4, 7, 10, and 11 in order to create the database schema.TheDBMSthen checks to ensure that all data follows the structures and constraints spec-ified in the schema
However, not all data is collected and inserted into carefully designed structureddatabases In some applications, data is collected in an ad-hoc manner before it is knownhow it will be stored and managed This data may have a certain structure, but not all theinformation collected will have identical structure Some attributes may be shared amongthe various entities, but other attributes may exist only in a few entities Moreover,additional attributes can be introduced in some of the newer data items at any time, andthere is no predefined schema This type of data is known as semistructured data Anumber of data models have been introduced for representing semistructured data, oftenbased on using tree or graph data structures rather than the flat relational model structures
A key difference between structured and semistructured data concerns how theschema constructs (such as the names of attributes, relationships, and entity types) arehandled In semistructured data, the schema information ismixedin with the data values,since each data object can have different attributes that are not known in advance.Hence, this type of data is sometimes referred to as self-describing data Consider thefollowing example We want to collect a list of bibliographic references related to acertain research project Some of these may be books or technical reports, others may beresearch articles in journals or conference proceedings, and still others may refer tocomplete journal issues or conference proceedings Clearly, each of these may havedifferent attributes and different types of information Even for the same type ofreference-say, conference articles-we may have different information For example,one article citation may be quite complete, with full information about author names,title, proceedings, page numbers, and so on, whereas another citation may not have allthe information available New types of bibliographic sources may appear in the future-for example, referencestoWeb pages ortoconference tutorials-and these may have newattributes that describe them
Trang 12FIGURE 26.1 Representing semistructured data as a graph.
Semistructured data may be displayed as a directed graph, as shown in Figure 26.1
The information shown in Figure 26.1 corresponds to some of the structured data shown
in Figure 5.6 As we can see, this model somewhat resembles the object model (see Figure
20.1) in its ability to represent complex objects and nested structures In Figure 26.1, the
labels or tags on the directed edges represent the schema names: thenames of attributes,
object types (or entity typesor classes), and relationships. The internal nodes represent
individual objects or composite attributes The leaf nodes represent actual data values of
simple (atomic) attributes
There are two main differences between the semistructured model and the object
model that we discussed in Chapter 20:
1.The schema information-names of attributes, relationships, and classes (object
types) in the semistructured model is intermixed with the objects and their data
values in the same data structure
2 In the semistructured model, there is no requirement for a predefined schema to
which the data objects must conform
In addition to structured and semistructured data, a third category exists, known as
unstructured data because there is very limited indication of the type of data A typical
example is a text document that contains information embedded within it Web pages in
HTML that contain some data are considered to be unstructured data Consider part of
an HTMLfile, shown in Figure 26.2 Text that appears between angled brackets, < >, is
an HTMLtag A tag with a backslash, «] >, indicates an end tag, which represents the
Trang 13<head>
</head>
<body>
<H1>List of company projects and the employees in each project<\H1>
<H2>The ProductX project:</H2>
<table width="100%" border=O cellpadding=O cellspacing=O>
<TR>
<TO width="50%"><font size="2" face="Arial">John Smith:</font></TO>
<TO>32.5 hours per week</TO>
</TR>
<TR>
<TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO>
<TO>20.0 hours per week</TD>
</TR>
</table>
<H2>The ProductY project:</H2>
<table width="100%" border=O cellpadding=O cellspacing=O>
<TR>
<TO width="50%"><font size="2" face="Arial">John Smith:</font></TO>
<TO>7.5 hours per week</TO>
</TR>
<TR>
<TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO>
<TO>20.0 hours per week</TO>
</TR>
<TR>
<TO width="50%%"><font size="2" face="Arial">Franklin Wong:</font></TO>
<TO>10.0 hours per week</TO>
</TR>
</table>
</body>
</html>
FIGURE 26.2 Part of an HTML document representing unstructured data
ending of the effect of a matching start tag The tags mark up the document! in order toinstruct an HTML processor howto display the text between a start tag and a matchingend tag Hence, the tags specify document formatting rather than the meaning of thevarious data elements in the document.HTMLtags specify information, such as font sizeand style (boldface, italics, and so on), color, heading levels in documents, and so on.Some tags provide text structuring in documents, such as specifying a numbered or
1 That is why it is known as HypertextMarkupLanguage
Trang 14unnumbered list or a table Even these structuring tags specify that the embedded textual
data is to be displayed in a certain manner, rather than indicating the type of data
represented in the table
HTML uses a large number of predefined tags, which are used to specify a variety of
commands for formatting Web documents for display The start and end tags specify the
range of text to be formatted by each command A few examples of the tags shown in
Figure 26.2 follow:
• The <html> </html> tags specify the boundaries of the document
• The document header information-within the <head> </head> tags-specifies
various commands that will be used elsewhere in the document For example, it may
specify various script functions in a language such asJAVAScript orPERL,or certain
formatting styles (fonts, paragraph styles, header styles, and so on) that can be used
in the document Itcan also specify a title to indicate what theHTMLfile is for, and
other similar information that will not be displayed as part of the document
• The body of the document-specified within the <body> </body> tags-includes
the document text and the markup tags that specify how the text is to be formatted
and displayed It can also include references to other objects, such as images, videos,
voice messages, and other documents
• The <HI> </HI> tags specify that the text is to be displayed as a level I heading
There are many heading levels «H2>, <H3>, and so on), each displaying text in a
less prominent heading format
• The <table> </table> tags specify that the following text is to be displayed as a
table Each row in the table is enclosed within <TR> </TR> tags, and the actual
text data in a row is displayed within <TD> </TD> tags.2
• Some tags may have attributes, which appear within the start tag and describe
addi-tional properties of the tag." In Figure 26.2, the <table> start tag has four attributes
describing various characteristics of the table The following <TD> and <font> start
tags have one and two attributes, respectively
HTML has a very large number of predefined tags, and whole books are devoted to
describing how to use these tags If designed properly,HTMLdocuments can be formatted
so that humans are able to easily understand the document contents, and are able to
navigate through the resulting Web documents However, the source HTML text
documents are very difficult tointerpret automatically bycomputer programsbecause they
do not include schema information about the type of data in the documents As
e-commerce and other Internet applications become increasingly automated, it is becoming
crucial to be able to exchange Web documents among various computer sites and to
interpret their contents automatically This need was one of the reasons that led to the
development ofXML, which we discuss in the next section
2 <TR> stands for table row, and <TO> for table data
3 This is how the termattributeis used in document markup languages, which differs from how it is
used in database models
Trang 1526.2 XMl HIERARCHICAL (TREE) DATA MODEL
We now introduce the data model used inXML.The basic object isXMLin theXMLment Two main structuring concepts are used to construct an XMLdocument: elementsand attributes.Itis importanttonote right away that the term attribute inXMLis not used
docu-in the same manner as is customary docu-in database termdocu-inology, but rather as it is used docu-indocument description languages such as HTML and SGML.4 Attributes in XML provideadditional information that describes elements, as we shall see There are additional con-cepts in XML,such as entities, identifiers, and references, but we first concentrate ondescribing elements and attributestoshow the essence of theXMLmodel
Figure 26.3 shows an example of an XML element called <projects> As in HTML,
elements are identified in a document by their start tag and end tag The tag names areenclosed between angled brackets < >, and end tags are further identified by abackslash, </ >.5Complex elements are constructed from other elements hierarchically,whereas simple elements contain data values A major difference betweenXMLandHTML
is that XML tag names are defined to describe the meaning of the data elements in thedocument, rather than to describe how the text is to be displayed This makes it possible
to process the data elements in theXMLdocument automatically by computer programs
Itis straightforward to see the correspondence between theXMLtextual representationshown in Figure 26.3 and the tree structure shown in Figure 26.1 In the tree representation,internal nodes represent complex elements, whereas leaf nodes represent simple elements.That is why theXMLmodel is called a tree model or a hierarchical model In Figure 26.3,the simple elements are the ones with the tag names <Name>, <Number>, <Location>,
<DeptNo>, <SSN>, <LastName>, <FirstName>, and <hours> The complex elements arethe ones with the tag names <projects>, <project>, and <Worker> In general, there is nolimit on the levels of nesting of elements
In general, it is possible to characterize three main types ofXMLdocuments:
fol-Iowa specific structure and hence may be extracted from a structured database Theyare formatted asXMLdocuments in ordertoexchange them or display them over theWeb
such as news articles or books There are few or no structured data elements in thesedocuments
and other parts that are predominantly textual or unstructured
It is importanttonote that data-centricXMLdocuments can be considered either assemistructured data or as structured data If an XMLdocument conforms to a predefined
4.SGML(Standard Generalized Markup Language) is a more general language for describing ments and provides capabilities for specifying new tags However, it is more complex thanHTML
docu-and XML.
5 The left and right angled bracket characters« and» are reserved characters, as are the sand (&), apostrophee),and single quotation marks (') To include them within the text of a doc-ument, they must be encoded as &It;, >, &, ', and ", respectively
Trang 16FIGURE 26.3 A complexXMLelement called <projects>.
XML schema or DTD (see Section 26.3), then the document can be considered as
structureddata. On the other hand, XML allows documents that do not conform to any
schema; and these would be considered assemistructureddata.The latter are also known as
schemaless XML documents When the value of the STANDALONEattribute in an XML document
is"YES",as in the first line of Figure 26.3, the document is standalone and schemaless
XML attributes are generally used in a manner similartohow they are used in HTML
(see Figure 26.2), namely,todescribe properties and characteristics of the elements (tags)
within which they appear It is also possible to use XML attributes tohold the values of
Trang 17simple data elements; however this is definitely not recommended We discuss XMLattributes further in Section 26.3 when we discussXMLschema andDTD.
26.3.1 Well-Formed and Valid XML Documents and XML DTD
In Figure 26.3, we saw what a simple XMLdocument may look like AnXMLdocument iswell formed if it follows a few conditions In particular, it must start with anXMLdeclara-tionto indicate the version ofXMLbeing used as well as any other relevant attributes, asshown in the first line of Figure 26.3 Itmust also follow the syntactic guidelines of thetree model This means that there should be asingle root element,and every element mustinclude a matching pair of start and end tags within the start and end tagsof the parent ele- ment.This ensures that the nested elements specify a well-formed tree structure
A well-formedXMLdocument is syntactically correct This allows it to be processed
by generic processors that traverse the document and create an internal treerepresentation A standard set of API (application programming interface) functionscalledDOM(Document Object Model) allows programs to manipulate the resulting treerepresentation corresponding to a well-formed XML document However, the wholedocument must be parsed beforehand when using DOM.Another APIcalledSAXallowsprocessing ofXMLdocuments on the fly by notifying the processing program whenever astart or end tag is encountered This makes it easier to process large documents and allowsfor processing of so-called streamingXMLdocuments, where the processing program canprocess the tags as they are encountered
A well-formedXML document can have any tag names for the elements within thedocument There is no predefined set of elements (tag names) that a program processingthe document knows to expect This gives the document creator the freedom to specifynew elements, but limits the possibilities for automatically interpreting the elementswithin the document
<!DOCTYPE projects [
<!ELEMENT projects (project+»
<!ELEMENT project (Name, Number, Location, DeptNo?, Workers»
<!ELEMENT Name (#PCDATA»
<!ELEMENT Number (#PCDATA»
<!ELEMENT Location (#PCDATA»
<!ELEMENT DeptNo (#PCDATA»
<!ELEMENT Workers (Worker*»
<!ELEMENT Worker (SSN, LastName?, FirstName?, hours»
<!ELEMENT SSN (#PCDATA»
<!ELEMENT LastName (#PCDATA»
<!ELEMENT FirstName (#PCDATA»
<!ELEMENT hours (#PCDATA»
] >
FIGURE 26.4 AnXML DTDfile called projects
Trang 18A stronger criterion is for an XML document to be valid In this case, the document
must be well formed, and in addition the element names used in the start and end tag
pairs must follow the structure specified in a separate XML DTD (Document Type
Definition) file or XMLschema file We first discussXML DTDhere, then give an overview
ofXMLschema in Section 26.3.2 Figure 26.4 shows a simpleXML DTDfile, which specifies
the elements (tag names) and their nested structures Any valid documents conforming
to this DTD should follow the specified structure A special syntax exists for specifying
DTD files, as illustrated in Figure 26.4 First, a name is given to the root tag of the
document, which is called projects in the first line of Figure 26.4 Then the elements and
their nested structure are specified
When specifying elements, the following notation is used:
• A *following the element name means that the element can be repeated zero or
more times in the document This kind of element is known as anoptional multivalued
(repeating) element.
• A + following the element name means that the element can be repeated one or
more times in the document This kind of element is arequired multivalued (repeating)
element.
• A ?following the element name means that the element can be repeated zero or one
times This kind is an optional single-valued (nonrepeating) element.
• An element appearing without any of the preceding three symbols must appear
exactly once in the document This kind is a required single-valued (nonrepeating)
element.
• The type of the element is specified via parentheses following the element If the
parentheses include names of other elements, these latter elements are the childrenof
the element in the tree structure If the parentheses include the keyword #PCDATA or
one of the other data types available inXML DTD, the element is a leaf node PCDATA
stands forparsed characterdata,which is roughly similar to a string data type
• Parentheses can be nested when specifying elements
• A bar symbol(e\ Iez )specifies that eithere\orezcan appear in the document
We can see that the tree structure in Figure 26.1 and theXML document in Figure
26.3 conform to the XML DTD in Figure 26.4 To require that an XML document be
checked for conformance to a DTD, we must specify this in the declaration of the
document For example, we could change the first line in Figure 26.3 to the following:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE projects SYSTEM "proj.dtd">
When the value of the standalone attribute in an XML document is "no", the
document needs to be checked against a separateDTDdocument TheDTDfile shown in
Figure 26.4 should be stored in the same file system as theXML document, and should be
given the file name "proj dtd" Alernatively, we could include theDTD document text
at the beginning of theXMLdocument itself to allow the checking
Although XML DTD is quite adequate for specifying tree structures with required,
optional, and repeating elements, it has several limitations First, the data types in DTD
Trang 19are not very general Second,DTDhas its own special syntax and thus requires specializedprocessors Itwould be advantageous to specifyXMLschema documents using the syntaxrules ofXMLitself so that the same processors used forXMLdocuments could processXMLschema descriptions Third, all DTDelements are always forced to follow the specifiedordering of the document, so unordered elements are not permitted These drawbacks led
to the development ofXMLschema, a more general language for specifying the structureand elements ofXMLdocuments
TheXMLschema language is a standard for specifying the structure ofXMLdocuments Ituses the same syntax rules as regularXMLdocuments, so that the same processors can beused on both To distinguish the two types of documents, we will use the term XML
instance documentorXML documentfor a regularXMLdocument, andXML schema document
for a document that specifies an XML schema Figure 26.5 shows an XML schema ment correspondingtothe COMPANYdatabase shown in Figures 3.2 and 5.5 Although it isunlikely that we would want to display the whole database as a single document, therehave been proposals to store data in nativeXMLformat as an alternative to storing thedata in relational databases The schema in Figure 26.5 would serve the purpose of speci-fying the structure of theCOMPANYdatabase if it were stored in a nativeXMLsystem We dis-cuss this topic further in Section 26.4
docu-As withXML DTD, XMLschema is based on the tree data model, with elements andattributes as the main structuring concepts However, it borrows additional concepts from
<7xml version="l.O" encoding="UTF-8" 7>
<xsd:schema xmlns:xsd=''http://www.w3.org/2001/XMLSchema''>
<xsd:annotation>
<xsd:documentation xml:lang="en">Company Schema (Element Approach)
-Prepared by Babak Hojabri</xsd:documentation>
Trang 21<xsd:complexType name="Department">
<xsd:sequence>
<xsd:element name="departmentName" type="xsd:string" />
<xsd:element name="departmentNumber" type="xsd:string" />
<xsd:element name="departmentManagerSSN" type="xsd:string" />
<xsd:element name="departmentManagerStartDate" type="xsd:date" />
<xsd:element name="departmentLocation" type="xsd:string"
<xsd:element name="employeeName" type="Name" />
<xsd:element name="employeeSSN" type="xsd:string" />
<xsd:element name="employeeSex" type="xsd:string" />
<xsd:element name="employeeSalary" type="xsd:unsignedlnt" />
<xsd:element name="employeeBirthDate" type="xsd:date" />
<xsd:element name="employeeDepartmentNumber" type="xsd:string" />
<xsd:element name="employeeSupervisorSSN" type="xsd:string" />
<xsd:element name="employeeAddress" type="Address" />
<xsd:element name="employeeWorksOn" type="WorksOn" m;nOccurs="I"maxOccurs="unbounded" />
<xsd:element name="employeeDependent" type="Dependent" m;nOccurs="O"maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Project">
<xsd:sequence>
<xsd:element name="projectName" type="xsd:string" />
<xsd:element name="projectNumber" type="xsd:string" />
<xsd:element name="projectLocat;on" type="xsd:string" />
<xsd:element name="projectDepartmentNumber" type="xsd:string" />
<xsd:element name="projectWorker" type="Worker" m;nOccurs="I"maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Dependent">
<xsd:sequence>
<xsd:element name="dependentName" type="xsd:string" />
<xsd:element name="dependentSex" type="xsd:string" />
<xsd:element name="dependentBirthDate" type="xsd:date" />
<xsd:element name="dependentRelationship" type="xsd:string" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Address">
<xsd:sequence>
<xsd:element name="number" type="xsd:string" />
<xsd:element name="street" type="xsd:string" />
<xsd:element name="city" type="xsd:string" />
<xsd:element name="state" type="xsd:string" />
</xsd:sequence>
FIGURE 26.5(CONTINUED) An XMLschema file called company
Trang 22<xsd:complexType name="Name">
<xsd:sequence>
<xsd:element name="firstName" type="xsd:string" />
<xsd:element name="middleName" type="xsd:string" />
<xsd:element name="lastName" type="xsd:string" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Worker">
<xsd:sequence>
<xsd:element name="SSN" type="xsd:string" />
<xsd:element name="hours" type="xsd:float" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="WorksOn">
<xsd:sequence>
<xsd:element name="projectNumber" type="xsd:string" />
<xsd:element name="hours" type="xsd:float" />
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
FIGURE 26.5(CONTINUED) An XMLschema file called company
database and object models, such as keys, references, and identifiers We here describe the
features of XML schema in a step-by-step manner, referring to the example XML schema
document of Figure 26.5 for illustration We introduce and describe some of the schema
concepts in the order in which they are used in Figure 26.5
ofXML schema language elements (tags) being used by specifying a file stored at a
Web site location The second line in Figure 26.5 specifies the file used in this
example, which is http://www.w3.org/200l/XMLSchema" This is the most
commonly used standard for XML schema commands Each such definition is
called an XML namespace, because it defines the set of commands (names) that
can be used The file name is assigned to the variable xsd (XML schema
descrip-tion) using the attribute xml ns (XML narnespace}, and this variable is used as a
prefix to all XML schema commands (tag names) For example, in Figure 26.5,
when we write xsd: el ement or xsd: sequence, we are referringtothe definitions
of the element and sequence tags as defined in the file ''http://www.w3.org/
200l/XMLSchema"
26.5 illustrate the XML schema elements (tags) xsd: annotati on and
xsd: documentati on, which are used for providing comments and other
descrip-tions in the XML document The attribute xml : 1ang of the xsd:documentati on
element specifies the language being used, where "en" stands for the English
language
Trang 23schema, the name attribute of the xsd: element tag specifies the element name,which is called company for the root element in our example (see Figure 26.5).The structure of the company root element can then be specified, which in ourexample is xsd: complexType This is further specified to be a sequence of depart-ments, employees, and projects using the xsd: sequence structure ofXMLschema.
Itis important to note here that this is not the only way to specify anXMLschemafor theCOMPANYdatabase We will discuss other options in Section 26.4
4 First-level elements in theCOMPANYdatabase:Next, we specify the three first-level ments under the company root element in Figure 26.5 These elements are namedemployee, department, and proj ect, and each is specified in an xsd: element tag.Notice that if a tag has only attributes and no further subelements or data within
ele-it, it can be ended with the backslash symbol C/» directly instead of having aseparate matching end tag These are called empty elements; examples are thexsd: el ement elements named department and project in Figure 26.5
5 Specifying element type andminimumand maximum occurrences:InXMLschema, theattributes type, minOccu rs , and maxOccurs in the xsd: element tag specify thetype and multiplicity of each element in any document that conforms to theschema specifications If we specify a type attribute in an xsd: element, the struc-ture of the element must be described separately, typically using thexsd : comp1exType element of XMLschema This is illustrated by the employee,department, and project elements in Figure 26.5 On the other hand, if no typeattribute is specified, the element structure can be defined directly following thetag, as illustrated by the company root element in Figure 26.5 The mi nOccurs andmaxOccurs tags are used for specifying lower and upper bounds on the number ofoccurrences of an element in any document that conforms to the schema specifi-cations If they are not specified, the default is exactly one occurrence Theseserve a similar role tothe ", +,and? symbols ofXML DTD,and to the (min, max)constraints of theERmodel (see Section 3.7.4)
6 Specifying keys:In XMLschema, it is possible to specify constraints that correspond
to unique and primary key constraints in a relational database (see Section 5.2.2),
as well as foreign keys (or referential integrity) constraints (see Section 5.2,4).The xsd: uni que tag specifies elements that correspond to unique attributes in arelational database that are not primary keys We can give each such uniquenessconstraint a name, and we must specify xsd: sel ector and xsd: fi e1d tags for it
to identify the element type that contains the unique element and the elementname within it that is unique via the xpath attribute This is illustrated by thedepartmentNameUni que and proj ectNameUni que elements in Figure 26.5 Forspecifying primary keys, the tag xsd: key is used instead of xsd: uni que, as illus-trated by the projectNumberKey, departmentNumberKey, and employeeSSNKeyelements in Figure 26.5 For specifying foreign keys, the tag xsd: keyref is used,
as illustrated by the six xsd: key ref elements in Figure 26.5 When specifying aforeign key, the attribute refer of the xsd: key ref tag specifies the referencedprimary key, whereas the tags xsd: se1ector and xsd: fi e1d specify the referenc-ing element type and foreign key (see Figure 26.5)
Trang 247 Specifying the structures of complex elements via complex types:The next part of our
example specifies the structures of the complex elements Department, Employee,
Project, and Dependent, using the tag xsd:complexType (see Figure 26.5) We
specify each of these as a sequence of subelements corresponding to the database
attributes of each entity type (see Figures 3.2 and 5.7) by using the xsd: sequence
and xsd: element tags ofXMLschema Each element is given a name and type via
the attributes name and type of xsd: element We can also specify mi nOccurs and
maxOccu rs attributes if we need to change the default of exactly one occurrence
For (optional) database attributes where null is allowed, we need to specify
mi nOccurs = 0, whereas for multivalued database attributes we need to specify
maxOccurs = "unbounded" on the corresponding element Notice that if we were
not going to specify any key constraints, we could have embedded the subelernents
within the parent element definitions directly without having to specify complex
types However, when unique, primary key, and foreign key constraints need to be
specified, we must define complex types to specify the element structures
specified as complex types in Figure 26.5, as illustrated by the Address, Name,
Worker, and WorksOn complex types These could have been directly embedded
within their parent elements
This example illustrates some of the main features ofXMLschema There are other
features, but they are beyond the scope of our presentation In the next section, we discuss
the different approaches to creatingXMLdocuments from relational databases and storing
XMLdocuments
26.4 XML DOCUMENTS AND DATABASES
We now discuss how various types ofXMLdocuments can be stored and retrieved Section
26.4.1 gives an overview of the various approaches for storingXMLdocuments Section
26.4.2 discusses one of these approaches, in which data-centric XMLdocuments are
extracted from existing databases, in more detail In particular, we show how tree
struc-tured documents can be created from graph-strucstruc-tured databases Section 26.4.3 discusses
the problem of cycles and how it can be dealt with
26.4.1 Approaches to Storing XML Documents
Several approaches to organizing the contents ofXMLdocumentstofacilitate their
subse-quent querying and retrieval have been proposed The following are the most common
approaches:
1.Using a DBMS to store the documents as text: A relational or object DBMScan be
used to store whole XMLdocuments as text fields within the DBMS records or
objects This approach can be used if theDBMShas a special module for document
processing, and would work for storing schemaless and document-centric XML
Trang 25documents The keyword indexing functions of the document processing module(see Chapter 22) can be used to index and speed up search and retrieval of thedocuments.
2 Using aDBMS to store the document contents as data elements: This approach would
work for storing a collection of documents that follow a specificXML DTDorXML
schema Because all the documents have the same structure, one can design arelational (or object) database to store the leaf-level data elements within the
XMLdocuments This approach would require mapping algorithms to design adatabase schema that is compatible with theXMLdocument structure as specified
in the XMLschema or DTDand to recreate the XMLdocuments from the storeddata These algorithms can be implemented either as an internalDBMSmodule or
as separate middleware that is not part of theDBMS
3 Designing a specialized system for storing nativeXMLdata: A new type of database
system based on the hierarchical (tree) model could be designed and mented The system would include specialized indexing and querying techniques,and would work for all types ofXMLdocuments It could also include data com-pression techniques to reduce the size of the documents for storage
imple-4 Creatingorpublishing customizedXMLdocuments from preexisting relational databases:
Because there are enormous amounts of data already stored in relational bases, parts of this data may need to be formatted as documents for exchanging ordisplaying over the Web This approach would use a separate middleware softwarelayertohandle the conversions needed between theXMLdocuments and the rela-tional database
data-All four of these approaches have received considerable attention over the past fewyears We focus on approach 4 in the next subsection, because it gives a good conceptualunderstanding of the differences between the XML tree data model and the traditionaldatabase models based on flat files (relational model) and graph representations (ER
Trang 26We will use the simplifiedUNIVERSITY ERschema shown in Figure 26.6 to illustrate our
discussion Suppose that an application needs to extract XMLdocuments for student,
course, and grade information from the UNIVERSITY database The data needed for these
documents is contained in the database attributes of the entity types COURSE, SECTION, and
STUDENTfrom Figure 26.6, and the relationships s-s and c-s between them In general,
most documents extracted from a database will only use a subset of the attributes, entity
types, and relationships in the database In this example, the subset of the database that is
needed is shown in Figure 26.7
0ections taught
FIGURE 26.6 AnERschema diagram for a simplified UNIVERSITYdatabase
~
Students attended
Trang 27At least three possible document hierarchies can be extracted from the databasesubset in Figure 26.7 First, we can choose COURSE as the root, as illustrated in Figure26.8 Here, each course entity has the set of its sections as subelements, and eachsection has its students as subelements We can see one consequence of modeling theinformation in a hierarchical tree structure If a student has taken multiple sections,that student's information will appear multiple times in the document-once undereach section A possible simplified XMLschema for this view is shown in Figure 26.9.The Grade database attribute in thes-s relationship is migrated to theSTUDENTelement.This is because STUDENT becomes a child of SECTION in this hierarchy, so each STUDENT
element under a specific SECTION element can have a specific grade in that section Inthis document hierarchy, a student taking more than one section will have severalreplicas, one under each section, and each replica will have the specific grade given inthat particular section
In the second hierarchical document view, we can choose STUDENT as root (Figure26.10) In this hierarchical view, each student has a set of sections as its child elements,and each section is related to one course as its child, because the relationship between
SECTION and COURSE is N:1.We can hence merge the COURSE and SECTION elements in this
COURSE
sections N SECTION
Trang 28<xsd:element name="root">
<xsd:sequence>
<xsd:element name="course" m;nOccurs="O" maxOccurs="unbounded">
<xsd:sequence>
<xsd:element name="cname" type="xsd:string" />
<xsd:element name="cnumber" type="xsd:unsignedlnt" />
<xsd:element name="section" m;nOccurs="O" maxOccurs="unbounded">
<xsd:sequence>
<xsd:element name="secnumber" type="xsd:unsignedlnt" />
<xsd:element name="year" type="xsd:string" />
<xsd:element name="quarter" type="xsd:string" />
<xsd:element name="student" m;nOccurs="O" maxOccurs="unbounded">
<xsd:sequence>
<xsd:element name="ssn" type="xsd:string" />
<xsd:element name="sname" type="xsd:string" />
<xsd:element name="class" type="xsd:string" />
<xsd:element name="grade" type="xsd:string" />
FIGURE 26.9 XMLschema document with COURSEas the root.
view, as shown in Figure 26.10 In addition, the GRADEdatabase attribute can be migrated
to the SECTION element In this hierarchy, the combined COURSE/SECTION information is
replicated under each student who completed the section A possible simplified XML
schema for this view is shown in Figure 26.11
The third possible way isto choose SECTION as the root, as shown in Figure 26.12
Similar to the second hierarchical view, the COURSE information can be merged into the
SECTIONelement TheGRADEdatabase attribute can be migrated to theSTUDENTelement As
we can see, even in this simple example, there can be numerous hierarchical document
views, each corresponding to a different root and a differentXMLdocument structure
26.4.3 Breaking Cycles to Convert Graphs into Trees
In the previous examples, the subset of the database of interest had no cycles.Itis
pos-sibletohave a more complex subset with one or more cycles, indicating multiple
rela-tionships among the entities In this case, it is more complex to decide how to create
the document hierarchies Additional duplication of entities may be needed to
repre-sent the multiple relationships We shall illustrate this with an example using the ER
schema in Figure 26.6
Trang 29Sections completed
FIGURE 26.10 Hierarchical (tree)viewwith STUDENTas the root
Suppose that we need the information in all the entity types and relationships ofFigure 26.6 for a particularXMLdocument, withSTUDENTas the root element Figure 26.13illustrates how a possible hierarchical tree structure can be created for this document.First, we get a lattice withSTUDENTas the root, as shown in part(l)of Figure 26.13 This isnot a tree structure because of the cycles One way to break the cycles is to replicate theentity types involved in the cycles First, we replicate INSTRUCTOR as shown in part (2) ofFigure 26.13, calling the replica to the rightINSTRUCTORI.TheINSTRUCTORreplica on the leftrepresents the relationship between instructors and the sections they teach, whereas the
INSTRUCTOR1replica on the right represents the relationship between instructors and thedepartment each works in After this, we still have the cycle involvingCOURSE, so we canreplicateCOURSE in a similar manner, leading to the hierarchy shown in part (3) of Figure26.13 The COURSEIreplica to the left represents the relationship between courses andtheir sections, whereas the COURSEreplicatothe right represents the relationship betweencourses and the department that offers each course
In part (3) of Figure 26.13, we have converted the initial graph to a hierarchy Wecan do further merging if desired (as in our previous example) before creating the finalhierarchy and the correspondingXMLschema structure
Trang 30<xsd:element name="root">
<xsd:sequence>
<xsd:element name="student" minOccurs="O" maxOccurs="unbounded">
<xsd:sequence>
<xsd:element name="ssn" type="xsd:string" />
<xsd:element name="sname" type="xsd:string" />
<xsd:element name="class" type="xsd:string" />
<xsd:element name="section" minOccurs="O" maxOccurs="unbounded">
<xsd:sequence>
<xsd:element name="secnumber" type="xsd:unsignedlnt" />
<xsd:element name="year" type="xsd:string" />
<xsd:element name="quarter" type="xsd:string" />
<xsd:element name="cnumber" type="xsd:unsignedlnt" />
<xsd:element name="cname" type="xsd:string" />
<xsd:element name="grade" type="xsd:string" />
,,
Trang 31FIGURE 26.13 Converting a graph with cycles into a hierarchical (tree) structure.
26.4.4 Other Steps for Extracting XML Documents from
There have been several proposals for XMLquery languages, but two standards haveemerged The first is XPath, which provides language constructs for specifying pathexpressions to identify certain nodes (elements) within anXMLdocument that match spe-
Trang 32cific patterns The second is XQuery, which is a more general query language XQuery
uses XPath expressions but has additional constructs We give an overview of each of
these languages in this section
26.5.1 XPath: Specifying Path Expressions in XML
An XPath expression returns a collection of element nodes that satisfy certain patterns
specified in the expression The names in the XPath expression are node names in theXML
document tree that are either tag (element) names or attribute names, possibly with
addi-tional qualifier conditions to further restrict the nodes that satisfy the pattern Two main
slash before a tag specifies that the tag must appear as a direct child of the previous
(par-ent) tag, whereas a double slash specifies that the tag can appear as a descendant of the
pre-vious tagat any level.Let us look at some examples of XPath as shown in Figure 26.14
The first XPath expression in Figure 26.14 returns the company root node and all its
descendant nodes, which means that it returns the wholeXMLdocument We should note
that it is customary to include the file name in the XPath query This allows us to specify
any local file name or even any path name that specifies a file on the Web For example, if
the COMPANYXMLdocument is stored at the location
www.company.com/info.xml
then the first XPath expression in Figure 26.14 can be written as
doc(www.company.com/info.xml)/company
This prefix would also be included in the other examples
The second example in Figure 26.14 returns all department nodes (elements) and
their descendant subtrees Note that the nodes (elements) in an XML document are
ordered, so the XPath result that returns multiple nodes will do so in the same order in
which the nodes are ordered in the document tree.
The third XPath expression in Figure 26.14 illustrates the use of II, which is
convenienttouse if we do not know the full path name we are searching for, but do know
the name of some tags of interest within theXMLdocument This is particularly useful for
schemaless XMLdocuments or for documents with many nested levels of nodes.6The
1 /company
2 /company/department
3 //employee [employeeSalary gt 70000]/employeeName
4. /company/employee [employeeSalary gt 70000]/employeeName
5 /company/project/projectWorker [hours ge 20.0]
FIGURE 26.14 Some examples of XPath expressions on XMLdocuments that follow
the XMLschema file COMPANYin Figure 26.5
- - -
-6 We are using the terms node, tag,andelementinterchangeably here
Trang 33expression returns all emp1oyeeName nodes that are direct children of an emp1oyee node,such that the employee node has another child element employeeSalary whose value isgreater than 70000. This illustrates the use of qualifier conditions, which restrict thenodes selected by the XPath expression to those that satisfy the condition XPath has anumber of comparison operations for use in qualifier conditions, including standardarithmetic, string, and set comparison operations.
The fourth XPath expression should return the same result as the previous one, exceptthat we specified the full path name in this example The fifth expression in Figure26.14
returns all p roj ectWo rke r nodes and their descendant nodes that are children under a path/company/project and have a child node hours with a value greater than20.0hours
26.5.2 XQuery: Specifying Queries in XML
XPath allows us towrite expressions that select nodes from a tree-structured XML ment XQuery permits the specification of more general queries on one or more XML doc-uments The typical form of a query in XQuery is known as aFLWR expression, whichstands for the four main clauses of XQuery and has the following form:
docu-FOR <variable bindings to individual nodes (elements»
LET <variable bindings to collections of nodes (elements»
WHERE <qualifier conditions>
RETURN <query result specification>
Figure 26.15includes some examples of queries in XQuery that can be specified onXML instance documents that follow the XML schema document in Figure26.5.The firstquery retrieves the first and last names of employees who earn more than$70,000.The
1 FOR $x IN
doc(www.company.com/info.xml)//employee [employeeSalary gt 70000]/employeeNameRETURN <res> $x/firstName, $x/lastName <Ires>
2 FOR $x IN
doc(www.company.com/info.xml)/company/employeeWHERE $x/employeeSalary gt 70000
RETURN <res> $x/employeeName/firstName,
$y/employeeName/lastName, $x/hours <Ires>
FIGURE 26.15 Some examples of XQuery queries on XMLdocuments that follow the
XML schema fileCOMPANYin Figure 26.5
Trang 34variable $x is bound to each emp1oyeeName element that is a child of an employee
element, but only for employee elements that satisfy the qualifier that their
employeeSalary value is greater than $70,000 The result retrieves the fi rs"tName and
1asrNamechild elements of the selected empl oyeeName elements The second query is an
alternative way of retrieving the same elements retrieved by the first query
The third query illustrates how a join operation can be performed by having more
than one variable Here, the $x variable is bound to each projec"tWorker element that is
a child of project number 5, whereas the $y variable is bound to each employee element
The join condition matchesSSNvalues in order to retrieve the employee names
This concludes our brief introduction to XQuery The interested reader is referred to
the Web site www.w3.org, which contains documents describing the latest standards
related to XML
This chapter gave an overview of the standard for representing and exchanging data over
the Internet We started by discussing the differences between structured, semistructured,
and unstructured data, then discussed why there was a need for a specification language
such as XML We described the XML standard and its tree-structured (hierarchical) data
model, and discussed XML documents and the languages for specifying the structure of
these documents, namely, XML DTD (Document Type Definition) and XML schema We
then gave an overview of the various approaches for storing XML documents, whether in
their native (text) format, in a compressed form, or in relational and other types of
data-bases, and discussed the mapping issues that arise when there is need to convert data
stored in traditional databases into XML documents Finally, we gave an overview of the
XPath and XQuery languages proposed for querying XML data
26.3 What are the differences between the use of tags in XML versus HTML?
26.4 What is the difference between data-centric and document-centric XML
documents?
26.5 What is the difference between attributes and elements in XML? List some of the
important attributes used in specifying elements in XML schema
26.6 What is the difference between XML schema and XML DTD?
Trang 3526.7 Create an XMLinstance document to correspond to the data stored in the tional database shown in Figure 5.6 such that theXMLdocument conforms to theXMLschema document in Figure 26.5
rela-26.8 CreateXMLschema documents to correspondtothe hierarchies shown in Figures26.12 and 26.13 part(3)
26.9 Consider the LIBRARY relational database schema of Figure 5.20 Create an XMLschema document that corresponds to this database schema
26.10 Specify the following views as queries in XQuery on the COMPANY XML schemashown in Figure 26.5
a A view that has the department name, manager name, and manager salary forevery department
b A view that has the employee name, supervisor name, and employee salary foreach employee who works in the Research department
c A view that has the project name, controlling department name, number ofemployees, and total hours worked per week on the project for each project
d A view that has the project name, controlling department name, number ofemployees, and total hours worked per week on the project for each projectwith more than one employee working on it
Selected Bibliography
There are so many articles and books on various aspects ofXMLthat it would be ble to make even a modest list We will mention one book: Chaudhri, Rashid, and Zicari,eds (2003) This book discusses various aspects ofXMLand contains a list of some recentreferences to XMLresearch and practice
Trang 36impossi-Over the last three decades, many organizations have generated a large amount of
machine-readable data in the form of files and databases To process this data, we have
the database technology available that supports query languages like SQL The problem
withSQLis that it is a structured language that assumes the user is aware of the database
schema.SQLsupports operations of relational algebra that allow a user to select rows and
columns of data from tables or join related information from tables based on common
fields In the next chapter, we shall see that data warehousing technology affords several
types of functionality: that of consolidation, aggregation, and summarization of data Data
warehouses let us view the same information along multiple dimensions In this chapter,
we will focus our attention on another very popular area of interest known as data
min-ing As the term connotes, data mining refers to the mining or discovery of new
informa-tion in terms of patterns or rules from vast amounts of data To be practically useful, data
mining must be carried out efficiently on large files and databases To date, it isnot
well-integrated with database management systems
We will briefly review the state of the art of this rather extensive field of data mining,
which uses techniques from such areas as machine learning, statistics, neural networks,
and genetic algorithms We will highlight the nature of the information that is
discovered, the types of problems faced when trying to mine databases, and the types of
applications of data mining We also survey the state of the art of a large number of
commercial tools available (see Section 26.2.5) and describe a number of research
advances that are needed to make this area viable
867
Trang 3727.1 OVERVIEW OF DATA MINING
TECHNOLOGY
In reports such as the very popular GartnerReport,'data mining has been hailed as one
of the top technologies for the near future In this section we relate data mining to thebroader area called knowledge discovery and contrast the two by means of an illustrativeexample
Data Mining versus Data Warehousing The goal of a data warehouse (seeChapter 28) is to support decision making with data Data mining can be used inconjunction with a data warehouse to help with certain types of decisions Data miningcan be applied to operational databases with individual transactions To make datamining more efficient, the data warehouse should have an aggregated or summarizedcollection of data Data mining helps in extracting meaningful new patterns that cannot
be found necessarily by merely querying or processing data or metadata in the datawarehouse Data mining applications should therefore be strongly considered early, duringthe design of a data warehouse Also, data mining tools should be designed to facilitatetheir use in conjunction with data warehouses In fact, for very large databases runninginto terabytes of data, successful use of data mining applications will depend first on theconstruction of a data warehouse
Data Mining as a Partof the Knowledge Discovery Process Knowledge Discovery
in Databases frequently abbreviated as KDD, typically encompasses more than datamining The knowledge discovery process comprises six phases.' data selection, datacleansing, enrichment, data transformation or encoding, data mining, and the reportingand display of the discovered information
As an example, consider a transaction database maintained by a specialty consumergoods retailer Suppose the client data includes a customer name, zip code, phonenumber, date of purchase, item code, price, quantity, and total amount A variety of newknowledge can be discovered by KDD processing on this client database During dataselection, data about specific items or categories of items, or from stores in a specific region
or area of the country, may be selected The data cleansing process then may correctinvalid zip codes or eliminate records with incorrect phone prefixes Enrichment typicallyenhances the data with additional sources of information For example, given the clientnames and phone numbers, the store may purchase other data about age, income, andcredit rating and append them to each record Data transformationand encoding may bedone to reduce the amount of data For instance, item codes may be grouped in terms ofproduct categories into audio, video, supplies, electronic gadgets, camera, accessories, and
so on Zip codes may be aggregated into geographic regions, incomes may be divided intoranges, and so on In Figure 28.1, we will show a step calledcleaningas a precursortothe
Trang 38data warehouse creation.Ifdata mining is based on an existing warehouse for this retail
store chain, we would expect that the cleaning has already been applied Itis only after
such preprocessing that data mining techniques are used to mine different rules and
patterns
The result of mining may be to discover the following type of "new" information:
a Association rules-for example, whenever a customer buys video equipment,
he or she also buys another electronic gadget
b Sequential patterns-for example, suppose a customer buys a camera, and
within three months he or she buys photographic supplies, then within six
months he is likely to buy an accessory item This defines a sequential pattern
of transactions A customer who buys more than twice in the lean periods may
be likely to buy at least once during the Christmas period
c Classification trees-for example, customers may be classified by frequency of
visits, by types of financing used, by amount of purchase, or by affinity for types
of items, and some revealing statistics may be generated for such classes
We can see that many possibilities exist for discovering new knowledge about buying
patterns, relating factors such as age, income group, place of residence, to what and how
much the customers purchase This information can then be utilized to plan additional
store locations based on demographics, to run store promotions, to combine items in
advertisements, or to plan seasonal marketing strategies As this retail store example
shows, data mining must be preceded by significant data preparation before it can yield
useful information that can directly influence business decisions
The results of data mining may be reported in a variety of formats, such as listings,
graphic outputs, summary tables, or visualizations
carried out with some end goals or applications Broadly speaking, these goals fall into the
following classes: prediction, identification, classification, and optimization
• Prediction-Data mining can show how certain attributes within the data will
behave in the future Examples of predictive data mining include the analysis of
buy-ing transactions to predict what consumers will buy under certain discounts, how
much sales volume a store would generate in a given period, and whether deleting a
product line would yield more profits In such applications, business logic is used
cou-pled with data mining In a scientific context, certain seismic wave patterns may
pre-dict an earthquake with high probability
• Identification-Data patterns can be used to identify the existence of an item, an
event, or an activity For example, intruders trying to break a system may be
identi-fied by the programs executed, files accessed, andCPU time per session In biological
applications, existence of a gene may be identified by certain sequences of nucleotide
symbols in theDNAsequence The area known as authentication is a form of
identifi-cation.Itascertains whether a user is indeed a specific user or one from an authorized
class, and involves a comparison of parameters or images or signals against a database
Trang 39gories can be identified based on combinations of parameters For example, customers
in a supermarket can be categorized into discount-seeking shoppers, shoppers in arush, loyal regular shoppers, shoppers attached to name brands, and infrequent shop-pers This classification may be used in different analyses of customer buying transac-tions as a post-mining activity Sometimes classification based on common domainknowledge is used as an input to decompose the mining problem and make it simpler.For instance, health foods, party foods, or school lunch foods are distinct categories
in the supermarket business It makes sense to analyze relationships within and acrosscategories as separate problems Such categorization may be usedtoencode the dataappropriately before subjecting it to further data mining
• Optimization-One eventual goal of data mining may be to optimize the use of ited resources such as time, space, money, or materials and to maximize output vari-ables such as sales or profits under a given set of constraints As such, this goal of datamining resembles the objective function used in operations research problems thatdeals with optimization under constraints
lim-The term data mining is popularly being used in a very broad sense In somesituations it includes statistical analysis and constrained optimization as well as machinelearning There is no sharp line separating data mining from these disciplines.Itis beyondour scope, therefore, to discuss in detail the entire range of applications that make up thisvast body of work For a detailed understanding of the area, readers are referred to
specialized books devoted to data mining
Types of Knowledge Discovered During Data Mining. The term "knowledge"
is very broadly interpreted as involving some degree of intelligence There is a progressionfrom raw data to information to knowledge as we go through additional processing.Knowledge is often classified as inductive versus deductive Deductive knowledgededuces new information based on applyingpre-specifiedlogical rules of deduction on thegiven data Data mining addresses inductive knowledge, which discovers new rules andpatterns from the supplied data Knowledge can be represented in many forms: In anunstructured sense, it can be represented by rules or propositional logic In a structuredform, it may be represented in decision trees, semantic networks, neural networks, orhierarchies of classes or frames It is common to describe the knowledge discovered duringdata mining in five ways, as follows
• Association rules-These rules correlate the presence of a set of items with anotherrange of values for another set of variables Examples: (1)When a female retail shop-per buys a handbag, she is likely to buy shoes (2) An X-ray image containing charac-teristics a and b is likely to also exhibit characteristic c
• Classification hierarchies-The goal is to work from an existing set of events ortransactions to create a hierarchy of classes Examples: (I) A population may bedivided into five ranges of credit worthiness based on a history of previous credittransactions (2) A model may be developed for the factors that determine the desir-ability oflocation of a store on a 1-10scale (3) Mutual funds may be classified based
on performance data using characteristics such as growth, income, and stability
Trang 40• Sequential patterns-A sequence of actions or events is sought Example: If a patient
underwent cardiac bypass surgery for blocked arteries and an aneurysm and later
developed high blood urea within a year of surgery, he or she is likely to suffer from
kidney failure within the next 18 months Detection of sequential patterns is
equiva-lenttodetecting associations among events with certain temporal relationships
• Patterns within time series-Similarities can be detected within positions of a time
series of data, which is a sequence of data taken at resular intervals such as daily sales
or daily closing stock prices Examples: (1) Stocks of a utility company, ABC Power,
and a financial company, XYZ Securities, showed the same pattern during 2002 in
terms of closing stock price (2) Two products show the same selling pattern in
sum-mer but a different one in winter (3) A pattern in solar magnetic wind may be used
to predict changes in earth atmospheric conditions
• Clustering-A given population of events or items can be partitioned (segmented)
into sets of "similar" elements Examples: (1) An entire population of treatment data
on a disease may be divided into groups based on the similarity of side effects
pro-duced (2) The adult population in the United States may be categorized into five
groups from "most likely to buy" to "least likely to buy" a new product (3) The web
accesses made by a collection of users against a set of documents (say, in a digital
library) may be analyzed in terms of the keywords of documents to reveal clusters or
categories of users
For most applications, the desired knowledge is a combination of the above types
We expand on each of the above knowledge types in the following sections
27.2.1 Market-Basket Model, Support, and Confidence
One of the major technologies in data mining involves the discovery of association
rules The database is regarded as a collection of transactions, each involving a set of
items A common example is that of market-basket data Here the market basket
corresponds to the sets of items a consumer buys in a supermarket during one visit
Consider four such transactions in a random sample shown in Figure 27.1
An association rule is of the form X=>Y, where X ={Xl' Xz, ,xn}, and Y={yj'
Yz, , Y m }are sets of items, withXiandYj being distinct items for alli and allj This
association states that if a customer buys X, he or she is also likely to buy Y In general,
any association rule has the form LHS (left-hand side) => RHS (right-hand side), where
LHS and RHS are sets of items The set LHS U RHS is called an itemset, the set of
items purchased by customers For an association rule to be of interest to a data miner, the
rule should satisfy some interest measure Two common interest measures are support and
confidence
The support for a rule LHS => RHS is with respect tothe iternset: it refers to how
frequently a specific itemset occurs in the database That is, the support is the percentage