Given that XML fundamentally supports just one data type, viz., character strings, it's at least arguable that the options available for structuring such data i.e., character-string dat
Trang 1Copyright (c) 2003 C J Date page
27.23
[PNUM = $sx//PNUM][@COLOR = 'Blue'] return
<Supplier>
{ $sx/SNUM, $sx/SNAME, $sx/STATUS, $sx/CITY }
</Supplier>
}
</Result>
27.22 Since the document doesn't have any immediate child elements
of type Supplier, the return clause is never executed, and the result is the empty sequence Note: If the query had been
formulated slightly differently, as follows──
<Result>
{ for $sx in document("SuppliersOverShipments.xml")/
Supplier[CITY = 'London']
return
<whatever>
{ $sx/SNUM, $sx/SNAME, $sx/STATUS, $sx/CITY }
</whatever>
}
</Result>
──then the result would have looked like this:
<Result>
</Result>
27.23 There appears to be no difference Here's an actual example
(query 1.1.9.3 Q3 from the W3C XML Query Use Cases document──see
reference [27.29]):
• Query:
<results>
{
for $b in document("http://www.bn.com/bib.xml")/bib/book,
$t in $b/title,
$a in $b/author return
<result>
{ $t } { $a }
</result>
}
</results>
• Query (modified):
<results>
Trang 2Copyright (c) 2003 C J Date page 27.24
{
for $b in document("http://www.bn.com/bib.xml")/bib/book,
$t in $b/title,
$a in $b/author return
<result>
{ $t, $a }
</result>
}
</results>
• Result (for both queries):*
<results>
<result>
<title>TCP/IP Illustrated</title>
<author>
<last>Stevens</last>
<first>W.</first>
</author>
</result>
<result>
<title>Advanced Unix Programming</title>
<author>
<last>Stevens</last>
<first>W.</first>
</author>
</result>
<result>
<title>Data on the Web</title>
<author>
<last>Abiteboul</last>
<first>Serge</first>
</author>
</result>
</results>
──────────
* Again we've altered the "official" result very slightly for formatting reasons
──────────
27.24 See Section 27.6
Trang 3Copyright (c) 2003 C J Date page
27.25
27.25 The following observations, at least, spring to mind
immediately:
• Several of the functions perform what is essentially type conversion The expression XMLFILETOCLOB ('BoltDrawing.svg'), for example, might be more conventionally written something like this:
CAST_AS_CLOB ( 'BoltDrawing.svg' )
In other words, XMLDOC should be recognized as a fully fledged type (see Section 27.6, subsection "Documents as Attribute Values")
• Likewise, the expression XMLCONTENT (DRAWING,
'RetrievedBoltDrawing.svg') might more conventionally be
written thus:
DRAWING := CAST_AS_XMLDOC ( 'RetrievedBoltDrawing.svg' ) ;
In fact, XMLCONTENT is an update operator (see Chapter 5), and
the whole idea of being able to invoke it from inside a read-only operation (SELECT in SQL) is more than a little suspect [3.3]
• Consider the expression XMLFILETOCLOB ('BoltDrawing.svg') once again The argument here is apparently of type character
string However, that character string is interpreted (in
fact, it is dereferenced──see Chapter 26), which means that it
can't be just any old character string In fact, the
XMLFILETOCLOB function is more than a little reminiscent of the EXECUTE IMMEDIATE operation of dynamic SQL (see Chapter 4)
• Remarks analogous to those in the previous paragraph apply also to arguments like
'//PartTuple[PNUM = "P3"]/WEIGHT'
(see the XMLEXTRACTREAL example)
27.26 The suggestion is correct, in the following sense Consider any of the PartsRelation documents shown in the body of the
chapter Clearly it would be easy, albeit tedious, to show a
tuple containing exactly the same information as that
document──though it's true that the tuple in question would
contain just one component, corresponding to the XML document in its entirety That component in turn would contain a list or
sequence of further components, corresponding to the first-level content of the XML document in their "document order"; those
Trang 4Copyright (c) 2003 C J Date page
27.26
components in turn would (in general) contain further components, and so on Omitted elements can be represented by empty
sequences Note in particular that tuples in the relational model carry their attribute types with them, just as XML elements carry their tags with them──implying that (contrary to popular opinion!) tuples too, like XML documents, are self-describing, in a sense
27.27 The claim that XML data is "schemaless" is absurd, of
course; data that was "schemaless" would have no known structure, and it would be impossible to query it──except by playing games with SUBSTRING operations, if we stretch a point and think of such game-playing as "querying"──or to design a query language for it.* Rather, the point is that the schemas for XML data and (say) SQL data are expressed in different styles, styles that might seem
distinct at a superficial level but aren't really so very
different at a deep level
──────────
* In fact, it would be a BLOB──i.e., an arbitrarily long bit
string, with no internal structure that the DBMS is aware of
──────────
27.28 In one sense we might say that an analogous remark does
apply to relational data Given that XML fundamentally supports
just one data type, viz., character strings, it's at least
arguable that the options available for structuring such data
(i.e., character-string data specifically) in a relational
database are exactly the same as those available in XML As a
trivial example, an address might be represented by a single
character string; or by separate strings for street, city, state, and zip; or in a variety of other ways
In a much larger sense, however, an analogous remark does not
apply First, relational systems provide a variety of additional (and genuine) data types over and above character strings, as well
as the ability for users to define their own types; they therefore don't force users to represent everything in character-string
form, and indeed they provide very strong incentives not to
Second, there's a large body of design theory available for
relational databases that militates against certain bad designs Third, relational systems provide a wide array of operators, the effect of which is (in part) that there's no logical incentive for biasing designs in such a way as to favor some applications at the expense of others (contrast the situation in XML)
Trang 5Copyright (c) 2003 C J Date page
27.27
27.29 This writer is aware of no differences of substance──except that the hierarchic model is usually regarded as including certain operators and constraints, while it's not at all clear that the same is true of "the semistructured model."
27.30 No answer provided
Trang 6Copyright (c) 2003 C J Date page appx.1
The following text speaks for itself:
(Begin quote)
There are four appendixes Appendix A is an introduction to a new
implementation technology called The TransRelational tm Model
Appendix B gives further details, for reference purposes, of the syntax and semantics of SQL expressions Appendix C contains a list of the more important abbreviations, acronyms, and symbols introduced in the body of the text Finally, Appendix D (online) provides a tutorial survey of common storage structures and access methods
(End quote)
Appendixes ***
Trang 7Copyright (c) 2003 C.J.Date page A.1
T h e T r a n s R e l a t i o n a
l tm M o d e l
Principal Sections
• Three levels of abstraction
• The basic idea
• Condensed columns
• Merged columns
• Implementing the relational operators
General Remarks
This is admittedly only an appendix, but if I was the instructor I would certainly cover it in class "It's the best possible time
to be alive, when almost everything you thought you knew is wrong"
(from Arcadia, by Tom Stoppard) The appendix is about a
radically new implementation technology, which (among other
things) does mean that an awful lot of what we've taken for
granted for years regarding DBMS implementation is now "wrong," or
at least obsolete For example:
• The data occupies a fraction of the space required for a
conventional database today
• The data is effectively stored in many different sort orders
at the same time
• Indexes and other conventional access paths are completely unnecessary
• Optimization is much simpler than it is with conventional
systems; often, there's just one obviously best way to
implement any given relational operation In particular, the need for cost-based optimizing is almost entirely eliminated
• Join performance is linear!──meaning, in effect, that the
time it takes to join twenty relations is only twice the time
it takes to join ten (loosely speaking) It also means that joining twenty relations, if necessary, is feasible in the
first place; in other words, the system is scalable
Trang 8Copyright (c) 2003 C.J.Date page A.2
• There's no need to compile database requests ahead of time for performance
• Performance in general is orders of magnitude better than it
is with a conventional system
• Logical design can be done properly (in particular, there is never any need to "denormalize for performance")
• Physical database design can be completely automated
• Database reorganization as conventionally understood is
completely unnecessary
• The system is much easier to administer, because far fewer human decisions are needed
• There's no such thing as a "stored relvar" or "stored tuple"
at the physical level at all!
In a nutshell, the TransRelational model allows us to build DBMSs that──at last!──truly deliver on the full promise of the
relational model Perhaps you can see why it's my honest opinion that "The TransRelationaltm Model" is the biggest advance in the
DB field since Ted Codd gave us the relational model, back in
1969
Note: We're supposed to put that trademark symbol on the term TransRelational, at least the first time we use it, also in titles and the like Also, you should be aware that various aspects of the TR model──e.g., the idea of storing the data "attribute-wise" rather than "tuple-wise"──do somewhat resemble various ideas that have been described elsewhere in the literature; however, nobody else (so far as I know) has described a scheme that's anything like as comprehensive as the TR model; what's more, there are many aspects of the TR model that (again so far as I know) aren't like anything else, anywhere
The logarithms analogy from reference [A.1] is helpful: "As
we all know, logarithms allow what would otherwise be complicated, tedious, and time-consuming numeric problems to be solved by
transforming them into vastly simpler but (in a sense) equivalent problems and solving those simpler problems instead Well, it's
my claim that TR technology does the same kind of thing for data management problems." Give some examples
Explain and justify the name: The TransRelational tm Model
(which we abbreviate to "TR" in the book and in these notes)
Credit to Steve Tarin, who invented it Discuss data independence
Trang 9Copyright (c) 2003 C.J.Date page A.3
and the conventional "direct image" style of implementation and the problems it causes
Note the simplifying assumptions: The database is (a) read-only and (b) in main memory Stress the fact that these
assumptions are made purely for pedagogic reasons; TR can and does
do well on updates and on disk
A.2 Three Levels of Abstraction
Straightforward──but stress the fact that the files are
abstractions (as indeed the TR tables are too) Be very careful
to use the terminology appropriate to each level from this point forward Show but do not yet explain in detail the Field Values Table and the (or, rather, a) Record Reconstruction Table for the
file of Fig A.3 Note: Each of those tables is derived from the
file independently of the other Point out that we're definitely not dealing with a direct-image style of implementation!
A.3 The Basic Idea
Explain "the crucial insight": Field Values in the Field Values Table, linkage information in the Record Reconstruction Table By the way, I deliberately don't abbreviate these terms to FVT and RRT Students have so much that's novel to learn here that I
think such abbreviations get in the way (the names, by contrast,
serve to remind students of the functionality) Note: Almost all
of the terms in this appendix are taken from reference [A.1] and
do not appear in reference [A.2]──which, to be frank, is quite
difficult to understand, in part precisely because its terminology isn't very good (or even consistent)
Regarding the Field Values Table: Built at load time (so
that's when the sorting is done) Explain intuitively obvious advantages for ORDER BY, value lookup, etc The Field Values
Table is the only TR table that contains user data as such
Isomorphic to the file
Regarding the Record Reconstruction Table: Also isomorphic, but contains pointers (row numbers) Those row numbers identify
rows in the Field Values Table or the Record Reconstruction Table
or both, depending on the context Explain the zigzag algorithm Can enter the rings (zigzags) anywhere! Explain simple equality restriction queries (binary search) TR lets us do a sort/merge join without having to do the sort!──or, at least, without having
to do the run-time sort (explain) Implications for the
optimizer: Little or no access path selection Don't need
indexes Physical database design is simplified (in fact, it
Trang 10Copyright (c) 2003 C.J.Date page A.4
should become clear later that it can be automated, given the
logical design) No need for performance tuning A boon for the tired DBA
Explain how the Record Reconstruction Table is built (or you could set this subsection as a reading assignment) Not unique;
we can turn this fact to our advantage, but the details are beyond the scope of this appendix; suffice it to say that some Record
Reconstruction Tables are "preferred." See reference [A.1] for further discussion
A.4 Condensed Columns
An obvious improvement to the Field Values Table but one with far-reaching consequences Note the implications for update in particular (we're pretending the database is read-only, but this point is worth highlighting in passing) The compression
advantages are staggering!──but note that we're compressing at the level of field values, not of bit string encodings Don't have
to pay the usual price of extra machine cycles to do the
decompressing!
Explain row ranges.* Emphasize the point that these are
conceptual: Various more efficient internal representations are possible Histograms The TR representation is all about
permutations and histograms Immediately obvious implications for certain kinds of queries──e.g., "How many parts are there of each color?" Explain the revised record reconstruction process
──────────
* Row ranges look very much like intervals as in Chapter 23 But we'll see in the next section that we sometimes need to deal with
empty row ranges, whereas intervals in Chapter 23 were always
nonempty
──────────
A.5 Merged Columns
An extension of the condensed-columns idea (in a way) Go through the bill-of-materials example Explain the implications for join!
In effect, we can do a sort/merge join without doing the sort and without doing the merge, either! (The sort and merge are done at load time Do the heavy lifting ahead of time! As with
logarithms, in fact.)