Document Type Definitions DTDs Define and Constrain Element Names & Structure Element Type Declaration Attribute List Declaration... Object Fusion in Mediator Systems Object Identity
Trang 1The XML Standard
Trang 2Overview of our XML Standards
• Motivation: HTML vs XML
• XML 101: syntax, elements, attributes, DTDs, …
• XML 201: XML Schema, Namespaces
• XSLT: Transforming and Rendering XML
• XQuery: Search, Transform & Integrate
Trang 3So what is XML (all about)?
Executive Summary:
• XML = HTML – idiosyncrasies (simplified syntax)
+ user-definable ("semantic") tags
• Separation of data and its presentation
=> simple, very flexible data exchange format:
Trang 4What’s Wrong with HTML?
Y Papakonstantinou, S Abiteboul, H Garcia-Molina.
“Object Fusion in Mediator Systems” In VLDB 96
HTML confuses presentation
with content
Trang 5What’s Wrong with HTML
Author
Conference
Title
Trang 6And Some Repercussions
• Lack of schema/semantics when querying the Web (HTML):
– "find documents (books, papers, ) where
author = Michael Jackson"
( and learn how software engineering meets the moon
− automation of information management
(retrieval, manipulation, integration)
Trang 7XML is Based on Markup
< bibliography >
< paper ID= "object-fusion">
< authors >
< author >Y.Papakonstantinou</ author >
< author >S Abiteboul</ author >
< author >H Garcia-Molina</ author >
Decoupled from presentation
Trang 8Elements and their Content
element
element name
Character content
Element Content
Empty Element
< bibliography >
< paper ID="object-fusion">
< authors >
< author >Y.Papakonstantinou</ author >
< author >S Abiteboul</ author >
< author >H Garcia-Molina</ author >
Trang 10XML = Labeled Ordered Trees
< bibliography >
< paper id=23 >
< authors >
<author>Yannis</author> <author>Serge</author>
Trang 11How do I share
structure and metadata/semantics
How do I learn and use
the element structure
of a document?
Trang 12Adding Structure and Semantics
• XML Document Type Definitions (DTDs):
• define the structure of "allowed" documents (i.e.,
– identify your vocabulary
• Resource Description Framework (RDF)
– simple metadata model
Trang 13XML DTDs as Extended CFGs
<!element bibliography paper* >
<!element paper (authors,fullPaper?,title,booktitle) >
<!element authors author+ >
Trang 14<!element bibliography paper* >
<!element paper (authors, fullPaper?, title, booktitle)>
<!element authors author+ >
<!element author (#PCDATA)>
<!element fullPaper EMPTY>
<!element title (#PCDATA)>
<!element booktitle (#PCDATA)>
<!attlist fullPaper source ENTITY #REQUIRED>
<!attlist paper ID ID>
Document Type Definitions (DTDs)
Define and Constrain Element Names & Structure
Element Type Declaration
Attribute List Declaration
Trang 15Element Declarations
<!element bibliography paper* >
<!element paper (authors, fullPaper?, title, booktitle)>
<!element authors author+ >
<!element author (#PCDATA) >
<!element fullPaper EMPTY>
<!element title (#PCDATA)>
<!element booktitle (#PCDATA)>
<!attlist fullPaper source ENTITY #REQUIRED>
<!attlist paper ID ID>
Character content
Authors followed by optional fullpaper, followed by title, followed by booktitle
Sequence of 1 or more author
Sequence of 0 or more paper
Trang 16Element Content Declarations
Declaration Meaning
element name Exactly one instance of element
R? Zero or one instances of R
R* Zero or more instances of R
R+ One or more instances of R
R1|R2|…|Rn One instance of R1 or R2 or … Rn
#PCDATA Character content
EMPTY Empty element
(#PCDATA e*)* Mixed Content
ANY Anything goes
Trang 17<title>Object Fusion in Mediator Systems</title>
<related papers= "semistructured-data" "mediators"/> </paper>
</bibliography>
Object Identity Attribute
CDATA (character data)
<person ID=" yannis "> Yannis’ info </person>
IDREF intradocument reference
Reference to external ENTITY
Trang 18Attribute Types
ID Token unique within the document IDREF Reference to an ID token
IDREFS Reference to multiple ID tokens
ENTITY External entity (image, video, …)
ENTITIES External entities
NMTOKEN Enumerated token
NMTOKENS Enumerated tokens
More to
appear?
More types (eg, DATE) may soon be part of the standard
Trang 20Types of Entities
• Internal (to a doc) vs External ( → use URI)
• General (in XML doc) vs Parameter (in DTD)
• Parsed (XML) vs Unparsed (non-XML)
Trang 21Internal Text Entities
<!ENTITY WWW "World Wide Web">
<p>We all use the &WWW; </p>
Internal Text Entity Declaration
Entity Reference
<p>We all use the World Wide Web </p>
Logically equivalent to actually appearing
Trang 22Unparsed (& "Binary") Entities
<!ENTITY fusion SYSTEM "fusion.ps" NDATA ps>
and unparsed entity
<fullPaper source="fusion"/>
<!attlist fullPaper source ENTITY #REQUIRED>
Element with ENTITY attribute Declare attribute type to be entity
<!NOTATION ps SYSTEM "ghostview.exe">
NOTATION declaration (helper app )
Declare external
Trang 23From Docs to Data: XML Schema
• XML DTDs (part of the XML spec.)
– flexible, semistructured data model (nesting, ANY, ?, *, |, )
– but document-oriented (SGML heritage)
• XML Schema (W3C working draft)
– schema definition language in XML
– data-oriented: data types
– extends capabilities of DTD
Trang 24Sample Data for Introduction to XML Schema
Trang 25The Simple “Russian Doll” Approach
<xsd:element name="title" type="xsd:string"/>
<xsd:element name="author" type="xsd:string"/>
<xsd:element name="character“
minOccurs="0" maxOccurs="unbounded" >
<xsd:complexType>
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="friend-of" type="xsd:string“
minOccurs="0" maxOccurs="unbounded"/>
<xsd:element name="since" type="xsd:date"/>
<xsd:element name="qualification" type="xsd:string"/> </xsd:sequence> …
<xsd:attribute name="isbn" type="xsd:string"/>
Optional Namespace DefinitionSequence Compositor Simple Type
Content fortitle andauthor
Complex Type
Content for book
Character may appear any number of times
Basic Type of XML Schema
Trang 26The Catalog Approach to XML Schema: Stand-Alone Declarations & References
<xsd:element name="title" type="xsd:string"/>
<xsd:element name="author" type="xsd:string"/>
<xsd:element name="name" type="xsd:string"/>
Attributes
Complex TypeElement character
Reference
Trang 27Catalog Approach Cont’d
<xsd:attribute ref="isbn"/>
</xsd:complexType>
</xsd:element>
Trang 28<xsd:simpleType name=" nameType ">
<xsd: restriction base=" xsd:string ">
Trang 29Groups: Named containers of sets of
Elements or Attributes
<xsd:group name="mainBookElements">
<xsd:sequence>
<xsd:element name="title" type="nameType"/>
<xsd:element name="author" type="nameType"/>
Trang 30Compositors: Sequence, Choice, All
The group nameTypes consists of one of
• the element “name”
• the sequence containing firstName,
middlename, lastName
Trang 31Compositors (cont’d)
<xsd:complexType name="characterType">
< xsd:all >
<xsd:element name="name“ type="nameType"/>
<xsd:element name="friend-of“ type="nameType”
minOccurs="0“ maxOccurs="unbounded"/>
<xsd:element name="since" type="sinceType"/>
<xsd:element name="qualification" type="descType"/> </xsd:all>
</xsd:complexType>
The characterType consists of name, a list of friend-of,
since, and qualification particles in no particular order.(Compare with the sequence compositor.)
Trang 32Derivation of Simple Types:
have seenrestrictions and facets
The simple type isbnType will be either
• a 10-digit string (notice the pattern)
• the token "TBD“ or the token "NA"
Trang 33By inserting xsd:unique in the book element declaration
we enforce that the character name’s in each book are unique
Trang 35Including Unknown Elements
<xsd:complexType name="descType" mixed="true">
Trang 36Presenting XML: XSLT
• Why Stylesheets?
– separation of content (XML) from presentation (XSL)
• Why not just CSS for XML?
– XSL is far more powerful:
• selecting elements
• transforming the XML tree
• content based display (result may depend on data)
Trang 37XSLT Overview
• XSLT stylesheets are denoted in XML syntax
• XSL components:
1 a language for transforming XML
documents ( XSLT : integral part
of the XSL specification)
2 an XML formatting vocabulary ( Formatting Objects : >90% of the formatting properties inherited from CSS)
Trang 38XSLT Processing Model
XSL stylesheet
Transformation
Trang 39XSLT Processing Model
• XSL stylesheet: collection of template rules
• template rule: ( pattern ⇒ template )
• main steps:
– match pattern against source tree
– instantiate template (replace current node “.” by the template in the result tree)
– select further nodes for processing
• control can be
– program-driven ("pull": <xsl:foreach> )
– data/event-driven ("push": <xsl:apply-templates> )
Trang 40< xsl:template match ="product">
Template Rule: Example
(i) match pattern: process <product> elements
(ii) instantiate template: replace each a product with two HTML tables (iii) select the <product> grandchildren (“sales/domestic”,
“sales/foreign”) for further processing
pattern
template
Trang 42Creating the Result Tree
• Literal result elements : non-XSL elements (e.g., HTML) appear “literally” in the result tree
Trang 43Example of Turning XML into HTML
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="FitnessCenter.xsl"?>
< FitnessCenter >
< Member level =" platinum ">
< Name > Jeff </ Name >
< Phone type =" home "> 555-1234 </ Phone >
< Phone type =" work "> 555-4321 </ Phone >
< FavoriteColor > lightgrey </ FavoriteColor >
</ Member >
</ FitnessCenter >
Trang 44HTML Document in an XSL Template
<?xml version="1.0"?>
<xsl:output method="html"/>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
Trang 45Extracting the Member Name
Trang 46Extracting a Value from an XML
Document, Navigating the XML Document
– A slash at the beginning of the path indicates
that it is an absolute path, starting from the
top of the XML document
/FitnessCenter/Member/Name
"Start from the top of the XML document, go to the FitnessCenter element, from there go to the Member element, and from there go to the Name element."
Trang 47Document /
Trang 48Extract the FavoriteColor and use it
Trang 49Attribute values cannot contain "<" nor ">"
- Consequently, the following is NOT valid:
<Body bgcolor="<xsl:value-of select='/FitnessCenter/Member/FavoriteColor'/>">
To extract the value of an XML element and use it as an attribute value you must use curly braces:
<Body bgcolor="{/FitnessCenter/Member/FavoriteColor}">
Evaluate the expression within the curly braces Assign the value
to the attribute.
Trang 50Extract the Home Phone Number
</HTML>
</xsl:template>
</xsl:stylesheet>
Trang 51Creating the Result Tree
• Further XSL elements for
Trang 52Creating the Result Tree: Repetition
Trang 53Creating the Result Tree: Sorting
<xsl:template match="employees">
< ul >
<xsl:apply-templates select="employee">
<xsl:sort select="name/last"/>
<xsl:sort select="name/first"/>
Trang 54More on XSL
• XSL(T):
– Conflict resolution for multiple applicable rules
Trang 55XQuery: Querying XML Sources
• Functional Query Language
– Operates on the Xpath/XQuery data model – List of ordered trees
– A document is list of size 1
• XQuery expressions are composed of
– Path expressions
– Element constructors
– FLWR expressions
– … and more …
Trang 56Path Expressions
doc(“zoo.xml”)//chapter[2]//figure[caption=“Tree Frogs”]
In the second chapter of the document zoo.xml find the
figures with caption “Tree Frogs”
book
section paragraph figure caption
“Tree Frogs”
chapter chapter
paragraph figure caption
“Just Frogs”
part
Trang 57More Path Expressions
Find the first immediate chapter subelements of immediate part subelements of the document zoo.xml and retrieve
figures that have … doc(“zoo.xml”)/part/chapter[1]//figure[caption=“Tree Frogs”]
chapter
book
section paragraph figure caption
“Tree Frogs”
chapter chapter
paragraph figure caption
“Just Frogs”
part
Trang 58“Tree Frogs”
result
Trang 59Bibliography Example Data Set
<bib>
<book>
<author> Aho </author>
<author> Hopcroft </author>
<author> Ullman </author>
<title> Automata Theory </title>
<publisher> Morgan Kaufmann </publisher>
<year> 1998 >/year>
</book>
<book>
<author> Ullman </author>
<title> Database Systems </title>
<publisher> Morgan Kaufmann </publisher>
<year> 1998 >/year>
</book>
<book>
<author> Abiteboul </author>
<author> Buneman </author>
<author> Suciu </author>
<title> Automata Theory </title>
<publisher> Prentice Hall </publisher>
<year> 1998 >/year>
</book>
</bib>
Trang 60Reviews Example Data Set
<reviews>
<review>
<title> Automata Theory </title>
<comment> It’s the best in automata theory </comment> <comment> A definitive textbook </comment>
</review>
…
</reviews>
Trang 61For-Let-Where-Return (FLWR)
FOR $b in doc(“bib.xml”)//book WHERE $b/publisher = “Morgan Kaufmann”
Morgan Kaufmann
book
year publisher
Prentice Hall
1998
Trang 62Think (tuples of) variable bindings
Morgan Kaufmann
book
year publisher
Prentice Hall
1998
Trang 64FOR $p in distinct(doc(“bib.xml”)//publisher)
LET $b := document(“bib.xml”)//book[publisher = $p] WHERE count($b) > 1
RETURN $p
List publishers who have published
more than 1 book
Tuples ($p, $b) are formulated
Trang 65Boolean Expressions in WHERE
Trang 66<author> Aho </author>
<author> Hopcroft </author>
<author> Ullman </author>
<title> Automata Theory </title>
<publisher> Morgan Kaufmann </publisher>
<year> 1998 >/year>
<comment> It’s the best in automata theory </comment>
<comment> A definitive textbook </comment>
</book_with_review>