Kiến thức về động cơ Hybrid
LARA HATA26 DB2magazine• QUARTER 3, 2005FIRING UPTHE IBM’s new hybridDB2 puts the fullpower of a relationalengine to work on atruly native XML storethat sits side by side withDB2’s relational datarepository. LARA HATANATIVE XML STORAGE<<<elational databases drive most businesses of any size today. Popular and important as these databases are, they’re simply not a great match for semi-structured (and hierarchical) content representedin XML. Because enterprises have, in aggregate, trillions of dollars invest-ed in relational data and relational database management systems (RDBMSs), simplyreplacing RDBMSs with a pure XML store isn’t an option. Adding an XML-only data-base into the infrastructure adds yet another integration and complexity challenge. HYBRID ENGINEIBM is about to introduce true-nativesupport for both XML and relational data.This evolutionary technology, now inbeta tests with a small group of IBM cus-tomers, provides hybrid relational/XMLstorage from the ground up. That meansDB2 will no longer need the XMLExtender (just as it doesn’t need an SQLExtender). DB2 will simply handle XMLnatively. (There are varying definitions of“native” XML support. To clear up theconfusion about what’s typically called“native” today, see the sidebar on page 45.)In the hybrid version, XML is han-dled as a new data type. Nearly everyDB2 component, tool, and utility hasbeen enhanced to recognize and handlethis new data type. The new storage par-adigm retains XML in a parsed, annotat-ed tree form—similar to the XMLDocument Object Model (DOM)—that’s separate from the relational datastore (see Figure 1, page 44).On top of both data stores (relation-al and XML) sits one hybrid databaseengine. That single engine can processXQuery, XPath, SQL, and SQL/XML.The engine features a bilingual querycompiler with parsers for both SQL andXQuery. So developers can access infor-mation using either language (or bothtogether) according to what makes themost sense in specific situations. A hybridDB2 provides the flexibility to shift(between XML and SQL) paradigms asinformation management needs change. Storing relational and XML data in adatabase management system that under-stands and supports both models at everylevel (from the client, through the engine,down to the disk) provides flexibility andconsistently fast performance. The XMLdata inherits the same backup and recov-ery, optimization, scalability, and highavailability DB2 offers for relational data.Ultimately, a unified XML/relational data-base keeps things simple by avoiding theneed to integrate XML and relational datafrom separate stores. NATIVE BENEFITSThe first generation of XML support inrelational databases was based on eithershredding (or decomposing) documents tofit into relational tables or storing docu-ments intact as character or binary largeobjects (CLOBs or BLOBs). Each of thesechoices attempts to force XML into a rela-tional model. However, these approacheshave serious limitations in capability andperformance. The hybrid model storesXML in a model similar to the DOM. TheXML data is formatted to buffered datapages for faster navigation and query exe-cution as well as simpler indexing.When DB2’s true-native XML supportdebuts with the next major release, exist-ing support for storing XML documentsshredded in relational tables or intact asCLOBs and BLOBs will continue. Supportfor shredding is important because XMLcan be used to feed existing relationalschemas. However, true-native storageoffers significant advantages in these areas:Storage. DB2’s native XML technologywill store XML with node-level granulari-ty instead of document-level. While inter-acting with IBM’s native XML support, theabstraction shown is a column of type XMLin a relational table. This column has nomaximum length and no mandatory con-straining XML schema. Any well-formedXML statement can be inserted into thatcolumn. Therefore, the following state-ment is a valid table definition: Create table dept (deptID int, deptdoc xml)A table isn’t limited to a single col-umn of any given type, so the followingstatement is equally valid:Create table dept2 (deptID int, deptinfo xml,orgchart xml, employees xml)In the physical storage layer, the pri-mary storage unit in the IBM implemen-tation is a node. A node exists on a pagealong with other nodes from the same ordifferent documents. Each node is linkednot only to its parent, but also to its chil-dren. As a result, navigating to a node’sRBY ANJUL BHAMBHRI www.db2mag.com•DB2magazine 43 44 DB2magazine• QUARTER 3, 2005Thus there may be zero, one, or multipleindex entries for a single row in a table(which is significantly different fromindexes on relational columns).You can create indexes on multiplepath expressions on any given column oftype XML. Therefore, the followingstatements are also valid:create index idx1 on dept(deptdoc) generatekeyusing xmlpattern '/dept/employee/name'as sql varchar(35);create index idx2 on dept(deptdoc) generate key using xmlpattern '/dept/employee/@id as sql int;Furthermore, path expressions caninclude both wildcards and descendant-or-self axis traversal, so the followingstatements are also valid:Create Index IX3 on dept(deptdoc) generatekeys using xmlpattern '/dept/*/name' as sqlvarchar(20)Create Index IX4 on dept(deptdoc) generate keys using xmlpattern '//office' as sql doubleCreate Index IX5 on dept(deptdoc) generatekeys using xmlpattern '/dept/employee/*' as sql varchar(20)Query. XQuery, the new language forquerying XML data, is designed to han-dle diverse schemas, including constructssuch as sequences (instead of sets, as inSQL), multiple nested sequences, andsparse attributes. XQuery can also sup-port heterogeneous schemas and dynam-ic schema changes. The IBM implementation has nostand-alone XQuery or XPath processor.The basic XQuery and XPath primitivesare built directly into the query engine.The query compiler itself is bilingual,having two interoperating query lan-guage parsers—one for SQL and theother for XQuery—to generate a newvariation of the Query Graph Modeldesigned to process relational and XMLdata. Because the intermediate queryrepresentation is language-neutral,XQuery, SQL and combinations ofXQuery and SQL compile into the sameparent, siblings, or children is highly effi-cient, operating at little more thanpointer traversal speeds as long as thenext referenced node is on the samepage. Nodes can grow or shrink in size, orthey can be relocated to other pageswithout rewriting the entire document.Indexing. XML applications thatmanage millions of XML documentsaren’t uncommon; indexing these largecollections of XML data is required toprovide high query performance. DB2supports path-specific indexes on XMLcolumns so that elements and attributesfrequently used in predicates and cross-document joins can be indexed.The new XML values index can pro-vide efficient evaluation of XML patternexpressions to improve performance dur-ing queries on XML documents. In con-trast to traditional relational indexes, inwhich index keys are composed of one ormore table columns specified by the user,the XML values index uses a particularXML pattern expression (subset of XPaththat doesn’t contain predicates, amongother things) to index paths and values inXML documents stored in a single XMLcolumn. The index can also fill in defaultattribute and element values from theschema at insertion time if the valuesaren’t specified in the document. Whencreating an index, you can specify whatpaths to index and as what type. Anynodes that match the path expression orthe set of path expressions in the XMLdocuments stored in that column areindexed, and the index points directly tothe node in storage that’s linked to itsparent and children for fast navigation.Instead of providing index-access tothe beginning of a document, indexentries contain actual document nodeposition information. As a result, theindex can quickly provide direct access tothe nodes within a document and avoid adocument traversal. In addition, becausethe index has this document node posi-tion information, it understands the doc-ument hierarchy and can perform con-tainment tests. The index knows whichchild nodes belong to the same ancestorand can do appropriate filtering.For example, here’s how to define anindex on all employee names in all docu-ments in the XML column deptdoc:create index idx1 on dept(deptdoc) generatekey using xmlpattern '/dept/employee/name'as sql varchar(35);The xmlpatternis a path that identifiesthe XML nodes to be indexed. Because DB2 doesn’t require a singleXML schema for alldocuments in an XMLcolumn, it may notknow which data typeto use in the index for agiven xmlpattern. Theuser must specify thedatatype explicitly inthe as sql <type>clause. If a node matchesthe xmlpatternbut failsto cast to the specifiedindex type, then noindex entry is createdfor the node withoutraising an error. A sin-gle document maycontain zero, one, ormultiple nodes thatmatch the xmlpattern.FIGURE 1. DB2’s new XML-relational storage model. Where $e/office = 344 Return $e/nameDeeper levels of nesting (SQL withinXQuery in which SQL itself containsnested XQuery) is supported. Flexibility and performance. WithIBM’s hybrid approach, there’s no need topredefine XML schema, limit documentsto a given schema, or provide any map-ping between XML and relational models. The hybrid approach offers an impor-tant advantage over shredding: It elimi-nates the cost of joins and other process-ing necessary to reconstitute XMLdocuments. In the case of complex docu-ments, these costs can be very significant. When compared to CLOB approach-es, truly native storage eliminates theneed to parse XML documents at querytime. Given XML parsing costs, CLOB-based approaches are impractical if anyform of search into the document—thatis, parsing—is necessary. CLOB should beconsidered only when the usage modelsare expected to be full document inser-tion, search by purely relational attrib-utes, and full document retrieval. Native storage improves on the BLOBapproach because it provides more con-sistent behavior as the size of documentsincreases or when the amount of data toaccess is a small percentage of the totaldocument size.RIGHT MODEL, RIGHT TASKA true-native XML data store does morethan expose XML to its clients—it repre-sents the XML as XML throughout theentire data engine stack (from client todisk and back out again).Hybrid systems don’t mandate that alldata be represented as relational data, nordo they require that all data be in XML;instead, they provide the choice of theright model for the right task. »Anjul Bhambhri [bhambhri@us.ibm.com] is thesenior development manager for XMLsupport in DB2 and heads the XMLeffort across DB2 UDB.www.db2mag.com• DB2magazine45intermediate representation, go throughthe same rewrites and transformation,are optimized in a similar manner, andgenerate similar executable code. Thisprocess results in optimal and interoper-ating query plans regardless of the lan-guage used to specify them.Because the two parsers interoperate,you can mix SQL and XQuery in thesame statement, making the searchesmore powerful by providing the ability toquery within the XML document andreturning fragments of it from SQL:select deptID, xmlquery(‘for $d in $deptdoc/deptwhere $d/@bldg = 101 return $d/name' passing d.deptdoc as NATIVE XML STORAGE<<<Each of the currently available (non-native) methods for managing XML inrelational databases attempts to makeXML conform to the relational model insome way. These approaches include: Shredding. Most major RDBMSs(including DB2) support shredding.Shredding involves defining a relationalschema that corresponds to the XML(for example, representing parent/childrelationships in the XML as one or morechild tables in a referential integrity con-straint with its parent) and defining amapping from the XML data to the rela-tional schema. Shredding is a good fit in existingrelational environments. However, map-ping can be complex and fragile, andyou must define a mapping for eachXML document you want to store. If theXML schema changes, the mappingmay no longer be valid or may require acomplex change process. Once decom-posed, the data ceases to be XML,loses any digital signature, andbecomes difficult and expensive toreconstruct (often requiring many joins). Storing XML as a CLOB. All majorvendors support storing entire XMLdocuments in a variable length charac-ter type (VARCHAR) or as CLOBs. IfXML documents are inserted into CLOBor VARCHAR columns, they are typicallyinserted as unparsed text objects.CLOBs preserve the original documentand provide uniform handling of anyXML, including highly volatile schemas. Avoiding XML parsing at insert timeguarantees high insert performance.However, without XML parsing, XMLdocument structure is entirely ignored.This precludes the database from doingintelligent and efficient search and sub-document level extract operations onthe stored text objects. The only remedyis to invoke the XML parser at queryexecution time to “look into” the XMLdocuments so that search conditionscan be evaluated. The high insert per-formance comes at the cost of lowsearch and extract performance. BLOB (pseudo native). BLOB-based storage is conceptually similar toCLOB storage; however, instead of stor-ing the XML data as a preparsed string,BLOBs store it in a proprietary post-parse binary representation. Thisapproach is sometimes called pseudonative, because the data representationremains in XML within the BLOB. However, the underlying storage fora document is virtualized as a singlecontiguous byte range, which can causeperformance problems. Updating canrequire the entire document to be rewrit-ten (and locked). Access to portions ofthe document might require the entiredocument to be read from disk. True native. True native storageholds the post-parsed data on disk,enabling individual nodes of the datamodel to be stored independently—thatis, not as a stream—and then intercon-nected. True native storage provides theadvantages of BLOB and CLOB, butresolves the remaining performanceissues because the document storageisn’t virtualized as a single contiguousbyte range. The storage for the entire setof documents is virtualized as a contigu-ous byte range; however, individualnodes can be relocated in this rangewith minimal impact on other nodes andindexing. WHAT IS TRUE NATIVE? •••••••••••••••••••••••••••••••••••••••••••••"deptdoc") from dept dwhere deptID <> "PR27";The query first filters the rows wheredeptIDis not PR27. After that, it returnsdeptIDand the department name as XMLfragments if the building is 101.In DB2, XQuery can operate on XMLdocuments in XML columns. However, ifyou want to restrict the input to anXQuery based on conditions placed onrelational columns you can do so via db2-fn:sqlquery, which accepts any select state-ment that returns a single XML column.For $e in db2-fn:sqlquery('select deptdoc from dept where deptid = "pr27"')/dept/employee . an XML-only data-base into the infrastructure adds yet another integration and complexity challenge. HYBRID ENGINEIBM is about to introduce true-nativesupport. data stores (relation-al and XML) sits one hybrid databaseengine. That single engine can processXQuery, XPath, SQL, and SQL/XML.The engine features a bilingual