Professional XML Databases phần 9 pot

Data Warehousing, Archival, and Repositories 695 An example of a document using this structure is shown below (ch17_ex13.xml): <?xml version="1.0"?> <!DOCTYPE listing SYSTEM "ch17_ex13.dtd" > <Invoice invoiceDate="10/17/2000" shipDate="10/20/2000" shipMethod="USPS"> <Customer Name="Homer J. Simpson" Address="742 Evergreen Terrace" City="Springfield" State="KY" postalCode="12345" /> <LineItem Quantity="12" Price="0.10"> <Part Color="Blue" Size="3-inch" Name="Grommets" /> </LineItem> <LineItem Quantity="12" Price="0.10"> <Part Color="Blue" Size="3-inch" Name="Grommets" /> </LineItem> </Invoice> As you can see, this document handles all of the issues we were encountering with our data archival strategy: ❑ The LineItem data is associated directly with the Invoice. Once the document containing the invoice is opened, all of the line item information is directly accessible. ❑ Because we have designed the document to be self-contained, all of the other information about the invoice – such as the customer and parts for the invoice – is described in place. If the relational database has archived off customer or part information, the full meaning of the archived invoice can be determined by the information contained in the document. ❑ This document is quite easy to read. If a customer has a question about an invoice and this file is identified as the XML document containing information about that particular invoice, it would be a simple matter to glean the information from the document and return it to the customer. While there is a small price to pay in archive space consumption (as the XML documents contain more information than a traditional bulk-copied file would), the space may be easily recaptured by compressing the XML documents before storing them. At any rate, since archived information is typically stored to a removable medium, space consumption is not as critical an issue as it would be for a live system. If your system has so many transactions that a month's worth would create a file that is too large to be easily handled, you might consider breaking the archive down into smaller files, such as a file per week. Another great benefit to using XML for a data archive is the ability to apply some of the emergent XML tools to leverage that information. For example, an XML indexer might be used to make your data archive easily searchable, making it almost as efficient for pure reads as the original relational data was. Chapter 17 696 When you are creating your data archive, you may want to retain some indexing data in your database to help you locate specific information more easily. For example, you might have a table that contains the file name of the archived data, the identifier of the removable medium where it was stored, and the data ranges of the invoices the file contains. That way, when a specific data recovery request is made, you will be able to more easily obtain the data you are looking for. Summary In this section, we've seen how XML may be used to improve the data archival process. In a properly designed XML archive, each document will be self-contained and have all the information necessary to reconstruct the original business meaning of the information stored in the document. The documents are human-readable, making manual extraction of data simpler than with traditional data archival methods. Finally, an XML data archive may be manipulated with the emergent XML toolsets to make it a more powerful archival medium than flat bulk-copied files. Data Repositories One of the challenges sometimes encountered when designing enterprise solutions is that of building data repositories. Data repositories consist of large amounts of data, much of which is seldom aggregated or queried against. An example where this might occur would be a real estate system – there are literally hundreds of data points associated with a particular property that is for sale, but a buyer is typically only going to run queries against a handful of them. However, the data is interesting once the buyer has drilled down to a particular property and wants to see the detailed information for that property. In this section, we'll see what the traditional approach to creating data repositories has been, and then see how XML can make the process easier. Classical Approaches Traditionally, data repositories are built in relational databases. All information, regardless of how often it is queried or summarized, is treated the same – as a column in a normalized structure where it is appropriate. If a column is searched against frequently, it may be indexed to improve performance, but that's about as much as can be done to differentiate it from columns that are only accessed on a single- row basis. Information that is only accessed at a detail level is effectively dead weight in the database from a querying perspective – it clogs up the pages, making more physical reads necessary per row accessed and leading to "cache thrashing". Let's see a simple example. Suppose we had the following table in our database (ch17_ex14a.sql): CREATE TABLE Property ( PropertyKey integer PRIMARY KEY IDENTITY, NumberOfBedrooms tinyint, HasSwimmingPool bit, Address varchar(50), City varchar(30), State char(2), PostalCode varchar(10), SellerName varchar(50), SellerAgent varchar(50)) Data Warehousing, Archival, and Repositories 697 Assuming that the character fields are entirely filled, each row in this table would consume about 200 bytes or so. If the database platform where this table resides uses 2K pages, about 20 properties would be able to fit on one page. However, if we want to select all the properties that have three bedrooms and no swimming pool, really we're only interested in six bytes of the record – the key and the two metrics we're querying against. In this case, about 650 properties would fit on one page in our database in this case. Your mileage may vary, depending on the way your platform chooses to store data, fill factors, and other issues, but generally speaking a table with fewer columns will return the results of a query faster than one with more columns (assuming the query isn't covered by an index, in which case that rule of thumb does not apply). We can improve our query speed by taking the columns that are not normally queried and moving them into another table (ch17_ex14b.sql): CREATE TABLE Property ( PropertyKey integer PRIMARY KEY IDENTITY, NumberOfBedrooms tinyint, HasSwimmingPool bit) CREATE TABLE PropertyDetail ( PropertyKey integer PRIMARY KEY, Address varchar(50), City varchar(30), State char(2), PostalCode varchar(10), SellerName varchar(50), SellerAgent varchar(50)) But why stop there? As we've discussed in this chapter, a great way to store detail data that doesn't need to be queried is as XML. In fact, systems with more detail-only data than not can benefit from using XML as their primary data repository. Let's see how we might do this. Using XML for Data Repositories Imagine turning the problem around and attacking it from an XML perspective. Information flows into your system in the form of XML. An indexing system picks up the XML document, indexes it into your relational database, and then stores the original XML document in a document repository. To carry on from our previous example, let's say we have XML documents with the following structure coming into our system (ch17_ex15.dtd): <!ELEMENT Property EMPTY> <!ATTLIST Property NumberOfBedrooms CDATA #REQUIRED HasSwimmingPool CDATA #REQUIRED Address CDATA #REQUIRED City CDATA #REQUIRED State CDATA #REQUIRED PostalCode CDATA #REQUIRED SellerName CDATA #REQUIRED SellerAgent CDATA #REQUIRED> We need to build a structure in our relational database to hold the index into these documents. We've already decided that the fields we may want to query on or summarize are NumberOfBedrooms and HasSwimmingPool. Therefore, we create the following table in our database (ch17_ex15.sql): Chapter 17 698 CREATE TABLE Property ( PropertyKey integer PRIMARY KEY IDENTITY, NumberOfBedrooms tinyint, HasSwimmingPool tinyint, DocumentFile varchar(50)) We would then store the original XML document to a particular predefined location on our network and use the DocumentFile field to point to the document location. If we want to query on the number of bedrooms now, we can do so against the index and return a handful of filenames; these filenames can be used to drill into the original XML documents to provide detail information about the address, the seller, and so on. There are a number of advantages to using XML for data repositories: ❑ Greater flexibility in providers. With the tendency towards XML standards, more and more external data providers will have the ability to provide data as XML. If you design your data repository to use XML as its primary storage mechanism, it becomes much easier to get data into and out of your system. ❑ Faster querying and summarization. If your relational database index is built properly, you can more quickly obtain a set of keys that will allow you to drill down into the specifics of each item in your repository. In addition, querying will be faster due to reduced database size. ❑ More presentation options. If your data is stored natively as XML, you will have a greater arsenal of tools at your disposal that can be used to leverage that content without additional coding. ❑ Fewer locking concerns. Like the OLTP database we discussed earlier, keeping most of the information at the file level with only the indexed information in the database will reduce the locking concerns in the database and improve overall performance. Be aware that if your data archive grows to be a large number of files, and you plan to access those files frequently, you may need to perform file system management to ensure that obtaining the information in those files doesn't become a bottleneck for you. Summary If you are designing a system that contains many data points that will never (or rarely) be queried and summarized – but will be reported at the detail level only – then using XML as your data repository platform might be your best bet. Passing the documents in the repository through an indexer – extracting the information needed to query and summarize your detail and storing it in your relational database, providing a way to find specific detail information that matches your search criteria – allows you to create a document index in your database so that you can find the documents you need quickly and easily, while allowing you to leverage existing XML tools to enhance the way you use that data. Data Warehousing, Archival, and Repositories 699 Summary In this chapter, we've seen how XML may be used to improve the way you access and manipulate your data. We've seen: ❑ How XML may be used to help create a data warehouse ❑ The benefits you can realize by using XML as your archival strategy ❑ How XML can improve the functionality of your data repository As more of your business partners move towards being able to send and receive XML natively, your systems will directly and immediately benefit. In addition, these strategies will help you to decrease lock contention on your systems and improve your data processing speed. Chapter 17 700 Data Transmission One of the most common uses of XML for data in enterprise today, and part of its appeal, is data transmission. Companies need to be able to communicate clearly and unambiguously with one another, and each other's systems, and XML provides a very good medium for doing so. In fact, as we've already seen, XML was created for data transmission between different vendors and systems. XML lets you create your own structure. In this chapter, we'll take a look at the common goals and engineering tasks involved in data transmission, and see how XML can improve our data transmission strategy. In particular we'll look at: ❑ What data transmission involves ❑ Classic strategies for dealing with data transmission issues, and where their shortcomings lie ❑ How we can overcome some of the problems associated with the classic strategies using XML ❑ SOAP (Simple Object Access Protocol), and the elements that make up SOAP messages ❑ The basics of using SOAP to transmit XML messages over HTTP Executing a Data Transmission First, let's take a look at what's involved in transmitting data between two systems. Once we get a feel for the steps involved and the traditional way of handling them, we'll see how XML makes the processing of those steps easier. Chapter 18 702 Agree on a Format Before we can send data between two systems, we need to agree what format the data transmission will take. This may or may not involve negotiation between the two teams developing the systems. If one of the systems is larger and has already implemented a data standard, typically the smaller team will write code to handle that standard. If no standard exists, on the other hand, the two development teams will have to collaborate on a standard that suits each team's needs – a process that may be quite time- consuming, as we'll see when we discuss classical strategies later in this chapter. Transport Next, the sending party has to have some way of getting the data to the receiving party – will it be e- mail, http, ftp? Again, the sending party and the receiving party will have to agree on the mechanism used to transmit the data, which may involve discussions about firewalls and network security. Routing As systems become larger and larger, and begin to exchange data with more and more partners, systems that receive data will need to have some way of routing data to the appropriate system or workflow queue. This decision will be based on the sender and the operation that needs to be performed on that data. There are also security implications here, but we'll discuss that when look at SOAP later in the chapter. As more and more systems start to interoperate in this scenario, a loosely-coupled information sharing approach becomes more practical. System-to-system transmission requires those systems to build an interface to each other, but as more systems are added, the cost of this interoperability increases exponentially. A loosely coupled approach that uses information brokers could reduce this cost to linear, as systems only require an interface to be built to the broker. Request-Response Processing More and more applications are starting to use the Internet as their framework for the processing and transmission of information. It's therefore becoming more important to build a mechanism that allows a specific transmission of data to be responded to in a traceable manner – things like an appraisal verification service, or a credit reporting service. This is especially important for service providers that offer access to their services via the Internet. Microsoft's Biztalk is one application that facilitates this – we'll mention Biztalk again later in the chapter. Classic Strategies In this section, we'll see how the issues of data transmission have traditionally been addressed by systems that were not XML-aware. After we've see some of the shortcomings of these strategies, we'll take a look at how XML can improve our ability to control the transmission and routing of data. Selecting on a Format When one system transmits data to another, that transmission typically takes the form of a character stream or file. Before two companies can set up a communications channel, they need to agree on the exact format of that channel. Typically, the stream or file is broken up into records, which are further subdivided into fields, as you would expect. Data Transmission 703 Let's see some of the typical structures we might expect to see in a classic data transmission format. Delimited Files This kind of delimited file is quite common, and usually has some character (such as a comma or vertical bar |) to separate the fields, and a carriage return to separate the records. Empty or NULL fields are shown by two delimiting characters immediately following each other. You can read more about these in Chapter 12 – Flat File Formats. Fixed-width Files Fixed-width flat files have an advantage in that the systems always know the length and exact format of the data being sent. A carriage return will generally still be used as the record delimiter in this case. Again, you can read more about fixed-width delimited files in Chapter 12. Proprietary/Tagged Record Formats As you might imagine, proprietary formats can vary in structure from hybrid delimited/fixed-width formats, to relatively normalized structures. The key to these structures is that typically there are different types of records; each record will have some sort of indicator specifying the type of record (and hence the meaning of the fields found in this record). For each record, however, all of our formatting and other specification rules still apply. For example, we might have the following specialized format for our invoice example, which we worked on in the first four chapters, where each record is exactly 123 bytes long. The first character of each record is used as the record identifier. Records must always start with the Invoice header record, followed by the Customer record, and then one or more Part records : 1. Invoice header record 2. Customer record 3. One or more Part records Based on its contents, the fields that make up each record are as follows: Invoice Header Record Field Start Position Size Name Format Description 1 1 1 Record type Always H. The letter H means this is an invoice header record Indicates an invoice header record 2 2 8 Order Date Datetime YYYYMMDD The date the order on the invoice was placed 3 10 8 Ship Date Datetime YYYYMMDD The date the order on the invoice was shipped 4 18 106 Unused (Filler) String Must be filled with all spaces Chapter 18 704 Customer Record Field Start Position Size Name Format Description 11 1Record type Always C. The letter C indicates that this is a customer record. Indicates a customer record 1 2 30 Customer Name String The name of the customer for this invoice 2 32 50 Customer Address String The street address of the customer 3 82 30 Customer City String The city of the customer's address 4 112 2 Customer State String The customer's state 5 114 10 Customer Postal Code String The customer's postal code Part Record Field Start Position Size Name Format Description 1 1 1 Record type Always P Indicates a part record 2 2 20 Part Description 1 String The description of the first part ordered 3 22 5 Part Quantity 1 Numeric. Left pad with zeroes. The quantity of the first part ordered 4 27 7 Part Price 1 Numeric. Two implied decimal places. Left pad with zeroes. The unit price of the first part ordered 5 34 90 Unused String Must be filled with all spaces A file that follows the above format might look like this (we've had to format this slightly to fit it on the page): H20001017200010223 CHomer J. Simpson 742 Evergreen Terrace Springfield KY12345 P3 inch blue grommets000170000010 P2 inch red sprockets000230000015 H20001017200010223 CKevin B. Williams 744 Evergreen Terrace Springfield KY12345 P1.5 inch silver spro000110000025 P3 inch red grommets 000140000030 P0.5 inch gold widget000090000035 [...]... away when processing an XML document XML Documents can Utilize Off-The-Shelf XML Tools There are many off-the-shelf tools that are well suited to the creation, manipulation, and processing of XML documents As XML becomes more and more prevalent in the business environment, you can bet that more and more toolsets will be developed that allow programmers to make use of content in an XML form Significantly,... cost-effective for most applications 707 Chapter 18 How Can XML Help? We've seen the various problems encountered when attempting to transfer data using traditional means Now, let's take a look at how using XML to transfer data helps us eliminate many of these challenges XML Documents are Self-Documenting One of the best things about XML is that properly designed XML documents are self-documenting, in the sense... SOAPAction header Note that we specify the content type as text /xml – this should always be the case for SOAP messages: POST /soap/Handler HTTP/1.1 Content-Type: text /xml; charset="utf-8" Content-Length: nnnn SOAPAction: "" ... Content-Type: text /xml; charset="utf-8" Content-Length: nnnn SOAP-ENV:Client.BusinessRule.NotFound The resent record was not found 7 19 Chapter 18 ... control over the format of the XML created – SQL Server and Oracle simply create an XML string based on the structure of the joined result set created While these technologies will almost certainly be the way we marshal XML from our relational databases in the long term, for now we will need to take a different approach The Manual Approach To marshal our data into an XML document, there are a few approaches... this: Resend Here, we're saying that there is a MessageStatus associated with the XML payload that's in... Elements and attributes that appear in the XML payload may be assigned to a namespace, but are not obliged to be 714 Data Transmission Let's say that what we're retransmitting is a copy of an invoice We might have a SOAP message that looks like this: XML One of the major concerns with XML is the large files that often result when data is represented in an XML document A system that is attempting to transmit or receive a large number of documents at once, may have to be concerned about the bandwidth consumption of those documents However, since XML documents are text (and typically repetitive text . an XML document. XML Documents can Utilize Off-The-Shelf XML Tools There are many off-the-shelf tools that are well suited to the creation, manipulation, and processing of XML documents. As XML. you to leverage existing XML tools to enhance the way you use that data. Data Warehousing, Archival, and Repositories 699 Summary In this chapter, we've seen how XML may be used to improve. element or attribute, assuming the author has designed the XML file well. Take for example the following XML structure (ch18_ex01 .xml) : < ?xml version="1.0"?> <!DOCTYPE OrderData

Định dạng
Số trang	84
Dung lượng	646,02 KB