Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 422 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
422
Dung lượng
2,27 MB
Nội dung
Chapter 5 Working with SAX Section 5.1 Introduction Section 5.2 Basic Tips for Using SAX Section 5.3 DOM versus SAX Section 5.4 Summary 5.1 Introduction Unlike DOM, the SAX specification is not authorized by W3C SAX was developed through the xml-dev mailing list, the largest community of XML-related developers The development of SAX was finished in May 1998 SAX 2.0, which introduced namespace support and the feature/property mechanism, was completed in May 2000 As described in Chapter 2, SAX is an event-based parsing API Its methods and data structures are much simpler than those of DOM This simplicity implies that application programs based on SAX are required to do more work than those based on DOM On the other hand, SAX-based programs can often achieve high performance In this chapter, we describe some tips for using SAX Then we compare DOM and SAX, and introduce sample programs using DOM and SAX 5.2 Basic Tips for Using SAX In Chapter 2, Sections 2.4 (see Figure 2.2) and 2.4.2 describe the basic concepts of SAX and the programming model for SAX The concept of SAX is simple A SAX parser reads an XML document from the beginning, and the parser tells an application what it finds by using the callback methods of ContentHandler or other interfaces However, there are some things you should know We discuss them in this section 5.2.1 ContentHandler In this section, we discuss a major trap for beginning users of SAX and the parser feature mechanism, an important feature introduced in SAX2 Trap of the characters() Events The characters() method of ContentHandler confuses SAX beginners Consider the following document: Hello, XML & Java! A programmer might expect the parsing of this document to throw five events: startDocument() startElement() for the root element characters(): "\n Hello,\n XML & Java!\n" endElement() for the root element endDocument() Actually, the SAX parser of Xerces produces three characters() events between startElement() and endElement() They are: characters(): "\n Hello,\n XML " characters(): "&" characters(): " Java!\n" The SAX parser of Crimson produces eight characters() events: characters(): "" characters(): "\n" characters(): " Hello," characters(): "\n" characters(): " XML " characters(): "&" characters(): " Java!" characters(): "\n" These behaviors are not bugs in these parsers The SAX specification allows splitting a text segment into several events So take care when you write an application that processes character data Listing 5.1 is a program that checks whether the text in an element matches a given string The program shows a way to solve the problem of split characters() events Listing 5.1 A correct way to process text, chap05/TextMatch.java package chap05; import java.io.IOException; import java.util.Stack; import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.helpers.XMLReaderFactory; public class TextMatch extends DefaultHandler { StringBuffer buffer; String pattern; Stack context; public TextMatch(String pattern) { this.buffer = new StringBuffer(); this.pattern = pattern; this.context = new Stack(); } protected void flushText() { if (this.buffer.length() > 0) { String text = new String(this.buffer); if (pattern.equals(text)) { System.out.print("Pattern '"+this.pattern +"' has been found around ") for (int i = 0; i < this.context.size(); System.out.print("/"+this.context.elem } System.out.println(""); } } this.buffer.setLength(0); } public void characters(char[] ch, int start, int len throws SAXException { this.buffer.append(ch, start, len); } public void ignorableWhitespace(char[] ch, int start throws SAXException { this.buffer.append(ch, start, len); } public void processingInstruction(String target, Str throws SAXException { // Nothing to do because PI does not affect the m // of a document } public void startElement(String uri, String local, String qname, Attributes at throws SAXException { this.flushText(); this.context.push(local); } public void endElement(String uri, String local, Str throws SAXException { this.flushText(); this.context.pop(); } public static void main(String[] argv) { if (argv.length != 2) { System.out.println("TextMatch Robert Smith 8 Oak Avenue, New York, US Hurry, my lawn is going wild! ABC-123 Hyper Toothbrush Turbo 2002-08-21 A closer look at this generated XML document reveals the mapping rules of Castor XML, as shown in Table 15.1 Class names and field names are converted to element type names and attribute names using Castor XML's name mapping rules If fields are of Java primitive types such as int and double, they are mapped to attributes If they are not primitive types (including String), new elements are created for them Table 15.1 Mapping Rules of Castor XML JAVA CLASS NAME OR FIELD NAME XML ELEMENT NAME OR ATTRIBUTE NAME class PurchaseOrder purchase-order (element) class Customer customer (element) int customerId customer-id (attribute) String comment comment (element) : : This XML document can be read (that is, unmarshaled) into another Java program by calling the org.exolab.castor.xml.Unmarshaller.unmarshal() method, as shown in the following code fragment import org.exolab.castor.xml.Unmarshaller; : FileReader reader = new FileReader("newpo.xml"); PurchaseOrder po = (PurchaseOrder) Unmarshaller.unma (Purchase.class, reader); Note that you need to pass the class object of Item as a parameter, because when looking at the root element , the unmarshaler cannot determine the class to be mapped There may be more than one class that could be mapped to this element name For example, there may be other PurchaseOrder classes in different packages, or there may be a purchaseOrder class (with the uncapitalized "p") For a similar reason, the XML document contains the following type information in the elements xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="java:chap15.castor.Item" These attributes tell the unmarshaler that the items field of a PurchaseOrder instance should have Item instances If this information is omitted, Castor XML creates Vector objects for the elements instead of Item instances 15.3.2 Pros and Cons of Generating XML Documents from Java Classes In what situations are these data binding tools based on Java reflection most useful? Let us consider this question by contrasting these tools with the schema-driven tools that we studied in Section 15.2 The biggest difference is that reflectiondriven data binding tools can be used without having a schema before the application development So the obvious answer is, when the application program is already designed or implemented, and the program needs to exchange its application data with other applications Data exchange between distributed applications has been studied for many years in the form of remote procedure calls and distributed objects In Java, the Java language provides Java object serialization, which is used for implementing Java's Remote Method Invocation (RMI) How are the reflection-based XML data binding tools different from Java object serialization? Java object serialization is a specification of encoding Java objects in an octet stream so that the object can be exchanged over a network Therefore, serialization in Java object serialization has the same meaning as the marshaling we have described in this chapter To use Java object serialization, classes to be serialized must implement the Serializable interface This interface is used for marking the implementation classes that are subject to the serialization and deserialization operations and has no required methods to be implemented When a serializable object is written to a java.io.ObjectOutputStream, the object is converted into an octet string For example, if you want to serialize an object referenced by the variable po, the following code fragment does it for you ObjectOutputStream os = new ObjectOutputStream(new File (args[1])); os.writeObject(po); os.close(); The object referred to by the variable po and all the objects referenced by the fields of the object are serialized into a single octet stream As an example, in Listing 15.12 we show a serialized purchase order that we created in Section 15.3.1 Listing 15.12 Serialized purchase order 00000000: aced 0005 7372 001b 6368 6170 3135 2e63 00000010: 6173 746f 722e 5075 7263 6861 7365 4f72 asto 00000020: 6465 7207 41d6 ba16 4616 5602 0003 4c00 der 00000030: 0763 6f6d 6d65 6e74 7400 124c 6a61 7661 com 00000040: 2f6c 616e 672f 5374 7269 6e67 3b4c 0008 /lan 00000050: 6375 7374 6f6d 6572 7400 184c 6368 6170 cust 00000060: 3135 2f63 6173 746f 722f 4375 7374 6f6d 15/c 00000070: 6572 3b4c 0005 6974 656d 7374 0012 4c6a er;L 00000080: 6176 612f 7574 696c 2f56 6563 746f 723b ava/ 00000090: 7870 7400 1d48 7572 7279 2c20 6d79 206c xpt 000000a0: 6177 6e20 6973 2067 6f69 6e67 2077 696c awn 000000b0: 6421 7372 0016 6368 6170 3135 2e63 6173 d!sr 000000c0: 746f 722e 4375 7374 6f6d 6572 4574 8df5 tor 000000d0: 5b62 39e3 0200 0349 000a 6375 7374 6f6d [b9 000000e0: 6572 4964 4c00 0761 6464 7265 7373 7100 erId 000000f0: 7e00 014c 0004 6e61 6d65 7100 7e00 0178 ~ L This octet stream can be deserialized by a receiving Java program by using java.io.ObjectInputStream as follows: ObjectInputStream is = new ObjectInputStream(new FileIn (args[1])); po = (PurchaseOrder) is.readObject(); is.close(); So from its functionality, Java object serialization looks very similar to the data binding of Castor XML Why would we want to use a separate library when the Java language provides a built-in function that does the same task? There are several reasons Data Exchange between Different Implementations The first limitation of Java object serialization is that it requires exactly the same class implementations on the sending side (the marshaling or serializing side) and on the receiving side (the unmarshaling or deserializing side) Java has a dynamic class loading capability, so it is technically possible to share the same class implementations across a network, but for management and security, it is sometimes not feasible or not desirable.[11] [11] Executing downloaded programs without an extensive sanity check is not generally recommended for security reasons Exchanging Data with Applications Written in a Non-Java Language A more serious limitation for B2B communication is that the use of Java object serialization as the standard data exchange format requires all the communication parties to use Java as the implementation language In B2B environments, we cannot make any assumptions about the platform, operating systems, programming languages, and middleware to be used by the communicating parties Reflection-based data binding tools serialize Java objects to an XML format that is a widely accepted data format for B2B data exchange Therefore, it is easier for other companies to receive data expressed as an XML document and process it in any programming language We have seen that data binding tools have interoperability advantages over Java object serialization It also has some drawbacks Handling Shared Structures One such drawback is handling shared data structures When marshaling a graph structure, Java object serialization keeps its topology in the serialized form, but the current data binding tools, at least the ones examined in this chapter, are not capable of preserving the shared structure information.[12] The XML 1.0 Specification defines attributes of type ID and IDREF that could be used for representing data sharing information, so it appears to be one of the ways to handle shared structures in data binding, but it is not widely supported today.[13] [12] We discussed graph structures in more detail in Chapter 8, Section 8.6 [13] The SOAP Encoding specification we discussed in Chapter 12 specifies a way of referring to shared data by using the id attributes of the type ID and the href attributes of the type anyURI Size and Speed Another possible limitation of data binding is the size of serialized data and the execution speed of the marshaling and unmarshaling operations As we have seen, Java object serialization has a binary format, which is generally more efficient than the text representation of XML One of the biggest time-consuming parts of XML parsing and generation is the tight loop of scanning characters, so the performance is largely affected by the size of XML documents Therefore, it is a common belief that XML data binding is less efficient than Java object serialization There is, however, no decisive evidence that supports this belief Different benchmarks have shown different results In many cases, existing implementations of Java object serialization are not as efficient as applications need Therefore, if the performance of serialization is absolutely critical in your application, you may need to write your own specialized marshaler and unmarshaler anyway 15.3.3 SOAP Encoding In Chapter 12, we discussed the use of SOAP for remote procedure calls SOAP has its encoding rule, which allows you to express a program internal data structure in XML In this sense, the SOAP encoding rule can be considered as marshaling How is it different from the data binding tools that we have looked at in this chapter? Encoding from Java Classes The SOAP encoding rule has a simple type system that tries to capture the common ingredients found in the type systems of various programming languages The primitive types of the SOAP encoding rule are those of W3C XML Schema (see Chapter 9 for the details of XML Schema primitive types) When the programming language in use has a type system consistent with the SOAP encoding rule, it should be possible to create an XML schema from the application data model In this book, the programming language in question is Java, and Java's primitive data types have almost straightforward mappings to XML Schema's primitive data types, so we should be able to create XML structures from Java object instances as we did with Castor XML Do existing SOAP implementations have this capability? The answer is yes For example, remember Apache SOAP programming, introduced in Chapter 12? In Listing 12.23, we called the mapType() method of the SOAPMappingRegistry class to specify how a particular Java class should be mapped to an XML element Using our purchase order example, the mapping could be specified as in the following fragment of ProcessOrderClient.java [47] SOAPMappingRegistry smr = new SOAPMappingRegist [48] BeanSerializer beanSer = new BeanSerializer(); [49] [50] // Map the types [51] smr.mapTypes(Constants.NS_URI_SOAP_ENC, [52] new QName("urn:purchaseOrder", "pu [53] PurchaseOrder.class, beanSer, bean The smr.mapTypes() call in line 51 says that a PurchaseOrder instance should be mapped to an XML element with the namespace URI urn:purchaseOrder and the local name purchaseOrder, and the marshaling and unmarshaling algorithms are provided in the class BeanSerializer Unlike the data binding tools we have shown in this chapter, the SOAP encoding rule has no default name conversion rules between XML elements and attribute names and Java names This is because the SOAP encoding rule does not assume any particular programming language Therefore, we have to supply such rules In addition, by supplying your own marshaling and unmarshaling algorithms, you can also control the structure of your XML document—for example, the order of child elements of the element Apache SOAP provides a built-in marshaling and unmarshaling algorithm called org.apache.soap.encoding.soapenc.BeanSerializer On the accompanying CD-ROM, you can find a complete set of source files for the Apache SOAP version of our purchase order application Part of a SOAP message generated by this program is shown in Listing 15.13 It shows the SOAP encoding of our PurchaseOrder object Listing 15.13 Part of a generated SOAP message