Luận án tiến sĩ: Optimizing communication performance of Web services using differential deserialization of SOAP messages

111 A.1 Percent performance improvement in deserialization times of FCPover bSOAP without DDS support.. 198A.4 Percent performance improvement in deserialization times of FCPover bSOAP w

The extensible markup language

Syntax of XML

XML documents usually start with an XML declaration, which indicates the version of the XML specification to which the document conforms, and the character encoding in which the document is encoded, among other things.

The building blocks of an XML document areelementsandattributes An XML element corresponds to a starting tag and a closing tag, surrounding text, other elements, both text and other elements, or nothing A starting tag corresponds to a name, possibly followed by one or more attributes, enclosed in angle brackets. Attributes are separated from the starting tag’s name and from each other by one or more whitespace characters A closing tag is a name enclosed in angle brackets, and prefixed with a forward slash A starting tag and a closing tag corresponding to the same element must have the same name.

Elements with no content (with nothing between the starting and closing tags), can be written using a single tag,empty-element tag, which differs from the closing tag in that the forward slash appears before the right angle bracket, and not before the tag’s name.

Attributes are character strings in the form name=value, withvaluebeing enclosed in either single or double quotes Names of tags and attributes are case- sensitive.

Figure 2.1 shows a sample XML document In this figure, vehicle is an XML element that consists of three other elements: make, model, and year, each of which has text-only content The vehicle element has a single attribute, type, having a value of car.

The first element following the XML declaration in an XML document is the document element All other elements must appear as children of the document element The document element in Figure 2.1 isvehicles.

XML elements must be properly nested That is, if XML element b appears inside elementa(bis said to be achild elementofa, or, equivalently,a is aparent element forb), then b’s closing tag must appear beforea’s closing tag in the XML document.

The text surrounded by, in Figure 2.1, is a comment It is ignored by an XML processor 3 Any legal XML character 4 can be used in a comment except that comments cannot be terminated with -> In other words, the two dashes before the right angle bracket cannot be preceded by a dash.

Handling whitespace in XML documents

Whitespace in XML documents includes space, tab, and new line characters In some places within the XML document, whitespace is ignored and not passed by an XML processor to the application For example, as mentioned above, at least one whitespace character between attributes in an attributes list is required, but an unlimited number of whitespace characters can be inserted between attributes Also, an unlimited number of whitespace characters can be inserted between the nameand the equal sign, as well as the equal sign and the valuein an attribute declaration.

In other places, the application decides whether whitespace is legal or not and, if legal, whether it should be ignored or not For example, whitespace between XML elements—between the closing tag of an element and the starting tag of a subsequent element—that are known to only have element content is typically ignored by applications One of the techniques underlying differential serializa-

3 We borrow some of the definitions from the XML v1.0 specification here Specifically, an XML processor is a software module that is used to read XML documents and provide access to their content and structure An XML processor works on behalf of another software module, an application, which decides how to interpret the content of the XML document.

4 The XML v1.0 specification defines a set of characters from the Unicode v2.0 [28] character set to be legal XML characters Other characters are considered illegal and may not appear within a valid XML document.

2.1 The extensible markup language tion, which is discussed in Section 2.3, is based on explicitly adding whitespace between XML tags.

Namespaces in XML

With the widespread use of XML, it is common to have two elements with the same name, but with different semantics associated with them, especially in different specifications that are based on XML The XML namespaces specification [16] was created to solve this problem The specification extends the XML syntax to allow names of elements and attributes to belong to namespaces, which are identified by URI strings.

For the purpose of declaring namespaces, the XML namespaces specification reserves a set of attributes; all attributes that have names starting with xmlns. The reserved attributes declare namespaces in two ways The first is by associating anamespace aliaswith a namespace and using the alias to explicitly declare a name to belong to the namespace associated with it The second, which only applies to element names, is by changing the default namespace, which is the namespace for element names that are not prefixed with an alias In both ways, the scope of the declaration is the element in which the reserved attributes appear, as well as its descendant elements Both the default namespace and namespace aliases declared in parent elements can be overridden by declarations in child elements.

For associating an alias with a namespace, the alias is declared as an attribute prefixed with xmlns: and having a value that is URI string, which is the name of the namespace For explicitly declaring an element or an attribute to belong to a certain namespace, the name of the element or attribute is prefixed with the alias, followed by a colon The XML namespaces specification modified the XML grammar by removing the colon from the set of characters that are legal for a name of an element or an attribute When the name of an element or an attribute is prefixed with an alias and a colon, it is called a qualified name If it is not

2.1 The extensible markup language prefixed, then it is called anunqualified name.

For example, the alias ns1, declared in element b, in Figure 2.2 is associated with the namespace uri:namespace1 The first child element of element b is explicitly declared to belong to the namespace uri:namespace1 by prefixing it with ns1 and a colon The second child element does not belong to the same namespace since its not prefixed with an alias In fact, the second child element does not belong to any namespace, since the default namespace is initially undefined.

My element belongs to ns1

My element belongs to the default namespace

My element and parent element belong to ns2

My element belongs to ns1

My element does not belong to a namespace

Figure 2.2: A sample XML document with namespaces Names belonging to the namespaceuri:namespace1along with the namespace name are in blue Names belonging to the namespace uri:namespace2 along with the namespace name are in red Names not belonging to any namespace, as well as other content are in black.

Element b also has two attributes namedatt The first belongs to namespace uri:namespace1 because its name is prefixed withns1and a colon.

The default namespace is changed by setting the value of the xmlnsattribute.

In Figure 2.2, the third child element of elementbdeclares the default namespace to be uri:namespace2 Note how the name of its child element, cElem, implicitly belongs to uri:namespace2 However, attribute names are not affected by the declaration of the default namespace; the name of the attributeattrstill does not belong to any namespace Finally, note how the last child element of the third child element of element b undeclares the default namespace; the special empty

2.1 The extensible markup language value for thexmlnsattribute can be used for this purpose 5

While the introduction of namespaces undoubtedly led to much wider adoption ofXML and easier extensibility and versioning of XML documents, it increased the overhead of XML parsing, since the XML parser must explicitly check for and resolve namespace aliases while processing an element or an attribute Namespaces can also increase the overhead of DDS, because they can increase the amount of state that DDS manipulates while creating or restoring checkpoints, as well as before switching to fast mode Chapter 4 discusses in more detail how it is essential for DDS to process and manipulate namespace information (for proper SOAP deserialization) and how DDS performs this.

Features of XML

The two most attractive features of XML for data manipulation are itsextensibility and interoperability.

All machines, which may dramatically differ in their underlying architecture and data representation formats, have one thing in common: they can process text and convert back and forth between text and their own data representation format This makes XML documents, especially ones encoded using the widely- adopted ASCII character set, an attractive platform for interoperability between systems.

The hierarchical relationship between elements in XML as well as the introduction of namespaces into XML allow XML documents to be easily extended with virtually no effort from applications For example, a new child element can be inserted to provide more information, or complete element hierarchies from new namespaces can be inserted The newly inserted elements can be easily ig-

5 In version 1.0 of the XML namespaces specification, namespace alias cannot be undeclared.However, in version 1.1, which was created in response to version 1.1 of the XML specification,namespace aliases can be undeclared.

2.1 The extensible markup language nored by applications that do not understand their meaning, yet provide more information for applications that do.

XML schemas

For an XML document to be interpretable, an XML schema must be associated with it An XML schema is a set of rules that define the legal contents of XML documents For example, a rule in an XML schema can state that an item XML element can only have apriceand adescriptionchild elements, in that order 6 An XML document that conforms to the XML’s syntax rules is said to bewell-formed.

A well-formed XML document that conforms to a particular schema is said to be valid with respect to that schema All valid XML documents with respect to a particular schema are said to beinstancesof that schema.

An XML schema need not explicitly exist for an XML document to be process- able For example, a schema describing a particular SOAP document typically implicitly exists as part of the SOAP deserializer’s code However, a more modular approach is to explicitly define the valid contents of XML documents using an XML schema language 7

One of the early XML schema languages is the Document Type Definition (DTD) DTD is described as part of the XML 1.0 specification, and is, therefore, recognized by many XML parsers Other XML schema languages include Doc- ument Structure Description (DSD) [53], RELAX NG [26], and Schematron [1]. However, the most widely used XML schema language, and the XML schema language adopted for SOAP, is the W3C XML schema language [81].

The W3C XML schema language is an XML-based schema language That is, a W3C XML schema is an XML document itself 8 The language supports a rich

6 As described in Chapter 4, bSOAP schemas generalize the concept of XML schemas to include information on how to process the information an XML document contains, and not only information about the legal structure of XML documents.

7 However, an XML schema language can never be more expressive than an implied schema within an XML application’s code.

8 In fact, the W3C XML schema specification contains a schema for schemas schema illustrating how the W3C XML schema language can define itself (i.e., define valid W3C XML schema XML documents).

2.1 The extensible markup language set of primitive (or simple) types and has powerful constructs for defining new types from existing ones Primitive types can be used to define the legal contents of XML elements with character-only content (i.e., elements that do not contain child elements) For example, a W3C XML schema document can indicate that a particular element in instance documents is of type int(an integer) This element must, in all valid XML documents, contain character data that is within thelexical spaceof the W3C XML schema’sinttype Thus, an XML document containing the character data3445for an element of typeintis valid with respect to the schema, whereas an XML document containing the character data 334t44 for the same element is not (because of the embedded charactert).

To give a flavor of the W3C XML schema language, Figure 2.3 shows a sample W3C XML schema document that describes the XML format for an element named vehicles, and for which the XML document in Figure 2.1, on Page 8, is a valid instance As the XML schema document in the figure indicates, every W3C XML schema document must begin with a schema element, and all W3C XML schema XML elements are defined in the the namespace http://www.w3.org/2001/XMLSchema 9 A complexType element can be used to define a type that corresponds to XML elements containing attributes, child elements, or a mix of child elements and character data (mixed content).

In the W3C XML schema document of Figure 2.3, a complexType defines a type namedvehicleElementType, containing an ordered sequence of three XML elements,make, model, andyear Themake, model, andyearelements are defined to contain character-only content of types string, string, andint, respectively In addition, the schema indicates that elements of typevehicleElementTypecan have an optional attribute of typevehicleType.

ThevehicleTypeis a simple type defined byrestrictingthe set of legal values, or thevalue space, of the typestring Specifically, vehicleTypeis defined so that only car,truck, and busare its legal values Finally, the element vehiclesis defined to

9 W3C XML schema use namespaces to indicate schema language versions Therefore, future versions may use a different namespace.

2.1 The extensible markup language be an XML element containing a sequence of one or more child elements, named vehicle, and are of typevehicleElementType.

Figure 2.3: A sample W3C XML schema document.

As Chapter 4 describes,bSOAP’s schemas are modeled after SOAP types (some of which are directly modeled after W3C XML schema types, as Section 2.2 states) and DDS relies onbSOAP schema object references in the process for detecting an opportunity to switch to fast mode In addition, Chapter 6 describes many XML parsing optimizations that are based on exploiting information in XML schemas.

Parsing XML

While the most common XML format is the text format described in Section 2.1.1,applications typically process XML documents in a more abstract format, XML’s data model, that models an XML document as a tree-structured graph Parsing

XML involves transforming the textual XML format to its abstract data model. Several APIs exist for parsing XML The most common are the Simple API for XML (SAX) [31], the Document Object Model (DOM) [84], and the Streaming API for XML (StAX) [23] Applications using the SAX API register callback functions (handlers) When the XML parser recognizes a particular XML construct for which the application has registered a handler, it notifies the application by calling that handler For this reason, SAX parsing is characterized as a push parsing model. DOM parsers, on the other hand, create a hierarchical data structure corresponding to the XML document before returning this structure to the application The application can then conveniently traverse this data structure to extract information from the XML document Finally, with StAX parsing API, applications drive the parsing process by explicitly calling StAX primitives to extract XML constructs from the document For this reason, StAX parsing is characterized as a pull parsing model.

Clearly, the three parsing APIs differ in their performance and ease-of-use characteristics StAX is the fastest parsing paradigm, since it avoids the overhead of calling handlers and does perform any work besides recognizing XML constructs It is also the most flexible, since both SAX and DOM can beefficiently implemented using StAX However, it is typically the least convenient to use, especially when the XML document is complex or the application is only interested in a small part of the XML document SAX parsing, on the other hand, can be slower than StAX due to function call overhead, but is more convenient to use when XML document is complex or the application is only interested in small parts of the XML document Finally, DOM parsing is typically the most convenient to use and is most useful when the application needs to traverse the XML document multiple times However, it is the slowest parsing paradigm, since it creates a data structure in memory, which involves memory copy and allocation operations Furthermore, a DOM parser’s memory requirements can be large, and, for this reason, may not be used to parse relatively large XML documents.

As described in Chapter 4, bSOAP’s parser is characterized as a pull parser.

SOAP

Overview of SOAP’s syntax

Figure 2.5 shows a sample SOAP message Three XML elements correspond to SOAP’s three main parts: Envelope, Header, and Body These elements are defined in the namespacehttp://schemas.xmlsoap.org/soap/envelope/.

10 The main reason for this is saving the program stack Without this, an application’s state need not be saved However, without saving the program stack, the application cannot avoid the overhead of calling primitives for extracting XML constructs In this case, however, the process of extracting XML constructs can be avoided.

Transport Protocol (HTTP,HTTPS,SMTP,FTP ) SOAP Envelope

Figure 2.4: Structure of a SOAP message Header and Fault are optional.

Envelope is the document’s root element It must be present in a SOAP message The Envelope element can be extended with attributes having qualified names—the attribute’s name must belong to some namespace Enve- lope may start with the Header element, followed by the Body element or, if the Header element is omitted, Envelope must start with the Body element. The Body element may be followed by one or more elements with qualified names The Header element may also contain one or more elements with qualified names The Body element may contain one or more elements How- ever, Body’s child elements need not have qualified names The Body element may contain a single Fault element, which is also defined in the namespace http://schemas.xmlsoap.org/soap/envelope/ The Faultelement is usually used to encode SOAP error messages TheFaultelement has two elements, not defined in any namespace, which must always be present: The faultcode and the fault- string elements Other child elements of the Fault element, which may have to be present under some situations, are thefaultactorand thedetail elements TheFaultelement may contain other elements as long as they are defined in names-

SOAP-ENV:encodingStyle="http://tempuri.org/signature">

QmluZ2hhbXRvbiBVbml2ZXJzaXR5

Figure 2.5: A sample SOAP notification message The SOAP Header element is used to transmit a digital signature so that the recipient of the message can verify the authenticity of the message. paces.

Encoding style

All elements in a SOAP message can have an attribute, SOAP- ENV:encodingStyle 11 , that indicates the encoding rules used within the element. This attribute can be reset by similar declarations in child elements The value of this attribute is an ordered list of URI strings that identify the encoding rules used, from the most specific to the least specific The SOAP specification defines its own encoding rules, which are identified by the namespace http://schemas.xmlsoap.org/soap/encoding/ We discuss the SOAP encoding in Section 2.2.4 below An empty value for this attribute indicates that no claims are made about the encoding rules This attribute has no default value.

11 From this point on, we assume that SOAP-ENV is an alias for the namespace http://schemas.xmlsoap.org/soap/envelope/.

Extending a SOAP message: The Header element

The SOAP header element can be used to extend a message in a modular fashion Child elements of the header element can have aSOAP-ENV:mustUnderstand, which, as its name implies, determines whether the recipient of the message must understand the extension or not If it has the value 1, then the recipient must understand how to process the relevant element and must return a SOAP error message if it does not know how to process the element.

In addition, the Header element can have a SOAP-ENV:actor attribute This attribute determines the recipient of a header The recipient cannot forward the header to the next SOAP host, if it is not the final destination for the SOAP message The value for this attribute is a URI string—URI strings identify SOAP hosts.

A special URI, http://schemas.xmlsoap.org/soap/actor/next, indicates that the header is intended for the next recipient of the message Omitting the SOAP-ENV:actor attribute indicates that the header is intended for the host that is the final destination of the SOAP message.

The SOAP encoding

The SOAP specification defines its own rules for encoding, or serializing, SOAP messages SOAP’s encoding rules are based on constructs commonly found in type systems A brief description of the SOAP encoding and its features follows. Values in the SOAP encoding are encoded as content of XML elements Values for primitive, or simple, types are encoded as character-only content (with no child elements) The SOAP encoding adopts all of W3C XML schema types defined in the “XML Schema Part 2: Data types” specification [12], both the value and lexical spaces Aggregate, or complex, types are encoded as a sequence of elements that correspond to the sub types they comprise.

The SOAP encoding supports encoding values asmulti-reference values Multi- reference values are values that can be referenced from other parts in a SOAP message Multi-reference values have an unqualified attribute, id, on their XML

2.2 SOAP element that its value serves as an identification of the multi-reference value Ref- erences to a multi-reference value are encoded as empty elements, with an un- qualifiedhref attribute with a value of a URI fragment 12 that identifies the multi- reference value Figure 2.6 demonstrates encoding of multi-reference values.

This is a multi-reference string

This is the first string

10

Figure 2.6: An XML fragment showing values encoded per the SOAP encoding. The result of a possible encoding for a multi-reference string value as well as three arrays are shown in the figure The first array is an array of strings The other two are arrays of a polymorphic data type The second array is a sparse array.

Arrays are encoded as XML elements with aSOAP-ENC:arrayType 13 that specifies the type of the array element as well as its dimensions Array elements appear as XML elements inside these elements The name of the XML element corresponding to an array element is insignificant.

The SOAP encoding allows array elements to be partially encoded This can be accomplished by adding a SOAP-ENC:offset attribute on the XML element corresponding to an array The SOAP-ENC:offset attribute’s value specifies the offset, from the start of the array, of the first encoded element In addition, the SOAP encoding allows sparse arrays to be encoded efficiently This can be done by only encoding array elements that have non-default values ASOAP-ENC:positionmay be present on an XML element corresponding to an array element This attribute specifies the position of the element in the array.

12 A fragment in a URI string is a “#”, possibly followed by a string.

13 From this point on, we assume that SOAP-ENC is an alias for the namespace http://schemas.xmlsoap.org/soap/encoding/.

XML elements corresponding to values for sub types in a complex type can be omitted This indicates the value is unknown or that a default value should be used.

Finally, a value for a polymorphic data type—a data type that can have values for multiple types—can be encoded by including an xsi:type 14 attribute on its corresponding XML elements The value of the xsi:type attribute specifies the actual type of the value Although not required, thexsi:type attribute can also be present for the values of non-polymorphic data types.

The SOAP HTTP binding

Although SOAP messages are self-contained, they do not define the means by which they can be transmitted from their source to their destination The SOAP specification describes one way of transmitting SOAP messages: the HTTP protocol 15

SOAP messages can be transmitted over HTTP using HTTP’s POST request. The POST request is used in HTTP to request an HTTP server to accept the HTTP payload and possibly perform an action dependent on the content of the payload as well as the URI of the resource identified in the POST request In SOAP, the action can be, for example, calling a procedure hosted on the HTTP server.

For identifying the action associated with an HTTP POST request carrying a SOAP message, the SOAP specification adds an HTTP header The HTTP header is called SOAPAction and must be present in HTTP-transmitted SOAP messages 16 Its value is a URI string identifying the action associated with the request.

The response to an HTTP POST request carrying a SOAP message can itself

14 From this point on, we assume that xsi is an alias for the XML Schema Instance namespace, which is identified by the URI http://www.w3.org/2001/XMLSchema-instance.

15 Currently, the HTTP protocol is the most commonly used transport protocol for SOAP The wide adoption of HTTP has significantly contributed to the quick deployment of SOAP as a message exchange protocol.

16 Note that there is no reason not to include information about the action associated with a SOAP message as an extension inside the SOAP message This is especially useful if the action is not dependent on the transport protocol (e.g., a remote procedure call) or can be deduced from the content of the SOAP message However, the SOAPAction header might be useful in some situations(e.g., implementing some HTTP proxy policy) It is worth to note that the SOAP v1.2 specification does not require the SOAPAction header to be present.

2.2 SOAP contain a SOAP message This is useful to model client-server communication.HTTP status codes are used to indicate the success or failure of the action associated with the POST request.

SOAP as a remote procedure protocol

The SOAP specification discusses one way of using SOAP for calling remote proce- dures, or methods A remote method is identified by a URI which is not included in the SOAP message, but is carried using mechanisms provided by the transport protocol 17

The rules for encoding RPC requests and responses as SOAP messages follow.

In a method call request message, the SOAP Body element contains one child element named after the method name Each input parameter is modeled as a child element of the method’s element The child elements must be ordered according to the order of parameters in the method signature Input parameters can be encoded according to the SOAP encoding rules or any other encoding rules identified using aSOAP-ENV:encodingStyleattribute Input parameters may be omitted In this case, it is up to the SOAP server to accept such messages or reject them by returning a SOAP error message.

In a method call response message, the return value and the output parameters, if any, are encoded inside a single child element of the Body element The name of this child element is insignificant The element for the return value appears first, followed by elements for the output parameters, ordered according to their order in the method signature The name of the return value’s element is also insignificant.

Like input parameters, output parameters may be encoded according to the SOAP encoding rules or any other encoding rules The response message may not contain both result elements and aFaultelement.

17 Note that, as it is the case with the SOAPAction HTTP header, the URI can be carried inside theSOAP message However, carrying the URI using mechanisms provided by the transport protocol allows for a simpler and modular implementation of other techniques (e.g authentication).

Differential serialization

The potential for improving performance

Earlier work on SOAP performance [21] shows that the most performance-critical task in the serialization of SOAP messages is the conversion of binary-encoded values to text, with the conversion of floating point values being the most expensive Therefore, a differential serialization scheme can significantly improve serialization performance by bypassing the conversion routines when the binary values do not change A conversion routine employing differential serialization techniques can even potentially execute faster even if binary values do change. SOAP messages for a remote method have a great deal of resemblance For example, about 6% of the HTTP-carried SOAP message for theaddmethod shown in Figure 2.7 may change between subsequent serializations In other words, the work performed in serializing 94% of subsequent messages would be redundant if the message is always serialized from scratch This reasoning glosses over the fact that some serialization work, like the conversion of binary values, is more costly than others, but it still shows that the potential for performance improvement exists.

Host: grid.cs.binghamton.edu

Content-Type: text/xml; charset="UTF-8"

SOAPAction: "http://grid.cs.binghamton.edu/samples#add"

Figure 2.7: A SOAP/HTTP message for a method,add, that takes two integer parameters,aandb, and adds their values.Contents that may change in subsequent serializations of the message are highlighted in blue.

The Data Update Tracking (DUT) table

The DUT table is concerned with detecting changes to the data structures associated with a remote SOAP method, and also records information about the serialized message template to allow for efficient updating of the message the next time the data structures are serialized.

Each remote SOAP method has its own DUT table As mentioned above, this makes differential serialization more effective due to the high degree of resemblance between SOAP messages for the same remote SOAP method.

Shifting

When a newly serialized value cannot fit in the space allocated for it in the message template, more space has to be allocated for it The most straightforward way to allocate more space is to shift to the right all the contents of the message on the right of the field 19 Shifting performance depends on the amount of data that has to be moved The more data moved, the worse the performance

19 We use the word field to refer to the place in the SOAP message where a serialized value is stored Sometimes this word may have a different meaning It will be clear from the context which meaning do we intend.

2.3 Differential serialization of shifting Shifting can also be used to shrink the size of a SOAP message, by squeezing out unnecessary whitespace This might be needed for pipelining, orXML canonicalization [15], for example.

Chunking

Message templates are stored in variable-size, possibly non-contiguous memory regions This reduces the cost of shifting because it generally results in less data being moved.

There are many trade offs involved in selecting an appropriate chunk size. Smaller chunks generally increase the performance of shifting, but require more overhead to keep track of them and more overhead to put them on the wire. Chunk size can also affect network bandwidth utilization, the CPU cache effectiveness, and the end-to-end performance of a SOAP call when pipelining is used. For these reasons, chunks can be split or merged to keep their size appropriate.

To further reduce the cost of shifting, padding can be added at the beginning and at the end of the chunks Padding at the beginning of a chunk allows a shifting algorithm to shift to the left if it determines that it is more efficient to do so Padding also reduces the number of memory reallocations Padding has no effect on the message length because it is not sent with the message.

Stuffing

Stuffing is the addition of redundant characters to the message to accommodate for future increases in the length of the serialized values Stuffing can be performed explicitly by allocating more space than needed when the message is first created or can exist as a result of overwriting a serialized value with one of a smaller length.

Stuffing can be added in two places Structural stuffing, is whitespace placed between XML tags Value stuffing, is special characters placed within the serialized values.

Value stuffing is performed by adding special characters to the serialized value such that the resultant literal still belongs to the lexical space of the corresponding data type and still maps to the same literal in the value space of the data type that the previous literal maps to [12] Value stuffing may not be possible for some data types (e.g., strings) because not all literals in the value space can be mapped to more than one literal in the lexical space for these data types.

Unlike structural stuffing, value stuffing does not require the closing tag to be moved when a new serialized value has a different length than the old one.

In addition, value stuffing, when possible, allows for updating SOAP messages with canonicalized XML content [15] Canonical XML is required for XML signature [11], which is, in turn, required for SOAP messages that carry digital signatures [18].

Both stuffing methods increase the message size and therefore result in reduced network bandwidth utilization and increased overhead at the server side. Additionally, both methods induce extra work on the server because it has to process more data However, we believe that the increase in performance at the client-side far exceeds the decrease in performance at the server side, which is likely to be negligible This has been validated experimentally.

Stuffing can completely avoid the regeneration of the XML structure if each field is stuffed such that it has space that is enough to store any serialized value of the corresponding data type Of course, this may not be possible for some data types without prior knowledge of the length of the values that are to be serialized.Typically, however, the more stuffing there is, the faster differential serialization is.

Stealing

Stealing is the reallocation of more space for a field by reclaiming unused space in other fields Stealing performance depends on several factors, which are described in the following subsections These factors are not independent; they can

2.3 Differential serialization affect each other and they all contribute to the performance of a stealing algorithm.

Values in a chunk can be serialized in at least two ways In left-to-right serialization, or in-order serialization, values are serialized starting from the first field in the chunk and ending at the last field In right-to-left serialization, or reverse- order serialization, values are serialized in the opposite order; starting from the last field in a chunk and ending at the first field.

Reverse-order serialization is more likely to increase the number of CPU cache hits and thus improve serialization performance This is because data at the beginning of the chunk is serialized last, but the bytes of the message are sent in order Therefore, it is more likely that the initial part of the chunk is found in the cache when it is about to be sent This is especially important when the data in the cache is short-lived.

It is worth noting that data can also be serialized out of order The answer to the question of whether one serialization order is better than another depends on several factors, including the length of serialized values Depending on how data is stored and managed in an application, some serialization orders may not be possible for use in some applications.

When a field needs more space, a stealing algorithm can search for space to steal either on the left or the right of the field, or it can alternate between left and right directions—back-and-forth stealing.

Stealing from the left aims at increasing the performance of the serialization of the current message because it is motivated by the fact that the stolen space will not be needed again during the serialization of the current message when the values are serialized in order However, this may have the adverse effect of

2.3 Differential serialization increasing the overhead of serialization of future messages because future values may need the stolen space.

Stealing from the right, on the other hand, takes into account the fact that stolen space may be needed in the future Stealing from the right is also more likely to fail to reclaim the needed space near the end of the chunk, in which case shifting performs well When used with reverse serialization, stealing from the right is more likely to perform better because the message quickly and efficiently adapts itself to the application’s values and converges to the case where no shifting or stealing is needed.

Neither left stealing nor right stealing optimizes for the fact that stealing from closer fields is more efficient because less data has to be moved Back-and- forth stealing, however, alternates between left and right directions in an attempt to steal from closer fields At the same time, it inherits, to some extent, the properties of both left and right stealing.

The switching criteria dictates when a back-and-forth stealing algorithm switches to steal from the other direction and the start-direction criteria determines which direction a back-and-forth stealing algorithm should start stealing from In fact, left and right stealing can be thought of as special cases of back- and-forth stealing where the switching criteria never switches direction and the start-direction criteria is left and right, respectively.

In some cases, it may not be beneficial to steal from the first field the searching algorithm finds The selection criteria determines which fields to steal from. Clearly, the selection criteria is constrained by the searching algorithm For example, the selection criteria cannot select fields on the left of the field that is to be allocated more space if the searching algorithm only searches for space on the right of that field.

When the stealing algorithm is about to steal space from a field, several options are possible A greedy stealing algorithm steals all the extra space allocated to the field, even if not all of the space is needed A greedy stealing algorithm blindly assumes that the field that its stealing space for will be in more need of the space than any of the other fields.

Aconservativestealing algorithm only steals as much space as needed, but no more It may steal all extraneous space from some fields, but it will never steal more space than needed.

Finally, anoptimisticstealing algorithm may not steal all the space from a field, even if it is in need of it, in the hope of finding space to steal from other fields. The goal of an optimistic stealing algorithm, for example, might be to increase the serialization performance in the long term.

The halting criteria determines when the stealing algorithm should declare its failure to reclaim the needed space For back-and-forth stealing, each direction can have its own halting criteria A good halting criteria would never allow a stealing algorithm’s performance to be worse than a shifting algorithm’s performance.

Differential serialization optimizations

In this section, we describe two optimizations that are enabled by the techniques underlying differential serialization The next subsection discusses pipelining,which optimizes performance by allowing computation and communication to overlap, followed by a subsection that discusses chunk overlaying, a technique for reducing differential serialization’s memory requirements Both techniques are discussed in more detail in [4].

Pipelining is the sending of each chunk as soon as all the values it contains are serialized Pipelining aims at overlapping computation at the client and the server, which results in dramatic increase in end-to-end SOAP performance The benefits of pipelining have been well studied in the literature [55].

Our chunking and pipelining techniques are much like those of HTTP 1.1. However, a distinctive feature of our pipelining algorithms is that they allow for pipelining even on top of protocols that do not support it and still require the message length to be sent in the very beginning of the message, like HTTP 1.0 [44].

To enable pipelining for a protocol like HTTP 1.0, our pipelining algorithms predict the message length and use stealing, shifting, and stuffing techniques to make sure that no misprediction occurs or to adjust the message should a misprediction occur.

The DUT table and message template typically require orders of magnitude more memory than the application’s data structures that are being serialized To reduce memory requirements, we allow for overlaying the serialized form of data structures that share the same XML structure on the same chunk or group of chunks.

Array elements have the same XML structure and thus are candidates for overlaying Chunk overlaying typically reduces the amount of memory required for both storing the message template and the DUT table However, the reduction in memory requirements generally comes at the cost of increasing serialization overhead because the DUT table has to assume that chunks with overlaid data need to be always updated.

An overlaid chunk has to be immediately sent after all the values in it are serialized because it may be reused and the serialized values may be lost Therefore,pipelining is used with chunk overlaying.

Summary and relation to DDS

Differential serialization is an optimization technique that helps alleviate the SOAP serialization bottleneck Rather than reserializing each message from scratch, DS saves a copy in the sender stub, tracks the changes that need to be made for the next message of the same type, and reuses this saved copy as a template for the next send DS employes techniques to increase its effectiveness and applicability, including on-the-fly message expansion, stuffing, message chunking, and chunk overlaying For applications that resend the same messages repeatedly, the performance study in [9] demonstrates an improvement in serialization performance by a factor of four to ten for arrays of different types of data.

In addtion, this performance study shows that resending messages with similar structure but containing some different values can also achieve significant speedup.

Like DDS, DS processes changes between consecutive messages for the same service However, DS employs completely different techniques, and works on the sender side, rather than on the receiver side DS was the topic of my Master’s thesis [3] and motivated the work described in this dissertation In particular, DS is a client-side optimizatiom technique It requires no help from the server-side It would be even better if the server side could implement the optimization, because it could have a broader effect.

Summary

2.3.8 Summary and relation to DDS

Differential serialization is an optimization technique that helps alleviate the SOAP serialization bottleneck Rather than reserializing each message from scratch, DS saves a copy in the sender stub, tracks the changes that need to be made for the next message of the same type, and reuses this saved copy as a template for the next send DS employes techniques to increase its effectiveness and applicability, including on-the-fly message expansion, stuffing, message chunking, and chunk overlaying For applications that resend the same messages repeatedly, the performance study in [9] demonstrates an improvement in serialization performance by a factor of four to ten for arrays of different types of data.

In addtion, this performance study shows that resending messages with similar structure but containing some different values can also achieve significant speedup.

Like DDS, DS processes changes between consecutive messages for the same service However, DS employs completely different techniques, and works on the sender side, rather than on the receiver side DS was the topic of my Master’s thesis [3] and motivated the work described in this dissertation In particular, DS is a client-side optimizatiom technique It requires no help from the server-side It would be even better if the server side could implement the optimization, because it could have a broader effect.

This chapter describes XML, a hierarchically-structured language for the description of data We briefly describe the syntax and features that make it an attractive base for other technologies, including SOAP.

We also describe SOAP, an XML-based protocol for exchanging messages We describe the syntax of SOAP and the SOAP encoding, which defines rules that can be used to express the constructs of a type system in XML.

Finally, we describe differential serialization, a sender-side optimization tech-

2.4 Summary nique, along with techniques to increase its effectiveness and applicability, including on-the-fly message expansion, message chunking, stuffing and stealing.

Differential deserialization (DDS) [6, 5, 7] is a receiver-side SOAP optimization technique that exploits similarities between incoming SOAP messages to reduce deserialization costs The idea is to avoid XML parsing and construction of application objects for unchanged portions between incoming messages.

The effectiveness of differential deserialization depends on the degree of sim- ilarity between previously-deserialized messages and incoming ones, as well as the overhead associated with the techniques underlying differential deserialization The worst case for differential deserialization is for the previous message to be completely different than the new one, in which case it will be equivalent to first-time deserialization of the message The best case for differential deserialization is for the new message to completely match the previous one, in which case no deserialization work has to be performed.

As with differential serialization, designing a particular differential deserialization scheme, which is defined by the techniques that underlie it, essentially requires solutions for two problems: how to detect which parts of the previously- deserialized message changed, and how to only deserialize the changed parts and only construct application objects for the changed parts.

This chapter discusses our DDS approach, which is based on checksumming and checkpointing of the deserializer’s state 1 , in some detail; it gives a high-level

1 The deserializer state is the state of the XML parser (e.g., namespace mappings) and that of the

Motivation

Content-Type: text/xml; charset=utf-8

XXXXXBinghamton Grid Computing010truefalselatin1latin1

(a) A search for “Binghamton Grid Computing”

POST / HTTP/1.1 Host: api.google.com User-Agent: gSOAP/2.7 Content-Type: text/xml; charset=utf-8 Content-Length: 667

Connection: close SOAPAction: "urn:GoogleSearchAction"

XXXXXDifferential Deserializatio n010truefalselatin1latin1

Figure 3.1: Two similargSOAP-generated SOAP messages to the GoogledoGoogle- Search Web Service. overview of our DDS approach and its underlying techniques Chapter 4 provides more details about our DDS approach and its implementation.

The rest of this chapter is organized as follows Section 3.1 motivates the potential for DDS to improve performance Section 3.2 gives an overview of our DDS approach Section 3.3 discusses the techniques that underlie our DDS approach in more detail Section 3.4 discusses how errors can be handled by a DDS-based deserializer Section 3.5 discusses checkpointing mechanisms and their implications on the effectiveness of DDS Finally, Section 3.6 discusses another DDS approach and compares it to ours.

Figure 3.1 depicts two sample SOAP messages Each message arrives at a Google deserializer (e.g a description of the current application object being deserialized) The application objects themselves are not considered part of the state.

Web services container to invoke a simple search The message in Figure 3.1(a) searches for “Binghamton Grid Computing,” whereas the one in Figure 3.1(b) searches for ”Differential Deserialization.” Aside from the search string (embedded seven lines from the bottom of the message), the only other difference in SOAP contents is the value of the “Content Length” field in the header (664 in Figure 3.1(a), 667 in Figure 3.1(b)) If these two messages arrived in succession, the SOAP receiver side would ordinarily have to parse and deserialize both of them completely and independently of one another Ideally, the receiver would be able to save time by recognizing that the messages are largely similar, and that only the search string needs to be deserialized and reconverted to the receiver’s memory.

Clearly, the Google Web service can expect to receive many such similar messages, that differ only in their search string Google’s SOAP messages are likely to resemble one another because they are primarily generated from a client Web interface, and they all invoke the GoogledoGoogleSearch service 2

One factor that may limit the opportunity for performance enhancement is that different SOAP toolkits will generate equally valid SOAP messages that differ cosmetically (due to spacing, tag names, etc.) Therefore, it might be necessary for a DDS-enabled receiver to differentiate between messages coming in from different toolkits; to consider all gSOAP-generated messages separately from Apache SOAP messages, for example The relatively small number of SOAP toolkits in prominent use, along with the common practice of most SOAP toolkits identifying themselves in message headers, makes this possible.

The performance improvement with DDS can be even more pronounced for scientific data This is because, as earlier studies have shown [21], conversion routines can account for 90% of the end-to-end time in a SOAP message exchange of scientific data, and DDS can completely avoid the need of re-converting unchanged data for such messages.

2 However, the performance benefit of DDS may not be significant for small messages like these.The Google Web service is merely used in this section to illustrate the basic ideas behind DDS.

DDS overview

Finally, it is worth noting that DDS has the potential to be more effective in a client-server distributed environment This because the client-server model implies interaction between many clients and the same server, and the speed of the server is more often the determining factor for performance.

DDS works by periodically checkpointing the state of the SOAP deserializer that reads and deserializes incoming SOAP messages, and computing checksums of SOAP message portions The checksums can be compared against those of the corresponding message portion in the previous messages, to identify with high probability that the message portions match one another exactly When message portions match, the deserializer can avoid duplicating the work of parsing and converting the SOAP message contents in that region.

To do so, the deserializer runs in one of two different modes, which we call regular mode and fast mode, and can switch back and forth between the two, as appropriate, while deserializing a message In regular mode, the deserializer reads and processes all SOAP message contents, as a normal SOAP deserializer would, creating checkpoints and corresponding message portion checksums along the way In fast mode, the deserializer considers the sequence of checksums (each corresponding to disjoint portions of the message), and compares them against the sequence of checksums associated with the most recently received message for the same service.

If all of the checksums match (the best, but unrealistic case where the next message is identical to the previous one), then the normal cost of deserializing is replaced by the cost of computing and comparing checksums, which is generally significantly faster When a checksum mismatch occurs, signaling a difference in the incoming message from the corresponding portion of the previous message,the deserializer switches from fast mode to regular mode, and reads and converts that message portion’s contents as it would otherwise have had to (without the

Enabling differential deserialization

Checkpoints and message portions

Checkpoints provide restoration points which the fast mode processor can revert to should a message portion’s checksum mismatch occur In addition, checkpoints also function as switching points at which the deserializer can switch to fast mode Therefore, the more frequently checkpoints are created, the less of unchanged message content is unnecessarily re-deserialized and the more time is spent in fast mode.

At the same time, creating too many checkpoints increases the overhead of our DDS approach and its memory requirements Therefore, mechanisms that bal- ance this overhead/effectiveness tradeoff can increase the overall benefit of our DDS approach In addition, mechanisms that reduce the overhead of checkpointing or its memory requirements increase the effectiveness and applicability of our DDS approach Section 3.5 below describes three checkpointing mechanisms and their implications on the overall effectiveness of our DDS approach.

Checkpoints define boundaries for message portions Each checkpoint has a message portion associated with it The message portion for a particular checkpoint begins just after the previous checkpoint was created, and just before the checkpoint is created Figure 3.2 illustrates this.

Figure 3.2: Checkpoints and message portions.

Fast mode

In fast mode, the deserializer merely checks the message portions for any changes.

It does so by comparing the previously-computed checksums for message portions to the checksums of the corresponding portions in the incoming message.

If the checksum for a message portion matches, then that message portion is not deserialized for the incoming message This is where the performance benefit of DDS comes from: the deserialization cost is replaced by computation and comparison of checksums If the checksum for a message portion mismatches, then the deserializer switches to regular mode to re-deserialize that message portion.

It may later switch back to fast mode for the following message portions.

Switching to fast mode

For the deserializer to safely switch to fast mode, its state must be the same as state previously saved in a checkpoint 3 We consider a switch to fast mode to be safe if the correct outcome of the deserializer is unaffected by the decision to switch to fast mode and continue processing the next portion of the message in fast mode The correct outcome of the deserializer is defined to be the outcome of the deserializer when the message is completely processed in regular mode.

To prove that a switch to fast mode is always safe when the state of the deserializer matches the state saved in a particular checkpoint, the deserialization process can be modeled as a state machine The checkpoints are the states in this machine and deserialization of message portions results in a transition from one state to another.

Switching to fast mode when the state matches that stored in a particular checkpoint guarantees that when the checksum of the next portion matches, indi- cating that the portion did not change 4 , the fast-mode processor can safely move

3 It is possible switch to fast mode before reaching a point where a checkpoint, A, was previously created if checksums for portions between such switching points have been computed Partial-state matching with checkpoint A would have to take place in this case.

4 Because checksumming does not completely guarantee that all changes are detected, this is not technically completely true We discuss the implications of this and possible solutions in Section 3.4 below.

3.4 Handling errors to the next checkpoint (state) since this is the state that the deserializer would have reached should the portion have been processed in regular mode This, in turn, guarantees that when the deserializer later switches to regular mode (due to a checksum mismatch), it restores the correct deserializer state.

Updating checkpoints

Due to changes in the message, some checkpoints may become stale when they no longer contain correct state information with regard to the new message content. Stale checkpoints must be removed for correct operation of the deserializer In addition, newly-created checkpoints must be merged into the checkpoints list. Figure 3.3 demonstrates this process through an example In Figure 3.3(a), checkpoints A, B, and C have been created from a previous deserialization of a message In Figure 3.3(b), message portions for checkpoints B and C have changed This resulted in checkpoint B becoming stale (it no longer contains correct state information with regard to the new message content) and it gets removed While the deserializer deserializes the new message content between checkpointsAandC, it creates two other checkpointsDandE, which are merged into the list of checkpoints associated with that particular invocation, as shown in Figure 3.3(c).

Handling errors

to the next checkpoint (state) since this is the state that the deserializer would have reached should the portion have been processed in regular mode This, in turn, guarantees that when the deserializer later switches to regular mode (due to a checksum mismatch), it restores the correct deserializer state.

Due to changes in the message, some checkpoints may become stale when they no longer contain correct state information with regard to the new message content. Stale checkpoints must be removed for correct operation of the deserializer In addition, newly-created checkpoints must be merged into the checkpoints list. Figure 3.3 demonstrates this process through an example In Figure 3.3(a), checkpoints A, B, and C have been created from a previous deserialization of a message In Figure 3.3(b), message portions for checkpoints B and C have changed This resulted in checkpoint B becoming stale (it no longer contains correct state information with regard to the new message content) and it gets removed While the deserializer deserializes the new message content between checkpointsAandC, it creates two other checkpointsDandE, which are merged into the list of checkpoints associated with that particular invocation, as shown in Figure 3.3(c).

In addition to regular deserialization errors (syntax or message format errors), a checksumming-based DDS deserializer can have other type of errors: those that result from changes in a message portion going undetected.

For regular deserialization errors, the deserializer forces the fast-mode processor to switch to regular mode before processing a portion for which it has encountered an error the last time it deserialized the message For example, in Figure 3.2 above, if checkpointC’s message portion gets processed in regular mode and the deserializer reports an error while its deserializing that portion, then the next time

(c) Newly created checkpoints D and E are merged into the list of checkpoints

Figure 3.3: Updating checkpoints example In (a), checkpoints A, B, C were created from a previous deserialization of the message In (b), message portions for checkpointsB and C change, which marks checkpointB as stale checkpoint that gets removed Finally, in (c), two newly created checkpoints are merged into the checkpoints list. a message for the same service is processed, the fast-mode processor is forced to switch to regular mode just after processing checkpointB’s message portion. For the checksumming-related errors, the deserializer cannot explicitly handle them, since it is unaware of their occurrence Typically, the probability that the corresponding message portions of two consecutive messages produce the same checksum, without actually matching, will be extremely low Should it happen, however, the deserializer is likely to report a spurious error as soon it switches to regular mode, since its internal view of the expected format of the message is very likely to be inconsistent with the actual message format However, should a change in values, rather than structure, go undetected, the back-end method would get passed incorrect values (whatever those values were after the previous deserialization).

Storing message portions instead of checksums will get rid of the (very unlikely) checksumming-related errors, at the expense of increased memory require-

Checkpointing in DDS

Full checkpointing

The first checkpointing mechanism, full checkpointing, saves the full deserializer statewith each created checkpoint This gives the deserializer the most flexibility since it allows the deserializer to create checkpoints at arbitrary points while deserializing a message However, because full state information is saved, the overhead of creating checkpoints and memory requirements for checkpoints are relatively large.

Unlike with other checkpointing mechanisms, checkpoints with full checkpointing are self-contained because they do not depend on the state information stored in other checkpoints This simplifies the state restoration process,since restoring the deserializer’s state is simply a matter of replacing the deserializer’s state with that stored in the checkpoint However, this mechanism incurs overhead because more data gets manipulated when creating checkpoints, and restoring and comparing state.

Differential checkpointing

Differential checkpointing is based on the observation that successive parser states will differ only by the changes due to processing the next message por-

5 As can be seen in Chapter 5, this can be more efficient than using checksums.

3.5 Checkpointing in DDS tion When the message portion is small, or when processing that portion of the message does not result in significant changes to the state of the deserializer, then the changes in deserializer state can be small, thereby requiring significantly less memory than full checkpointing.

Performance-wise, the primary benefit of differential checkpointing is in the reduction of the amount of state being manipulated: less data is copied when creating or restoring a checkpoint 6 and less data is compared for switching to fast mode.

Differential checkpointing has two sources of overhead: (1) the overhead of detecting changes in the deserializer’s state that have occurred since previous checkpoints were created, and (2) the overhead of tracking down and manipulating partial state information while restoring or comparing state In most cases,both overheads are small compared to the benefit of manipulating less data.

Lightweight checkpointing

Lightweight checkpointing exploits the hierarchical structure of SOAP messages by creating checkpoints at strategic points where changes to the deserializer’s state are minimal and can be statically predicted Thus, lightweight checkpointing shares with differential checkpointing the fact that changes in state are saved. They differ in the way they do this.

Lightweight checkpointing is typically more efficient because it does not incur any overhead for detection of state changes since most state changes are known before deserializing the message In addition, the omission of the need for detecting changes also makes lightweight checkpointing much less dependent on the internals of the deserializer’s state.

As with differential checkpointing, the primary benefit of lightweight checkpointing comes from manipulating less data: memory requirements and overhead of manipulating state information are reduced However, this benefit is typically

6 It is possible to avoid restoring some or all of the unchanged deserializer state with differential checkpointing as described in our differential checkpointing implementation in Chapter 4.

3.5 Checkpointing in DDS more pronounced with lightweight checkpointing due to its ability to manipulate less data more efficiently.

While differential checkpointing should, in practice, match lightweight checkpointing’s ability to minimize the saved changes—particularly when checkpoints are created at the same point in the message—it typically does not This is because a differential checkpointing scheme would favor speed over the ability to minimize manipulated data for the overall performance to increase That is, for differential checkpointing to minimize the manipulated state information, the overhead of detecting changes to state would increase, which reduces the performance benefit of manipulating less state information 7

The performance benefit of lightweight checkpoints comes at the price of flexibility Unlike full or differential checkpoints, lightweight checkpoints cannot be created at arbitrary locations in the message In addition, the relatively small amount of state data stored in lightweight checkpoints must be complemented by a large amount of state for the full deserializer state to be defined at a particular lightweight checkpoint Thus, lightweight checkpointing may occasionally need to create heavyweight checkpoints (which can be in the form of full or differential checkpoints); a policy for advantageously creating heavyweight checkpoints is essential for lightweight checkpointing to be most effective.

Hierarchical approaches

Full, differential, and lightweight checkpoints are not mutually-exclusive: it is possible to use the three kinds of checkpoints while deserializing a single message In addition, it is possible to use hierarchical approaches for saving the deserializer’s state For example, lightweight checkpointing’s flexibility can be increased by using lightweight checkpointing to some hierarchical level and then switching to a variant of full checkpointing that does not save the state already saved in the lightweight checkpoint in the upper levels in the hierarchy.

7 Differential checkpointing attempts to compress the saved state information Therefore, it is very similar in this sense to compression algorithms, which typically have a tunable speed/compression ratio trade off.

Flexibility Highly flexible Highly flexible Moderately flexible

When it is best to use

When few strategic points exist and checkpointing frequency is relatively large

When few strategic points exist and checkpointing frequency is relatively small

When plenty of strategic points exist (a schema-based analysis is possible)

Table 3.1: Comparison of three checkpointing mechanisms.

Summary

Table 3.1 compares the three checkpointing mechanisms described above.Lightweight checkpointing is best both in terms of performance and memory requirements However, it may not be applicable for all messages Differential checkpointing is generally better than full checkpointing, except when the checkpointing frequency is large, in which case differential checkpointing’s overhead may offset its benefit.

Other DDS approaches

Flexibility Highly flexible Highly flexible Moderately flexible

When it is best to use

When few strategic points exist and checkpointing frequency is relatively large

When few strategic points exist and checkpointing frequency is relatively small

When plenty of strategic points exist (a schema-based analysis is possible)

Table 3.1: Comparison of three checkpointing mechanisms.

Table 3.1 compares the three checkpointing mechanisms described above. Lightweight checkpointing is best both in terms of performance and memory requirements However, it may not be applicable for all messages Differential checkpointing is generally better than full checkpointing, except when the checkpointing frequency is large, in which case differential checkpointing’s overhead may offset its benefit.

Other approaches for differential deserialization may be used One approach is to have a SOAP sender tag the SOAP message with information that tells the

3.6 Other DDS approaches receiver which portions of the message have changed since it was last sent 8 A SOAP receiver not aware of differential deserialization will deserialize the message fully and ignore the extra information, since the extra information in the message will not violate the SOAP protocol, which allows such information to be sent (in a SOAP Header element, for example) A SOAP receiver that can take advantage of such information will deserialize only the portions that have changed.

While such an approach would achieve better performance than a DDS approach based on checkpointing, because it avoids the need to detect changes , it has a major disadvantage: it will only work with support from a SOAP sender.

In addition, unlike our approach, this approach will not allow DDS to be applied across messages sent from different clients to the same service In a DDS approach based on checkpointing, it is sufficient for a service to be invoked once for differential deserialization to be enabled (since the receiver detects any changes). For the approach described in this section, each different client must invoke the service at least once before differential deserialization can be enabled between that client and the receiver Furthermore, support for multiple clients for a particular service requires that the receiver maintain a separate set of applications objects for each different client.

8 This is particularly feasible for a sender employing differential serialization techniques.

Differential Deserialization in the bSOAP Toolkit bSOAP is a C++-based SOAP toolkit with an efficient serialization implementation based on differential serialization, and an efficient deserialization implementation based on the differential deserialization techniques described in Chapter 3. This chapter describes bSOAP’s deserializer implementation in detail, with emphasis on its DDS implementation The chapter begins by briefly describing bSOAP’s XML parser in Section 4.1 Section 4.2 gives an overview of how deserialization works inbSOAP Section 4.3 describes major types of checkpoints bSOAP can create and the state associated with each type Section 4.4 explains how bSOAP creates checkpoints Section 4.5 discusses fast mode processing in bSOAP Section 4.6 details the process bSOAP goes through for switching to fast mode Section 4.7 discusses how bSOAP restores state when switching to regular mode Section 4.8 describes how bSOAP identifies and gets rid of stale checkpoints Section 4.9 describes how bSOAP updates the information associated with checkpoints and incorporates new checkpoints Section 4.10 describes policies bSOAP uses for increasing the effectiveness of checkpointing Finally,

XML parsing

Section 4.11 describes memory management, followed by Section 4.12, which describes a limitation inbSOAP’s DDS implementation and possible approaches for removing this limitation.

4.1 XML parsing bSOAP contains an XML parser that is based on the pull parsing model—the deserializer instructs the parser to pull the next XML construct from the SOAP message stream The parser incorporates many optimizations found in other parsers, including avoiding unnecessary string copies.

One unique feature ofbSOAP’s XML parser is that it contains parsing routines that are optimized for the case where the expected content of the message is known a priori, which is typically true for SOAP messages In this case, the parser requires only a single pass over the document content.

Another unique feature of bSOAP’s XML parser is that it supports checkpointing and restoration of its state Due to the hierarchical nature of XML, its state consists mainly of stacks One stack keeps track of string references, and another holds the actual strings 1 In addition,bSOAP saves namespace alias mappings in a hash table, and the hash table’s nodes are stored on their own stack.

A schema-driven deserializer

bSOAP’s deserializer uses a schema to drive the deserialization process AbSOAP schemais a collection of hierarchically-related data structures and supporting deserialization routines that define the acceptable format of a SOAP message, and the rules for converting a SOAP message to application-specific objects Specif- ically, schemas are represented using instances of four schema classes, each of which has an associated deserialization routine These classes are SimpleEle-

1 This separation is needed because some strings may stay in the parsing buffer and never get copied onto the strings stack before they are no longer needed.

4.2 A schema-driven deserializer ment, ComplexElement, ArrayElement, andMethodElement 2 This representation of schemas is primarily based on the SOAP data model 3 ; SimpleElement correspond to SOAP’s simple type, ComplexElement correspond to SOAP’s compound type, andArrayElement correspond to SOAP arrays.

The information stored in a schema class object generally depends on the type and instance of the application object(s) to which the schema object maps For example, anArrayElement object contains a reference to the schema class object corresponding to the type of the array elements Similarly, aComplexElement object contains references to schema class objects corresponding to the subtypes constituting the compound type In addition, a schema object of ArrayElement, ComplexElement, orSimpleElement may contain the expected name of the corresponding XML element in the SOAP message.

For bSOAP to deserialize a message, the application must construct a bSOAP schema and pass it tobSOAP It can do this at run-time (e.g., from a WSDL [34] document) or at compile-time bSOAP calls an application-defined callback routine, passing it the name of the back-end method The application returns the corresponding bSOAP schema (if possible) to bSOAP, in the form of aMethodEle- ment object 4 AMethodElement object contains references to all the information bSOAP needs to deserialize the message This includes the schema class objects corresponding to the method’s parameters and any dynamically-allocated memory from previous deserializations (i.e., memory containing the application’s objects and checkpoint data).

To make this more concrete, consider, for example, abSOAP schema for deserializing SOAP messages invoking a method,getAverage, that returns the average of a doubles array The method takes two parameters: an array of doubles, d,

2 These schema classes provide most commonly needed support for SOAP-to-C++ language binding More support is needed For example, XML schema’s union types can be supported through a UnionElement schema class.

3 bSOAP currently supports SOAP 1.1 Support for other data models can be achieved by defining the appropriate schema classes and associated deserialization routines.

4 While bSOAP currently has only support for SOAP’s RPC encoding, DDS is not dependent on the RPC encoding in any way The schema classes and objects specify the expected format of the message They can be extended to support other SOAP encodings In fact, DDS can be applied to other application of XML—and not just SOAP.

Figure 4.1: Hierarchical relationship between schema objects in abSOAP schema for a method, getAverage, that takes two parameters, an array of doubles,d, and an integer,n. and an integer, n, specifying the number of elements in the array Figure 4.1 illustrates a bSOAP schema for this method The figure shows the hierarchical relationship between the schema objects.

To deserialize a SOAP message invoking getAverage,bSOAP works as follows.When the message arrives, bSOAP parses the initial XML content and verifies that it conforms to SOAP’s encoding rules When it determines the method name(which is getAverage in this case), it calls an application-defined routine, which returns getAverage’s bSOAP schema bSOAP then calls MethodElement’s deserialization routine, which in turn, calls ArrayElement’s deserialization routine.ArrayElement’s deserialization routine calls XML parsing routines to verify that the relevant part of the XML message is valid per the SOAP data model, and to extract information about the array (including its size) After that, it calls theSim- pleElement’s deserialization routine once for each element in the doubles array,which, in turn, does its own validation checks and handles the text-to-double conversions When ArrayElement’s deserialization routine returns, MethodEle-

Checkpoint types in bSOAP

Figure 4.2: Checkpoint types in bSOAP and their hierarchy. ment’s deserialization routine callsSimpleElement’s deserialization routine again, which deserializes the message content corresponding to the second parameter. The conversion routines take care of properly constructing the application objects This includes properly aligning data in memory, setting up the stack frame 5 for calling the back-end method, and allocating memory.

4.3 Checkpoint types in bSOAP bSOAP can create several types of checkpoints, depending on its configuration and the point at which the checkpoint is created Checkpoints differ in the kind of state they store and in the way their state is manipulated (restored or compared against) The following list describes the major types of checkpoints that bSOAP can create Figure 4.2 depicts the checkpoint type hierarchy inbSOAP.

• Heavyweight checkpoints: As their name implies, heavyweight checkpoints store a relatively large amount of state bSOAP does not create heavyweight checkpoints directly Instead, heavyweight checkpoints provide common support for other types of checkpoints Table 4.1 briefly describes state stored with heavyweight checkpoints Heavyweight checkpoints can be created with both differential checkpointing and full checkpointing in effect.

5 While such an approach reduces the portability of bSOAP, it has the performance advantage of avoiding unnecessary copies when setting up the stack frame as well as providing fully dynamic invocation framework A stub compiler can be used to generate method invocation code if portability is a concern.

Program stack The portion of the program stack used by the deserializer.

Matching stack Describes the application object being deserialized.

String references stack Contains references to strings used during deserialization.

Strings stack Contains strings used during deserialization.

Namespace aliases stack Contains the namespace alias mappings hash table’s nodes.

Memory blocks stack Contains pointers to active memory blocks.

Base checkpoints stack Contains pointers to base checkpoints for lightweight checkpoints.

Table 4.1: Description of state saved in heavyweight checkpoints.

• Lightweight checkpoints: Like heavyweight checkpoints, lightweight checkpoints provide common support for other, more specialized, checkpoints. They currently store a reference to a heavyweight base checkpoint We discuss creation of lightweight checkpoints and base checkpoints in Sec- tion 4.4.2.

• Context-independent checkpoints: These are heavyweight checkpoints which can be created at any point in the message In addition to the state common to heavyweight checkpoints, context-independent checkpoints store a couple of pointers needed to properly restore the state of the XML parser and state associated with conversion routines (e.g., a context-independent checkpoint can get created while a base64 converter has not fully converted a base64 value).

• Array heavyweight checkpoints: These are heavyweight checkpoints which can only be created before deserializing an array element.

• Array lightweight checkpoints: These are lightweight checkpoints that are similar to array heavyweight checkpoints in that they can only be created before deserializing an array element The only state they store is the index

A linked list of any memory blocks created while deserializing the checkpoint’s message portion.

Size The number of bytes in the checkpoint’s message portion.

Checksum Checksum for the checkpoint’s message portion.

Table 4.2: A checkpoint’s message portion information. of the array element that is about to be deserialized when the checkpoint is created 6

• Complex element heavyweight checkpoints: Similar to array heavyweight checkpoints, but for complex elements (which correspond to C++ classes/structs in the SOAP encoding) They can only be created before deserializing a sub-element of a complex element Additional state stored is a reference to the schema object corresponding to the sub-element that is about to be deserialized when the checkpoint is created.

• Complex element lightweight checkpoints: Similar to array lightweight checkpoints, but for complex elements.

• First checkpoint: A smaller heavyweight checkpoint that gets created when bSOAP deserializes a message for a particular service for the first time This is discussed in Section 4.4.

• Last checkpoint: A dummy checkpoint that does not contain state information It allowsbSOAP to handle all message portions uniformly.

A checkpoint also has information about its associated message portion This information is associated with all checkpoints, regardless of the checkpoint type. Table 4.2 describes this information.

6 This is implicitly stored on the matching stack for array heavyweight checkpoints.

Creating checkpoints

Creating differential checkpoints

As described in Section 4.3, the most significant portions of bSOAP’s state are stored in various stacks, including the string reference stack, the strings stack,the matching stack, and the stack for namespace aliases Therefore, performing differential checkpointing is a matter of storing only the differences between stacks at various times while processing an incoming message To make this process efficient, bSOAP only stores differences from full copies, and not from other differential copies In particular, bSOAP (1) tracks the changes that have been made to the stack since the last time it was stored in full and (2) stores the changes to the full stack separately in a partial stack copy, when appropriate.

Shared portion (Guaranteed to be the same)

Figure 4.3: Tracking stack changes Base stack copy is depicted on the left side and current stack contents are depicted on the right side Shaded regions depict identical portions of stacks in the base copy and the current stack contents. These two requirements are described separately below.

Each stack in bSOAP maintains a pointer to one base copythat holds a complete copy of an instance of that stack, which was previously saved Each stack also maintains a tracking pointer that determines how much the current instance of the stack differs from its base copy Whenever a new complete copy of a stack is created, its base stack copy pointer is set to point to that copy and its the tracking pointer is set to point to the very top of the stack Whenever data that is located below the tracking pointer is popped off, the tracking pointer is set to point to the new top of the stack Any new data pushed on the stack does not result in a change to the tracking pointer Thus, the tracking pointer essentially splits the stack into two portions: a bottom portion that is guaranteed to be the same in the base copy, and a top portion that may contain different data than the top portion(if any) of the base copy Figure 4.3 illustrates this process In this figure, the left side depicts a full base stack copy and the right side depicts the current stack contents Shaded regions depict identical portions of stacks in the base copy and the current stack contents.

All created stack copies in a checkpoint are partial copies A partial stack copy consists of any data that is located in the top portion of the stack (the non-shaded portion in the depiction of the current stack contents in Figure 4.3) A partial stack copy also contains a pointer to the base copy, which contains its bottom portion (the shaded region in Figure 4.3) and the value of the tracking pointer. Should the size of the bottom portion of the stack fall below a certain threshold, a new full copy is created before the partial copy This full copy functions as a base copy for the current (empty) partial copy as well as future partial copies The rationale behind creating a full copy is that future partial stack copies are likely to be smaller when based on a more recent full copy than when based on a less recent full copy.

Creating lightweight checkpoints

As mentioned in Section 4.3, each lightweight checkpoint has a reference to a base checkpoint A base checkpoint is a heavyweight checkpoint that contains state information a lightweight checkpoint shares with other lightweight checkpoints. That is, the full state of the deserializer at a point where a lightweight checkpoint is created is defined by the combined state of the lightweight checkpoint and its base checkpoint. bSOAP maintains information necessary for creating lightweight checkpoints in two stacks, a base checkpoints stack holds references to base checkpoints for which lightweight checkpoints can still be created, and alightweight checkpoint location information stack(or location information stack, for short) contains the last encountered potential locations of lightweight checkpoints in the parsing buffer, as well as any lightweight checkpoint state information associated with those locations.

Each deserialization routine for which lightweight checkpointing can be applied (currently, arrays and complex type deserialization routines), allocates en-

Fast mode

tries on both stacks A deserialization routine initially allocates a null entry on the location information stack, which indicates that it has not yet encountered a location suitable for creating a lightweight checkpoint Whenever it encounters a potential location for a lightweight checkpoint, it updates its entry on that stack with the location of the checkpoint in the parsing buffer, and with state information associated with the potential lightweight checkpoint As mentioned in Section 4.3, the current state information associated with lightweight checkpoints is an array index for array lightweight checkpoints and a reference to a schema object for complex type lightweight checkpoints.

Whenever a processing interrupt occurs, the location information stack is in- spected to determine if it is possible to create a lightweight checkpoint If any entry on the stack is non-null, then bSOAP may create a lightweight checkpoint, in which case it will nullify the entry We discuss the current policybSOAP uses for deciding when to create lightweight checkpoints as well as possible improvements on that policy in Section 4.10.1 below.

Figure 4.4 illustrates a detailed example of how bSOAP creates a lightweight checkpoint In this example, bSOAP creates an array heavyweight checkpoint,

B2, at timet1 (Figure 4.4(b)), deserializes the array elements (beyond the eleventh element, as illustrated in Figure 4.4(c)), and, when a processing interrupt occurs at time t4 (Figure 4.4(d)), creates an array lightweight checkpoint located just before the twelfths element of the array.

In fast mode,bSOAP traverses the checkpoints in order, starting from the checkpoint following the one at which the deserializer switched to fast mode For each checkpoint, bSOAP computes the checksum of the corresponding portion in the message and compares it to the one stored in the checkpoint If they match, the deserializer can avoid deserializing this portion of the message, since that work was already done for the previously arriving message (the contents of the two

(b) Array base checkpoint, B 2 , gets created.

(c) Array element 10 about to be deserialized at time t 2 and array element 12 at time t 3

(d) A processing interrupt occurs at time t 4

Figure 4.4: Creating lightweight checkpoints example Initially, in (a), the base checkpoints stack (the one on the left side in each sub figure) contains one reference to a checkpoint,B 1 , and the corresponding entry in the location information stack (on the right side in each sub figure) is null In (b) and at time t 1 , an array base checkpoint, B2, gets created New entries are pushed on both stacks In (c) and at time t2, array element 10 is about to be deserialized The top entry in the location information stack is updated accordingly However, because bSOAP keeps track of one lightweight checkpoint location, the entry gets overwritten at t3, just before deserializing element 12, whose corresponding XML element occurs at position 347 in the parsing buffer Finally, in (d) and at time t 4 , a processing interrupt occurs Because the top entry in the location information stack is non- null, a lightweight checkpoint, with a base checkpoint B2, is created (just before element 12, the last encountered potential lightweight checkpoint location) The top entry in the locations stack (for which the lightweight checkpoint was created) is nullified.

Switching to fast mode

The matching stack

Amatching stackcontains all the necessary information for detecting a structural match (step one) Each deserialization routine pushes its own frame onto the matching stack The first item a deserialization routine pushes on the stack is a reference to the schema object for which it has been invoked This is used to make the matching process more robust, as each schema object has a unique address in memory The schema object reference is followed by any necessary information that allows the deserialization routine to detect (1) whether a particular heavyweight checkpoint, given the matching stack stored in that checkpoint, could have been created while the deserializer was deserializing the same application objects, and if so, (2) how far it has proceeded into the deserialization of those objects.

Figure 4.6 shows a sample matching stack for a SOAP message for invoking a method, myMethod, which takes an array of MIO objects (struct MIO { int x; int y; double value; };) parameter This matching stack precisely describes the application object the deserializer is currently deserializing, as it indicates that the deserializer is deserializing an instance of the integeryfor the second element of the array.

Tracking switching candidates with progressive matching

The matching stack not only describes the structural point at which the deserializer is, but also allows each deserialization routine to efficiently check its status with respect to a particular checkpoint; that is, to check whether a checkpoint was created when the deserializer was deserializing an application object that it had deserialized.

The information pushed on the matching stack enables deserialization routines to perform such checks For example, a matching stack frame for an Ar- rayElement object includes the number of array elements that have been deserialized This allows its deserialization routine to determine whether a particular

4.6 Switching to fast mode checkpoint was created while the deserializer was processing an array element it had not yet deserialized (if the number of elements in the checkpoint is larger) or not (if the number of elements in the checkpoint is smaller) Similarly, a matching stack frame for aComplexElementobject includes a pointer to a schema class object (corresponding to one of the compound type’s subtypes) that is currently being deserialized, allowing its deserialization routine to use similar logic to determine its status with respect to a particular checkpoint.

Such checks allow the deserializer to gradually determine when its matching stack matches that of a particular checkpoint We refer to the process of determining when the matching stack of a checkpoint and the deserializer match as progressive matching, since this process takes place as the deserializer is progressing in the deserialization of the message.

Through the process of progressive matching, the deserializer keeps track of the first checkpoint that is likely to match its state When it detects that it can no longer have its state match that checkpoint, it picks the very next checkpoint whose state could match, as its new candidate If the new candidate is a lightweight checkpoint, then the deserializer applies progressive matching to its base checkpoint, since lightweight checkpoints do not explicitly contain a matching stack.

Finalizing state matching

When the matching stack of the deserializer completely matches that stored in a particular checkpoint, this indicates that the checkpoint was created while the deserializer was deserializing the same application objects that it is currently deserializing However, because the matching stack does not store actual string values (but references to them) and namespace alias mappings may have changed, this does not guarantee that the deserializer is at the same structural point within the XML message Furthermore, as mentioned above and illustrated in Figure 4.5, a namespace alias may change without a corresponding change in data struc-

4.6 Switching to fast mode tures being deserialized or even the contents of subsequent message portions. Consequently, the next step bSOAP performs is checking of string values and namespace alias mappings.

For all heavyweight checkpoints, except context-independent checkpoints, and for the first checkpoint, this process is simply a word-for-word comparison of the namespace alias stack, the strings stack, and the string references stack 8 For context-independent checkpoints, this process is slightly more involved, as discussed in the following sub section.

Matching state with context-independent checkpoints bSOAP may need to match additional state for context-independent checkpoints since they can be created at arbitrary locations in the message In particular, if a context-independent checkpoint gets created while the XML parser or a conversion routine has partially processed an XML content (e.g., a tag or character data), then bSOAP must also check that any state changes that have occurred while processing the XML content match for a switch to fast mode to be safe For example, an XML parser processing a tag’s name may maintain a flag that determines whether it has processed a colon (which indicates whether it may regard the name as a qualified name 9 ) This flag must match for the switch to be safe. Because matching such state can be cumbersome, bSOAP takes a different approach: It allows switching to occur before getting to the point in the message were the context-independent checkpoint was created, but it only allows this when the partially processed content, which the XML content between the point were the switch is to occur and the point where the checkpoint was created, has unchanged since the checkpoint was created The rationale behind this is that if the partially processed content did not change, then the deserializer’s state will match that stored in the checkpoint if the deserializer would process the partially

8 bSOAP ’s stack memory allocator guarantees returning the same memory addresses for identical sequences of memory allocation requests This makes word-for-word comparison possible, as pointers are guaranteed to be the same for identical sequences of memory allocation requests.

9 Section 2.1.3 in Chapter 2 discusses qualified names.

4.6 Switching to fast mode processed content 10

To check for changes in partially processed content, bSOAP stores the checksum of the partially processed content with the checkpoint and compares the stored checksum with a newly-computed checksum of the corresponding content in the new message before the switch occurs For example, if a checkpoint gets created just before processing the character e in the start tag , at which bSOAP is about to switch to fast mode, a checksum of M text elements, thenN−M text array values would have to beal- ways partially parsed, even if subsequent messages are identical Furthermore, if a similar subsequent message arrives for the same array, but with M values changed from the previous instance, then the number of byte-sequences that would have to be always partially parsed grows up toN+M−Mtext, because byte sequences fromboth messages would get associated with the same DFA state.

In our DDS approach, there is no reason to impose a limit on the number of lightweight checkpoints (or even other checkpoints) This is because our implementation does not create a checkpoint after parsing every XML construct. Instead, it relies on a dynamically-tunable parameter, the interrupt frequency, to trigger the creation of checkpoints Furthermore, our approach does not use binary search, so creating more checkpoints has no overhead other than that associated with creating the checkpoints.

While we do not account for multiple previous messages in our current implementation, we could In fact, we do recognize that the adaptation of DDS to more than one message could increase its benefit One approach to support this is to connect the checkpoints in a directed graph, instead of a linked-list This

6.1 IBM’s differential deserialization would allow multiple similar messages to share checkpoints, and also maintain order between checkpoints Note that this approach would need some mechanism (e.g., binary search, hash map, or heuristic-based mechanism) to select among multiple outgoing links for some checkpoints In this case, however, a significant difference from Deltarser’s approach is that the number of outgoing links to select from depends on the number of different checkpoints from processing different messages, rather than the number of different checkpoints from processing both the same and different messages.

A couple of other differences also exist in our approach For example, our approach is not dependent on the method for detecting changes It could easily take advantage of information provided by the sender about the portions in the message that have changed This is not as easy to do with Deltarser, for the lack of information about the ordering of byte sequences in the message OurDDS approach also supports creating checkpoints at arbitrary locations in the message This would allow partially parsing XML constructs.

A Deltarser-based differential deserialization

Suzumura et al [69] describe a DDS approach based on Deltarser with an implementation in Java In this approach Deltarser’s DFA states are classified into fixed states and variable states A fixed state is a DFA state corresponding to an XML construct that does not change from message to message, as indicated by an XML schema, while a variable state is a DFA state corresponding to an XML construct that could change from message to message (e.g character-data corresponding to a value).

The deserializer maintains a variable table with an entry for each variable state This table is created when a message is first deserialized and is associated with the final state of the DFA Each table entry stores the following fields: a variable ID, anobject parent, aclass type, anobject value, and an optionalmethod setter A variable ID is a key that the deserializer uses to index into the table for

6.1 IBM’s differential deserialization updating an entry It is not clear how this key or the table is implemented and whether it is explicitly stored with each variable DFA state or not The object parentis the object that contains a reference to (or the actual value of) the object corresponding to the variable DFA state Theclass typefield identifies the type of the object corresponding to the variable state It could be used for converting the object value to the appropriate object (although this is not mentioned) Theobject valueis the last value of the object corresponding to the variable state (apparently, in text format) Finally, the optionalmethod setteris a reference to an object with a setter method 6 for setting the value of the object corresponding to the variable DFA state The actual setter method is on the parent object and maintaining an explicit reference to such object with setter method (which could be the parent object itself) would avoid the need for using Java reflection on the parent object in order to obtain a setter method.

When the deserializer deserializes subsequent messages and partially parses a byte sequence at a particular variable DFA state (due to a change in the subsequent messages), it updates the value in the corresponding entry in the variable table After it reaches the end state, and if it has updated any entry in the variable table, it reflects those updates to the application objects It can do this by using the setter method (if present) or using Java reflection The deserializer supports modifying references to the application objects or modifying a copy Copying of the application objects is performed using Java’sObject’s clone method (if it has been overridden to perform a deep copy when necessary), Java serialization, or Java reflection.

Since this approach is based on Deltarser, the points discussed in the previous section about Deltarser also apply to this DDS approach In addition, there are some differences in how our DDS approach updates the application objects that are also crucial to its effectiveness when compared to this DDS approach 7

6 Java does not support pointer to functions Therefore, a wrapper object is needed.

7 Note that our implementation is C++ based, and some implementation approaches cannot be directly translated to Java due to its restrictiveness Nevertheless, we believe it is possible to implement our approach in Java, but some low-level implementation details would have to be changed For example, the code would have to explicitly implement its own stack, since it may not

The most important difference is that we do not maintain a variable table In- stead, deserialization pointers (Section 4.11.3, Chapter 4) are stored as part of the state in a checkpoint and are used to determine the addresses of application objects while updating them Each deserialization routine handles a particular type of application object 8 and maintains enough information about an object that allow it to properly construct it or partially re-deserialize it For example, an ArrayElement’s deserialization routine stores the array index on the matching stack Construction and updating of objects takes place while the new contents are deserialized and not afterthe incoming message is fully processed The Java equivalent of our DDS approach cannot use pointers However, pointers can be emulated by using wrapper objects for primitive types and maintaining references to the wrapper objects and other, aggregate objects Setter methods on the containing objects could also be used, as done in the DDS approach described in this section The important thing to note is that there is no need to explicitly maintain a variable table, even for the DDS approach described in this section This is because the information stored in a particular variable-table entry is available while the corresponding object is being deserialized Thus, updates to it are redundant. Variable tables induce a lot of unnecessary overhead, both performance-wise and memory-wise, that may offset the benefit of DDS, particularly for objects that can be deserialized relatively quickly A variable table can be very large. For example, a 10K-element array of doubles would require the minimum of 10K and M text variable table entries (besides the memory required for storing the byte sequences) Also, for the variable table’s performance to be acceptable, its implementation would have to allow for a quick indexing as well as a quick resizing implementation (a hash table or map is a good candidate) This is because indexing into the table is performed whenever a value is changed in subsequent be possible to directly access a thread’s stack in Java without using native methods.

8 Note that there is no reason that a deserialization routine could not handle multiple object types In this case, it would have some information in the corresponding schema object that would indicate the actual type In fact, the current deserialization routines do handle objects of multiple types For example, ComplexElement’s deserialization routine handles all C++’s types constructed using struct or class keywords and not just a particular type (which is also possible to do).

Caching

messages In addition, the serialized form of new objects may appear in subsequent messages and those new objects may appear between the serialized forms of old, unchanged objects in the message Thus, while an array, for example, allows for fast indexing, it cannot be used to efficiently implement the variable table because inserting entries in the middle of the table can be very expensive in this case Finally, note that postponing the update of the application objects till after fully processing the message adds even more unnecessary overhead, since the deserializer would have to iterate through all the entries in the variable table to determine which updates in the entries need to be reflected to the application objects.

Currently, our DDS implementation does not support maintaining separate copies of the application objects Again, this is not a limitation of our DDS approach We discuss this issue, how copying can be done easily (because bSOAP manages memory allocation), and possible alternatives for copying (some of which may not be possible using pure Java code) in Section 4.12 in Chapter 4.

Recently, there has been considerable work on caching of SOAP messages to increase performance or decrease communication overhead Takase et al [71] describe an architecture for a two-level caching of SOAP response messages In this architecture, both a SOAP client machine and areverse proxymachine may cache responses The request message is converted to a canonical XML format [15] before being stored in the cache and a 160-bit SHA1 hash value is computed for the canonical message to speed up cache lookup time.

A major disadvantage of this caching approach is that identical messages (in structure and values) have to sent for the cache to be effective Also, the canonicalization of messages induces an unnecessary overhead due to the relatively small number SOAP toolkits in prominent use.

The paper describes an optimization where the client generates a message

6.2 Caching in canonical format This optimization can eliminate the (typically unnecessary) canonicalization overhead However, it requires clients to be modified and to be able to determine the address of the reverse proxy Furthermore, requiring clients to generate canonicalized messages may hinder the use of other optimization techniques like differential serialization.

In more recent work, Takase et al [72] discuss another approach for caching of SOAP response messages In this approach, the cache is stored at the client-side and all of the cache management is done on the client-side—the server is unaware of the fact the client is caching the request The paper assumes that the client knows which methods are safe to cache SOAP responses for, and thus there is no risk in getting an incorrect SOAP response from the cache The paper discusses two approaches for representing cache keys Both depend on the parameter values The first is to use the SOAP message that results from serialization The second is to directly use parameter values, method name, and endpoint URI as cache keys This allows for bypassing serialization on cache hits.

In addition, the paper discusses three approaches for storing response messages in the cache In the first approach, the response message is stored as is, in XML format This approach requires the client to always deserialize the response message, even on a cache hit In the second approach, XML parsing is avoided on a cache hit by storing the response message as a sequence of SAX parsing events. Finally, in the third approach, the application objects built by the deserializer are stored in the cache This approach is better than the other two approaches in that both XML parsing and the construction of the application objects are avoided on a cache hit However, the approach induces the overhead of copying of the application objects from the cache to the application This is even true the first time the response message is inserted in the cache In addition, it requires large amounts of memory, as two copies of the application objects exist.

The authors discuss an optimization where a reference to read-only objects is kept in the cache instead of copying the objects This, however, burdens the application programmer as it requires that the cache be explicitly notified by the

6.2 Caching application about read-only objects Additionally, it requires that the cache be notified when a read-only object is about to be deleted from the application’s memory, so that it can evict the corresponding cache entry or alternatively copy the object.

The experimental studies suggest that storing references to read-only objects performs the best, followed by copying, followed by SAX parsing events, and, finally, followed by storing the XML format of the response As witht heir previous work [71], the use of values in cache keys limits the benefit of caching to methods with simple structures and results in a large cache size as each different value, even for the same remote method, results in a new entry being entered into the cache Also, methods have to be repeatedly called with the same values in order to get any performance benefit.

Fernandez et al [35] propose a scheme for caching wherein a Web service is responsible for describing appropriate caching mechanisms for its response messages This description occurs as a SOAP header extension in the SOAP messages and, thus, can be ignored by Web services that do not understand this caching scheme.

Their scheme supports two caching mechanisms: weak caching and strong caching Weak caching does not guarantee that stale responses are never re- turned, whereas strong caching guarantees that no stale response are ever re- turned Weak caching is implemented by having a Web service specify a Time- to-Live (TTL) value in its response message, which indicates how long the cached response can remain valid.

For strong caching, the paper proposes using invalidation to prevent stale responses from being sent Invalidation works by having a Web service notifying its past clients of its intention to invalidate its response The paper proposes two invalidation mechanisms Inrelaxed invalidation, a Web service simply sends notification messages to its past clients before invalidating a response It need not ensure that all or any of the clients actually received a notification message.

In secure invalidation, a Web service also sends notification messages to clients

6.2 Caching before invalidating a response However, in this case, a Web service waits for a response from each client before actually invalidating the response To guarantee progress in case a Web service does not receive a response from a client, a TTL value is used When the TTL value expires, the Web service invalidates the response whether or not it has received response messages from all past clients. Without taking network latency into consideration, it is unclear how the scheme guarantees the responses can be safely invalidated when the TTL value expires, as the TTL value is sent by the Web service and not the clients.

An advantage of this caching scheme is that it charges a Web service with the responsibility of specifying an appropriate caching policy This not only decouples caching from the transport protocol, but also solves one problem shared by most other caching schemes; how to decide whether a response message can be cached without violating the semantics of a Web service invocation (i.e., returning stale responses) This advantage, however, comes at the price of transparency, as Web service programmers have to explicitly describe appropriate caching polices for their Web service for any caching to occur.

Schema-specific parsers

request messages Messages need not have identical input values (as opposed to [71, 72, 35]), identical message structure (as opposed to [48]), or even come from the same the same client It is fully SOAP compliant and does not need special support to be effective (as opposed to [71, 35, 48]) It optimizes both XML parsing and SOAP deserialization (as opposed to [48]) 10

An XML schema language is a set of rules for describing the legal contents of XML documents For example, one rule may indicate that a certain element’s name should be car and that element should have two sub-elements: year and make. XML documents that conform to a certain XML schema are said to be instances of that XML schema An XML document is said to be well-formed if it conforms to the XML language grammar described in the XML specification [17] If the XML document also conforms to a certain XML schema it is said to bevalidwith respect to that schema 11

It is possible to exploit the information in an XML schema to develop optimized XML parsers for parsing instances of that XML schema Such approach to XML parsing is commonly referred to asschema-specific parsingand the parsers using such approach usually have the combined functionality of ensuring that a document is both valid and well-formed Several approaches have been described in the literature for performing schema-specific parsing In this section, we briefly describe three approaches.

Chiu et al [22] describe a compiler-based approach for schema-specific parsing In this approach, an XML schema is translated into an intermediate representation called generalized automata The intermediate representation can then be processed, after possibly going through optimization passes, to generate

10 Note that DDS effectiveness could be increased using some form of caching Also, should DDS have support from SOAP serializers, its major overhead, that of detecting changes, could be significantly reduced.

11 These concepts are discussed in more detail in Section 2.1.5 in Chapter 2 They are repeated here for the reader’s convenience.

6.3 Schema-specific parsers a schema-specific parser written in anytarget language, like C++ or Java Using a prototype implementation that accepts a small subset of the W3C XML Schema and generates a C++ parser, the authors demonstrate that this technique has a great potential for improving the performance of XML parsing.

Van Engelen [74] describes a schema-specific parsing approach based on a two-level DFA The lower-level DFA is charged with recognizing XML constructs and ensuring that the XML document is well-formed The higher-level DFA is charged with ensuring that the XML document is a valid instance of the XML schema for which the DFA was generated Some of the actions in the lower-level DFA are also dependent on the XML schema, such as actions for recognizing the tag names appearing within the schema This two-level approach results in a significant reduction in the number of DFA states (and therefore, the generated code), at the cost of potentially slight reduction in performance.

Finally, in a recent paper, Perkins et al [60] go into many details about the challenges involved in designing a schema-specific parser generator For example, they point out that due to the unspecified order attributes are allowed to appear in within an XML start tag, some validity and well-formedness checks cannot be conclusively performed until end of a start tag 12

They also present an architecture for generating a schema-specific parser directly from the schema components defined by the W3C XML Schema specifications 13 In this architecture, the generated parser is divided into a validation layer and a scanner The validation layer drivers the scanner, which has primitives for identifying XML constructs Because validity and well-formedness checks cannot be performed until the end of a start tag, the scanner does not return to the validation layer until after it has fully processed a tag 14 , including all of its attributes

12 One example where this may not be true is namespace alias definitions It is possible for the parser to conclusively determine the value of a namespace alias after it encounters its definition within a start tag, since any similar definition within the same tag is an error.

13 The W3C XML Schema specifications define schema components as an abstract representation of its schema constructs That is, a schema component describes a W3C XML Schema construct without any reference to a concrete representation, such as XML.

14 Note that this may preclude some optimizations, particularly when the XML document has many attributes within start tags.

(in case of a start tag) The validation layer is a recursive-decent parser that is specially generated for parsing XML documents conforming to a specific schema. The code of the validation layer has aparse function for every complex type, as well as a dispatch function for handling element-specific validation constraints, such as data type validation The validation logic includes special code for handling common, simple use cases.

The scanner incorporates many optimizations based on two strategiesspecial- izationandoptimistic scanning An example of an optimization that can be catego- rized as specialization is a special close-tag scanning logic that looks for the start of a close tag, followed by the exact bytes already seen in the name of the start tag An example of optimistic scanning is assuming that integers are represented as a string of digits, without any use of character entities The scanner has also fast scanning primitives For example, a read-tag primitive, for processing tags, comes in several flavors Oneread-tag flavor handles closing tags, while another flavor handles the case where the tag name is known, and a third flavor handles the general case where the tag name is not known.

When considering DDS and schema-specific parsing as separate optimization techniques, DDS is a better alternative when XML documents tend to be similar 15 This is because it can avoid both XML parsing costs as well as XML post- processing costs for similar portions within XML documents On the other hand, when messages are very likely not to be similar, schema-specific parsing is better, because the overhead of DDS is likely to offset its benefit in this case.

In reality, however, schema-specific parsing is not an alternate optimization technique to DDS, but rather a complementary one It is possible to build a DDS deserializer on top of a schema-specific parser Specifically, DDSavoids parsing and construction of objects for unchanged portions in messages, while schema- specific parsing exploits the information in an XML schema for optimizing the

Changing the message format

Approaches that retain textual XML format

Van Engelen [75] describes an approach for improving SOAP’s performance for scientific computing for which a software package called gSOAP Numerical Task Library(gNTL) is being developed This approach is based on W3C XML Schema’s type extensibility, which allows for new types to be derived from other types by restricting their value space.

In this approach, the W3C XML Schema base64 type is derived to define new types that allow for efficient encoding of arrays of numerical types For example, arrays of 32-bit integer and 32-bit floating point types are encoded as strings of base64 digits that reflect their big endian and IEEE 754 binary formats, respectively Unlike SOAP arrays, no markup tags separate array elements Instead, all array elements are encoded using a single base64 string Appropriate type definitions are placed in a schema and the encoded array carries its type in the message This allows a receiver to convert the new types properly For example, an int32and a fp32types are defined for arrays of 32-bit big endian integers and arrays of 32-bit IEEE 754 floating point numbers, respectively.

Van Englen also studies using DIME attachments [42] to send binary data He indicates that DIME attachments are more flexible than other techniques used to transfer binary data, such as FTP or GridFTP This is because the SOAP message can be utilized to carry meta-data, DIME allows for multiple files to be attached, DIME attachments can be used to send dynamic data efficiently in streaming mode, and WS-Addressing [14] and WS-Security [56] can be utilized to route and protect SOAP messages with DIME attachments.

The performance study shows that the base64 encoding for numerical arrays results in significant performance improvement when compared to SOAP arrays.The study also shows that this encoding method is better than Java RMI and slightly worse than using DIME attachments to send the arrays.

While using base64-based formats to encode arrays of numeric types improves performance and reduces bandwidth requirements, SOAP’s attractive features, such as interoperability, are compromised The use of W3C XML Schema ex- tensions and type attributes to indicate the type within the SOAP message can help reduce the disadvantage of this approach However, this requires that all receivers understand the new types Another disadvantage of this encoding approach is that the elimimation of array element markup tags hinders other optimization techniques that take advantage of the hierarchical structure of XML messages, such as LCP and PXP (See Section 6.5 below) 17

DIME attachments also share with the base64 encoding approach its interoperability disadvantage, as it requires receivers to understand the binary attach- ment’s format In addition, using DIME attachments requires that developers use and learn two different APIs (a Web services API and an API for using DIME attachments), and also results in further interoperability reduction due to the lack of universal standard for sending binary data [20].

DDS, on the other hand, has the benefit of working on pure XML-based SOAP messages, and therefore allows SOAP to retain its attractive features In addition,DDS does not require any API changes—programmers can interface with Web services using standard APIs.

Binary XML

Binary XML formats have recently gained momentum as an alternative XML format to address textual XML’s shortcomings The XML Binary Characterization Working Group [80] was formed to compile a list of XML properties and use cases in order to better understand and quantify the merits of binary XML formats. Several binary XML formats have been proposed that optimize for one or more properties or use cases In this section, we discuss two binary XML formats. Chui et al [20] present a binary XML format, Binary XML for Scientific Appli-

17 Both LCP’s and PXP’s XML parsing implementations can be modified to accommodate for this.However this is may be undesirable from a software-engineering point of view, as the XML parser may have to depend on application-specific interpretations of the XML data.

6.4 Changing the message format cations(BXSA), specifically designed for use in high-performance scientific applications The goal of this format is to allow for high-performance processing and exchanging of SOAP messages containing scientific data The format contains direct support for the SOAP data model.

A BXSA document consists of a sequence of frames of potentially different types All frames begin with information that indicates the byte order of the data in the frame, the frame’s type, and the frame’s size (in bytes) The frame size and all integers within frames are encoded asvariable-length integers A variable- length integer allows for compactness by using the first leading bits to specify the number of bytes needed to encode the integer.

Some of BXSA’s frame types are modeled after SOAP’s data types A leaf element frame corresponds to SOAP’s simple types, which are encoded in XML as elements with no child elements 18 This frame contains information about namespace alias definitions, the name of the element, attributes list, the native data type that the element corresponds to, and a data value The data value is encoded in native format in accordance with the data type.

A compound element frame corresponds to XML elements with both element- only content (SOAP complex types) and mixed content Its format is similar to a leaf element frame, with the data type and value replaced by an integer specifying the number of sub-frames The sub-frames can be of any type and follow the compound element frame data in a BXSA document Character data within a compound element is represented using a character-data frame, which consists of common frame data, followed by an integer specifying the number of characters in the character data, followed by the character data.

An array element frame corresponds to a SOAP array Its format is similar to a leaf element’s format, with an additional integer specifying the number of elements in the array The array’s elements follow the frame data and are encoded in native format Arrays of strings or polymorphic types cannot be encoded using an array element frame The XML equivalent of an array element frame is that

18 The SOAP encoding is discussed in Section 2.2.4 in Chapter 2.

6.4 Changing the message format of an XML element with a sequence of child elements having an identical name and with content corresponding to the array element values Currently, an array element frame does store the child elements name.

Finally, a comment frameand aprocessing instruction frame are encode information about comments and processing instructions, respectively A comment frame has a format identical to to a character-data frame A processing instruction frame has also a similar format, but with an additional length integer followed by a sequence of characters The first string is interpreted as the processing instruction’s name, while the second string is interpreted as its content.

With the direct support for SOAP arrays and storage of application values in native format, BXSA enables high-performance XML processing for scientific applications This is particularly true when XML data is exchanged between homo- geneous machines, as, large arrays, for example, can be read directly into memory For other data structures or heterogeneous machines, BXSA’s processing speed is reduced, as processing has to be performed in sequence and conversion between incompatible data formats may take place Both variable-length integers and frame formats allow BXSA to have a compact format, which can help in reducing bandwidth requirements or increasing communication performance. This is because many of textual XML’s syntactical constructs are not encoded in BXSA For example, element closing tags and angle brackets are not encoded. This benefit is more pronounced with the direct support for SOAP arrays, which removes the need to delimit each array value with a pair of tags.

Werner et al [82] describe an approach for generating a compact binary XML format using information in XML schemas together with an implementation called Xenia This approach is similar to the schema-specific parsing approaches discussed in Section 6.3 above in that its implementation is based on a state machine constructed from an XML schema However, in this approach, the primary goal of using the state machine is to generate a compact binary XML format, yet the state machine allows for efficient serialization and deserialization of XML documents.

In this approach, an XML schema is processed to generate code for a deter-

6.4 Changing the message format ministic push-down automaton (PDA) The resulting PDA reflects the structure of all valid XML instances of that schema Transitions in the PDA are based on XML tag names Thus, for the state corresponding to some element’s start tag, there are transitions to states corresponding to start tags of all of its child elements. Attributes are handled as a special kind of child elements in the PDA, with their values appearing as character data of those elements.

Each transition from a state is labeled uniquely using binary codes All transitions are currently assumed to have the same probability of occurring in XML instances Therefore, binary codes are generated using consecutive binary numbers For example, if the schema indicates that elements a, b, c, and d are the only valid children of element X, then the state corresponding to the start tag of

X would contain four transitions labeled00, 01,10, and11.

The binary codes are used in generating the binary XML format Serializing an XML tag corresponds to making a transition on the PDA The PDA simply outputs the binary code corresponding to the transition Typed character data are serialized in their native format For example, if the XML schema indicates that the type of an element is integer, then the PDA would output a 32-bit integer in native format When character data has to be serialized as string, Xenia can use three approaches The first is to write the character data unmodified The second is to compress it using adaptive Huffman codes The third is to group all strings into astring container, compress the string container using the Prediction by Partial Match (PPM) algorithm [27], and replace the actual strings with indexes into the string container.

Parsing works in the opposite manner At any PDA state, the binary code in the document is compared to the binary code corresponding to the transitions originating from that state The matching transition indicates the XML tag in the document In this case, a SAX parser, for example, can call the correspondingSAX handler For states with a transition to a state corresponding to character data, the character data is read directly from the document and interpreted in a format appropriate with its data type.

Replacing XML

Eisenhauer et al [33] observe that the wire format strongly determines the performance and flexibility of communication systems They develop a format, Native Data Representation (NDR), along with an implementation, Portable Binary I/O (PBIO), with the aim of being as flexible and descriptive as XML, yet allowing for high-performance encoding and decoding.

In NDR, the message conceptually consists of a stream of records, where each record has one or more fields and each field could correspond to a sub-record or a primitive type Applications must provide PBIO with descriptions of the fields

6.4 Changing the message format in each record A field’s description consists of its name, type, size, and its in- memory offset from the start of its record.

NDR is based on a receiver makes right communication paradigm, in which senders transmit data in their native format The message is tagged with meta- information that describe this format This meta-information includes field descriptions provided by the application When a machine receives a message, it inspects the message’s meta-information to determine its format Format corre- spondence within each record is based on field names If the type, size, or offset of a particular field within a record does not match that provided by the application, PBIO uses conversion routines to convert the field’s values to the appropriate format.

To avoid sending format meta-information with each message, PBIO can use two approaches In the first approach, meta-information for messages of a particular format is only sent with the first message of that format The receiver caches the meta-information and subsequent messages of the same format carry a 32-bit format tokenin place of the meta-information A format token uniquely identifies a particular meta-information during the lifetime of a connection If messages of different format are sent during the same connection, multiple format tokens are generated When a connection is closed, format tokens and cached meta- information used during that connection are invalidated.

In the second approach, PBIO registers format meta-information with aformat server The format server returns a 64-bit format token that can be used for identifying the registered meta-information Messages carry the format token instead of the meta-information When a machine receives a format token for the first time, it contacts the format server to get the corresponding meta-information then stores it in a local cache To reduce the number of such requests, the format server does not generate different format tokens for registration requests for meta-information that is identical to a previously-registered meta-information.This approach allows caches to be more utilized than the first approach, as format tokens outlive the connection in which they are used.

NDR and the PBIO implementation allow for excellent performance to be achieved for several reasons First, when the application objects occupy a contiguous region in memory (e.g., no pointers or dynamic arrays), senders can transmit dataas isfrom memory Second, format conversion only takes place when necessary Thus, receiving can be fast when no format discrepancy exists between the sender and the receiver Finally, PBIO messages can be small when the number of different message formats is relatively small, as format tokens are used in place of meta-information.

At the same time, applications that use PBIO can be flexible This is because they need not worry about any format incompatibility as NDR’s meta-information allow PBIO to seamlessly convert between incompatible formats In addition, PBIO allows applications to inspect meta-information within an incoming message This allows applications to make run-time decisions about the use and processing of the message Finally, PBIO determines the nature of format conversion atrun-time Thus, it can allow for communication between programs that have noa prioriknowledge of each others data format.

The means with which PBIO performs conversion is dynamic code generation. Conversion code is dynamically generated when a message of a particular format is first received and the generated code is cached for use with subsequent messages of the same format The generated code depends on the message data format For example, if meta-information indicates that integers are stored in big endian format, then PBIO running on a little endian machine would generate appropriate code to convert big endian integers into little endian ones.

While PBIO allows for high-performance communication, particularly for messages with contiguous in-memory representation (e.g., no pointers), and flexibility, it hampers interoperability This is because NDR is not as widely adopted as XML Furthermore, although programs need not know each others data format, they still have to know how to convert between their data formats Specifically, PBIO must make provisions for converting from each possible data format In a system where program truly have no a priori knowledge of each others data

6.4 Changing the message format format, this should not be the case For example, PBIO’s dynamically generated conversion routines can convert between little and big endian integer formats. However, if a machine uses integers with byte ordering that is neither little endian nor big endian, it cannot communicate with other machines using PBIO unless PBIO has already made provisions to convert from that machine’s non-traditional integer format 20 In other words, PBIO’s interoperability level is limited by the expressiveness of its data format description.

Each of the two approaches with which PBIO allows machines to publish message formats, and, thus, achieve both flexibility and high-performance have their own problems The first approach is only advantageous when machines exchange many messages with the same format over the same connection Therefore, communication paradigms or limits on the number of open connections may render this approach useless The problem with this is generally not in exchanging larger messages 21 , but in increasing the overhead of dynamic code generation when data formats are incompatible While the second approach increases the effectiveness of format tokens, it introduces several other problems, including scalability, con- sistency, and fault-tolerance problems.

Note that NDR’s meta-information is somehow equivalent to XML schemas. However, XML schema management approach is superior to NDR’s meta- information management approaches This is because schema versioning is up to the schema’s author and not to a third party (e.g., a format server) 22 There- fore there’s no risk of inconsistency in case of format server crashes, for example.

In addition, location of a schema document can be specified using a URI and schemas need not be located on a single “format server.” This is, however, more

20 This is somehow a contrived example, as most machines use either little or big endian formats (very old machines, like the PDP-11, are an exception) However, it demonstrates that this approach does not offer the level of interoperability of the approach used in XML, which is to use a least common denominator format: text A more realistic example where PBIO’s approach would break is floating point number formats.

21 To give an idea of how large meta-information can be, the estimated size of meta-information for a message carrying data corresponding to a C++’s struct MIO { int x; int y; double value; }; can be as large as 89 bytes.

22 URIs are typically used for schema versioning The hierarchical naming approach in URIs makes version name clashes unlikely For example, schemas authored by Binghamton University could all be prefixed with urn://www.binghamton.edu.

6.4 Changing the message format of an implementation issue (i.e., PBIO) rather than a design issue.

Parallel XML parsing

The industry trend toward multi-core CPUs makes parallel XML parsing a vi- able approach for increasing XML parsing and deserialization performance Lu et al [50] describe a parallel approach to XML parsing called Parallel XML Parsing (PXP) In PXP, an XML message goes through two processing stages In the first stage, the message is sequentiallypreparsed In the second stage, the message is parsed in parallel.

Preparsing is a pass over the document to determine its logical structure. Preparsing is much faster than regular XML parsing, as the generated logical structure does not include all the information typically extracted from the document by parsing The generated tree structure is called a skeleton, and is later used in distributing the real parsing work among the processors.

The skeleton is a graph with a node for each element Each node contains information about the location of the start and end tag of the element, the number of child elements, as well as a link to the node corresponding to each child element.

To distribute the parsing work among the processors, PXP can use two partitioning schemes In static partitioning, the partitioning of the tree occurs once before parsing the document and does not change afterward In dynamic partitioning, the partitioning occurs as the document is being parsed and changes depending on the load on each processor.

Static partitioning induces no overhead during parsing, since each processor’s workload does not change during parsing However, it is not always possible to statically partition the message in such a way that each processor performs a comparable amount of parsing work from a contiguous portion of the message without resulting in a strong dependency between the processors 25 Further- more, the overhead of the partitioning that takes place before the document is parsed may offset the zero-overhead benefit of static partitioning during parsing.

25 This is true under the assumption that each processor performs the full parsing work in a single pass.

Special cases that allow for fast and balanced static partitioning exist In particular, messages consisting of large arrays encoded per the SOAP encoding, can be easily statically partitioned by assigning a contiguous chunk of elements to each processor.

Dynamic partitioning, on the other hand, does not induce any overhead before parsing, but may induce overhead during parsing Dynamic partitioning in PXP works as follows Initially, one processor is assigned the root node and all other processors are idle All the processors share arequest queue, onto which idle processors can announce their request for more work to do Just before a processor parses a start tag, it checks the request queue If an idle processor is found, the processor splits its workload (which consists of a set of sibling nodes) into two halves and assigns the second half for itself and the first half to the processor that posted the request The processor that posted the request also copies the parsing context of the processor from which it got its new workload as its initial context to avoid any synchronization overhead.

The advantages of PXP is that it does not only reduce parsing time, but can also reduce the cost of conversion routines since conversion can be performed in parallel When compared to a sequential DDS, PXP is better to use when the messages tend not to be very similar Roughly speaking, PXP should be better to use in the long run when the average number of bytes changed in the sequence of messages deserialized exceeds the average number of bytes that each processor deserializes 26

A possible disadvantage of PXP when compared to a sequential DDS is that the amount ofusefulparallelism (for which communication overhead is small and processors are busy) is limited when messages are small or when the depth of the skeleton is larger relative to its breadth, particularly at the top levels in the skeleton Similar disadvantages also exists in DDS: LCP may not be applicable for some messages, since it is dependent on the structure of the message, and

26 This ignores two significant sources of overhead in both approaches: checksumming and checkpointing in DDS, and preparsing and dynamic partitioning in PXP.

Hardware acceleration

the overhead of DDS may be large when messages are small and the amount of changes is also small in those messages.

It is also possible for DDS to use the ideas in PXP to improve performance For example, changed data within a portion can be deserialized in parallel Alterna- tively, a parallel DDS implementation may exploit the fact that each checkpoint provides all necessary context to parse the following portion in order to process portions in parallel Of course, since message portion sizes can change, a method similar to preparsing may be needed to re-synchronize the previously created checkpoints with respect to the new contents of the message.

With the widespread use of XML in many applications, the need for high- performance XML parsers significantly increased Several hardware-based approaches for increasing XML performance have been proposed and implemented.

We discuss some of them in this section.

Lunteren et al [77] describe an XML acceleration engine, the Zurich XML Ac- celerator(ZUXA), based on a programmable finite state machine calledBART-FSM (B-FSM) The key features of B-FSM is that it allows for a single-cycle state transition, supports up to 32-bit input symbols, can produce more than 64-bit output per transition, programmable, and supports large number of states through clus- tering.

The key structure in B-FSM is the transition rule table, which consists of a transition rule memory and a rule selector The transition rule memory is a table that records all the rules that define the operation of the state machine Each table entry contains a current state field, an input field, a flags field, a next state field, and an output field The current state and input fields specify the state and input, respectively, for which the transition rule applies The flags field is used to denote a don’t care state or input, which indicates that the rule applies for any state or any input, and not those specified in the current state or input

6.6 Hardware acceleration fields Finally, the output field specifies the data that is outputted when the rules matches and the transition occurs Because of don’t care flags, multiple rules may match In this case, only the rule with the highest priority is selected A rule’s priority is indicated by the location of the rule in the table—rules with the higher priority are assigned lower indexes.

State transition occurs at each cycle by selecting the appropriate rule from the transition rule memory This done by comparing the current state, which is stored in a register, and input symbol against the rules in the transition rule memory This process takes into account don’t care rules and the priority of the rules when multiple rules match The process is optimized by designating m bits from the current state or input symbol as a hash index Matching works by initially selecting at most2 m rules that have allmbits equal to the corresponding bits in the current state or input symbol This selection is realized by using a mask register with the appropriate bits set After that, the (at most) 2 m rules are fed into the rule selector, which selects the matching rule among them This complete rule selection process can take place in one cycle 27,28

To support a large number of rules and still allow for fast transitions, rules can be partitioned into state clusters Each state cluster contains its own transition rule table A table address register identifies the state cluster which the current state belongs to To support transitions to rules in non-local state clusters, the transition rule entries for non-local transitions are extended to include a table address and a mask The flags field is also extended to indicate the kind of transition (local or non-local) In case of a non-local transition, the appropriate registers are updated as indicated by thetable addressand mask fields.

ZUXA works by using a B-FSM and an instruction handler The output fields in B-FSM’s transition rule tables encode instructions, along with their operands, that are executed by the instruction handler and the results of executing instruc-

27 The paper indicates that this is possible for clock frequencies of up to 2 GHz It is not clear how large can the transition rule table for one-cycle transitions to be possible at this clock rate.

28 This can be done by using a multiplexer to select the 2 m rules and using a structure similar to a content-addressable memory (CAM) [47] for the rule selector.

6.6 Hardware acceleration tions can be used in selecting state transitions—in addition to the current state and input Thus, depending on the XML document, the B-FSM dynamically generates parsing code for the instruction handler to execute.

In ZUXA, the format of the B-FSM’s transition rule table entry is modified by reinterpreting part of the input field as a conditions field The conditions field specifies conditions to be evaluated by the rule selector before performing the transition Specifically, a condition can resolve to true or false, and a rule can only be selected when all conditions associated with it are true An example of a condition is match This condition evaluates to true when the input character matches with the current character in the character memory 29 Thus, the rule selector also performs simple execution of conditional instructions.

According to the paper, one of the problems in software parsing code for general-purpose processors is that it contains many difficult-to-predict branch instructions that result in many pipeline stalls While this may be true, the com- parision of ZUXA to general-purpose processors seems unfair This is because

ZUXA is not executing instructions in a pipelined fashion In addition, conditional move instructions on general-purpose processors can eliminate branch instructions, at the cost of larger code size Conditional move instructions can also allow parallel execution of conditions of various type, which is another advantage that

ZUXA claims Thus, the disadvantages of conditional move instructions must be weighed against any disadvantages of ZUXA (e.g., potentially large transition rule tables) for a fair comparison.

As ZUXA is an ongoing work and no hardware implementation or instruction set exist, it is still too early to give an informed judgment However, it seems most of ZUXA’s benefit can come from encoding the parsing code in the form of a state machine table, the design of an instruction set architecture specific to parsing XML code, in which insturctions do not require complex execution logic, and the simplictly and efficiencly of its data structures (e.g., the character memory is a

29 This is from an example given in the paper In this example, a character memory is memory area independent of the input stream A read pointer points to the current character in this memory and instructions can be used to modify the read pointer.

Miscellaneous optimizations

can help optimize parsing for XML applications interested in extracting specific content from an XML document, but such optimizations cannot be exploited for applications that process whole XML documents, like SOAP Although IBM’s and Dajeil’s commercial products claim to accelerate XML and SOAP processing, it is not clear how this is performed, and whether conversion routines are accelerated or the host CPU has still to perform the conversions.

Overall, because of lack of wide adoption of current hardware-based accelerators, it still not clear whether they use the appropriate approaches to solve SOAP and XML processing performance problems However, they are definitely faster than software-based solutions (at least for general-purpose XML parsing).

When such hardware acceleration products become more common and more evolved, they could make obsolete any software-based XML optimization techniques (including DDS) However, hardware-based accelerators may also make use of software-based XML optimization techniques to potentially reduce their complexity, power consumption, and cost Meanwhile, it seems that software- based solutions are likely to remain important, and likely to help in increasing the deployment of SOAP and XML-based solutions With this, they also pave the way for more innovative hardware accelelration approaches, as well as wider adoption of hardware-based SOAP and XML accelerators.

Davis et al [32] study the performance of several SOAP toolkits Their study shows that the way a SOAP toolkit prepares and sends SOAP messages can have an effect on latency and performance, particularly for small messages For example, their study shows that the Nagle algorithm and using multiple system calls to send one logical message are sources of inefficiency This is because when whenSOAP toolkits send SOAP messages using multiple system calls (which could result in multiple TCP packets being sent), the Nagle algorithm can have an adverse effect on performance since it could delay all but the first packet.

They also conclude that XML parsing and formatting times can be sources of inefficiency For example, they indicate that longer tag names for array XML elements could have a large impact on performance In addition, they also point out that binary encodings and HTTP chunking can improve SOAP’s performance (when HTTP is used to transport SOAP messages).

Van Engelen [75] suggests using persistent connections and chunking and pipelining to improve performance He also evaluates the performance of compression and suggests not using compression except when the improvement in bandwidth utilization outweighs the overhead of compression He notes also that, in some cases, SOAP compression is unlikely to result in any performance improvements because compression may already be applied in lower network layers. Govindaraju et al [40] show that SOAP by itself is not efficient enough for large scale scientific applications They observe that for small messages, SOAP can outperform some binary RMI implementations and thus is preferable to use in such cases Based on this, they propose a multi-protocol RMI system, in which SOAP is a common-denominatorprotocol In this system, clients and servers register protocols that they support When an application invokes a method, it is redirected to ameta-stub, which queries the registry to determine which protocol to use Thus, protocols can be switched on each method invocation Applications provide the RMI system with a policy for protocol selection.

Finally, Chiu et al [21] analyze SOAP performance by breaking down serialization and deserialization into stages and studying steps performed in each stage to determine how performance can be improved They design an optimized SOAP deserializer based on two principles: minimizing then number of reads from the network and going over the data only once Their XML parser uses tries for tag matching and calls tag-specific application handlers on tag matches Thus, the application could avoid examining tags again.

Furthermore, they confirm the observation of other researchers that HTTP/1.1 chunking and persistent connections can significantly improve performance, particularly for high-latency and high-bandwidth networks Finally, they observe

6.7 Miscellaneous optimizations that for arrays of 1000 or more doubles, over 90% of CPU time is spent performing double-to-text conversions and, for this reason, efforts on improving performance should consider improving the performance of numerical algorithms, rather than just XML parsing This work identifies a fundamental problem with SOAP, and frames the opportunity for our differential serialization and deserialization optimizations.

We present differential deserialization (DDS), a novel technique for improving the performance of deserialization of SOAP messages Differential deserialization is a receiver-side optimization technique that takes advantage of similarities between messages in an incoming message stream to a Web service The bSOAP parser and deserializer checkpoints its state and calculates checksums of portions of incoming messages When the corresponding portions of the next message match, bSOAP can avoid the expensive step of completely parsing and deserializing the incoming message; instead, it can use the results of the previously parsed message.

To do so, the deserializer runs in one of two different modes, which we call regular mode and fast mode, and can switch back and forth between the two, as appropriate, while deserializing a message In regular mode, the deserializer reads and processes all SOAP message contents, as a normal SOAP deserializer would, creating checkpoints and corresponding message portion checksums along the way In fast mode, the deserializer considers the sequence of checksums (each corresponding to disjoint portions of the message), and compares them against the sequence of checksums associated with the most recently received message for the same service.

If all of the checksums match (the best, but unrealistic case where the next message is identical to the previous one), then the normal cost of deserializing is replaced by the cost of computing and comparing checksums, which is generally significantly faster When a checksum mismatch occurs, signaling a difference in the incoming message from the corresponding portion of the previous message, the deserializer switches from fast mode to regular mode, and reads and converts that message portion’s contents as it would otherwise have had to (without the DDS optimization) The deserializer can switch back to processing the message in fast mode if and when it recognizes that the deserializer state is the same as one that has been saved in a checkpoint If the next message is different from the previous message in all message portions (the worst case scenario), then a DDS- enabled deserializer runs slower than a normal deserializer, because it does the same work, plus the added work of calculating checksums and creating parser checkpoints.

To detect when the deserializer state is the same as one that has been saved in a checkpoint, the deserializer tracks checkpoints that are likely to match its state as it is progressing in the deserialization of the message An efficient mechanism, progressive matching, is used determine the status of a checkpoint with respect to the deserializer (whether it has been created at a position in the message ahead of the current deserializer’s position) This mechanism also identifies stale checkpoints; checkpoints created while deserializing an application object that no longer exists in the incoming message.

We describe three mechanisms for checkpointing: full checkpointing (FCP), differential checkpointing (DCP), and lightweight checkpointing (LCP), that differ in their performance implications, flexibility, and memory requirements Full checkpointing stores the full deserializer state when creating a checkpoint and, therefore, has large memory requirements Differential checkpointing tracks changes in deserializer’s state while deserialization is proceeding and stores only the differences between successive parser states This results in significantly reduced memory requirements and DDS overhead, particularly when a relatively larger number of checkpoints are created Finally, lightweight checkpointing exploits the hierarchical structure of SOAP messages to create checkpoints with minimal state information In particular, lightweight checkpointing creates checkpoints at locations where the parser state is known to be identical The full state at those locations is stored only once in a base checkpoint, and lightweight checkpoints, storing little state information, contain a reference to the base checkpoint. Lightweight checkpointing requires significantly less memory than both FCP and DCP, and can result in significant reduction in DDS overhead A disadvantage of LCP is that checkpoints cannot be created at arbitrary locations in the message and its effectiveness depends on base checkpoints creation policy This disadvantage, however, does not typically exist for scientific data.

We present a thorough DDS performance study demonstrating that DDS shows potential for Web services that receive sequences of similar messages Our performance study demonstrates that improvements in deserialization time can be up to a factor of three for arrays of different types of data and that the overhead can be small, even when incoming array values are completely different. For example, LCP had a maximum overhead of about 28.7%, when deserializing scientific messages containing arrays of doubles that completely change from one message to the next.

Percent performance improvement in deserialization times of LCP

Tiêu đề	Optimizing Communication Performance of Web Services Using Differential Deserialization of SOAP Messages
Tác giả	Nayef Bassam Abu-Ghazaleh
Người hướng dẫn	Michael J. Lewis, PTS, Weiyi Meng, Madhusudhan Govindaraju, Kenneth Chiu, Kenneth Kurtz
Trường học	Binghamton University
Chuyên ngành	Computer Science
Thể loại	Dissertation
Năm xuất bản	2006
Thành phố	Ann Arbor

Định dạng
Số trang	235
Dung lượng	1,27 MB