Jump Right To It. Three days of pure PHP http://www.phparch.com/phpworks php|w rks Toronto, Sept. 22-24, 2004 Existing subscribers can upgrade to the Print edition and save! Login to your account for more details. NEW! NEW! *By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above. Because of fluctuations inthe exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly. **Offer available only in conjunction with the purchase of a print subscription. Choose a Subscription type: CCaannaaddaa//UUSSAA $$ 8833 9999 CCAADD (($$5599 9999 UUSS**)) IInntteerrnnaattiioonnaall SSuurrffaaccee $$111111 9999 CCAADD (($$7799 9999 UUSS**)) IInntteerrnnaattiioonnaall AAiirr $$112255 9999 CCAADD (($$8899 9999 UUSS**)) CCoommbboo eeddiittiioonn aadddd--oonn $$ 1144 0000 CCAADD (($$1100 0000 UUSS)) ((pprriinntt ++ PPDDFF eeddiittiioonn)) Your charge will appear under the name "Marco Tabini & Associates, Inc." Please allow up to 4 to 6 weeks for your subscription to be established and your first issue to be mailed to you. *US Pricing is approximate and for illustration purposes only. php|architect Subscription Dept. P.O. Box 54526 1771 Avenue Road Toronto, ON M5M 4N5 Canada Name: ____________________________________________ Address: _________________________________________ City: _____________________________________________ State/Province: ____________________________________ ZIP/Postal Code: ___________________________________ Country: ___________________________________________ Payment type: VISA Mastercard American Express Credit Card Number:________________________________ Expiration Date: _____________________________________ E-mail address: ______________________________________ Phone Number: ____________________________________ Visit: http://www.phparch.com/print for more information or to subscribe online. Signature: Date: To subscribe via snail mail - please detach/copy this form, fill it out and mail to the address above or fax to +1-416-630-5057 php|architect The Magazine For PHP Professionals YYoouu’’llll nneevveerr kknnooww wwhhaatt wwee’’llll ccoommee uupp wwiitthh nneexxtt W elcome to part two of our little trip down PDF lane. While last month we focused primarily on understanding what the structure of a PDF document is, this time over we’ll look at the problem of altering the contents of a PDF file from a more practical perspective. The main thing to understand, before we move on to anything else, is that parsing a PDF file is a complex— but by no means complicated—endeavour because the file is not only not intended for human consumption, but it also does not follow a top-down logic. In other words, as we also discovered last month, when parsing a PDF file one doesn’t start at the beginning and move down to the end ofthe file. In fact, the exact opposite is true. Since we’ll often find ourselves jumping at various— and completely arbitrary—positions into the docu- ment, the first decision that we need to make is how we’re going to access the data. While it is tempting to just load the entire file in memory, that’s usually not such a good idea; if you consider that a PDF can have pretty much any size, by loading an entire document in memory we expose ourselves to the potential of clog- ging up large chunks of RAM, thus limiting our server’s ability to process a large number of requests. Yet, seeking to arbitrary locations in a document is not always easy, or even possible. Imagine, for exam- ple, if you’re accessing a PDF document via HTTP. In this case, you’d have to download the entire file before you could actually find out about any of its characteris- tics, since the offset ofthe cross-reference table appears at the end ofthe file. Even in this case, I would recom- mend storing the document in a local file and then accessing the data through the filesystem. The one notable exception to this rule is a special class of PDF documents known as “linearized PDF files”. A linearized PDF document contains a dictionary at the beginning ofthe file that provides the necessary facili- ties for determining the location ofthe first page inthe file without having to read through the cross-reference table first. The structure of linearized PDF files is beyond the scope of this article, but you can find out more about it directly from the PDF specification document published by Adobe. Getting Started The first thing we need to do in order to be able to interpret the contents of a PDF document is to deter- mine where the cross-reference table and trailer dic- tionary are. This is quite easy if you consider that the format ofthe ssttaarrttxxrreeff pointer is fixed. For example, in my document it looks like the following: startxref 53593 %%EOF May 2004 ● PHP Architect ● www.phparch.com 34 FEATURE IntheBellyoftheBeast Interpreting and Manipulating PDF Files by Marco Tabini PHP: 4.3.0+ OS: Any Applications: A PDF Reader (for testing) Code Directory: pdf REQUIREMENTS In last month's issue, we examined the structure and con- tents of a PDF document in considerable detail. This month, we'll actually write a PHP library capable of open- ing one and modifying its contents. Thus, all we need to do is move to the end ofthe file, back up a few bytes and then find this sequence of data. As you can see from Listing 1 ( ffiinnddxxrreeff pphhpp ), this is readily accomplished by using a simple regular expression. Note how the regex pattern specification ends with a dollar sign, indicating that the resulting match must be anchored to the end ofthe data stream. Even though we’re only taking fifty characters from the end ofthe file, I have added the anchor to prevent the regex engine from picking up a previous cross-refer- ence table pointer by mistake. If you’re wondering why the cross-reference table pointer is not saved to the document using a fixed format (say, for example, using 10 digits for the offset like the cross-reference entries themselves), you’re not alone. This decision is a bit of a mystery, but it’s something that we have to live with. By the way—throughout the remainder ofthe article, you’ll notice that I have created an individual include file for each ofthe functions that we will be writing. This is clearly not a good design practice, but it fulfills one important purpose: it keeps the listings inthe arti- cles short and to the point. Thus, inthe interest of clar- ity, I hope that you’ll forgive me and that, if you decide to use any ofthe code in your own projects, you will not follow the same layout. Reading the Cross-reference Table Now that we now where to look for it, it’s time to fig- ure out how to read the cross-reference table itself. If we move to offset 55,593 ofthe file, we’ll find the fol- lowing: xref 0 22 0000000000 65535 f 0000000017 00000 n 0000005632 00000 n 0000005659 00000 n 0000006483 00000 n 0000053169 00000 n 0000006509 00000 n 0000039936 00000 n The word xxrreeff is followed by the first object represent- ed inthe table (0 in this case) and the number of entries that follow (twenty-two); we’ll call this the “header” ofthe table. Next come the entries them- selves: for each line, we have the offset at which the object can be found (10 characters), followed by the generation number and the letter nn for objects that are in use or ff for objects that are free. There are a few important things to notice here. First of all, each set of data is conveniently laid out in a line of text, so that we can use the ffggeettss(()) function to retrieve it. However, you should keep in mind that PDF files always use the Windows convention for identifying newlines inthe cross-reference table (but not necessar- ily elsewhere) and, therefore, you must instruct the PHP interpreter to do so as well—regardless ofthe platform your script is running on. This can be accomplished by turning on the aauuttoo__ddeetteecctt__lliinnee__eennddiinnggss INI directive (which became available as of PHP 4.3.0). We can do this directly from the code by first reading the current value, turning the directive on for the duration of our file operations and then restoring it back to its original value. This sequence of operations is important, because it is possible that other portions of our script may depend on the directive being in a different state than the one we need it in. Another gotcha when reading the cross-reference table is that there may be more than one block of entries—that is, once you’ve read out all the entries, you could find another header followed by a new set of entries, or you could find the trailer dictionary. If we didn’t check for this possibility and simply assume that the cross-reference table is always followed by a trailer, our code would be unable to read most documents that have been modified after their creation, since that’s the situation in which partial cross-reference tables are most likely to be found. As you can see in Listing 2 ( rreeaaddxxrreeff pphhpp ), the ppddff__rreeaadd__xxrreeff(()) function is a bit long, but otherwise quite simple. It is written to take full advantage ofthe fact that the cross-reference table is formatted using a very stylized layout, so that we can take advantage ofthe fastest and most convenient string functions pro- vided by PHP. The only aspect of this function that we have not explored is the little segment of code that starts at line 84 and ends at line 100. This is where our code reads May 2004 ● PHP Architect ● www.phparch.com 35 FFEEAATTUURREE IntheBellyoftheBeast Listing 1 1 <?php 2 3 /* 4 * Returns the offset ofthe most recent 5 * cross-reference table inthe file 6 */ 7 8 function pdf_find_xref ($f) 9 { 10 // First, seek to the end ofthe file, 11 // allowing for 50 bytes just so that 12 // we have enough data to look into. 13 14 fseek ($f, -50, SEEK_END); 15 16 // Next, try to find the proper sequence 17 // of data. Note that the information can be 18 // separated by a Windows-style, Mac-style 19 // or Unix-style newline 20 21 $data = fread ($f, 50); 22 23 if (!preg_match (‘/startxref(?:\r|\n|\r\n)(\d+)(?:\r|\n|\r\n)%%EOF(?:\r|\n|\r\n)$/’ , $data, $matches)) { 24 die (“Unable to find pointer to xref table”); 25 } 26 27 // If we get here, then we have the offset 28 // where the most recently introduced xref 29 // table is. 30 31 return (int) $matches[1]; 32 } 33 34 ?> the trailer dictionary; as you can see, it makes use of a few elements that I have not yet introduced (the ppddff__ccoonntteexxtt class and the ppddff__rreeaadd__vvaalluuee(()) function). However, if you leave the mechanics of how the infor- mation is retrieved aside for a moment, you’ll notice that the trailer dictionary ends up in an associative array. If you remember from last month’s article, files that have been modified usually contain more than one cross-reference table; this is indicated by the presence of a //PPrreevv key/value pair inthe trailer, with a pointer to its beginning. If this entry is present, the function sim- ply recourses onto itself until all the cross-reference tables present inthe file are read. Note that any infor- mation inthe older tables and trailers is not allowed to overwrite the data contained inthe newer ones by the simple stratagem of checking that an entry is not set inthe first case, and by merging the trailer arrays in a par- ticular order inthe second. Writing a PDF Lexer Now that we know where the objects are—the cross reference table gives us the location of every object inthe file—it’s time to try and read them. We could, in theory, write a series of ad-hoc functions that try to read from the file and interpret its contents, but things are much easier if we, instead, make use of that won- derful computer science concept known as the lexer (also known as a tokenizer). May 2004 ● PHP Architect ● www.phparch.com 36 FFEEAATTUURREE IntheBellyoftheBeast Listing 2 1 <?php 2 3 /* 4 * Reads a cross-reference table 5 * 6 * if $offset is provided and $start and $end are 7 * set to Null, the function will start reading the 8 * xref table from the current position inthe file. 9 * If more than one parts of xref table are present, 10 * the function will recurse onto itself as many times 11 * as needed. 12 */ 13 14 function pdf_read_xref ($f, &$result, $offset, $start = null, $end = null) 15 { 16 // If we didn’t get a start and end, we need 17 // to get them from the document itself. 18 19 if (is_null ($start) || is_null ($end)) { 20 21 // Move to the start ofthe table 22 23 fseek ($f, $offset); 24 25 // Make sure that PHP keeps track of 26 // the line endings properly 27 28 $old_ini = ini_get (‘auto_detect_line_endings’); 29 30 // Get a line of text from the file 31 32 $data = trim (fgets ($f)); 33 34 // Make sure the xref marker is where we 35 // expect it. 36 37 if ($data !== ‘xref’) { 38 die (“Unable to find xref table”); 39 } 40 41 // Now get the next line and split 42 // it across a single space character 43 44 $data = explode (‘ ‘, trim (fgets ($f))); 45 46 // Make sure the format is what we expected 47 48 if (count ($data) != 2) { 49 die (“Unexpected header in xref table”); 50 } 51 52 // Calculate the start and end object 53 // inthe xref table 54 55 $start = $data[0]; 56 $end = $start + $data[1]; 57 } 58 59 if (!isset ($result[‘xref_location’])) { 60 $result[‘xref_location’] = $offset; 61 } 62 63 if (!isset ($result[‘max_object’]) || $end > $result[‘max_object’]) { 64 $result[‘max_object’] = $end; 65 } 66 67 // Now cycle through each object 68 // pointer 69 70 for (; $start < $end; $start++) { 71 72 // Get a line of text from the 73 // file and extract the proper 74 // information out of there 75 76 $data = trim (fgets ($f)); 77 78 $offset = substr ($data, 0, 10); 79 $generation = substr ($data, 11, 5); 80 81 if (!isset ($result[‘xref’][$start][(int) $genera- tion])) { 82 $result[‘xref’][$start][(int) $generation] = (int) $offset; 83 } 84 } 85 86 // Get the next line, which could either be the beginning 87 // ofthe trailer dictionary or the header of another 88 // xref section 89 90 $data = trim (fgets ($f)); 91 92 if ($data === ‘trailer’) { 93 94 // Read trailer dictionary 95 96 $c = new pdf_context ($f); 97 $trailer = pdf_read_value ($c); 98 99 // Check whether there is a /Prev 100 // entry, which indicates that there 101 // is another xref table from before 102 103 if (isset ($trailer[‘/Prev’])) { 104 pdf_read_xref ($f, $result, $trailer[‘/Prev’]); 105 $result[‘trailer’] = array_merge ($result[‘trail- er’], $trailer); 106 } else { 107 $result[‘trailer’] = $trailer; 108 } 109 110 } else { 111 112 // We have another xref segment 113 // to read. Extract the start 114 // and length, and recurse into 115 // this function 116 117 $data = explode (‘ ‘, $data); 118 pdf_read_xref ($f, $result, null, $data[0], $data[0] + $data[1]); 119 120 } 121 } 122 123 ?> Our lexer will take the input from the PDF file and split it in individual tokens according to a particular set of rules. For example, if we were writing a lexer for reducing the contents of this article in a series of words (with every grammatical element representing a token), we would establish that a token is either a set of characters or a punctuation mark—assuming that whitespace and paragraph markers are of no impor- tance to us. Identifying tokens in a PDF file is quite simple in the- ory, although in practical terms you have to watch out for a few potential pitfalls. First the basics: the simplest form of delimiter is the whitespace, which has no semantic value (meaning that it is used only for the pur- pose of delimiting tokens and has no other purpose). Whitespace is composed of space characters, newlines and line feeds. This would be enough to cover most situations, but in some cases you’ll find that tokens are not always delimited using whitespaces. When some applications (including some of Adobe’s own) “optimize” a PDF file to reduce its size as much as possible, they remove whitespace characters where the distinction between two tokens is made obvious in another way. For exam- ple, consider the following snippet of PDF code that shows the beginning of a dictionary: << /Entry (Value) >> The whitespace between <<<< and //EEnnttrryy is made unnec- essary by the fact that the two tokens are made up of two completely different classes of characters. Since <<<< could only appear outside of a literal string to indicate the beginning of a dictionary, the lexer should stop at the second open angular bracket and delimit a token before the next character—whatever that is. Therefore, the snippet above could be rewritten as follows: <</Entry (Value)>> Clearly, whitespace isn’t enough to delimit a token—we must also keep in mind all the other possible character classes that can be used for the same purpose. Listing 3 ( ttookkeenniizzeerr pphhpp ) shows our lexer, the ppddff__rreeaadd__ttookkeenn(()) function, which looks a lot more complicated than it really is. This file also contains the ppddff__ccoonntteexxtt class that we mentioned earlier, which the tokenizer also makes use of. The ppddff__ccoonntteexxtt class is used to create a wrapper around a file pointer that makes it possible to: • Create a memory-based buffer for the file’s contents. • Keep track ofthe current pointer inthe file and ofthe length ofthe buffer • Maintain a stack of tokens that have been read from the file but not yet used The necessity of creating a buffer here arises from the fact that we don’t want our tokenizer to read one sin- gle character at a time out ofthe file. By reading a fixed amount at a time and then accessing the dara directly in memory, we can save ourselves a few expensive function calls. The token stack is actually used by the portion ofthe system that is responsible for interpret- ing the meaning ofthe tokens—more about that later. Note that there is no compelling reason to store this information in a class, other than the convenience fac- tor of having a convenient PHP syntax to work with. You could just as easily store everything in an array and avoid OOP altogether, although, in my opinion, that would significantly complicate your code and make it easier to introduce bugs that would be tough to find and fix. Going back to the ppddff__rreeaadd__ttookkeenn(()) function for a moment, you can see that it works in a very simple way: first, it removes any whitespace that is at the cur- rent offset inthe file buffer. Next, it tries to determine the type of token that it is dealing with by looking at the first character. The procedure used to then find the end ofthe token varies depending on the character class it belongs to: for array and literal string delimiters, a single character is all we need, whereas for hex string and dictionary delimiters we need to check one more character, since they both share the same initial open angular bracket. For all the other types of tokens, we simply scan the file until we end up in a different char- acter class. Parsing the Data Next inthe list, we need to be able to understand the meaning of each token inthe context ofthe PDF file— and this is the job of another great computer science construct: the parser. Parsers can be very complicated, and are usually not coded by hand—in most cases, a developer would use a “parser generator” like YACC or Bison. These reduce the parser to a relatively complex finite-state machine that is flexible enough to accommodate certain types of languages. In our case, however, the parsing of a PDF file is simple enough that the entire process can be coded in just about 150 lines’ worth of PHP. Before introducing another listing, however, let’s con- sider the types of data that we need to deal with. For the most part, they are simple to handle: for direct val- ues, for example, we read as many tokens as we need from the file and store them inthe appropriate data structures. In two cases, however, we need to make a distinction: strings and indirect objets. The problem with strings—and, particularly, with lit- eral string—is that they change the rules that our lexer May 2004 ● PHP Architect ● www.phparch.com 37 FFEEAATTUURREE IntheBellyoftheBeast May 2004 ● PHP Architect ● www.phparch.com 38 FFEEAATTUURREE IntheBellyoftheBeast Listing 3 1 <?php 2 3 /* 4 * This class is used to 5 * read data from the input 6 * file in a bufferized way 7 * and to store unused tokens 8 */ 9 10 class pdf_context 11 { 12 var $file; 13 var $buffer; 14 var $offset; 15 var $length; 16 17 var $stack; 18 19 // Constructor 20 21 function pdf_context ($f) 22 { 23 $this->file = $f; 24 $this->reset(); 25 } 26 27 // Optionally move the file 28 // pointer to a new location 29 // and reset the buffered data 30 31 function reset($pos = null) 32 { 33 if (!is_null ($pos)) { 34 fseek ($this->file, $pos); 35 } 36 37 $this->buffer = fread ($this->file, 100); 38 $this->offset = 0; 39 $this->length = strlen ($this->buffer); 40 $this->stack = array(); 41 } 42 43 // Make sure that there is at least one 44 // character beyond the current offset in 45 // the buffer to prevent the tokenizer 46 // from attempting to access data that does 47 // not exist 48 49 function ensure_content() 50 { 51 if ($this->offset >= $this->length - 1) { 52 return $this->increase_length(); 53 } else { 54 return true; 55 } 56 } 57 58 // Forcefully read more data into the buffer 59 60 function increase_length() 61 { 62 if (feof ($this->file)) { 63 return false; 64 } else { 65 $this->buffer .= fread ($this->file, 100); 66 $this->length = strlen ($this->buffer); 67 return true; 68 } 69 } 70 } 71 72 /* 73 * Reads a token from the file 74 */ 75 76 function pdf_read_token (&$c) 77 { 78 // If there is a token available 79 // on the stack, pop it out and 80 // return it. 81 82 if (count ($c->stack)) { 83 return array_pop($c->stack); 84 } 85 86 // Strip away any whitespace 87 88 do { 89 if (!$c->ensure_content()) { 90 return false; 91 } 92 $c->offset += strspn ($c->buffer, “ \n\r”, $c->off- set); 93 } while ($c->offset >= $c->length - 1); 94 95 // Get the first character inthe stream 96 97 $char = $c->buffer[$c->offset++]; 98 99 switch ($char) { 100 101 case ‘[‘ : 102 case ‘]’ : 103 case ‘(‘ : 104 case ‘)’ : 105 106 // This is either an array or literal string 107 // delimiter, Return it 108 109 return $char; 110 111 case ‘<’ : 112 case ‘>’ : 113 114 // This could either be a hex string or 115 // dictionary delimiter. Determine the 116 // appropriate case and return the token 117 118 if ($c->buffer[$c->offset] == $char) { 119 if (!$c->ensure_content()) { 120 return false; 121 } 122 $c->offset++; 123 return $char . $char; 124 } else { 125 return $char; 126 } 127 128 default : 129 130 // This is “another” type of token (probably 131 // a dictionary entry or a numeric value) 132 // Find the end and return it. 133 134 if (!$c->ensure_content()) { 135 return false; 136 } 137 138 while(1) { 139 140 // Determine the length ofthe token 141 142 $pos = strcspn ($c->buffer, “ []<>()\r\n/”, $c->offset); 143 144 if ($c->offset + $pos < $c->length - 1) { 145 break; 146 } else { 147 // If the script reaches this point, 148 // the token may span beyond the end 149 // ofthe current buffer. Therefore, 150 // we increase the size ofthe buffer 151 // and try again—just to be safe. 152 153 $c->increase_length(); 154 } 155 } 156 157 $result = substr ($c->buffer, $c->offset - 1, $pos + 1); 158 159 $c->offset += $pos; 160 return $result; 161 } 162 } 163 164 ?> has to follow in order to find the end ofthe token, because a closed parenthesis could be escaped by a backslash and, therefore, its presence alone does not indicate the end ofthe string. In a “traditional” lexer, this problem is taken care of by switching the machine to a new context in which a different set of rules apply. We could, in fact, do the very same thing to our lexer by creating a special case inthe sswwiittcchh statement that is part of ppddff__rreeaadd__ttookkeenn(()) in Listing 2 and writing some additional code that looks for a parenthesis not preceded by an even number of backslashes. Why an even number? Because the backslashes themselves can be escaped by prefixing them with another backslash. Therefore, an even number of backslashes means that they are all escaped and should be interpreted as liter- al characters, so that the last one does not escape the parenthesis, which becomes the string delimiter. The last in am odd number of backslashes right before a parenthesis becomes an “orphan” and escapes the parenthesis, thus preventing it from terminating the string. Given that we only have a limited amount of space and I really wanted to keep things as simple as possible, however, I chose to implement the string parsing func- tionality inside the parser itself. When an open paren- thesis token is returned by the tokenizer, the code sim- ply keeps scanning the input file until it finds an unescaped closed parenthesis. The other problematic data elements are, as I men- tioned above, indirect objects. Both object declarations and references are made up by three tokens. Therefore, once our parser encounters a numeric value, it won’t be able to tell whether it is part of a larger element until it has read at least one more token—and potentially two. The problem here is not with reading the tokens—it’s with what to do with them if, by any chance, the numeric value turns out to be… just a numeric value. We could, in theory, put the extra tokens “back inthe buffer” by rolling back the offset pointer inthe buffer to the beginning ofthe second token, but that would be difficult to do, since we don’t really know how many whitespace characters were between the tokens to start with. Therefore, we use a completely different approach: unused tokens are stored in a stack, which is part ofthe file context. When a new token is requested, ppddff__rreeaadd__ttookkeenn(()) checks whether anything is present inthe stack and, if something is in there, it pops it out and returns it, without even reading one character from the file buffer. You can see the end result of all our tribulations in Listing 4 ( rreeaaddvvaalluuee pphhpp ), which contains the ppddff__rreeaadd__vvaalluuee(()) function. You will also notice a num- ber of constant definitions that look suspiciously like data types—and they are. Since we’ll be reading and writing data back and forth, we’ll need to keep track ofthe object types as we read them from the stream. To do so, each object is encapsulated in an array whose zeroth element indicates the type, while element 1 con- tains the actual value, which varies depending on the nature ofthe data. Thus, for example, the trailer dic- tionary could look like this: Array ( PDF_TYPE_DICTIONARY, Array ( ‘/Size’ => array ( PDF_TYPE_NUMERIC, 22), ‘/Root’ => array ( PDF_TYPE_OBJ_REF, 12, 0 ), ‘/Prev’ => array ( PDF_TYPE_NUMERIC, 54655 ) ); Not unlike some of its predecessors, ppddff__rreeaadd__vvaalluuee(()) looks a lot scarier than it actually is—the code is quite heavily commented, so I will limit myself to noting that each value is actually stored in an array whose zeroth element contains its type. This makes identifying the data type of a type practically immediate, which will turn out to be very important later on when we’ll need to write objects back to the file. Before moving on to the next step, note that we make no provision in our lexer for reading stream data. This is because we are not intent on interpreting every- thing that is stored in a PDF file—but only those ele- ments that allow us to modify its contents. However, adding support for streams shouldn’t be too much of a problem—all you need is the ability to resolve object references, which we’ll add shortly, since the length of a stream is often expressed in that way. Getting to the Root ofthe Problem All the pieces are finally in place—we should now be able to read through the PDF file and interpret its con- tents, at least to the extent that we need in order to be able to append data to it. In order to demonstrate how the PDF functionality that we have built works, our goal is to open a PDF file and add a textual element to its first page. Listing 5 ( iinnddeexx pphhpp ) is our main script—and, unfor- tunately, it’s too large to show here; you will, however, find it inthe code associated with this article, so you will hopefully be able to follow me there. Once we have declared a few variables that we we’ll end up using throughout the script, we read the cross- reference table from the file, then immediately attempt to retrieve the Root object from it. Because the //RRoooott entry inside the file trailer has to be an indirect object reference, we must find a way to retrieve the actual May 2004 ● PHP Architect ● www.phparch.com 39 FFEEAATTUURREE IntheBellyoftheBeast May 2004 ● PHP Architect ● www.phparch.com 40 FFEEAATTUURREE IntheBellyoftheBeast Continued on page 41 . Listing 4 1 <?php 2 3 // Define various data types 4 // that we use throughout the system 5 6 define (‘PDF_TYPE_NULL’, 0); 7 define (‘PDF_TYPE_NUMERIC’, 1); 8 define (‘PDF_TYPE_TOKEN’, 2); 9 define (‘PDF_TYPE_HEX’, 3); 10 define (‘PDF_TYPE_STRING’, 4); 11 define (‘PDF_TYPE_DICTIONARY’, 5); 12 define (‘PDF_TYPE_ARRAY’, 6); 13 define (‘PDF_TYPE_OBJDEC’, 7); 14 define (‘PDF_TYPE_OBJREF’, 8); 15 define (‘PDF_TYPE_OBJECT’, 9); 16 define (‘PDF_TYPE_STREAM’, 10); 17 18 /* 19 * Reads a value from the current 20 * data stream 21 */ 22 23 function pdf_read_value (&$c, $token = null) 24 { 25 // Get a token from the stream. 26 27 if (is_null ($token)) { 28 $token = pdf_read_token ($c); 29 } 30 31 if ($token === false) { 32 return false; 33 } 34 35 switch ($token) { 36 37 case ‘<’ : 38 39 // This is a hex string. 40 // Read the value, then the terminator 41 42 $s = pdf_read_token ($c); 43 44 if ($s === false) { 45 return false; 46 } 47 48 $term = pdf_read_token ($c); 49 50 if ($term !== ‘>’) { 51 die (“Unexpected data after hex string”); 52 } 53 54 return array (PDF_TYPE_HEX, $s); 55 56 break; 57 58 case ‘<<’ : 59 60 // This is a dictionary. 61 62 $result = array(); 63 64 // Recurse into this function until we reach 65 // the end ofthe dictionary. 66 67 while (($key = pdf_read_token ($c)) !== ‘>>’) { 68 if ($key === false) { 69 return false; 70 } 71 72 if (($value = pdf_read_value ($c)) === false) { 73 return false; 74 } 75 76 $result[$key] = $value; 77 } 78 79 return array (PDF_TYPE_DICTIONARY, $result); 80 81 case ‘[‘ : 82 83 // This is an array. 84 85 $result = array(); 86 87 // Recurse into this function until we reach 88 // the end ofthe array. 89 90 while (($token = pdf_read_token ($c)) !== ‘]’) { 91 if ($token === false) { 92 return false; 93 } 94 95 if (($value = pdf_read_value ($c, $token)) === false) { 96 return false; 97 } 98 99 $result[] = $value; 100 } 101 102 return array (PDF_TYPE_ARRAY, $result); 103 104 case ‘(‘ : 105 106 // This is a string 107 108 $pos = $c->offset; 109 110 while(1) { 111 112 // Start by finding the next closed 113 // parenthesis 114 115 $pos = strpos ($c->buffer, ‘)’, $pos); 116 117 // If you can’t find it, try 118 // reading more data from the stream 119 120 if ($pos == -1) { 121 if (!$c->increase_length()) { 122 return false; 123 } 124 } 125 126 // Make sure that there is no backslash before the parenthesis. If there is, 127 // move on. Otherwise, return the string. 128 129 if ($c->buffer[$pos - 1] !== ‘\\’) { 130 $result = substr ($c->buffer, $c->offset, $pos - $c->offset + 1); 131 $c->offset = $pos + 1; 132 return array (PDF_TYPE_STRING, $result); 133 } else { 134 $pos++; 135 136 if ($pos > $c->offset + $c->length) { 137 $c->increase_length(); 138 } 139 } 140 } 141 142 default : 143 144 if (is_numeric ($token)) { 145 146 // A numeric token. Make sure that it is not part of something else. 147 148 if (($tok2 = pdf_read_token ($c)) !== false) { 149 if (is_numeric ($tok2)) { 150 151 // Two numeric tokens in a row. In this case, we’re probably in 152 // front of either an object reference or an object specification. 153 // Determine the case and return the data 154 155 if (($tok3 = pdf_read_token ($c)) !== false) { 156 switch ($tok3) { 157 158 case ‘obj’ : 159 160 return array (PDF_TYPE_OBJDEC, (int) $token, (int) $tok2); 161 162 case ‘R’ : 163 164 return array (PDF_TYPE_OBJREF, (int) $token, (int) $tok2); 165 } object data, as the reference itself won’t help us much. This is accomplished by the ppddff__rreessoollvvee__oobbjjeecctt(()) func- tion, which you can see in Listing 6 as part ofthe oobbjjeeccttss pphhpp include file. The function can actually be used to determine whether any object is an indirect ref- erence and resolve it to the actual object data—some- thing that will come in handy at pretty much every step ofthe way. As you can see, ppddff__rreessoollvvee__oobbjjeecctt(()) first checks to see if the value it has been passed is an indirect object reference. If it isn’t, the function has really nothing to do, other than returning right away. If, on the other hand, it did receive an indirect reference, it uses the cross-reference table to determine its position and starts reading it. The $$eennccaappssuullaattee parameter deter- mines how the object is returned to the caller; if it is set to true, ppddff__rreessoollvvee__oobbjjeecctt(()) stores the object ID and generation number inthe array, so that effectively the object’s data is encapsulated inside another object of type PPDDFF__TTYYPPEE__OOBBJJEECCTT . Otherwise, the direct value is returned, and all information regarding the object’s ID and generation number is lost. Both types of return val- ues have their uses—if you want to retrieve an object with the intention of modifying it, you will probably want it encapsulated, so that you can later rewrite it back to the stream. If, on the other hand, you’re just trying to retrieve a value, as you would, for example, if you were reading a stream object and you wanted to determine its length, the non-encapsulated version will be easier to handle. Speaking of retrieving streams, even though my code doesn’t perform that function (since I’m not writing a PDF reader), if you intend to add it, the ppddff__rreessoollvvee__oobbjjeecctt(()) function is safe to use because it saves the file pointer’s current position before reading the object and restores it afterwards. If the function didn’t do so and you were reading a stream, resolving the //LLeennggtthh parameter could result inthe file pointer being moved to a different location inthe file—and you would be unable to read the rest ofthe stream. Let’s go back to index.php . With the root object firm- ly in hand, we can now compile a list of all the pages contained inthe document. To do so, we feed the //PPaaggeess element ofthe root dictionary to the ppddff__rreeaadd__ppaaggeess(()) function, which you can see in Listing 7 ( rreeaaddppaaggeess pphhpp ). The reason why we have a separate function just to read through the //PPaaggeess element ofthe root object is that, as I mentioned in last month’s article, the pages could be nested in an arbitrary combination of //PPaaggee and //PPaaggeess dictionaries, so that we may need to recurse into the function several times in order to end up with an array that contains only page elements. It is impor- tant to understand that the order in which the pages are resolved by using this method doesn’t necessarily correspond to the logical order in which they will appear to the user—that is, the first page inthe list is not necessarily the first page ofthe document; the PDF specification provides a different set of facilities for determining the logical page order, but, technically speaking, you should only be interested in that if you want to display the contents of a document. In practi- cal terms, I have never found an occasion in which the logical and physical page order didn’t coincide—at most, there might be a fixed discrepancy because the document is an excerpt that starts from, say, page 25, but the order ofthe pages should usually be the same. In our sample script, we only take in consideration page 1 (which is the zeroth element resulting from the pages array). We then use the ppddff__ffiinndd__rreessoouurrcceess(()) function, shown in Listing 8 ( ppaaggee pphhpp ) to retrieve the resources associated with the page. Here, again, we need a dedicated function because, as you may remember, the resource dictionary is an inheritable May 2004 ● PHP Architect ● www.phparch.com 41 FFEEAATTUURREE IntheBellyoftheBeast Listing 4: Continued from page 40 164 return array (PDF_TYPE_OBJREF, (int) $token, (int) $tok2); 165 } 166 167 // If we get to this point, that numeric value up 168 // there was just a numeric value. Push the extra 169 // tokens back into the stack and return the value. 170 171 array_push ($c->stack, $tok3); 172 } 173 } 174 175 array_push ($c->stack, $tok2); 176 } 177 178 return array (PDF_TYPE_NUMERIC, $token); 179 } else { 180 // Just a token. Return it. 181 182 return array (PDF_TYPE_TOKEN, $token); 183 } 184 185 } 186 } 187 188 ?> resource, so that if there isn’t one associated with the page itself, there may be one associated with its parent, or with its parent’s parent, and so on. Inthe case ofthe sample file that we were looking at last month, the same resource object is actually associated explicitly with every page and with their parent (the /Pages dic- tionary). This is rather redundant, and makes for poor optimization, but it is perfectly acceptable (for the record, the PDF was creating on Linux by exporting an OpenOffice.org 1.1 file). Next, we need to find the font resources, so that we can append our own to the existing ones. Finding the font resource dictionary is actually a lot simpler than any of its predecessors so far, since it’s either there, in which case we piggyback on it, or it isn’t, in which case we create our own and add it to the resources associat- ed with the page. The only difficulty here is in finding a name for the font resource that doesn’t conflict with one that already exists. The approach that I have taken is to simply run through all the resources available and look at those called /F x , where x is a numerical value. The font resource we create is the next highest avail- able—for example, if the highest font resource current- ly used is /F10 , ours will be /F11 . Note that this choice is entirely arbitrary—you can choose whatever combi- nation you like, as long as it starts with a letter and not a digit. The font resource that we create and add to the font dictionary is the simplest possible one: it uses the Helvetica font, which must be supported by every PDF reader and, therefore, doesn’t need to be embedded inthe document itself. Graffiti on the Wall We’ve now come to the part where we actually need to “write” some text on the document. Unfortunately, this involves a few steps. First, the concept of drawing pretty much anything May 2004 ● PHP Architect ● www.phparch.com 42 FFEEAATTUURREE IntheBellyoftheBeast Listing 6 1 <?php 2 3 /* 4 * Resolves an object reference, 5 * ensuring that the result value 6 * is always a direct object 7 */ 8 9 function pdf_resolve_object (&$c, $obj_spec, $encapsulate = true) 10 { 11 global $xref_data; 12 13 // Exit if we get invalid data 14 15 if (!is_array ($obj_spec)) { 16 return false; 17 } 18 19 if ($obj_spec[0] == PDF_TYPE_OBJREF) { 20 21 // This is a reference, resolve it 22 23 if (isset ($xref_data[‘xref’][$obj_spec[1]][$obj_spec[2]])) { 24 25 // Save current file position 26 // This is needed if you want to resolve 27 // references while you’re reading another object 28 // (e.g.: if you need to determine the length 29 // of a stream) 30 31 $old_pos = ftell ($c->file); 32 33 // Reposition the file pointer and 34 // load the object header. 35 36 $c->reset ($xref_data[‘xref’][$obj_spec[1]][$obj_spec[2]]); 37 $header = pdf_read_value ($c); 38 39 if ($header[0] != PDF_TYPE_OBJDEC || $header[1] != $obj_spec[1] || $header[2] != $obj_spec[2]) { 40 die (“Unable to find object ({$obj_spec[1]}, {$obj_spec[2]}) at expected location”); 41 } 42 43 // If we’re being asked to store all the informa- tion 44 // about the object, we add the object ID and gen- eration 45 // number for later use 46 47 if ($encapsulate) { 48 $result = array ( 49 PDF_TYPE_OBJECT, 50 ‘obj’ => $obj_spec[1], 51 ‘gen’ => $obj_spec[2] 52 ); 53 } else { 54 $result = array(); 55 } 56 57 // Now simply read the object data until 58 // we encounter an end-of-object marker 59 60 while(1) { 61 $value = pdf_read_value ($c); 62 63 if ($value === false) { 64 return false; 65 } 66 67 if ($value[0] == PDF_TYPE_TOKEN && $value[1] === ‘endobj’) { 68 break; 69 } 70 71 $result[] = $value; 72 } 73 74 $c->reset ($old_pos); 75 76 return $result; 77 } 78 } else { 79 return $obj_spec; 80 } 81 } 82 83 /* 84 * Generates a new object container 85 * with the proper object ID and 86 * a generation number of zero 87 */ 88 89 function pdf_new_object() 90 { 91 global $xref_data; 92 93 return array ( 94 PDF_TYPE_OBJECT, 95 ‘obj’ => $xref_data[‘max_object’]++, 96 ‘gen’ => 0 97 ); 98 } 99 100 ?> on a page requires a series of commands that PDF bor- rows from Postscript. In order for the reader to recog- nize them, we’ll have to encapsulate them in a stream, and add that stream to the contents ofthe page. When drawing text on the screen, a certain number of transformations can be applied to it: translation (so that you can move the text to the location of your choice), rotation and scaling. In our case, we will only deal with the first two. The transformations are applied using a simple matrix; unfortunately, we do not have enough space here to go at length about how the matrix works, but the PDF specification document does a pretty good job of that, so I’ll refer you to it. Instead, let us focus on the commands used to apply the transformation itself; here’s an example: Ma Mb Mc Md x y Tm Looks cryptic, doesn’t it? The first four elements ofthe matrix ( Ma , Mb , Mc and Md ) are used, in our case to express the rotation that should be applied to the text. They can also be used to determine the scale, but, as I mentioned, that is beyond the scope of this article. The x and y parameters, on the other hand, indicate the coordinates at which we want the text to apply. Finally, Tm is the command itself, which tells the PDF interpreter to apply these values to the text transformation matrix. As you have probably noticed, the format of this func- tion call is the exact opposite of what we are used to in PHP (where we use function (param1, param2, …) . This format is called “Reverse Polish Notation” and is often used in machines that use a stack to store their param- May 2004 ● PHP Architect ● www.phparch.com 43 FFEEAATTUURREE IntheBellyoftheBeast Listing 7 1 <?php 2 3 // Creates a list of all the pages 4 // that are present in a document 5 6 function pdf_read_pages (&$c, &$pages, &$result) 7 { 8 // Get the kids dictionary 9 10 $kids = pdf_resolve_object ($c, $pages[1][1][‘/Kids’]); 11 12 foreach ($kids[1] as $v) { 13 $pg = pdf_resolve_object ($c, $v); 14 15 if ($v[1][1][‘/Type’] === ‘Pages’) { 16 17 // If one ofthe kids is an embedded 18 // /Pages array, resolve it as well. 19 20 pdf_read_pages ($c, $v, $result); 21 } else { 22 $result[] = $pg; 23 } 24 } 25 } 26 27 ?> Listing 8 1 <?php 2 3 /* 4 * Finds the resources associated with a page 5 */ 6 7 function pdf_find_resources (&$c, $obj) 8 { 9 $obj = pdf_resolve_object($c, $obj); 10 11 // If the current object has a resources 12 // dictionary associated with it, we use 13 // it. Otherwise, we move back to its 14 // parent object. 15 16 if (isset ($obj[1][1][‘/Resources’])) { 17 return pdf_resolve_object ($c, $obj[1][1][‘/Resources’]); 18 } else { 19 if (!isset ($obj[1][1][‘/Parent’])) { 20 return false; 21 } else { 22 return pdf_find_resources ($obj[1][1][‘/Parent’]); 23 } 24 } 25 } 26 27 ?> eters, such as the PostScript virtual machine on which the PDF specification is based. Next, we’ll select a font that will be used to draw the text: /F11 10 Tf The Tf command sets the current font resource to /F11 , with a size of 10 points. Note that the font size can be a floating-point value, so that you could have text in size 12.5. Before writing the text itself, we need to set the spac- ing between one line of text and the next. This is not as easy to determine as you may think—because it depends on where the baseline ofthe font resides and how the font itself is designed. From a practical per- spective, I find that half the size ofthe font is a good empirical default that works in most occasions. The TL command below sets the interline to five points: 5 TL Finally, we can actually draw the text! This is done by using a combination of two commands. The text is actually drawn using the ‘ command (no that’s not a mistake—the command is a quotation mark). However, if a newline character is present inthe text, it is simply ignored. So, we simply replace every occurrence of a newline character with the T* command, which causes the drawing pointer to be reset to the next line. Finally, all we need to do is update the page’s /Contents array with a reference to our stream. Once again, we need to determine if there already is an array and what it contains, and act accordingly, so that we can add our own data to it. Writing it All Back The final step before we can call it a day consists of actually writing our changes back to the file so that they can be applied to the document. To do so, we first of all open the output file, which was defined at the beginning ofthe main script (Listing 5). Next, we call the ppddff__wwrriittee__oobbjjeeccttss(()) function to rewrite the objects that we modified back to the file. If you take a look at Listing 9 ( writer.php ), you’ll notice that this function is, essentially, the reverse of ppddff__rreeaadd__vvaalluuee(()) , since it first builds an indirect object “shell” and then fills it with the appropriate value. There are two things here that are worth mentioning. First, the information that we write back to the file is not a “true” delta—the resources dictionary may not change at all, but we write it back anyway. This is not May 2004 ● PHP Architect ● www.phparch.com 44 FFEEAATTUURREE IntheBellyoftheBeast Figure 1 very optimized, but it will do if you’re only making small changes to a document—and it beats having to build a system that “remembers” what was changed. Also, you have probably noticed that, whenever a new object is written to stream, ppddff__wwrriittee__oobbjjeeccttss(()) “makes a note” ofthe file pointer’s current position. This comes in handy afterwards, when we rebuild the cross-reference table by calling ppddff__wwrriittee__xxrreeff(()) . Here, we create the proper entries one at a time. This process could be optimized by grouping those entries belong- ing to objects with IDs in sequence, but, again, if you’re dealing only with small changes it’s hardly worth the trouble. Once the cross-reference table is inthe file, ppddff__wwrriittee__xxrreeff(()) terminates by writing the trailer dic- tionary. In our case, it contains a pointer to the root object, which has not changed but must be there nonetheless, as well as a numeric value that declares the number of objects stored inthe file and a pointer to the previous cross-reference table. Where to Go From here That’s it! As you can see, once one figures out how things work it doesn’t take too long to actually open and modify the contents of a PDF file programmatical- ly—which, of course, makes the impression, apparently shared by many people, that PDF is a non-modifiable format quite strange. Although the end result of our sample script is rela- tively simple (if you run it against the sample file that I included in last month’s code—and which is once again included in this month’s for your convenience—and open the resulting PDF file, you should see something like the output in Figure 1), the foundation on which it is built is quite solid and can be expanded upon to pro- vide additional functionality. Before parting ways, I just want to share one final tid- bit of information with you. Working with PDF files can be very frustrating, particularly if you work on Windows, because the Acrobat PDF viewer is about as useful for debugging as testing whether your house’s electrical circuit is working by sticking your fingers inthe power outlet. However, there are ways around that. First, you can actually get Acrobat to provide you with more useful error messages by pressing the Control key while click on the OK button inthe error window that appears when you try to load a corrupted file. Second, you can use a free PDF decoder (such as the one avail- able online at wwwwww ppllaanneettppddff ccoomm//mmaaiinnppaaggee aasspp??wweebbppaaggeeii-- dd==33446633 ) to visually inspect the contents of your file and determine what is wrong with it. Sometimes, however, it will be hard to figure bugs out. While I was writing this article, I lost lots of time debugging a problem that turned out to be just a spelling mistake but that caused Acrobat to crash with- out any useful error. This brings us to the last tool you’ll need plenty of—patience! May 2004 ● PHP Architect ● www.phparch.com 45 FFEEAATTUURREE IntheBellyoftheBeast About the Author ?> To Discuss this article: http://forums.phparch.com/145 Marco is the Publisher of (and a frequent contributor to) php|architect. When not posting bogus bugs to the PHP website just for the heck of it, he can be found trying to hack his computer into submission. You can write to him at mmaarrccoott@@pphhppaarrcchh ccoomm . www.moliere.co.uk Tel: +44 (0)161 2477771 email:info@moliere.co.uk All you need to know about PHP for the World Wide Web This course on the world's most popular Web development language teaches all you need to know to begin developing dynamic Web sites today. 20 -23 - July - 04 MySQL and SQL Discussing both SQL--the standardized language used by all databases--and MySQL- -the world's most popular open source database, this class teaches how to best store and retrieve information. 26 -29 - July - 04 PHP & MySQL training by published author Larry Ullman 20th-29th July, Manchester, UK LTD . value. We could, in theory, put the extra tokens “back in the buffer” by rolling back the offset pointer in the buffer to the beginning of the second token,. PDF document contains a dictionary at the beginning of the file that provides the necessary facili- ties for determining the location of the first page in