1. Trang chủ
  2. » Công Nghệ Thông Tin

Tài liệu Jump Right To It. pptx

14 459 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 14
Dung lượng 457,92 KB

Nội dung

Jump Right To It. Three days of pure PHP http://www.phparch.com/phpworks php|w rks Toronto, Sept. 22-24, 2004 Existing subscribers can upgrade to the Print edition and save! Login to your account for more details. NEW! NEW! *By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above. Because of fluctuations in the exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly. **Offer available only in conjunction with the purchase of a print subscription. Choose a Subscription type: CCaannaaddaa//UUSSAA $$ 8833 9999 CCAADD (($$5599 9999 UUSS**)) IInntteerrnnaattiioonnaall SSuurrffaaccee $$111111 9999 CCAADD (($$7799 9999 UUSS**)) IInntteerrnnaattiioonnaall AAiirr $$112255 9999 CCAADD (($$8899 9999 UUSS**)) CCoommbboo eeddiittiioonn aadddd--oonn $$ 1144 0000 CCAADD (($$1100 0000 UUSS)) ((pprriinntt ++ PPDDFF eeddiittiioonn)) Your charge will appear under the name "Marco Tabini & Associates, Inc." Please allow up to 4 to 6 weeks for your subscription to be established and your first issue to be mailed to you. *US Pricing is approximate and for illustration purposes only. php|architect Subscription Dept. P.O. Box 54526 1771 Avenue Road Toronto, ON M5M 4N5 Canada Name: ____________________________________________ Address: _________________________________________ City: _____________________________________________ State/Province: ____________________________________ ZIP/Postal Code: ___________________________________ Country: ___________________________________________ Payment type: VISA Mastercard American Express Credit Card Number:________________________________ Expiration Date: _____________________________________ E-mail address: ______________________________________ Phone Number: ____________________________________ Visit: http://www.phparch.com/print for more information or to subscribe online. Signature: Date: To subscribe via snail mail - please detach/copy this form, fill it out and mail to the address above or fax to +1-416-630-5057 php|architect The Magazine For PHP Professionals YYoouu’’llll nneevveerr kknnooww wwhhaatt wwee’’llll ccoommee uupp wwiitthh nneexxtt W elcome to part two of our little trip down PDF lane. While last month we focused primarily on understanding what the structure of a PDF document is, this time over we’ll look at the problem of altering the contents of a PDF file from a more practical perspective. The main thing to understand, before we move on to anything else, is that parsing a PDF file is a complex— but by no means complicated—endeavour because the file is not only not intended for human consumption, but it also does not follow a top-down logic. In other words, as we also discovered last month, when parsing a PDF file one doesn’t start at the beginning and move down to the end of the file. In fact, the exact opposite is true. Since we’ll often find ourselves jumping at various— and completely arbitrary—positions into the docu- ment, the first decision that we need to make is how we’re going to access the data. While it is tempting to just load the entire file in memory, that’s usually not such a good idea; if you consider that a PDF can have pretty much any size, by loading an entire document in memory we expose ourselves to the potential of clog- ging up large chunks of RAM, thus limiting our server’s ability to process a large number of requests. Yet, seeking to arbitrary locations in a document is not always easy, or even possible. Imagine, for exam- ple, if you’re accessing a PDF document via HTTP. In this case, you’d have to download the entire file before you could actually find out about any of its characteris- tics, since the offset of the cross-reference table appears at the end of the file. Even in this case, I would recom- mend storing the document in a local file and then accessing the data through the filesystem. The one notable exception to this rule is a special class of PDF documents known as “linearized PDF files”. A linearized PDF document contains a dictionary at the beginning of the file that provides the necessary facili- ties for determining the location of the first page in the file without having to read through the cross-reference table first. The structure of linearized PDF files is beyond the scope of this article, but you can find out more about it directly from the PDF specification document published by Adobe. Getting Started The first thing we need to do in order to be able to interpret the contents of a PDF document is to deter- mine where the cross-reference table and trailer dic- tionary are. This is quite easy if you consider that the format of the ssttaarrttxxrreeff pointer is fixed. For example, in my document it looks like the following: startxref 53593 %%EOF May 2004 ● PHP Architect ● www.phparch.com 34 FEATURE In the Belly of the Beast Interpreting and Manipulating PDF Files by Marco Tabini PHP: 4.3.0+ OS: Any Applications: A PDF Reader (for testing) Code Directory: pdf REQUIREMENTS In last month's issue, we examined the structure and con- tents of a PDF document in considerable detail. This month, we'll actually write a PHP library capable of open- ing one and modifying its contents. Thus, all we need to do is move to the end of the file, back up a few bytes and then find this sequence of data. As you can see from Listing 1 ( ffiinnddxxrreeff pphhpp ), this is readily accomplished by using a simple regular expression. Note how the regex pattern specification ends with a dollar sign, indicating that the resulting match must be anchored to the end of the data stream. Even though we’re only taking fifty characters from the end of the file, I have added the anchor to prevent the regex engine from picking up a previous cross-refer- ence table pointer by mistake. If you’re wondering why the cross-reference table pointer is not saved to the document using a fixed format (say, for example, using 10 digits for the offset like the cross-reference entries themselves), you’re not alone. This decision is a bit of a mystery, but it’s something that we have to live with. By the way—throughout the remainder of the article, you’ll notice that I have created an individual include file for each of the functions that we will be writing. This is clearly not a good design practice, but it fulfills one important purpose: it keeps the listings in the arti- cles short and to the point. Thus, in the interest of clar- ity, I hope that you’ll forgive me and that, if you decide to use any of the code in your own projects, you will not follow the same layout. Reading the Cross-reference Table Now that we now where to look for it, it’s time to fig- ure out how to read the cross-reference table itself. If we move to offset 55,593 of the file, we’ll find the fol- lowing: xref 0 22 0000000000 65535 f 0000000017 00000 n 0000005632 00000 n 0000005659 00000 n 0000006483 00000 n 0000053169 00000 n 0000006509 00000 n 0000039936 00000 n The word xxrreeff is followed by the first object represent- ed in the table (0 in this case) and the number of entries that follow (twenty-two); we’ll call this the “header” of the table. Next come the entries them- selves: for each line, we have the offset at which the object can be found (10 characters), followed by the generation number and the letter nn for objects that are in use or ff for objects that are free. There are a few important things to notice here. First of all, each set of data is conveniently laid out in a line of text, so that we can use the ffggeettss(()) function to retrieve it. However, you should keep in mind that PDF files always use the Windows convention for identifying newlines in the cross-reference table (but not necessar- ily elsewhere) and, therefore, you must instruct the PHP interpreter to do so as well—regardless of the platform your script is running on. This can be accomplished by turning on the aauuttoo__ddeetteecctt__lliinnee__eennddiinnggss INI directive (which became available as of PHP 4.3.0). We can do this directly from the code by first reading the current value, turning the directive on for the duration of our file operations and then restoring it back to its original value. This sequence of operations is important, because it is possible that other portions of our script may depend on the directive being in a different state than the one we need it in. Another gotcha when reading the cross-reference table is that there may be more than one block of entries—that is, once you’ve read out all the entries, you could find another header followed by a new set of entries, or you could find the trailer dictionary. If we didn’t check for this possibility and simply assume that the cross-reference table is always followed by a trailer, our code would be unable to read most documents that have been modified after their creation, since that’s the situation in which partial cross-reference tables are most likely to be found. As you can see in Listing 2 ( rreeaaddxxrreeff pphhpp ), the ppddff__rreeaadd__xxrreeff(()) function is a bit long, but otherwise quite simple. It is written to take full advantage of the fact that the cross-reference table is formatted using a very stylized layout, so that we can take advantage of the fastest and most convenient string functions pro- vided by PHP. The only aspect of this function that we have not explored is the little segment of code that starts at line 84 and ends at line 100. This is where our code reads May 2004 ● PHP Architect ● www.phparch.com 35 FFEEAATTUURREE In the Belly of the Beast Listing 1 1 <?php 2 3 /* 4 * Returns the offset of the most recent 5 * cross-reference table in the file 6 */ 7 8 function pdf_find_xref ($f) 9 { 10 // First, seek to the end of the file, 11 // allowing for 50 bytes just so that 12 // we have enough data to look into. 13 14 fseek ($f, -50, SEEK_END); 15 16 // Next, try to find the proper sequence 17 // of data. Note that the information can be 18 // separated by a Windows-style, Mac-style 19 // or Unix-style newline 20 21 $data = fread ($f, 50); 22 23 if (!preg_match (‘/startxref(?:\r|\n|\r\n)(\d+)(?:\r|\n|\r\n)%%EOF(?:\r|\n|\r\n)$/’ , $data, $matches)) { 24 die (“Unable to find pointer to xref table”); 25 } 26 27 // If we get here, then we have the offset 28 // where the most recently introduced xref 29 // table is. 30 31 return (int) $matches[1]; 32 } 33 34 ?> the trailer dictionary; as you can see, it makes use of a few elements that I have not yet introduced (the ppddff__ccoonntteexxtt class and the ppddff__rreeaadd__vvaalluuee(()) function). However, if you leave the mechanics of how the infor- mation is retrieved aside for a moment, you’ll notice that the trailer dictionary ends up in an associative array. If you remember from last month’s article, files that have been modified usually contain more than one cross-reference table; this is indicated by the presence of a //PPrreevv key/value pair in the trailer, with a pointer to its beginning. If this entry is present, the function sim- ply recourses onto itself until all the cross-reference tables present in the file are read. Note that any infor- mation in the older tables and trailers is not allowed to overwrite the data contained in the newer ones by the simple stratagem of checking that an entry is not set in the first case, and by merging the trailer arrays in a par- ticular order in the second. Writing a PDF Lexer Now that we know where the objects are—the cross reference table gives us the location of every object in the file—it’s time to try and read them. We could, in theory, write a series of ad-hoc functions that try to read from the file and interpret its contents, but things are much easier if we, instead, make use of that won- derful computer science concept known as the lexer (also known as a tokenizer). May 2004 ● PHP Architect ● www.phparch.com 36 FFEEAATTUURREE In the Belly of the Beast Listing 2 1 <?php 2 3 /* 4 * Reads a cross-reference table 5 * 6 * if $offset is provided and $start and $end are 7 * set to Null, the function will start reading the 8 * xref table from the current position in the file. 9 * If more than one parts of xref table are present, 10 * the function will recurse onto itself as many times 11 * as needed. 12 */ 13 14 function pdf_read_xref ($f, &$result, $offset, $start = null, $end = null) 15 { 16 // If we didn’t get a start and end, we need 17 // to get them from the document itself. 18 19 if (is_null ($start) || is_null ($end)) { 20 21 // Move to the start of the table 22 23 fseek ($f, $offset); 24 25 // Make sure that PHP keeps track of 26 // the line endings properly 27 28 $old_ini = ini_get (‘auto_detect_line_endings’); 29 30 // Get a line of text from the file 31 32 $data = trim (fgets ($f)); 33 34 // Make sure the xref marker is where we 35 // expect it. 36 37 if ($data !== ‘xref’) { 38 die (“Unable to find xref table”); 39 } 40 41 // Now get the next line and split 42 // it across a single space character 43 44 $data = explode (‘ ‘, trim (fgets ($f))); 45 46 // Make sure the format is what we expected 47 48 if (count ($data) != 2) { 49 die (“Unexpected header in xref table”); 50 } 51 52 // Calculate the start and end object 53 // in the xref table 54 55 $start = $data[0]; 56 $end = $start + $data[1]; 57 } 58 59 if (!isset ($result[‘xref_location’])) { 60 $result[‘xref_location’] = $offset; 61 } 62 63 if (!isset ($result[‘max_object’]) || $end > $result[‘max_object’]) { 64 $result[‘max_object’] = $end; 65 } 66 67 // Now cycle through each object 68 // pointer 69 70 for (; $start < $end; $start++) { 71 72 // Get a line of text from the 73 // file and extract the proper 74 // information out of there 75 76 $data = trim (fgets ($f)); 77 78 $offset = substr ($data, 0, 10); 79 $generation = substr ($data, 11, 5); 80 81 if (!isset ($result[‘xref’][$start][(int) $genera- tion])) { 82 $result[‘xref’][$start][(int) $generation] = (int) $offset; 83 } 84 } 85 86 // Get the next line, which could either be the beginning 87 // of the trailer dictionary or the header of another 88 // xref section 89 90 $data = trim (fgets ($f)); 91 92 if ($data === ‘trailer’) { 93 94 // Read trailer dictionary 95 96 $c = new pdf_context ($f); 97 $trailer = pdf_read_value ($c); 98 99 // Check whether there is a /Prev 100 // entry, which indicates that there 101 // is another xref table from before 102 103 if (isset ($trailer[‘/Prev’])) { 104 pdf_read_xref ($f, $result, $trailer[‘/Prev’]); 105 $result[‘trailer’] = array_merge ($result[‘trail- er’], $trailer); 106 } else { 107 $result[‘trailer’] = $trailer; 108 } 109 110 } else { 111 112 // We have another xref segment 113 // to read. Extract the start 114 // and length, and recurse into 115 // this function 116 117 $data = explode (‘ ‘, $data); 118 pdf_read_xref ($f, $result, null, $data[0], $data[0] + $data[1]); 119 120 } 121 } 122 123 ?> Our lexer will take the input from the PDF file and split it in individual tokens according to a particular set of rules. For example, if we were writing a lexer for reducing the contents of this article in a series of words (with every grammatical element representing a token), we would establish that a token is either a set of characters or a punctuation mark—assuming that whitespace and paragraph markers are of no impor- tance to us. Identifying tokens in a PDF file is quite simple in the- ory, although in practical terms you have to watch out for a few potential pitfalls. First the basics: the simplest form of delimiter is the whitespace, which has no semantic value (meaning that it is used only for the pur- pose of delimiting tokens and has no other purpose). Whitespace is composed of space characters, newlines and line feeds. This would be enough to cover most situations, but in some cases you’ll find that tokens are not always delimited using whitespaces. When some applications (including some of Adobe’s own) “optimize” a PDF file to reduce its size as much as possible, they remove whitespace characters where the distinction between two tokens is made obvious in another way. For exam- ple, consider the following snippet of PDF code that shows the beginning of a dictionary: << /Entry (Value) >> The whitespace between <<<< and //EEnnttrryy is made unnec- essary by the fact that the two tokens are made up of two completely different classes of characters. Since <<<< could only appear outside of a literal string to indicate the beginning of a dictionary, the lexer should stop at the second open angular bracket and delimit a token before the next character—whatever that is. Therefore, the snippet above could be rewritten as follows: <</Entry (Value)>> Clearly, whitespace isn’t enough to delimit a token—we must also keep in mind all the other possible character classes that can be used for the same purpose. Listing 3 ( ttookkeenniizzeerr pphhpp ) shows our lexer, the ppddff__rreeaadd__ttookkeenn(()) function, which looks a lot more complicated than it really is. This file also contains the ppddff__ccoonntteexxtt class that we mentioned earlier, which the tokenizer also makes use of. The ppddff__ccoonntteexxtt class is used to create a wrapper around a file pointer that makes it possible to: • Create a memory-based buffer for the file’s contents. • Keep track of the current pointer in the file and of the length of the buffer • Maintain a stack of tokens that have been read from the file but not yet used The necessity of creating a buffer here arises from the fact that we don’t want our tokenizer to read one sin- gle character at a time out of the file. By reading a fixed amount at a time and then accessing the dara directly in memory, we can save ourselves a few expensive function calls. The token stack is actually used by the portion of the system that is responsible for interpret- ing the meaning of the tokens—more about that later. Note that there is no compelling reason to store this information in a class, other than the convenience fac- tor of having a convenient PHP syntax to work with. You could just as easily store everything in an array and avoid OOP altogether, although, in my opinion, that would significantly complicate your code and make it easier to introduce bugs that would be tough to find and fix. Going back to the ppddff__rreeaadd__ttookkeenn(()) function for a moment, you can see that it works in a very simple way: first, it removes any whitespace that is at the cur- rent offset in the file buffer. Next, it tries to determine the type of token that it is dealing with by looking at the first character. The procedure used to then find the end of the token varies depending on the character class it belongs to: for array and literal string delimiters, a single character is all we need, whereas for hex string and dictionary delimiters we need to check one more character, since they both share the same initial open angular bracket. For all the other types of tokens, we simply scan the file until we end up in a different char- acter class. Parsing the Data Next in the list, we need to be able to understand the meaning of each token in the context of the PDF file— and this is the job of another great computer science construct: the parser. Parsers can be very complicated, and are usually not coded by hand—in most cases, a developer would use a “parser generator” like YACC or Bison. These reduce the parser to a relatively complex finite-state machine that is flexible enough to accommodate certain types of languages. In our case, however, the parsing of a PDF file is simple enough that the entire process can be coded in just about 150 lines’ worth of PHP. Before introducing another listing, however, let’s con- sider the types of data that we need to deal with. For the most part, they are simple to handle: for direct val- ues, for example, we read as many tokens as we need from the file and store them in the appropriate data structures. In two cases, however, we need to make a distinction: strings and indirect objets. The problem with strings—and, particularly, with lit- eral string—is that they change the rules that our lexer May 2004 ● PHP Architect ● www.phparch.com 37 FFEEAATTUURREE In the Belly of the Beast May 2004 ● PHP Architect ● www.phparch.com 38 FFEEAATTUURREE In the Belly of the Beast Listing 3 1 <?php 2 3 /* 4 * This class is used to 5 * read data from the input 6 * file in a bufferized way 7 * and to store unused tokens 8 */ 9 10 class pdf_context 11 { 12 var $file; 13 var $buffer; 14 var $offset; 15 var $length; 16 17 var $stack; 18 19 // Constructor 20 21 function pdf_context ($f) 22 { 23 $this->file = $f; 24 $this->reset(); 25 } 26 27 // Optionally move the file 28 // pointer to a new location 29 // and reset the buffered data 30 31 function reset($pos = null) 32 { 33 if (!is_null ($pos)) { 34 fseek ($this->file, $pos); 35 } 36 37 $this->buffer = fread ($this->file, 100); 38 $this->offset = 0; 39 $this->length = strlen ($this->buffer); 40 $this->stack = array(); 41 } 42 43 // Make sure that there is at least one 44 // character beyond the current offset in 45 // the buffer to prevent the tokenizer 46 // from attempting to access data that does 47 // not exist 48 49 function ensure_content() 50 { 51 if ($this->offset >= $this->length - 1) { 52 return $this->increase_length(); 53 } else { 54 return true; 55 } 56 } 57 58 // Forcefully read more data into the buffer 59 60 function increase_length() 61 { 62 if (feof ($this->file)) { 63 return false; 64 } else { 65 $this->buffer .= fread ($this->file, 100); 66 $this->length = strlen ($this->buffer); 67 return true; 68 } 69 } 70 } 71 72 /* 73 * Reads a token from the file 74 */ 75 76 function pdf_read_token (&$c) 77 { 78 // If there is a token available 79 // on the stack, pop it out and 80 // return it. 81 82 if (count ($c->stack)) { 83 return array_pop($c->stack); 84 } 85 86 // Strip away any whitespace 87 88 do { 89 if (!$c->ensure_content()) { 90 return false; 91 } 92 $c->offset += strspn ($c->buffer, “ \n\r”, $c->off- set); 93 } while ($c->offset >= $c->length - 1); 94 95 // Get the first character in the stream 96 97 $char = $c->buffer[$c->offset++]; 98 99 switch ($char) { 100 101 case ‘[‘ : 102 case ‘]’ : 103 case ‘(‘ : 104 case ‘)’ : 105 106 // This is either an array or literal string 107 // delimiter, Return it 108 109 return $char; 110 111 case ‘<’ : 112 case ‘>’ : 113 114 // This could either be a hex string or 115 // dictionary delimiter. Determine the 116 // appropriate case and return the token 117 118 if ($c->buffer[$c->offset] == $char) { 119 if (!$c->ensure_content()) { 120 return false; 121 } 122 $c->offset++; 123 return $char . $char; 124 } else { 125 return $char; 126 } 127 128 default : 129 130 // This is “another” type of token (probably 131 // a dictionary entry or a numeric value) 132 // Find the end and return it. 133 134 if (!$c->ensure_content()) { 135 return false; 136 } 137 138 while(1) { 139 140 // Determine the length of the token 141 142 $pos = strcspn ($c->buffer, “ []<>()\r\n/”, $c->offset); 143 144 if ($c->offset + $pos < $c->length - 1) { 145 break; 146 } else { 147 // If the script reaches this point, 148 // the token may span beyond the end 149 // of the current buffer. Therefore, 150 // we increase the size of the buffer 151 // and try again—just to be safe. 152 153 $c->increase_length(); 154 } 155 } 156 157 $result = substr ($c->buffer, $c->offset - 1, $pos + 1); 158 159 $c->offset += $pos; 160 return $result; 161 } 162 } 163 164 ?> has to follow in order to find the end of the token, because a closed parenthesis could be escaped by a backslash and, therefore, its presence alone does not indicate the end of the string. In a “traditional” lexer, this problem is taken care of by switching the machine to a new context in which a different set of rules apply. We could, in fact, do the very same thing to our lexer by creating a special case in the sswwiittcchh statement that is part of ppddff__rreeaadd__ttookkeenn(()) in Listing 2 and writing some additional code that looks for a parenthesis not preceded by an even number of backslashes. Why an even number? Because the backslashes themselves can be escaped by prefixing them with another backslash. Therefore, an even number of backslashes means that they are all escaped and should be interpreted as liter- al characters, so that the last one does not escape the parenthesis, which becomes the string delimiter. The last in am odd number of backslashes right before a parenthesis becomes an “orphan” and escapes the parenthesis, thus preventing it from terminating the string. Given that we only have a limited amount of space and I really wanted to keep things as simple as possible, however, I chose to implement the string parsing func- tionality inside the parser itself. When an open paren- thesis token is returned by the tokenizer, the code sim- ply keeps scanning the input file until it finds an unescaped closed parenthesis. The other problematic data elements are, as I men- tioned above, indirect objects. Both object declarations and references are made up by three tokens. Therefore, once our parser encounters a numeric value, it won’t be able to tell whether it is part of a larger element until it has read at least one more token—and potentially two. The problem here is not with reading the tokens—it’s with what to do with them if, by any chance, the numeric value turns out to be… just a numeric value. We could, in theory, put the extra tokens “back in the buffer” by rolling back the offset pointer in the buffer to the beginning of the second token, but that would be difficult to do, since we don’t really know how many whitespace characters were between the tokens to start with. Therefore, we use a completely different approach: unused tokens are stored in a stack, which is part of the file context. When a new token is requested, ppddff__rreeaadd__ttookkeenn(()) checks whether anything is present in the stack and, if something is in there, it pops it out and returns it, without even reading one character from the file buffer. You can see the end result of all our tribulations in Listing 4 ( rreeaaddvvaalluuee pphhpp ), which contains the ppddff__rreeaadd__vvaalluuee(()) function. You will also notice a num- ber of constant definitions that look suspiciously like data types—and they are. Since we’ll be reading and writing data back and forth, we’ll need to keep track of the object types as we read them from the stream. To do so, each object is encapsulated in an array whose zeroth element indicates the type, while element 1 con- tains the actual value, which varies depending on the nature of the data. Thus, for example, the trailer dic- tionary could look like this: Array ( PDF_TYPE_DICTIONARY, Array ( ‘/Size’ => array ( PDF_TYPE_NUMERIC, 22), ‘/Root’ => array ( PDF_TYPE_OBJ_REF, 12, 0 ), ‘/Prev’ => array ( PDF_TYPE_NUMERIC, 54655 ) ); Not unlike some of its predecessors, ppddff__rreeaadd__vvaalluuee(()) looks a lot scarier than it actually is—the code is quite heavily commented, so I will limit myself to noting that each value is actually stored in an array whose zeroth element contains its type. This makes identifying the data type of a type practically immediate, which will turn out to be very important later on when we’ll need to write objects back to the file. Before moving on to the next step, note that we make no provision in our lexer for reading stream data. This is because we are not intent on interpreting every- thing that is stored in a PDF file—but only those ele- ments that allow us to modify its contents. However, adding support for streams shouldn’t be too much of a problem—all you need is the ability to resolve object references, which we’ll add shortly, since the length of a stream is often expressed in that way. Getting to the Root of the Problem All the pieces are finally in place—we should now be able to read through the PDF file and interpret its con- tents, at least to the extent that we need in order to be able to append data to it. In order to demonstrate how the PDF functionality that we have built works, our goal is to open a PDF file and add a textual element to its first page. Listing 5 ( iinnddeexx pphhpp ) is our main script—and, unfor- tunately, it’s too large to show here; you will, however, find it in the code associated with this article, so you will hopefully be able to follow me there. Once we have declared a few variables that we we’ll end up using throughout the script, we read the cross- reference table from the file, then immediately attempt to retrieve the Root object from it. Because the //RRoooott entry inside the file trailer has to be an indirect object reference, we must find a way to retrieve the actual May 2004 ● PHP Architect ● www.phparch.com 39 FFEEAATTUURREE In the Belly of the Beast May 2004 ● PHP Architect ● www.phparch.com 40 FFEEAATTUURREE In the Belly of the Beast Continued on page 41 . Listing 4 1 <?php 2 3 // Define various data types 4 // that we use throughout the system 5 6 define (‘PDF_TYPE_NULL’, 0); 7 define (‘PDF_TYPE_NUMERIC’, 1); 8 define (‘PDF_TYPE_TOKEN’, 2); 9 define (‘PDF_TYPE_HEX’, 3); 10 define (‘PDF_TYPE_STRING’, 4); 11 define (‘PDF_TYPE_DICTIONARY’, 5); 12 define (‘PDF_TYPE_ARRAY’, 6); 13 define (‘PDF_TYPE_OBJDEC’, 7); 14 define (‘PDF_TYPE_OBJREF’, 8); 15 define (‘PDF_TYPE_OBJECT’, 9); 16 define (‘PDF_TYPE_STREAM’, 10); 17 18 /* 19 * Reads a value from the current 20 * data stream 21 */ 22 23 function pdf_read_value (&$c, $token = null) 24 { 25 // Get a token from the stream. 26 27 if (is_null ($token)) { 28 $token = pdf_read_token ($c); 29 } 30 31 if ($token === false) { 32 return false; 33 } 34 35 switch ($token) { 36 37 case ‘<’ : 38 39 // This is a hex string. 40 // Read the value, then the terminator 41 42 $s = pdf_read_token ($c); 43 44 if ($s === false) { 45 return false; 46 } 47 48 $term = pdf_read_token ($c); 49 50 if ($term !== ‘>’) { 51 die (“Unexpected data after hex string”); 52 } 53 54 return array (PDF_TYPE_HEX, $s); 55 56 break; 57 58 case ‘<<’ : 59 60 // This is a dictionary. 61 62 $result = array(); 63 64 // Recurse into this function until we reach 65 // the end of the dictionary. 66 67 while (($key = pdf_read_token ($c)) !== ‘>>’) { 68 if ($key === false) { 69 return false; 70 } 71 72 if (($value = pdf_read_value ($c)) === false) { 73 return false; 74 } 75 76 $result[$key] = $value; 77 } 78 79 return array (PDF_TYPE_DICTIONARY, $result); 80 81 case ‘[‘ : 82 83 // This is an array. 84 85 $result = array(); 86 87 // Recurse into this function until we reach 88 // the end of the array. 89 90 while (($token = pdf_read_token ($c)) !== ‘]’) { 91 if ($token === false) { 92 return false; 93 } 94 95 if (($value = pdf_read_value ($c, $token)) === false) { 96 return false; 97 } 98 99 $result[] = $value; 100 } 101 102 return array (PDF_TYPE_ARRAY, $result); 103 104 case ‘(‘ : 105 106 // This is a string 107 108 $pos = $c->offset; 109 110 while(1) { 111 112 // Start by finding the next closed 113 // parenthesis 114 115 $pos = strpos ($c->buffer, ‘)’, $pos); 116 117 // If you can’t find it, try 118 // reading more data from the stream 119 120 if ($pos == -1) { 121 if (!$c->increase_length()) { 122 return false; 123 } 124 } 125 126 // Make sure that there is no backslash before the parenthesis. If there is, 127 // move on. Otherwise, return the string. 128 129 if ($c->buffer[$pos - 1] !== ‘\\’) { 130 $result = substr ($c->buffer, $c->offset, $pos - $c->offset + 1); 131 $c->offset = $pos + 1; 132 return array (PDF_TYPE_STRING, $result); 133 } else { 134 $pos++; 135 136 if ($pos > $c->offset + $c->length) { 137 $c->increase_length(); 138 } 139 } 140 } 141 142 default : 143 144 if (is_numeric ($token)) { 145 146 // A numeric token. Make sure that it is not part of something else. 147 148 if (($tok2 = pdf_read_token ($c)) !== false) { 149 if (is_numeric ($tok2)) { 150 151 // Two numeric tokens in a row. In this case, we’re probably in 152 // front of either an object reference or an object specification. 153 // Determine the case and return the data 154 155 if (($tok3 = pdf_read_token ($c)) !== false) { 156 switch ($tok3) { 157 158 case ‘obj’ : 159 160 return array (PDF_TYPE_OBJDEC, (int) $token, (int) $tok2); 161 162 case ‘R’ : 163 164 return array (PDF_TYPE_OBJREF, (int) $token, (int) $tok2); 165 } object data, as the reference itself won’t help us much. This is accomplished by the ppddff__rreessoollvvee__oobbjjeecctt(()) func- tion, which you can see in Listing 6 as part of the oobbjjeeccttss pphhpp include file. The function can actually be used to determine whether any object is an indirect ref- erence and resolve it to the actual object data—some- thing that will come in handy at pretty much every step of the way. As you can see, ppddff__rreessoollvvee__oobbjjeecctt(()) first checks to see if the value it has been passed is an indirect object reference. If it isn’t, the function has really nothing to do, other than returning right away. If, on the other hand, it did receive an indirect reference, it uses the cross-reference table to determine its position and starts reading it. The $$eennccaappssuullaattee parameter deter- mines how the object is returned to the caller; if it is set to true, ppddff__rreessoollvvee__oobbjjeecctt(()) stores the object ID and generation number in the array, so that effectively the object’s data is encapsulated inside another object of type PPDDFF__TTYYPPEE__OOBBJJEECCTT . Otherwise, the direct value is returned, and all information regarding the object’s ID and generation number is lost. Both types of return val- ues have their uses—if you want to retrieve an object with the intention of modifying it, you will probably want it encapsulated, so that you can later rewrite it back to the stream. If, on the other hand, you’re just trying to retrieve a value, as you would, for example, if you were reading a stream object and you wanted to determine its length, the non-encapsulated version will be easier to handle. Speaking of retrieving streams, even though my code doesn’t perform that function (since I’m not writing a PDF reader), if you intend to add it, the ppddff__rreessoollvvee__oobbjjeecctt(()) function is safe to use because it saves the file pointer’s current position before reading the object and restores it afterwards. If the function didn’t do so and you were reading a stream, resolving the //LLeennggtthh parameter could result in the file pointer being moved to a different location in the file—and you would be unable to read the rest of the stream. Let’s go back to index.php . With the root object firm- ly in hand, we can now compile a list of all the pages contained in the document. To do so, we feed the //PPaaggeess element of the root dictionary to the ppddff__rreeaadd__ppaaggeess(()) function, which you can see in Listing 7 ( rreeaaddppaaggeess pphhpp ). The reason why we have a separate function just to read through the //PPaaggeess element of the root object is that, as I mentioned in last month’s article, the pages could be nested in an arbitrary combination of //PPaaggee and //PPaaggeess dictionaries, so that we may need to recurse into the function several times in order to end up with an array that contains only page elements. It is impor- tant to understand that the order in which the pages are resolved by using this method doesn’t necessarily correspond to the logical order in which they will appear to the user—that is, the first page in the list is not necessarily the first page of the document; the PDF specification provides a different set of facilities for determining the logical page order, but, technically speaking, you should only be interested in that if you want to display the contents of a document. In practi- cal terms, I have never found an occasion in which the logical and physical page order didn’t coincide—at most, there might be a fixed discrepancy because the document is an excerpt that starts from, say, page 25, but the order of the pages should usually be the same. In our sample script, we only take in consideration page 1 (which is the zeroth element resulting from the pages array). We then use the ppddff__ffiinndd__rreessoouurrcceess(()) function, shown in Listing 8 ( ppaaggee pphhpp ) to retrieve the resources associated with the page. Here, again, we need a dedicated function because, as you may remember, the resource dictionary is an inheritable May 2004 ● PHP Architect ● www.phparch.com 41 FFEEAATTUURREE In the Belly of the Beast Listing 4: Continued from page 40 164 return array (PDF_TYPE_OBJREF, (int) $token, (int) $tok2); 165 } 166 167 // If we get to this point, that numeric value up 168 // there was just a numeric value. Push the extra 169 // tokens back into the stack and return the value. 170 171 array_push ($c->stack, $tok3); 172 } 173 } 174 175 array_push ($c->stack, $tok2); 176 } 177 178 return array (PDF_TYPE_NUMERIC, $token); 179 } else { 180 // Just a token. Return it. 181 182 return array (PDF_TYPE_TOKEN, $token); 183 } 184 185 } 186 } 187 188 ?> [...]... will be hard to figure bugs out While I was writing this article, I lost lots of time debugging a problem that turned out to be just a spelling mistake but that caused Acrobat to crash without any useful error This brings us to the last tool you’ll need plenty of—patience! About the Author ?> Marco is the Publisher of (and a frequent contributor to) php|architect When not posting bogus bugs to the PHP... drawing pointer to be reset to the next line Finally, all we need to do is update the page’s /Contents array with a reference to our stream Once again, we need to determine if there already is an array and what it contains, and act accordingly, so that we can add our own data to it Writing it All Back The final step before we can call it a day consists of actually writing our changes back to the file so... trying to hack his computer into submission You can write to him at marcot@phparch.com To Discuss this article: http://forums.phparch.com/145 PHP & MySQL training by published author Larry Ullman 20th-29th July, Manchester, UK LTD Tel: +44 (0)161 2477771 All you need to know about PHP for the World Wide Web This course on the world's most popular Web development language teaches all you need to know to. .. dictionary In our case, it contains a pointer to the root object, which has not changed but must be there nonetheless, as well as a numeric value that declares the number of objects stored in the file and a pointer to the previous cross-reference table Where to Go From here That’s it! As you can see, once one figures out how things work it doesn’t take too long to actually open and modify the contents of... a series of commands that PDF borrows from Postscript In order for the reader to recognize them, we’ll have to encapsulate them in a stream, and add that stream to the contents of the page When drawing text on the screen, a certain number of transformations can be applied to it: translation (so that you can move the text to the location of your choice), rotation and scaling In our case, we will only... express the rotation that should be applied to the text They can also be used to determine the scale, but, as I mentioned, that is beyond the scope of this article The x and y parameters, on the other hand, indicate the coordinates at which we want the text to apply Finally, Tm is the command itself, which tells the PDF interpreter to apply these values to the text transformation matrix As you have... day consists of actually writing our changes back to the file so that they can be applied to the document To do so, we first of all open the output file, which was defined at the beginning of the main script (Listing 5) Next, we call the pdf_write_objects() function to rewrite the objects that we modified back to the file If you take a look at Listing 9 (writer.php), you’ll notice that this function... with a letter and not a digit The font resource that we create and add to the font dictionary is the simplest possible one: it uses the Helvetica font, which must be supported by every PDF reader and, therefore, doesn’t need to be embedded in the document itself Graffiti on the Wall We’ve now come to the part where we actually need to “write” some text on the document Unfortunately, this involves a few... ways around that First, you can actually get Acrobat to provide you with more useful error messages by pressing the Control key while click on the OK button in the error window that appears when you try to load a corrupted file Second, you can use a free PDF decoder (such as the one available online at www.planetpdf.com/mainpage.asp?webpageid=3463) to visually inspect the contents of your file and determine... Next, we’ll select a font that will be used to draw the text: /F11 10 Tf The Tf command sets the current font resource to /F11, with a size of 10 points Note that the font size can be a floating-point value, so that you could have text in size 12.5 Before writing the text itself, we need to set the spacing between one line of text and the next This is not as easy to determine as you may think—because it . able to read through the PDF file and interpret its con- tents, at least to the extent that we need in order to be able to append data to it. In order to. Table Now that we now where to look for it, it s time to fig- ure out how to read the cross-reference table itself. If we move to offset 55,593 of the file,

Ngày đăng: 21/12/2013, 11:15

w